tape.ensemble#
Attributes#
Classes#
Ensemble object is a collection of light curve ids |
Module Contents#
- class Ensemble(client=False, **kwargs)[source]#
Ensemble object is a collection of light curve ids
- add_frame(frame, label)[source]#
Adds a new frame for the Ensemble to track.
- Parameters:
frame (tape.ensemble_frame.EnsembleFrame) – The frame object for the Ensemble to track.
label (str) – The label for the Ensemble to use to track the frame.
- Return type:
- Raises:
ValueError – if the label is “source”, “object”, or already tracked by the Ensemble.
- update_frame(frame)[source]#
Updates a frame tracked by the Ensemble or otherwise adds it to the Ensemble. The frame is tracked by its EnsembleFrame.label field.
- Parameters:
frame (tape.ensemble.EnsembleFrame) – The frame for the Ensemble to update. If not already tracked, it is added.
- Return type:
- Raises:
ValueError – if the frame.label is unpopulated, or if the frame is not a SourceFrame or ObjectFrame but uses the reserved labels.
- drop_frame(label)[source]#
Drops a frame tracked by the Ensemble.
- Parameters:
label (str) – The label of the frame to be dropped by the Ensemble.
- Return type:
- Raises:
ValueError – if the label is “source”, or “object”.
KeyError – if the label is not tracked by the Ensemble.
- select_frame(label)[source]#
Selects and returns frame tracked by the Ensemble.
- Parameters:
label (str) – The label of a frame tracked by the Ensemble to be selected.
- Return type:
tape.ensemble.EnsembleFrame
- Raises:
KeyError – if the label is not tracked by the Ensemble.
- frame_info(labels=None, verbose=True, memory_usage=True, **kwargs)[source]#
Wrapper for calling dask.dataframe.DataFrame.info() on frames tracked by the Ensemble.
- Parameters:
labels (list, optional) – A list of labels for Ensemble frames to summarize. If None, info is printed for all tracked frames.
verbose (bool, optional) – Whether to print the whole summary
memory_usage (bool, optional) – Specifies whether total memory usage of the DataFrame elements (including the index) should be displayed.
**kwargs – keyword arguments passed along to dask.dataframe.DataFrame.info()
- Return type:
None
- Raises:
KeyError – if a label in labels is not tracked by the Ensemble.
- insert_sources(obj_ids, bands, timestamps, fluxes, flux_errs=None, force_repartition=False, **kwargs)[source]#
Manually insert sources into the ensemble.
Requires, at a minimum, the object’s ID and the band, timestamp, and flux of the observation.
Note
This function is expensive and is provides mainly for testing purposes. Care should be used when incorporating it into the core of an analysis.
- Parameters:
obj_ids (list) – A list of the sources’ object ID.
bands (list) – A list of the bands of the observation.
timestamps (list) – A list of the times the sources were observed.
fluxes (list) – A list of the fluxes of the observations.
flux_errs (list, optional) – A list of the errors in the flux.
force_repartition (bool optional) – Do an immediate repartition of the dataframes.
- client_info()[source]#
Calls the Dask Client, which returns cluster information
- Parameters:
None
- Returns:
self.client – Dask Client information
- Return type:
distributed.client.Client
- info(verbose=True, memory_usage=True, **kwargs)[source]#
Wrapper for dask.dataframe.DataFrame.info() for the Source and Object tables
- Parameters:
verbose (bool, optional) – Whether to print the whole summary
memory_usage (bool, optional) – Specifies whether total memory usage of the DataFrame elements (including the index) should be displayed.
- Return type:
None
- check_sorted(table='object')[source]#
Checks to see if an Ensemble Dataframe is sorted (increasing) on the index.
- Parameters:
table (str, optional) – The table to check.
- Returns:
indicating whether the index is sorted (True) or not (False)
- Return type:
boolean
- check_lightcurve_cohesion()[source]#
Checks to see if lightcurves are split across multiple partitions.
With partitioned data, and source information represented by rows, it is possible that when loading data or manipulating it in some way (most likely a repartition) that the sources for a given object will be split among multiple partitions. This function will check to see if all lightcurves are “cohesive”, meaning the sources for that object only live in a single partition of the dataset.
- Returns:
indicates whether the sources tied to a given object are only found in a single partition (True), or if they are split across multiple partitions (False)
- Return type:
boolean
- sort_lightcurves(by_band=True)[source]#
Sorts each Source partition first by the indexed ID column and then by the time column, each in ascending order.
This allows for efficient access of lightcurves by their indexed object ID while still giving easy access to the sorted time series.
Note that if the lightcurves are split across multiple partitions, this operation only sorts on a per-partition basis, and the table will not be globally sorted.
You can check that no lightcurves are not split across multiple partitions by seeing if Ensemble.check_lightcurve_cohesion() is True.
- Parameters:
by_band (bool, optional) – If True, the lightcurves are still sorted first by the indexed ID column, but then by band and then by timestamp, all in ascending order.
- Return type:
- compute(table=None, **kwargs)[source]#
Wrapper for dask.dataframe.DataFrame.compute()
The compute operation performs the computations that had been lazily allocated and returns the results as an in-memory pandas data frame.
- Parameters:
table (str, optional) – The table to materialize.
- Returns:
A single pandas data frame for the specified table or a tuple of (object, source) data frames.
- Return type:
pd.Dataframe
- persist(**kwargs)[source]#
Wrapper for dask.dataframe.DataFrame.persist()
The compute operation performs the computations that had been lazily allocated, but does not bring the results into memory or return them. This is useful for preventing a Dask task graph from growing too large by performing part of the computation.
- sample(frac=None, replace=False, random_state=None)[source]#
Selects a random sample of objects (sampling each partition).
This sampling will be lazily applied to the SourceFrame as well. A new Ensemble object is created, and no additional EnsembleFrames will be carried into the new Ensemble object. Most of docstring copied from https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.sample.html.
- Parameters:
frac (float, optional) – Approximate fraction of objects to return. This sampling fraction is applied to all partitions equally. Note that this is an approximate fraction. You should not expect exactly len(df) * frac items to be returned, as the exact number of elements selected will depend on how your data is partitioned (but should be pretty close in practice).
replace (boolean, optional) – Sample with or without replacement. Default = False.
random_state (int or np.random.RandomState) – If an int, we create a new RandomState with this as the seed; Otherwise we draw from the passed RandomState.
- Returns:
ensemble – A new ensemble with the subset of data selected
- Return type:
tape.ensemble.Ensemble
- dropna(table='source', **kwargs)[source]#
Removes rows with a >=`threshold` nan values.
- Parameters:
table (str, optional) – A string indicating which table to filter. Should be one of “object” or “source”.
**kwargs – keyword arguments passed along to dask.dataframe.DataFrame.dropna
- Returns:
ensemble – The ensemble object with nans removed according to the threshold scheme
- Return type:
tape.ensemble.Ensemble
- select(columns, table='object')[source]#
Select a subset of columns. Modifies the ensemble in-place by dropping the unselected columns.
- Parameters:
columns (list) – A list of column labels to keep.
table (str, optional) – A string indicating which table to filter. Should be one of “object” or “source”.
- query(expr, table='object')[source]#
Keep only certain rows of a table based on an expression of what information to keep. Wraps Dask query.
- Parameters:
expr (str) – A string specifying the expression of what to keep.
table (str, optional) – A string indicating which table to filter. Should be one of “object” or “source”.
Examples
Keep sources with flux above 100.0:
ens.query("flux > 100", table="source")
Keep sources in the green band:
ens.query("band_col_name == 'g'", table="source")
Filtering on the flux column without knowing its name:
ens.query(f"{ens._flux_col} > 100", table="source")
- filter_from_series(keep_series, table='object')[source]#
Filter the tables based on a DaskSeries indicating which rows to keep.
- Parameters:
keep_series (dask.dataframe.Series) – A series mapping the table’s row to a Boolean indicating whether or not to keep the row.
table (str, optional) – A string indicating which table to filter. Should be one of “object” or “source”.
- assign(table='object', temporary=False, **kwargs)[source]#
Wrapper for dask.dataframe.DataFrame.assign()
- Parameters:
table (str, optional) – A string indicating which table to filter. Should be one of “object” or “source”.
kwargs (dict of {str: callable or Series}) – Each argument is the name of a new column to add and its value specifies how to fill it. A callable is called for each row and a series is copied in.
temporary ('bool', optional) – Dictates whether the resulting columns are flagged as “temporary” columns within the Ensemble. Temporary columns are dropped when table syncs are performed, as their information is often made invalid by future operations. For example, the number of observations information is made invalid by a filter on the source table. Defaults to False.
- Returns:
self – The ensemble object.
- Return type:
tape.ensemble.Ensemble
Examples
Direct assignment of my_series to a column named “new_column”:
ens.assign(table="object", new_column=my_series)
Subtract the value in “err” from the value in “flux”:
ens.assign(table="source", lower_bnd=lambda x: x["flux"] - 2.0 * x["err"])
- calc_nobs(by_band=False, label='nobs', temporary=True)[source]#
Calculates the number of observations per lightcurve.
- Parameters:
by_band (bool, optional) – If True, also calculates the number of observations for each band in addition to providing the number of observations in total
label (str, optional) – The label used to generate output columns. “_total” and the band labels (e.g. “_g”) are appended.
temporary ('bool', optional) – Dictates whether the resulting columns are flagged as “temporary” columns within the Ensemble. Temporary columns are dropped when table syncs are performed, as their information is often made invalid by future operations. For example, the number of observations information is made invalid by a filter on the source table. Defaults to True.
- Returns:
ensemble – The ensemble object with nobs columns added to the object table.
- Return type:
tape.ensemble.Ensemble
- prune(threshold=50, col_name=None)[source]#
remove objects with less observations than a given threshold
- Parameters:
threshold (int, optional) – The minimum number of observations needed to retain an object. Default is 50.
col_name (str, optional) – The name of the column to assess the threshold if available in the object table. If not specified, the ensemble will calculate the number of observations and filter on the total (sum across bands).
- Returns:
ensemble – The ensemble object with pruned rows removed
- Return type:
tape.ensemble.Ensemble
- find_day_gap_offset()[source]#
Finds an approximation of the MJD offset for noon at the observatory.
This function looks for the longest strecth of hours of the day with zero observations. This gap is treated as the daylight hours and the function returns the middle hour of the gap. This is used for automatically finding offsets for binning.
- Returns:
empty_hours – The estimated middle of the day as a floating point day. Returns -1.0 if no such time is found.
- Return type:
list
Note
Calls a compute on the source table.
- bin_sources(time_window=1.0, offset=0.0, custom_aggr=None, count_col=None, use_map=True, **kwargs)[source]#
Bin sources on within a given time range to improve the estimates.
- Parameters:
time_window (float, optional) – The time range (in days) over which to consider observations in the same bin. The default is 1.0 days.
offset (float, optional) – The offset in days to use for binning. This should correspond to the middle of the daylight hours for the observatory. Default is 0.0. This value can also be computed with find_day_gap_offset.
custom_aggr (dict, optional) – A dictionary mapping column name to aggregation method. This can be used to both include additional columns to aggregate OR overwrite the aggregation method for time, flux, or flux error by matching those column names. Example: {“my_value_1”: “mean”, “my_value_2”: “max”, “psFlux”: “sum”}
count_col (str, optional) – The name of the column in which to count the number of sources per bin. If None then it does not include this column.
use_map (boolean, optional) – Determines whether dask.dataframe.DataFrame.map_partitions is used (True). Using map_partitions is generally more efficient, but requires the data from each lightcurve is housed in a single partition. If False, a groupby will be performed instead.
- Returns:
ensemble – The ensemble object with pruned rows removed
- Return type:
tape.ensemble.Ensemble
Notes
This should only be used for slowly varying sources where we can treat the source as constant within time_window.
As a default the function only aggregates and keeps the id, band, time, flux, and flux error columns. Additional columns can be preserved by providing the mapping of column name to aggregation function with the additional_cols parameter.
- batch(func, *args, meta=None, by_band=False, use_map=True, on=None, label='', **kwargs)[source]#
Run a function from tape.TimeSeries on the available ids
- Parameters:
func (function) – A function to apply to all objects in the ensemble. The function could be a TAPE function, an initialized feature extractor from light-curve package or a user-defined function. In the least case the function must have the following signature: func(*cols, **kwargs), where the names of the cols are specified in args, kwargs are keyword arguments passed to the function, and the return value schema is described by meta. For TAPE and light-curve functions args, meta and on are populated automatically.
*args – Denotes the ensemble columns to use as inputs for a function, order must be correct for function. If passing a TAPE or light-curve function, these are populated automatically.
meta (pd.Series, pd.DataFrame, dict, or tuple-like) – Dask’s meta parameter, which lays down the expected structure of the results. Overridden by TAPE for TAPE and light-curve functions. If none, attempts to coerce the result to a pandas.Series.
by_band (boolean, optional) – If true, the lightcurves are split into separate inputs for each band and passed along to the function individually. If the band column is already specified in on then batch will ensure the band column is the final element in on. For all original columns outputted by func, by_band will generate a set of new columns per band (for example, a function with output column “result” will instead have “result_g” and “result_r” as columns if the data had g and r band data) If False (default), the full lightcurve is passed along to the function (assuming the band column in not already part of on)
use_map (boolean) – Determines whether dask.dataframe.DataFrame.map_partitions is used (True). Using map_partitions is generally more efficient, but requires the data from each lightcurve is housed in a single partition. This can be checked using Ensemble.check_lightcurve_cohesion. If False, a groupby will be performed instead.
on ('str' or 'list', optional) – Designates which column(s) to groupby. Columns may be from the source or object tables. If not specified, then the id column is used by default. For TAPE and light-curve functions this is populated automatically.
label ('str', optional) – If provided the ensemble will use this label to track the result dataframe. If not provided, a label of the from “result_{x}” where x is a monotonically increasing integer is generated. If None, the result frame will not be tracked.
**kwargs – Additional optional parameters passed for the selected function
- Returns:
result – Series of function results
- Return type:
Dask.Series
Examples
Run a TAPE function on the ensemble:
from tape.analysis.stetsonj import calc_stetson_J ens = Ensemble().from_dataset('rrlyr82') ensemble.batch(calc_stetson_J, band_to_calc='i')
Run a light-curve function on the ensemble:
from light_curve import EtaE ens.batch(EtaE(), band_to_calc='g')
Run a custom function on the ensemble:
def s2n_inter_quartile_range(flux, err): first, third = np.quantile(flux / err, [0.25, 0.75]) return third - first ens.batch(s2n_inter_quartile_range, ens._flux_col, ens._err_col)
Or even a numpy built-in function:
amplitudes = ens.batch(np.ptp, ens._flux_col)
- save_ensemble(path='.', dirname='ensemble', additional_frames=True, **kwargs)[source]#
Save the current ensemble frames to disk.
- Parameters:
path ('str' or path-like, optional) – A path to the desired location of the top-level save directory, by default this is the current working directory.
dirname ('str', optional) – The name of the saved ensemble directory, “ensemble” by default.
additional_frames (bool, or list, optional) – Controls whether EnsembleFrames beyond the Object and Source Frames are saved to disk. If True or False, this specifies whether all or none of the additional frames are saved. Alternatively, a list of EnsembleFrame names may be provided to specify which frames should be saved. Object and Source will always be added and do not need to be specified in the list. By default, all frames will be saved.
**kwargs – Additional kwargs passed along to EnsembleFrame.to_parquet()
- Return type:
None
Note
If the object frame has no columns, which is often the case when an Ensemble is constructed using only source files/dictionaries, then an object subdirectory will not be created. Ensemble.from_ensemble will know how to work with the directory whether or not the object subdirectory is present.
Be careful about repeated saves to the same directory name. This will not be a perfect overwrite, as any products produced by a previous save may not be deleted by successive saves if they are removed from the ensemble. For best results, delete the directory between saves or verify that the contents are what you would expect.
- from_ensemble(dirpath, additional_frames=True, column_mapper=None, **kwargs)[source]#
Load an ensemble from an on-disk ensemble.
- Parameters:
dirpath ('str' or path-like, optional) – A path to the top-level ensemble directory to load from.
additional_frames (bool, or list, optional) – Controls whether EnsembleFrames beyond the Object and Source Frames are loaded from disk. If True or False, this specifies whether all or none of the additional frames are loaded. Alternatively, a list of EnsembleFrame names may be provided to specify which frames should be loaded. Object and Source will always be added and do not need to be specified in the list. By default, all frames will be loaded.
column_mapper (Tape.ColumnMapper object, or None, optional) – Supplies a ColumnMapper to the Ensemble, if None (default) searches for a column_mapper.npy file in the directory, which should be created when the ensemble is saved.
- Returns:
ensemble – The ensemble object.
- Return type:
tape.ensemble.Ensemble
- from_pandas(source_frame, object_frame=None, column_mapper=None, sync_tables=True, npartitions=None, partition_size=None, **kwargs)[source]#
Read in Pandas dataframe(s) into an ensemble object
- Parameters:
source_frame ('pandas.Dataframe') – A Dask dataframe that contains source information to be read into the ensemble
object_frame ('pandas.Dataframe', optional) – If not specified, the object frame is generated from the source frame
column_mapper ('ColumnMapper' object) – If provided, the ColumnMapper is used to populate relevant column information mapped from the input dataset.
sync_tables ('bool', optional) – In the case where an `object_frame`is provided, determines whether an initial sync is performed between the object and source tables. If not performed, dynamic information like the number of observations may be out of date until a sync is performed internally.
npartitions (int, optional) – If specified, attempts to repartition the ensemble to the specified number of partitions
partition_size (int, optional) – If specified, attempts to repartition the ensemble to partitions of size partition_size.
- Returns:
ensemble – The ensemble object with the Dask dataframe data loaded.
- Return type:
tape.ensemble.Ensemble
- from_dask_dataframe(source_frame, object_frame=None, column_mapper=None, sync_tables=True, npartitions=None, partition_size=None, sorted=False, sort=False, **kwargs)[source]#
Read in Dask dataframe(s) into an ensemble object
- Parameters:
source_frame ('dask.Dataframe') – A Dask dataframe that contains source information to be read into the ensemble
object_frame ('dask.Dataframe', optional) – If not specified, the object frame is generated from the source frame
column_mapper ('ColumnMapper' object) – If provided, the ColumnMapper is used to populate relevant column information mapped from the input dataset.
sync_tables ('bool', optional) – In the case where an `object_frame`is provided, determines whether an initial sync is performed between the object and source tables.
npartitions (int, optional) – If specified, attempts to repartition the ensemble to the specified number of partitions
partition_size (int, optional) – If specified, attempts to repartition the ensemble to partitions of size partition_size.
sorted (bool, optional) – If the index column is already sorted in increasing order. Defaults to False
sort (bool, optional) – If True, sorts the DataFrame by the id column. Otherwise set the index on the individual existing partitions. Defaults to False.
- Returns:
ensemble – The ensemble object with the Dask dataframe data loaded.
- Return type:
tape.ensemble.Ensemble
- from_lsdb(source_catalog, object_catalog=None, column_mapper=None, sync_tables=False, sorted=True, sort=False)[source]#
Read in from LSDB catalog objects.
- Parameters:
source_catalog ('dask.Dataframe') – An LSDB catalog that contains source information to be read into the ensemble.
object_catalog ('dask.Dataframe', optional) – An LSDB catalog containing object information. If not specified, a minimal ObjectFrame is generated from the source catalog.
column_mapper ('ColumnMapper' object) – If provided, the ColumnMapper is used to populate relevant column information mapped from the input dataset.
sync_tables ('bool', optional) – In the case where an `object_catalog`is provided, determines whether an initial sync is performed between the object and source tables. Defaults to False.
sorted (bool, optional) – If the index column is already sorted in increasing order. Defaults to True.
sort (bool, optional) – If True, sorts the DataFrame by the id column. Otherwise set the index on the individual existing partitions. Defaults to False.
- Returns:
ensemble – The ensemble object with the LSDB catalog data loaded.
- Return type:
tape.ensemble.Ensemble
- from_hipscat(source_path, object_path=None, column_mapper=None, source_index=None, object_index=None, sorted=True, sort=False)[source]#
Use LSDB to read from a hipscat directory.
This function utilizes LSDB for reading a hipscat directory into TAPE. In cases where a user would like to do operations on the LSDB catalog objects, it’s best to use LSDB itself first, and then load the result into TAPE using tape.Ensemble.from_lsdb. A join is performed between the two tables to modify the source table to use the object index, using object_index and source_index.
- Parameters:
source_path (str or Path) – A hipscat directory that contains source information to be read into the ensemble.
object_path (str or Path, optional) – A hipscat directory containing object information. If not specified, a minimal ObjectFrame is generated from the sources.
column_mapper ('ColumnMapper' object) – If provided, the ColumnMapper is used to populate relevant column information mapped from the input dataset.
object_index ('str', optional) – The join index of the object table, should be the label for the object ID contained in the object table.
source_index ('str', optional) – The join index of the source table, should be the label for the object ID contained in the source table.
sorted (bool, optional) – If the index column is already sorted in increasing order. Defaults to True.
sort (bool, optional) – If True, sorts the DataFrame by the id column. Otherwise set the index on the individual existing partitions. Defaults to False.
- Returns:
ensemble – The ensemble object with the hipscat data loaded.
- Return type:
tape.ensemble.Ensemble
- make_column_map()[source]#
Returns the current column mapping.
- Returns:
result – A new column mapper representing the Ensemble’s current mappings.
- Return type:
tape.utils.ColumnMapper
- update_column_mapping(column_mapper=None, **kwargs)[source]#
Update the mapping of column names.
- Parameters:
column_mapper (tape.utils.ColumnMapper, optional) – An entirely new mapping of column names. If None then modifies the current mapping using kwargs.
kwargs – Individual column to name settings.
- Returns:
self
- Return type:
Ensemble
- _load_column_mapper(column_mapper, **kwargs)[source]#
Load a column mapper object.
- Parameters:
column_mapper (tape.utils.ColumnMapper or None) – The ColumnMapper to use. If None then the function creates a new one from kwargs.
kwargs (optional) – Individual column to name settings.
- Returns:
self
- Return type:
Ensemble
- Raises:
ValueError if a required column is missing. –
- from_parquet(source_file, object_file=None, column_mapper=None, sync_tables=True, additional_cols=True, npartitions=None, partition_size=None, sorted=False, sort=False, **kwargs)[source]#
Read in parquet file(s) into an ensemble object
- Parameters:
source_file ('str') – Path to a parquet file, or multiple parquet files that contain source information to be read into the ensemble
object_file ('str', optional) – Path to a parquet file, or multiple parquet files that contain object information. If not specified, it is generated from the source table
column_mapper ('ColumnMapper' object) – If provided, the ColumnMapper is used to populate relevant column information mapped from the input dataset.
sync_tables ('bool', optional) – In the case where object files are loaded in, determines whether an initial sync is performed between the object and source tables. If not performed, dynamic information like the number of observations may be out of date until a sync is performed internally.
additional_cols ('bool', optional) – Boolean to indicate whether to carry in columns beyond the critical columns, true will, while false will only load the columns containing the critical quantities (id,time,flux,err,band)
npartitions (int, optional) – If specified, attempts to repartition the ensemble to the specified number of partitions
partition_size (int, optional) – If specified, attempts to repartition the ensemble to partitions of size partition_size.
sorted (bool, optional) – If the index column is already sorted in increasing order. Defaults to False
sort (bool, optional) – If True, sorts the DataFrame by the id column. Otherwise set the index on the individual existing partitions. Defaults to False.
- Returns:
ensemble – The ensemble object with parquet data loaded
- Return type:
tape.ensemble.Ensemble
- from_dataset(dataset, **kwargs)[source]#
Load the ensemble from a TAPE dataset.
- Parameters:
dataset ('str') – The name of the dataset to import
- Returns:
ensemble – The ensemble object with the dataset loaded
- Return type:
tape.ensemble.Ensemble
- available_datasets()[source]#
Retrieve descriptions of available TAPE datasets.
- Returns:
A dictionary of datasets with description information.
- Return type:
dict
- from_source_dict(source_dict, column_mapper=None, npartitions=1, sorted=False, sort=False, **kwargs)[source]#
Load the sources into an ensemble from a dictionary.
- Parameters:
source_dict ('dict') – The dictionary containing the source information.
column_mapper ('ColumnMapper' object) – If provided, the ColumnMapper is used to populate relevant column information mapped from the input dataset.
npartitions (int, optional) – If specified, attempts to repartition the ensemble to the specified number of partitions
sorted (bool, optional) – If the index column is already sorted in increasing order. Defaults to False
sort (bool, optional) – If True, sorts the DataFrame by the id column. Otherwise set the index on the individual existing partitions. Defaults to False.
- Returns:
ensemble – The ensemble object with dictionary data loaded
- Return type:
tape.ensemble.Ensemble
- convert_flux_to_mag(zero_point, zp_form='mag', out_col_name=None, flux_col=None, err_col=None)[source]#
Converts a flux column into a magnitude column.
- Parameters:
zero_point ('str' or 'float') – The name of the ensemble column containing the zero point information for column transformation. Alternatively, a single float number to apply for all fluxes.
zp_form (str, optional) – The form of the zero point column, either “flux” or “magnitude”/”mag”. Determines how the zero point (zp) is applied in the conversion. If “flux”, then the function is applied as mag=-2.5*log10(flux/zp), or if “magnitude”, then mag=-2.5*log10(flux)+zp.
out_col_name ('str', optional) – The name of the output magnitude column, if None then the output is just the flux column name + “_mag”. The error column is also generated as the out_col_name + “_err”.
flux_col ('str', optional) – The name of the ensemble flux column to convert into magnitudes. Uses the Ensemble mapped flux column if not specified.
err_col ('str', optional) – The name of the ensemble column containing the errors to propagate. Errors are propagated using the following approximation: Err= (2.5/log(10))*(flux_error/flux), which holds mainly when the error in flux is much smaller than the flux. Uses the Ensemble mapped error column if not specified.
- Returns:
ensemble – The ensemble object with a new magnitude (and error) column.
- Return type:
tape.ensemble.Ensemble
- _lazy_sync_tables_from_frame(frame)[source]#
Call the sync operation for the frame only if the table being modified (frame) needs to be synced. Does nothing in the case that only the table to be modified is dirty or if it is not the object or source frame for this Ensemble.
- Parameters:
frame (tape.ensemble_frame.EnsembleFrame) – The frame being modified. Only an ObjectFrame or SourceFrame tracked by this `Ensemble may trigger a sync.
- _lazy_sync_tables(table='object')[source]#
Call the sync operation for the table only if the the table being modified (table) needs to be synced. Does nothing in the case that only the table to be modified is dirty.
- Parameters:
table (str, optional) – The table being modified. Should be one of “object”, “source”, or “all”
- _sync_tables()[source]#
Sync operation to align both tables.
Filtered objects are always removed from the source. But filtered sources may be kept in the object table is the Ensemble’s keep_empty_objects attribute is set to True.
- select_random_timeseries(seed=None)[source]#
Selects a random lightcurve from a random partition of the Ensemble.
- Parameters:
seed (int, or None) – Sets a seed to return the same object id on successive runs. None by default, in which case a seed is not set for the operation.
- Returns:
ts – Timeseries for a single object
- Return type:
TimeSeries
Note
This is not uniformly sampled. As a random partition is chosen first to avoid a search in full index space, and partitions may vary in the number of objects they contain. In other words, objects in smaller partitions will have a higher probability of being chosen than objects in larger partitions.
- to_timeseries(target, id_col=None, time_col=None, flux_col=None, err_col=None, band_col=None)[source]#
Construct a timeseries object from one target object_id, assumes that the result is a collection of lightcurves (output from query_ids)
- Parameters:
target (int) – Id of a source to be extracted
id_col ('str', optional) – Identifies which column contains the Object IDs
time_col ('str', optional) – Identifies which column contains the time information
flux_col ('str', optional) – Identifies which column contains the flux/magnitude information
err_col ('str', optional) – Identifies which column contains the error information
band_col ('str', optional) – Identifies which column contains the band information
- Returns:
ts – Timeseries for a single object
- Return type:
TimeSeries
Note
All _col parameters when not specified will use the appropriate columns determined on data ingest as critical columns.
- _build_index(obj_id, band)[source]#
Build pandas multiindex from object_ids and bands
- Parameters:
obj_id (np.array or list) – A list of object id for each row in the data.
band (np.array or list) – A list of the band for each row in the data.
- Returns:
index
- Return type:
pd.MultiIndex
- sf2(sf_method='basic', argument_container=None, use_map=True)[source]#
Wrapper interface for calling structurefunction2 on the ensemble
- Parameters:
sf_method ('str') – The structure function calculation method to be used, by default “basic”.
argument_container (StructureFunctionArgumentContainer, optional) – Container object for additional configuration options, by default None.
use_map (boolean) – Determines whether dask.dataframe.DataFrame.map_partitions is used (True). Using map_partitions is generally more efficient, but requires the data from each lightcurve is housed in a single partition. If False, a groupby will be performed instead.
- Returns:
result – Structure function squared for each of input bands.
- Return type:
pandas.DataFrame
Note
In case that no value for band_to_calc is passed, the function is executed on all available bands in band.
- _translate_meta(meta)[source]#
Translates Dask-style meta into a TapeFrame or TapeSeries object.
- Parameters:
meta (dict, tuple, list, pd.Series, pd.DataFrame, pd.Index, dtype, scalar)
- Returns:
result – The appropriate meta for Dask producing an tape.ensemble_frame.EnsembleFrame or Ensemble.EnsembleSeries respectively
- Return type:
ensemble.TapeFrame or ensemble.TapeSeries