tape

Contents

tape#

Subpackages#

Submodules#

Package Contents#

Classes#

AnalysisFunction

Base class for analysis functions.

FeatureExtractor

Apply light-curve package feature extractor to a light curve

LightCurve

This base class is meant to support various analysis routines and be

StetsonJ

Compute the StetsonJ statistic on data from one or several bands

StructureFunction2

Calculate structure function squared

AnalysisFunction

Base class for analysis functions.

FeatureExtractor

Apply light-curve package feature extractor to a light curve

EnsembleFrame

An extension for a Dask Dataframe for data used by a lightcurve Ensemble.

EnsembleSeries

A barebones extension of a Dask Series for Ensemble data.

ObjectFrame

A subclass of EnsembleFrame for Object data.

SourceFrame

A subclass of EnsembleFrame for Source data.

TapeFrame

A barebones extension of a Pandas frame to be used for underlying Ensemble data.

TapeObjectFrame

A barebones extension of a Pandas frame to be used for underlying Ensemble object data.

TapeSourceFrame

A barebones extension of a Pandas frame to be used for underlying Ensemble source data

TapeSeries

A barebones extension of a Pandas series to be used for underlying Ensemble data.

TimeSeries

Represent and analyze Rubin TimeSeries data

ColumnMapper

Maps columns from a given dataset into known ensemble column

Ensemble

Ensemble object is a collection of light curve ids

EnsembleFrame

An extension for a Dask Dataframe for data used by a lightcurve Ensemble.

EnsembleSeries

A barebones extension of a Dask Series for Ensemble data.

ObjectFrame

A subclass of EnsembleFrame for Object data.

SourceFrame

A subclass of EnsembleFrame for Source data.

TapeFrame

A barebones extension of a Pandas frame to be used for underlying Ensemble data.

TapeObjectFrame

A barebones extension of a Pandas frame to be used for underlying Ensemble object data.

TapeSourceFrame

A barebones extension of a Pandas frame to be used for underlying Ensemble source data

TapeSeries

A barebones extension of a Pandas series to be used for underlying Ensemble data.

TimeSeries

Represent and analyze Rubin TimeSeries data

Attributes#

QUERY_PLANNING_ON

calc_stetson_J

calc_sf2

SF_METHODS

calc_sf2

SOURCE_FRAME_LABEL

OBJECT_FRAME_LABEL

DEFAULT_FRAME_LABEL

METADATA_FILENAME

calc_stetson_J

calc_sf2

QUERY_PLANNING_ON[source]#
class AnalysisFunction[source]#

Bases: abc.ABC, Callable

Base class for analysis functions.

Analysis functions are functions that take few arrays representing an object and return a single pandas.Series representing the result.

cols(ens) List[str][source]#

Return the columns that the analysis function takes as input.

meta(ens) pd.DataFrame[source]#

Return the metadata pandas.DataFrame required by Dask to pre-build a computation graph. It is basically the schema for calculate() method output.

on(ens) List[str][source]#

Return the columns to group source table by. Typically, [ens._id_col].

__call__(*cols, \*\*kwargs)[source]#

Calculate the analysis function.

abstract cols(ens: Ensemble) List[str][source]#

Return the column names that the analysis function takes as input.

Parameters:

ens (Ensemble) – The ensemble object, it could be required to get column names of the “special” columns like ens._time_col or ens._err_col.

Returns:

The column names to select and pass to .calculate() method. For example [ens._time_col, ens._flux_col].

Return type:

List[str]

abstract meta(ens: Ensemble)[source]#

Return the schema of the analysis function output.

Parameters:

ens (Ensemble) – The ensemble object.

Returns:

pd.DataFrame or (str, dtype) tuple or {str – Dask meta, for example pd.DataFrame(columns=[‘x’, ‘y’], dtype=float).

Return type:

dtype} dictionary

abstract on(ens: Ensemble) List[str][source]#

Return the columns to group source table by.

Parameters:

ens (Ensemble) – The ensemble object.

Returns:

The column names to group by. Typically, [ens._id_col].

Return type:

List[str]

abstract __call__(*cols, **kwargs)[source]#

Calculate the analysis function.

Parameters:
  • *cols (array_like) – The columns to calculate the analysis function on. It must be consistent with .cols(ens) output.

  • **kwargs – Additional keyword arguments.

Returns:

The result, it must be consistent with .meta() output.

Return type:

pd.Series or pd.DataFrame or array or value

class FeatureExtractor(feature: light_curve.light_curve_ext._FeatureEvaluator)[source]#

Bases: tape.analysis.base.AnalysisFunction

Apply light-curve package feature extractor to a light curve

Parameters:

feature (light_curve.light_curve_ext._FeatureEvaluator) – Feature extractor to apply, see “light-curve” package for more details.

feature#

Feature extractor to apply, see “light-curve” package for more details.

Type:

light_curve.light_curve_ext._FeatureEvaluator

cols(ens: Ensemble) List[str][source]#

Return the column names that the analysis function takes as input.

Parameters:

ens (Ensemble) – The ensemble object, it could be required to get column names of the “special” columns like ens._time_col or ens._err_col.

Returns:

The column names to select and pass to .calculate() method. For example [ens._time_col, ens._flux_col].

Return type:

List[str]

meta(ens: Ensemble) pandas.DataFrame[source]#

Return the schema of the analysis function output.

It always returns a pandas.DataFrame with the same columns as self.feature.names and dtype np.float64. However, if input columns are all single precision floats then the output dtype will be np.float32.

on(ens: Ensemble) List[str][source]#

Return the columns to group source table by.

Parameters:

ens (Ensemble) – The ensemble object.

Returns:

The column names to group by. Typically, [ens._id_col].

Return type:

List[str]

__call__(time, flux, err, band, *, band_to_calc: str, **kwargs) pandas.DataFrame[source]#

Apply a feature extractor to a light curve, concatenating the results over all bands.

Parameters:
  • time (numpy.ndarray) – Time values

  • flux (numpy.ndarray) – Brightness values, flux or magnitudes

  • err (numpy.ndarray) – Errors for “flux”

  • band (numpy.ndarray) – Passband names.

  • band_to_calc (str or int or None) – Name of the passband to calculate features for, usually a string like “g” or “r”, or an integer. If None, then features are calculated for all sources - band is ignored.

  • **kwargs (dict) – Additional keyword arguments to pass to the feature extractor.

Returns:

features – Feature values for each band, dtype is a common type for input arrays.

Return type:

pandas.DataFrame

class LightCurve(times: numpy.ndarray, fluxes: numpy.ndarray, errors: numpy.ndarray, minimum_observations: int = 0)[source]#

This base class is meant to support various analysis routines and be extended as needed. (Hence it’s location in the analysis package.)

The base class ensures that the data for a single lightcurve is well formed. Namely that the input data is all of the same length, with NaN’s removed and that there are enough observations to perform a given analysis.

_process_input_data()[source]#

Cleaning and validation occurs here, ideally by calling sub-methods for specific checks and cleaning tasks.

_filter_nans()[source]#

Mask out any NaN values from time, flux and error arrays

_check_input_data_size_is_equal()[source]#

Make sure that the three input np.arrays have the same size

_check_input_data_length_is_sufficient()[source]#

Make sure that we have enough data after cleaning and filtering to be able to perform Structure Function calculations.

calc_stetson_J[source]#
class StetsonJ[source]#

Bases: tape.analysis.base.AnalysisFunction

Compute the StetsonJ statistic on data from one or several bands

cols(ens: Ensemble) List[str][source]#

Return the column names that the analysis function takes as input.

Parameters:

ens (Ensemble) – The ensemble object, it could be required to get column names of the “special” columns like ens._time_col or ens._err_col.

Returns:

The column names to select and pass to .calculate() method. For example [ens._time_col, ens._flux_col].

Return type:

List[str]

meta(ens: Ensemble)[source]#

Return the schema of the analysis function output.

Parameters:

ens (Ensemble) – The ensemble object.

Returns:

pd.DataFrame or (str, dtype) tuple or {str – Dask meta, for example pd.DataFrame(columns=[‘x’, ‘y’], dtype=float).

Return type:

dtype} dictionary

on(ens: Ensemble) List[str][source]#

Return the columns to group source table by.

Parameters:

ens (Ensemble) – The ensemble object.

Returns:

The column names to group by. Typically, [ens._id_col].

Return type:

List[str]

__call__(flux: numpy.ndarray, err: numpy.ndarray, band: numpy.ndarray, *, band_to_calc: str | Iterable[str] | None = None, check_nans: bool = False)[source]#

Compute the StetsonJ statistic on data from one or several bands

Parameters:
  • flux (numpy.ndarray (N,)) – Array of flux/magnitude measurements

  • err (numpy.ndarray (N,)) – Array of associated flux/magnitude errors

  • band (numpy.ndarray (N,)) – Array of associated band labels

  • band_to_calc (str or list of str) – Bands to calculate StetsonJ on. Single band descriptor, or list of such descriptors.

  • check_nans (bool) – Boolean to run a check for NaN values and filter them out.

Returns:

stetsonJ – StetsonJ statistic for each of input bands.

Return type:

dict

Note

In case that no value for band_to_calc is passed, the function is executed on all available bands in band.

class StructureFunction2[source]#

Bases: tape.analysis.base.AnalysisFunction

Calculate structure function squared

cols(ens: Ensemble) List[str][source]#

Return the column names that the analysis function takes as input.

Parameters:

ens (Ensemble) – The ensemble object, it could be required to get column names of the “special” columns like ens._time_col or ens._err_col.

Returns:

The column names to select and pass to .calculate() method. For example [ens._time_col, ens._flux_col].

Return type:

List[str]

meta(ens: Ensemble) Dict[str, type][source]#

Return the schema of the analysis function output.

Parameters:

ens (Ensemble) – The ensemble object.

Returns:

pd.DataFrame or (str, dtype) tuple or {str – Dask meta, for example pd.DataFrame(columns=[‘x’, ‘y’], dtype=float).

Return type:

dtype} dictionary

on(ens: Ensemble) List[str][source]#

Return the columns to group source table by.

Parameters:

ens (Ensemble) – The ensemble object.

Returns:

The column names to group by. Typically, [ens._id_col].

Return type:

List[str]

__call__(time, flux, err=None, band=None, lc_id=None, *, sf_method='basic', argument_container=None) pandas.DataFrame[source]#

Calculate structure function squared using one of a variety of structure function calculation methods defined by the input argument sf_method, or in the argument container object.

Parameters:
  • time (numpy.ndarray (N,) or None) – Array of times when measurements were taken. If all array values are None or if a scalar None is provided, then equidistant time between measurements is assumed.

  • flux (numpy.ndarray (N,)) – Array of flux/magnitude measurements.

  • err (numpy.ndarray (N,), float, or None, optional) – Array of associated flux/magnitude errors. If a scalar value is provided we assume that error for all measurements. If None is provided, we assume all errors are 0. By default None

  • band (numpy.ndarray (N,), optional) – Array of associated band labels, by default None

  • lc_id (numpy.ndarray (N,), optional) – Array of lightcurve ids per data point. By default None

  • sf_method (str, optional) – The structure function calculation method to be used, by default “basic”.

  • argument_container (StructureFunctionArgumentContainer, optional) – Container object for additional configuration options, by default None.

Returns:

sf2 – Structure function squared for each of input bands.

Return type:

pandas.DataFrame

Notes

In case that no value for band_to_calc is passed, the function is executed on all available bands in band.

calc_sf2[source]#
class AnalysisFunction[source]#

Bases: abc.ABC, Callable

Base class for analysis functions.

Analysis functions are functions that take few arrays representing an object and return a single pandas.Series representing the result.

cols(ens) List[str][source]#

Return the columns that the analysis function takes as input.

meta(ens) pd.DataFrame[source]#

Return the metadata pandas.DataFrame required by Dask to pre-build a computation graph. It is basically the schema for calculate() method output.

on(ens) List[str][source]#

Return the columns to group source table by. Typically, [ens._id_col].

__call__(*cols, \*\*kwargs)[source]#

Calculate the analysis function.

abstract cols(ens: Ensemble) List[str][source]#

Return the column names that the analysis function takes as input.

Parameters:

ens (Ensemble) – The ensemble object, it could be required to get column names of the “special” columns like ens._time_col or ens._err_col.

Returns:

The column names to select and pass to .calculate() method. For example [ens._time_col, ens._flux_col].

Return type:

List[str]

abstract meta(ens: Ensemble)[source]#

Return the schema of the analysis function output.

Parameters:

ens (Ensemble) – The ensemble object.

Returns:

pd.DataFrame or (str, dtype) tuple or {str – Dask meta, for example pd.DataFrame(columns=[‘x’, ‘y’], dtype=float).

Return type:

dtype} dictionary

abstract on(ens: Ensemble) List[str][source]#

Return the columns to group source table by.

Parameters:

ens (Ensemble) – The ensemble object.

Returns:

The column names to group by. Typically, [ens._id_col].

Return type:

List[str]

abstract __call__(*cols, **kwargs)[source]#

Calculate the analysis function.

Parameters:
  • *cols (array_like) – The columns to calculate the analysis function on. It must be consistent with .cols(ens) output.

  • **kwargs – Additional keyword arguments.

Returns:

The result, it must be consistent with .meta() output.

Return type:

pd.Series or pd.DataFrame or array or value

class FeatureExtractor(feature: light_curve.light_curve_ext._FeatureEvaluator)[source]#

Bases: tape.analysis.base.AnalysisFunction

Apply light-curve package feature extractor to a light curve

Parameters:

feature (light_curve.light_curve_ext._FeatureEvaluator) – Feature extractor to apply, see “light-curve” package for more details.

feature#

Feature extractor to apply, see “light-curve” package for more details.

Type:

light_curve.light_curve_ext._FeatureEvaluator

cols(ens: Ensemble) List[str][source]#

Return the column names that the analysis function takes as input.

Parameters:

ens (Ensemble) – The ensemble object, it could be required to get column names of the “special” columns like ens._time_col or ens._err_col.

Returns:

The column names to select and pass to .calculate() method. For example [ens._time_col, ens._flux_col].

Return type:

List[str]

meta(ens: Ensemble) pandas.DataFrame[source]#

Return the schema of the analysis function output.

It always returns a pandas.DataFrame with the same columns as self.feature.names and dtype np.float64. However, if input columns are all single precision floats then the output dtype will be np.float32.

on(ens: Ensemble) List[str][source]#

Return the columns to group source table by.

Parameters:

ens (Ensemble) – The ensemble object.

Returns:

The column names to group by. Typically, [ens._id_col].

Return type:

List[str]

__call__(time, flux, err, band, *, band_to_calc: str, **kwargs) pandas.DataFrame[source]#

Apply a feature extractor to a light curve, concatenating the results over all bands.

Parameters:
  • time (numpy.ndarray) – Time values

  • flux (numpy.ndarray) – Brightness values, flux or magnitudes

  • err (numpy.ndarray) – Errors for “flux”

  • band (numpy.ndarray) – Passband names.

  • band_to_calc (str or int or None) – Name of the passband to calculate features for, usually a string like “g” or “r”, or an integer. If None, then features are calculated for all sources - band is ignored.

  • **kwargs (dict) – Additional keyword arguments to pass to the feature extractor.

Returns:

features – Feature values for each band, dtype is a common type for input arrays.

Return type:

pandas.DataFrame

SF_METHODS[source]#
calc_sf2[source]#
class EnsembleFrame(expr, label=None, ensemble=None)[source]#

Bases: _Frame, dask.dataframe.DataFrame

An extension for a Dask Dataframe for data used by a lightcurve Ensemble.

The underlying non-parallel dataframes are TapeFrames and TapeSeries which extend Pandas frames.

Examples

Instatiation:

import tape
ens = tape.Ensemble()
data = {...} # Some data you want tracked by the Ensemble
ensemble_frame = tape.EnsembleFrame.from_dict(data, label="my_frame", ensemble=ens)
_partition_type#
__getitem__(key)[source]#
classmethod from_tapeframe(data, npartitions=None, chunksize=None, sort=True, label=None, ensemble=None)[source]#

Returns an EnsembleFrame constructed from a TapeFrame.

Parameters:
  • data (TapeFrame) – Frame containing the underlying data fro the EnsembleFram

  • npartitions (int, optional) – The number of partitions of the index to create. Note that depending on the size and index of the dataframe, the output may have fewer partitions than requested.

  • chunksize (int, optional) – Size of the individual chunks of data in non-parallel objects that make up Dask frames.

  • sort (bool, optional) – Whether to sort the frame by a default index.

  • label (str, optional) – The label used to by the Ensemble to identify the frame.

  • ensemble (tape.Ensemble, optional) – A link to the Ensemble object that owns this frame.

Returns:

result – The constructed EnsembleFrame object.

Return type:

tape.EnsembleFrame

classmethod from_dask_dataframe(df, ensemble=None, label=None)[source]#

Returns an EnsembleFrame constructed from a Dask dataframe.

Parameters:
  • df (dask.dataframe.DataFrame or list) – a Dask dataframe to convert to an EnsembleFrame

  • ensemble (tape.ensemble.Ensemble, optional) – A link to the Ensemble object that owns this frame.

  • label (str, optional) – The label used to by the Ensemble to identify the frame.

Returns:

result – The constructed EnsembleFrame object.

Return type:

tape.EnsembleFrame

update_ensemble()[source]#

Updates the Ensemble linked by the EnsembelFrame.ensemble property to track this frame.

Returns:

result – The Ensemble object which tracks this frame, None if no such Ensemble.

Return type:

tape.Ensemble

classmethod from_dict(data, npartitions, orient='columns', dtype=None, columns=None, label=None, ensemble=None)[source]#

Construct a Tape EnsembleFrame from a Python Dictionary

Parameters:
  • data (dict) – Of the form {field : array-like} or {field : dict}.

  • npartitions (int) – The number of partitions of the index to create. Note that depending on the size and index of the dataframe, the output may have fewer partitions than requested.

  • orient ({'columns', 'index', 'tight'}, default 'columns') – The “orientation” of the data. If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’. If ‘tight’, assume a dict with keys [‘index’, ‘columns’, ‘data’, ‘index_names’, ‘column_names’].

  • dtype (bool) – Data type to force, otherwise infer.

  • columns (string, optional) – Column labels to use when orient='index'. Raises a ValueError if used with orient='columns' or orient='tight'.

  • label (str, optional) – The label used to by the Ensemble to identify the frame.

  • ensemble (tape.ensemble.Ensemble, optional) – A link to the Ensemble object that owns this frame.

Returns:

result – The constructed EnsembleFrame object.

Return type:

tape.EnsembleFrame

classmethod from_parquet(path, index=None, columns=None, label=None, ensemble=None, **kwargs)[source]#

Returns an EnsembleFrame constructed from loading a parquet file.

Parameters:
  • path (str or list) – Source directory for data, or path(s) to individual parquet files. Prefix with a protocol like s3:// to read from alternative filesystems. To read from multiple files you can pass a globstring or a list of paths, with the caveat that they must all have the same protocol.

  • index (str, list, False, optional) – Field name(s) to use as the output frame index. Default is None and index will be inferred from the pandas parquet file metadata, if present. Use False to read all fields as columns.

  • columns (str or list, optional) – Field name(s) to read in as columns in the output. By default all non-index fields will be read (as determined by the pandas parquet metadata, if present). Provide a single field name instead of a list to read in the data as a Series.

  • label (str, optional) – The label used to by the Ensemble to identify the frame.

  • ensemble (tape.ensemble.Ensemble, optional) – A link to the Ensemble object that owns this frame.

Returns:

result – The constructed EnsembleFrame object.

Return type:

tape.EnsembleFrame

convert_flux_to_mag(flux_col, zero_point, err_col=None, zp_form='mag', out_col_name=None)[source]#

Converts this EnsembleFrame’s flux column into a magnitude column, returning a new EnsembleFrame.

Parameters:
  • flux_col ('str') – The name of the EnsembleFrame flux column to convert into magnitudes.

  • zero_point ('str') – The name of the EnsembleFrame column containing the zero point information for column transformation.

  • err_col ('str', optional) – The name of the EnsembleFrame column containing the errors to propagate. Errors are propagated using the following approximation: Err= (2.5/log(10))*(flux_error/flux), which holds mainly when the error in flux is much smaller than the flux.

  • zp_form (str, optional) – The form of the zero point column, either “flux” or “magnitude”/”mag”. Determines how the zero point (zp) is applied in the conversion. If “flux”, then the function is applied as mag=-2.5*log10(flux/zp), or if “magnitude”, then mag=-2.5*log10(flux)+zp.

  • out_col_name ('str', optional) – The name of the output magnitude column, if None then the output is just the flux column name + “_mag”. The error column is also generated as the out_col_name + “_err”.

Returns:

result – A new EnsembleFrame object with a new magnitude (and error) column.

Return type:

tape.EnsembleFrame

coalesce(input_cols, output_col, drop_inputs=False)[source]#

Combines multiple input columns into a single output column, with values equal to the first non-nan value encountered in the input cols.

Parameters:
  • input_cols (list) – The list of column names to coalesce into a single column.

  • output_col (str, optional) – The name of the coalesced output column.

  • drop_inputs (bool, optional) – Determines whether the input columns are dropped or preserved. If a mapped column is an input and dropped, the output column is automatically assigned to replace that column mapping internally.

Returns:

ensemble – An ensemble object.

Return type:

tape.ensemble.Ensemble

class EnsembleSeries(expr, label=None, ensemble=None)[source]#

Bases: _Frame, dask.dataframe.Series

A barebones extension of a Dask Series for Ensemble data.

_partition_type#
class ObjectFrame(expr, ensemble=None)[source]#

Bases: EnsembleFrame

A subclass of EnsembleFrame for Object data.

_partition_type#
classmethod from_parquet(path, index=None, columns=None, ensemble=None)[source]#

Returns an ObjectFrame constructed from loading a parquet file.

classmethod from_dask_dataframe(df, ensemble=None)[source]#

Returns an ObjectFrame constructed from a Dask dataframe.

Parameters:
  • df (dask.dataframe.DataFrame or list) – a Dask dataframe to convert to an ObjectFrame

  • ensemble (tape.ensemble.Ensemble, optional) – A link to the Ensemble object that owns this frame.

Returns:

result – The constructed ObjectFrame object.

Return type:

tape.ObjectFrame

class SourceFrame(expr, ensemble=None)[source]#

Bases: EnsembleFrame

A subclass of EnsembleFrame for Source data.

_partition_type#
__getitem__(key)[source]#
classmethod from_parquet(path, index=None, columns=None, ensemble=None)[source]#

Returns a SourceFrame constructed from loading a parquet file.

classmethod from_dask_dataframe(df, ensemble=None)[source]#

Returns a SourceFrame constructed from a Dask dataframe.

Parameters:
  • df (dask.dataframe.DataFrame or list) – a Dask dataframe to convert to a SourceFrame

  • ensemble (tape.ensemble.Ensemble, optional) – A link to the Ensemble object that owns this frame.

Returns:

result – The constructed SourceFrame object.

Return type:

tape.SourceFrame

class TapeFrame(data=None, index: pandas._typing.Axes | None = None, columns: pandas._typing.Axes | None = None, dtype: pandas._typing.Dtype | None = None, copy: bool | None = None)[source]#

Bases: pandas.DataFrame

A barebones extension of a Pandas frame to be used for underlying Ensemble data.

See https://pandas.pydata.org/docs/development/extending.html#subclassing-pandas-data-structures

property _constructor#

Used when a manipulation result has the same dimensions as the original.

property _constructor_expanddim#
class TapeObjectFrame(data=None, index: pandas._typing.Axes | None = None, columns: pandas._typing.Axes | None = None, dtype: pandas._typing.Dtype | None = None, copy: bool | None = None)[source]#

Bases: TapeFrame

A barebones extension of a Pandas frame to be used for underlying Ensemble object data.

See https://pandas.pydata.org/docs/development/extending.html#subclassing-pandas-data-structures

property _constructor#

Used when a manipulation result has the same dimensions as the original.

property _constructor_expanddim#
class TapeSourceFrame(data=None, index: pandas._typing.Axes | None = None, columns: pandas._typing.Axes | None = None, dtype: pandas._typing.Dtype | None = None, copy: bool | None = None)[source]#

Bases: TapeFrame

A barebones extension of a Pandas frame to be used for underlying Ensemble source data

See https://pandas.pydata.org/docs/development/extending.html#subclassing-pandas-data-structures

property _constructor#

Used when a manipulation result has the same dimensions as the original.

property _constructor_expanddim#
class TapeSeries(data=None, index=None, dtype: pandas._typing.Dtype | None = None, name=None, copy: bool | None = None, fastpath: bool | pandas._libs.lib.NoDefault = lib.no_default)[source]#

Bases: pandas.Series

A barebones extension of a Pandas series to be used for underlying Ensemble data.

See https://pandas.pydata.org/docs/development/extending.html#subclassing-pandas-data-structures

property _constructor#

Used when a manipulation result has the same dimensions as the original.

property _constructor_sliced#
class TimeSeries(data=None)[source]#

Represent and analyze Rubin TimeSeries data

property time#

Time values stored as a Pandas Series

property flux#

Flux values stored as a Pandas Series

property flux_err#

Flux error values stored as a Pandas Series

property band#

Band labels stored as a Pandas Index

from_dict(data_dict, time_label='time', flux_label='flux', err_label='flux_err', band_label='band')[source]#

Build dataframe from a python dictionary

Parameters:
  • data_dict (dict) – Dictionary contaning the data.

  • time_label (str) – Name for column containing time information.

  • flux_label (str) – Name for column containing signal (flux, magnitude, etc) information.

  • err_label (str) – Name for column containing error information.

  • band_label (str) – Name for column containing filter information.

dropna(**kwargs)[source]#

Handle NaN values, wrapper for pandas.DataFrame.dropna

from_dataframe(data, object_id, time_label='time', flux_label='flux', err_label='flux_err', band_label='band')[source]#

Loader function for inputing data from a dataframe.

Parameters:
  • data (pandas.DataFrame) – The data for the time serires.

  • object_id (str) – The ID of the current object.

  • time_label (str) – Name for column containing time information.

  • flux_label (str) – Name for column containing signal (flux, magnitude, etc) information.

  • err_label (str) – Name for column containing error information.

  • band_label (str) – Name for column containing filter information.

_build_index(band)[source]#

Build pandas multiindex from band array

stetson_J(band=None)[source]#

Compute the stetsonJ statistic on data from one or several bands

Parameters:

band (str or list of str) – Single band descriptor, or list of such descriptors.

Returns:

stetsonJ – StetsonJ statistic for each of input bands.

Return type:

dict

Note

In case that no value for band is passed, the function is executed on all available bands.

sf2(sf_method='basic', argument_container=None)[source]#

Compute the structure function squared statistic on data

Parameters:
  • bins (numpy.array or list) – Manually provided bins, if not provided then bins are computed using the method kwarg

  • band_to_calc (str or list of str) – Single band descriptor, or list of such descriptors.

  • method ('str') – The binning method to apply, choices of ‘size’; which seeks an even distribution of samples per bin using quantiles, ‘length’; which creates bins of equal length in time and ‘loglength’; which creates bins of equal length in log time.

  • sthresh ('int') – Target number of samples per bin.

Returns:

stetsonJ – Structure function squared statistic for each of input bands.

Return type:

dict

Note

In case that no value for band_to_calc is passed, the function is executed on all available bands.

class ColumnMapper(id_col=None, time_col=None, flux_col=None, err_col=None, band_col=None)[source]#

Maps columns from a given dataset into known ensemble column

abstract _set_known_map(hipscat=True)[source]#

Must be defined in a known map class

use_known_map(map_id, hipscat=True)[source]#

Use a known mapping scheme

Parameters:
  • map_id ('str') – Identifies which mapping scheme to use

  • hipscat ('bool') – Indicates whether the data is in hipscat format or not, which will affect the chosen ID column (_hipscat_index will be used when hipscat is true. True by default.

Returns:

  • A ColumnMapper subclass object dependent on the map_id provided,

  • ZTFColumnMapper in the case of “ZTF” for example

is_ready(show_needed=False)[source]#

shows whether the ColumnMapper has all critical columns assigned

Parameters:

show_needed ('bool', optional) – Indicates whether to also return a list of missing columns

Return type:

bool or tuple of (bool, list) dependent on show_needed parameter

assign(id_col=None, time_col=None, flux_col=None, err_col=None, band_col=None)[source]#

Updates a given set of columns

Parameters:
  • id_col ('str', optional) – Identifies which column contains the Object IDs

  • time_col ('str', optional) – Identifies which column contains the time information

  • flux_col ('str', optional) – Identifies which column contains the flux/magnitude information

  • err_col ('str', optional) – Identifies which column contains the flux/mag error information

  • band_col ('str', optional) – Identifies which column contains the band information

  • nobs_col (list of 'str', optional) – Identifies which columns contain number of observations for each band, if available in the input object file

  • nobs_tot_col ('str', optional) – Identifies which column contains the total number of observations, if available in the input object file

SOURCE_FRAME_LABEL = 'source'[source]#
OBJECT_FRAME_LABEL = 'object'[source]#
DEFAULT_FRAME_LABEL = 'result'[source]#
METADATA_FILENAME = 'ensemble_metadata.json'[source]#
class Ensemble(client=False, **kwargs)[source]#

Ensemble object is a collection of light curve ids

__enter__()[source]#
__exit__(exc_type, exc_value, traceback)[source]#
__del__()[source]#
add_frame(frame, label)[source]#

Adds a new frame for the Ensemble to track.

Parameters:
  • frame (tape.ensemble_frame.EnsembleFrame) – The frame object for the Ensemble to track.

  • label (str) – The label for the Ensemble to use to track the frame.

Return type:

Ensemble

Raises:

ValueError – if the label is “source”, “object”, or already tracked by the Ensemble.

update_frame(frame)[source]#

Updates a frame tracked by the Ensemble or otherwise adds it to the Ensemble. The frame is tracked by its EnsembleFrame.label field.

Parameters:

frame (tape.ensemble.EnsembleFrame) – The frame for the Ensemble to update. If not already tracked, it is added.

Return type:

Ensemble

Raises:

ValueError – if the frame.label is unpopulated, or if the frame is not a SourceFrame or ObjectFrame but uses the reserved labels.

drop_frame(label)[source]#

Drops a frame tracked by the Ensemble.

Parameters:

label (str) – The label of the frame to be dropped by the Ensemble.

Return type:

Ensemble

Raises:
  • ValueError – if the label is “source”, or “object”.

  • KeyError – if the label is not tracked by the Ensemble.

select_frame(label)[source]#

Selects and returns frame tracked by the Ensemble.

Parameters:

label (str) – The label of a frame tracked by the Ensemble to be selected.

Return type:

tape.ensemble.EnsembleFrame

Raises:

KeyError – if the label is not tracked by the Ensemble.

frame_info(labels=None, verbose=True, memory_usage=True, **kwargs)[source]#

Wrapper for calling dask.dataframe.DataFrame.info() on frames tracked by the Ensemble.

Parameters:
  • labels (list, optional) – A list of labels for Ensemble frames to summarize. If None, info is printed for all tracked frames.

  • verbose (bool, optional) – Whether to print the whole summary

  • memory_usage (bool, optional) – Specifies whether total memory usage of the DataFrame elements (including the index) should be displayed.

  • **kwargs – keyword arguments passed along to dask.dataframe.DataFrame.info()

Return type:

None

Raises:

KeyError – if a label in labels is not tracked by the Ensemble.

_generate_frame_label()[source]#

Generates a new unique label for a result frame.

insert_sources(obj_ids, bands, timestamps, fluxes, flux_errs=None, force_repartition=False, **kwargs)[source]#

Manually insert sources into the ensemble.

Requires, at a minimum, the object’s ID and the band, timestamp, and flux of the observation.

Note

This function is expensive and is provides mainly for testing purposes. Care should be used when incorporating it into the core of an analysis.

Parameters:
  • obj_ids (list) – A list of the sources’ object ID.

  • bands (list) – A list of the bands of the observation.

  • timestamps (list) – A list of the times the sources were observed.

  • fluxes (list) – A list of the fluxes of the observations.

  • flux_errs (list, optional) – A list of the errors in the flux.

  • force_repartition (bool optional) – Do an immediate repartition of the dataframes.

client_info()[source]#

Calls the Dask Client, which returns cluster information

Parameters:

None

Returns:

self.client – Dask Client information

Return type:

distributed.client.Client

info(verbose=True, memory_usage=True, **kwargs)[source]#

Wrapper for dask.dataframe.DataFrame.info() for the Source and Object tables

Parameters:
  • verbose (bool, optional) – Whether to print the whole summary

  • memory_usage (bool, optional) – Specifies whether total memory usage of the DataFrame elements (including the index) should be displayed.

Return type:

None

check_sorted(table='object')[source]#

Checks to see if an Ensemble Dataframe is sorted (increasing) on the index.

Parameters:

table (str, optional) – The table to check.

Returns:

indicating whether the index is sorted (True) or not (False)

Return type:

boolean

check_lightcurve_cohesion()[source]#

Checks to see if lightcurves are split across multiple partitions.

With partitioned data, and source information represented by rows, it is possible that when loading data or manipulating it in some way (most likely a repartition) that the sources for a given object will be split among multiple partitions. This function will check to see if all lightcurves are “cohesive”, meaning the sources for that object only live in a single partition of the dataset.

Returns:

indicates whether the sources tied to a given object are only found in a single partition (True), or if they are split across multiple partitions (False)

Return type:

boolean

sort_lightcurves(by_band=True)[source]#

Sorts each Source partition first by the indexed ID column and then by the time column, each in ascending order.

This allows for efficient access of lightcurves by their indexed object ID while still giving easy access to the sorted time series.

Note that if the lightcurves are split across multiple partitions, this operation only sorts on a per-partition basis, and the table will not be globally sorted.

You can check that no lightcurves are not split across multiple partitions by seeing if Ensemble.check_lightcurve_cohesion() is True.

Parameters:

by_band (bool, optional) – If True, the lightcurves are still sorted first by the indexed ID column, but then by band and then by timestamp, all in ascending order.

Return type:

Ensemble

compute(table=None, **kwargs)[source]#

Wrapper for dask.dataframe.DataFrame.compute()

The compute operation performs the computations that had been lazily allocated and returns the results as an in-memory pandas data frame.

Parameters:

table (str, optional) – The table to materialize.

Returns:

A single pandas data frame for the specified table or a tuple of (object, source) data frames.

Return type:

pd.Dataframe

persist(**kwargs)[source]#

Wrapper for dask.dataframe.DataFrame.persist()

The compute operation performs the computations that had been lazily allocated, but does not bring the results into memory or return them. This is useful for preventing a Dask task graph from growing too large by performing part of the computation.

sample(frac=None, replace=False, random_state=None)[source]#

Selects a random sample of objects (sampling each partition).

This sampling will be lazily applied to the SourceFrame as well. A new Ensemble object is created, and no additional EnsembleFrames will be carried into the new Ensemble object. Most of docstring copied from https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.sample.html.

Parameters:
  • frac (float, optional) – Approximate fraction of objects to return. This sampling fraction is applied to all partitions equally. Note that this is an approximate fraction. You should not expect exactly len(df) * frac items to be returned, as the exact number of elements selected will depend on how your data is partitioned (but should be pretty close in practice).

  • replace (boolean, optional) – Sample with or without replacement. Default = False.

  • random_state (int or np.random.RandomState) – If an int, we create a new RandomState with this as the seed; Otherwise we draw from the passed RandomState.

Returns:

ensemble – A new ensemble with the subset of data selected

Return type:

tape.ensemble.Ensemble

columns(table='object')[source]#

Retrieve columns from dask dataframe

head(table='object', n=5, **kwargs)[source]#

Wrapper for dask.dataframe.DataFrame.head()

tail(table='object', n=5, **kwargs)[source]#

Wrapper for dask.dataframe.DataFrame.tail()

dropna(table='source', **kwargs)[source]#

Removes rows with a >=`threshold` nan values.

Parameters:
  • table (str, optional) – A string indicating which table to filter. Should be one of “object” or “source”.

  • **kwargs – keyword arguments passed along to dask.dataframe.DataFrame.dropna

Returns:

ensemble – The ensemble object with nans removed according to the threshold scheme

Return type:

tape.ensemble.Ensemble

select(columns, table='object')[source]#

Select a subset of columns. Modifies the ensemble in-place by dropping the unselected columns.

Parameters:
  • columns (list) – A list of column labels to keep.

  • table (str, optional) – A string indicating which table to filter. Should be one of “object” or “source”.

query(expr, table='object')[source]#

Keep only certain rows of a table based on an expression of what information to keep. Wraps Dask query.

Parameters:
  • expr (str) – A string specifying the expression of what to keep.

  • table (str, optional) – A string indicating which table to filter. Should be one of “object” or “source”.

Examples

Keep sources with flux above 100.0:

ens.query("flux > 100", table="source")

Keep sources in the green band:

ens.query("band_col_name == 'g'", table="source")

Filtering on the flux column without knowing its name:

ens.query(f"{ens._flux_col} > 100", table="source")
filter_from_series(keep_series, table='object')[source]#

Filter the tables based on a DaskSeries indicating which rows to keep.

Parameters:
  • keep_series (dask.dataframe.Series) – A series mapping the table’s row to a Boolean indicating whether or not to keep the row.

  • table (str, optional) – A string indicating which table to filter. Should be one of “object” or “source”.

assign(table='object', temporary=False, **kwargs)[source]#

Wrapper for dask.dataframe.DataFrame.assign()

Parameters:
  • table (str, optional) – A string indicating which table to filter. Should be one of “object” or “source”.

  • kwargs (dict of {str: callable or Series}) – Each argument is the name of a new column to add and its value specifies how to fill it. A callable is called for each row and a series is copied in.

  • temporary ('bool', optional) – Dictates whether the resulting columns are flagged as “temporary” columns within the Ensemble. Temporary columns are dropped when table syncs are performed, as their information is often made invalid by future operations. For example, the number of observations information is made invalid by a filter on the source table. Defaults to False.

Returns:

self – The ensemble object.

Return type:

tape.ensemble.Ensemble

Examples

Direct assignment of my_series to a column named “new_column”:

ens.assign(table="object", new_column=my_series)

Subtract the value in “err” from the value in “flux”:

ens.assign(table="source", lower_bnd=lambda x: x["flux"] - 2.0 * x["err"])
calc_nobs(by_band=False, label='nobs', temporary=True)[source]#

Calculates the number of observations per lightcurve.

Parameters:
  • by_band (bool, optional) – If True, also calculates the number of observations for each band in addition to providing the number of observations in total

  • label (str, optional) – The label used to generate output columns. “_total” and the band labels (e.g. “_g”) are appended.

  • temporary ('bool', optional) – Dictates whether the resulting columns are flagged as “temporary” columns within the Ensemble. Temporary columns are dropped when table syncs are performed, as their information is often made invalid by future operations. For example, the number of observations information is made invalid by a filter on the source table. Defaults to True.

Returns:

ensemble – The ensemble object with nobs columns added to the object table.

Return type:

tape.ensemble.Ensemble

prune(threshold=50, col_name=None)[source]#

remove objects with less observations than a given threshold

Parameters:
  • threshold (int, optional) – The minimum number of observations needed to retain an object. Default is 50.

  • col_name (str, optional) – The name of the column to assess the threshold if available in the object table. If not specified, the ensemble will calculate the number of observations and filter on the total (sum across bands).

Returns:

ensemble – The ensemble object with pruned rows removed

Return type:

tape.ensemble.Ensemble

find_day_gap_offset()[source]#

Finds an approximation of the MJD offset for noon at the observatory.

This function looks for the longest strecth of hours of the day with zero observations. This gap is treated as the daylight hours and the function returns the middle hour of the gap. This is used for automatically finding offsets for binning.

Returns:

empty_hours – The estimated middle of the day as a floating point day. Returns -1.0 if no such time is found.

Return type:

list

Note

Calls a compute on the source table.

bin_sources(time_window=1.0, offset=0.0, custom_aggr=None, count_col=None, use_map=True, **kwargs)[source]#

Bin sources on within a given time range to improve the estimates.

Parameters:
  • time_window (float, optional) – The time range (in days) over which to consider observations in the same bin. The default is 1.0 days.

  • offset (float, optional) – The offset in days to use for binning. This should correspond to the middle of the daylight hours for the observatory. Default is 0.0. This value can also be computed with find_day_gap_offset.

  • custom_aggr (dict, optional) – A dictionary mapping column name to aggregation method. This can be used to both include additional columns to aggregate OR overwrite the aggregation method for time, flux, or flux error by matching those column names. Example: {“my_value_1”: “mean”, “my_value_2”: “max”, “psFlux”: “sum”}

  • count_col (str, optional) – The name of the column in which to count the number of sources per bin. If None then it does not include this column.

  • use_map (boolean, optional) – Determines whether dask.dataframe.DataFrame.map_partitions is used (True). Using map_partitions is generally more efficient, but requires the data from each lightcurve is housed in a single partition. If False, a groupby will be performed instead.

Returns:

ensemble – The ensemble object with pruned rows removed

Return type:

tape.ensemble.Ensemble

Notes

  • This should only be used for slowly varying sources where we can treat the source as constant within time_window.

  • As a default the function only aggregates and keeps the id, band, time, flux, and flux error columns. Additional columns can be preserved by providing the mapping of column name to aggregation function with the additional_cols parameter.

batch(func, *args, meta=None, by_band=False, use_map=True, on=None, label='', **kwargs)[source]#

Run a function from tape.TimeSeries on the available ids

Parameters:
  • func (function) – A function to apply to all objects in the ensemble. The function could be a TAPE function, an initialized feature extractor from light-curve package or a user-defined function. In the least case the function must have the following signature: func(*cols, **kwargs), where the names of the cols are specified in args, kwargs are keyword arguments passed to the function, and the return value schema is described by meta. For TAPE and light-curve functions args, meta and on are populated automatically.

  • *args – Denotes the ensemble columns to use as inputs for a function, order must be correct for function. If passing a TAPE or light-curve function, these are populated automatically.

  • meta (pd.Series, pd.DataFrame, dict, or tuple-like) – Dask’s meta parameter, which lays down the expected structure of the results. Overridden by TAPE for TAPE and light-curve functions. If none, attempts to coerce the result to a pandas.Series.

  • by_band (boolean, optional) – If true, the lightcurves are split into separate inputs for each band and passed along to the function individually. If the band column is already specified in on then batch will ensure the band column is the final element in on. For all original columns outputted by func, by_band will generate a set of new columns per band (for example, a function with output column “result” will instead have “result_g” and “result_r” as columns if the data had g and r band data) If False (default), the full lightcurve is passed along to the function (assuming the band column in not already part of on)

  • use_map (boolean) – Determines whether dask.dataframe.DataFrame.map_partitions is used (True). Using map_partitions is generally more efficient, but requires the data from each lightcurve is housed in a single partition. This can be checked using Ensemble.check_lightcurve_cohesion. If False, a groupby will be performed instead.

  • on ('str' or 'list', optional) – Designates which column(s) to groupby. Columns may be from the source or object tables. If not specified, then the id column is used by default. For TAPE and light-curve functions this is populated automatically.

  • label ('str', optional) – If provided the ensemble will use this label to track the result dataframe. If not provided, a label of the from “result_{x}” where x is a monotonically increasing integer is generated. If None, the result frame will not be tracked.

  • **kwargs – Additional optional parameters passed for the selected function

Returns:

result – Series of function results

Return type:

Dask.Series

Examples

Run a TAPE function on the ensemble:

from tape.analysis.stetsonj import calc_stetson_J
ens = Ensemble().from_dataset('rrlyr82')
ensemble.batch(calc_stetson_J, band_to_calc='i')

Run a light-curve function on the ensemble:

from light_curve import EtaE
ens.batch(EtaE(), band_to_calc='g')

Run a custom function on the ensemble:

def s2n_inter_quartile_range(flux, err):
first, third = np.quantile(flux / err, [0.25, 0.75])
return third - first

ens.batch(s2n_inter_quartile_range, ens._flux_col, ens._err_col)

Or even a numpy built-in function:

amplitudes = ens.batch(np.ptp, ens._flux_col)
_standardize_batch(batch, on, by_band)[source]#

standardizes the output of a batch result

save_ensemble(path='.', dirname='ensemble', additional_frames=True, **kwargs)[source]#

Save the current ensemble frames to disk.

Parameters:
  • path ('str' or path-like, optional) – A path to the desired location of the top-level save directory, by default this is the current working directory.

  • dirname ('str', optional) – The name of the saved ensemble directory, “ensemble” by default.

  • additional_frames (bool, or list, optional) – Controls whether EnsembleFrames beyond the Object and Source Frames are saved to disk. If True or False, this specifies whether all or none of the additional frames are saved. Alternatively, a list of EnsembleFrame names may be provided to specify which frames should be saved. Object and Source will always be added and do not need to be specified in the list. By default, all frames will be saved.

  • **kwargs – Additional kwargs passed along to EnsembleFrame.to_parquet()

Return type:

None

Note

If the object frame has no columns, which is often the case when an Ensemble is constructed using only source files/dictionaries, then an object subdirectory will not be created. Ensemble.from_ensemble will know how to work with the directory whether or not the object subdirectory is present.

Be careful about repeated saves to the same directory name. This will not be a perfect overwrite, as any products produced by a previous save may not be deleted by successive saves if they are removed from the ensemble. For best results, delete the directory between saves or verify that the contents are what you would expect.

from_ensemble(dirpath, additional_frames=True, column_mapper=None, **kwargs)[source]#

Load an ensemble from an on-disk ensemble.

Parameters:
  • dirpath ('str' or path-like, optional) – A path to the top-level ensemble directory to load from.

  • additional_frames (bool, or list, optional) – Controls whether EnsembleFrames beyond the Object and Source Frames are loaded from disk. If True or False, this specifies whether all or none of the additional frames are loaded. Alternatively, a list of EnsembleFrame names may be provided to specify which frames should be loaded. Object and Source will always be added and do not need to be specified in the list. By default, all frames will be loaded.

  • column_mapper (Tape.ColumnMapper object, or None, optional) – Supplies a ColumnMapper to the Ensemble, if None (default) searches for a column_mapper.npy file in the directory, which should be created when the ensemble is saved.

Returns:

ensemble – The ensemble object.

Return type:

tape.ensemble.Ensemble

from_pandas(source_frame, object_frame=None, column_mapper=None, sync_tables=True, npartitions=None, partition_size=None, **kwargs)[source]#

Read in Pandas dataframe(s) into an ensemble object

Parameters:
  • source_frame ('pandas.Dataframe') – A Dask dataframe that contains source information to be read into the ensemble

  • object_frame ('pandas.Dataframe', optional) – If not specified, the object frame is generated from the source frame

  • column_mapper ('ColumnMapper' object) – If provided, the ColumnMapper is used to populate relevant column information mapped from the input dataset.

  • sync_tables ('bool', optional) – In the case where an `object_frame`is provided, determines whether an initial sync is performed between the object and source tables. If not performed, dynamic information like the number of observations may be out of date until a sync is performed internally.

  • npartitions (int, optional) – If specified, attempts to repartition the ensemble to the specified number of partitions

  • partition_size (int, optional) – If specified, attempts to repartition the ensemble to partitions of size partition_size.

Returns:

ensemble – The ensemble object with the Dask dataframe data loaded.

Return type:

tape.ensemble.Ensemble

from_dask_dataframe(source_frame, object_frame=None, column_mapper=None, sync_tables=True, npartitions=None, partition_size=None, sorted=False, sort=False, **kwargs)[source]#

Read in Dask dataframe(s) into an ensemble object

Parameters:
  • source_frame ('dask.Dataframe') – A Dask dataframe that contains source information to be read into the ensemble

  • object_frame ('dask.Dataframe', optional) – If not specified, the object frame is generated from the source frame

  • column_mapper ('ColumnMapper' object) – If provided, the ColumnMapper is used to populate relevant column information mapped from the input dataset.

  • sync_tables ('bool', optional) – In the case where an `object_frame`is provided, determines whether an initial sync is performed between the object and source tables.

  • npartitions (int, optional) – If specified, attempts to repartition the ensemble to the specified number of partitions

  • partition_size (int, optional) – If specified, attempts to repartition the ensemble to partitions of size partition_size.

  • sorted (bool, optional) – If the index column is already sorted in increasing order. Defaults to False

  • sort (bool, optional) – If True, sorts the DataFrame by the id column. Otherwise set the index on the individual existing partitions. Defaults to False.

Returns:

ensemble – The ensemble object with the Dask dataframe data loaded.

Return type:

tape.ensemble.Ensemble

from_lsdb(source_catalog, object_catalog=None, column_mapper=None, sync_tables=False, sorted=True, sort=False)[source]#

Read in from LSDB catalog objects.

Parameters:
  • source_catalog ('dask.Dataframe') – An LSDB catalog that contains source information to be read into the ensemble.

  • object_catalog ('dask.Dataframe', optional) – An LSDB catalog containing object information. If not specified, a minimal ObjectFrame is generated from the source catalog.

  • column_mapper ('ColumnMapper' object) – If provided, the ColumnMapper is used to populate relevant column information mapped from the input dataset.

  • sync_tables ('bool', optional) – In the case where an `object_catalog`is provided, determines whether an initial sync is performed between the object and source tables. Defaults to False.

  • sorted (bool, optional) – If the index column is already sorted in increasing order. Defaults to True.

  • sort (bool, optional) – If True, sorts the DataFrame by the id column. Otherwise set the index on the individual existing partitions. Defaults to False.

Returns:

ensemble – The ensemble object with the LSDB catalog data loaded.

Return type:

tape.ensemble.Ensemble

from_hipscat(source_path, object_path=None, column_mapper=None, source_index=None, object_index=None, sorted=True, sort=False)[source]#

Use LSDB to read from a hipscat directory.

This function utilizes LSDB for reading a hipscat directory into TAPE. In cases where a user would like to do operations on the LSDB catalog objects, it’s best to use LSDB itself first, and then load the result into TAPE using tape.Ensemble.from_lsdb. A join is performed between the two tables to modify the source table to use the object index, using object_index and source_index.

Parameters:
  • source_path (str or Path) – A hipscat directory that contains source information to be read into the ensemble.

  • object_path (str or Path, optional) – A hipscat directory containing object information. If not specified, a minimal ObjectFrame is generated from the sources.

  • column_mapper ('ColumnMapper' object) – If provided, the ColumnMapper is used to populate relevant column information mapped from the input dataset.

  • object_index ('str', optional) – The join index of the object table, should be the label for the object ID contained in the object table.

  • source_index ('str', optional) – The join index of the source table, should be the label for the object ID contained in the source table.

  • sorted (bool, optional) – If the index column is already sorted in increasing order. Defaults to True.

  • sort (bool, optional) – If True, sorts the DataFrame by the id column. Otherwise set the index on the individual existing partitions. Defaults to False.

Returns:

ensemble – The ensemble object with the hipscat data loaded.

Return type:

tape.ensemble.Ensemble

make_column_map()[source]#

Returns the current column mapping.

Returns:

result – A new column mapper representing the Ensemble’s current mappings.

Return type:

tape.utils.ColumnMapper

update_column_mapping(column_mapper=None, **kwargs)[source]#

Update the mapping of column names.

Parameters:
  • column_mapper (tape.utils.ColumnMapper, optional) – An entirely new mapping of column names. If None then modifies the current mapping using kwargs.

  • kwargs – Individual column to name settings.

Returns:

self

Return type:

Ensemble

_load_column_mapper(column_mapper, **kwargs)[source]#

Load a column mapper object.

Parameters:
  • column_mapper (tape.utils.ColumnMapper or None) – The ColumnMapper to use. If None then the function creates a new one from kwargs.

  • kwargs (optional) – Individual column to name settings.

Returns:

self

Return type:

Ensemble

Raises:

ValueError if a required column is missing.

from_parquet(source_file, object_file=None, column_mapper=None, sync_tables=True, additional_cols=True, npartitions=None, partition_size=None, sorted=False, sort=False, **kwargs)[source]#

Read in parquet file(s) into an ensemble object

Parameters:
  • source_file ('str') – Path to a parquet file, or multiple parquet files that contain source information to be read into the ensemble

  • object_file ('str', optional) – Path to a parquet file, or multiple parquet files that contain object information. If not specified, it is generated from the source table

  • column_mapper ('ColumnMapper' object) – If provided, the ColumnMapper is used to populate relevant column information mapped from the input dataset.

  • sync_tables ('bool', optional) – In the case where object files are loaded in, determines whether an initial sync is performed between the object and source tables. If not performed, dynamic information like the number of observations may be out of date until a sync is performed internally.

  • additional_cols ('bool', optional) – Boolean to indicate whether to carry in columns beyond the critical columns, true will, while false will only load the columns containing the critical quantities (id,time,flux,err,band)

  • npartitions (int, optional) – If specified, attempts to repartition the ensemble to the specified number of partitions

  • partition_size (int, optional) – If specified, attempts to repartition the ensemble to partitions of size partition_size.

  • sorted (bool, optional) – If the index column is already sorted in increasing order. Defaults to False

  • sort (bool, optional) – If True, sorts the DataFrame by the id column. Otherwise set the index on the individual existing partitions. Defaults to False.

Returns:

ensemble – The ensemble object with parquet data loaded

Return type:

tape.ensemble.Ensemble

from_dataset(dataset, **kwargs)[source]#

Load the ensemble from a TAPE dataset.

Parameters:

dataset ('str') – The name of the dataset to import

Returns:

ensemble – The ensemble object with the dataset loaded

Return type:

tape.ensemble.Ensemble

available_datasets()[source]#

Retrieve descriptions of available TAPE datasets.

Returns:

A dictionary of datasets with description information.

Return type:

dict

from_source_dict(source_dict, column_mapper=None, npartitions=1, sorted=False, sort=False, **kwargs)[source]#

Load the sources into an ensemble from a dictionary.

Parameters:
  • source_dict ('dict') – The dictionary containing the source information.

  • column_mapper ('ColumnMapper' object) – If provided, the ColumnMapper is used to populate relevant column information mapped from the input dataset.

  • npartitions (int, optional) – If specified, attempts to repartition the ensemble to the specified number of partitions

  • sorted (bool, optional) – If the index column is already sorted in increasing order. Defaults to False

  • sort (bool, optional) – If True, sorts the DataFrame by the id column. Otherwise set the index on the individual existing partitions. Defaults to False.

Returns:

ensemble – The ensemble object with dictionary data loaded

Return type:

tape.ensemble.Ensemble

convert_flux_to_mag(zero_point, zp_form='mag', out_col_name=None, flux_col=None, err_col=None)[source]#

Converts a flux column into a magnitude column.

Parameters:
  • zero_point ('str' or 'float') – The name of the ensemble column containing the zero point information for column transformation. Alternatively, a single float number to apply for all fluxes.

  • zp_form (str, optional) – The form of the zero point column, either “flux” or “magnitude”/”mag”. Determines how the zero point (zp) is applied in the conversion. If “flux”, then the function is applied as mag=-2.5*log10(flux/zp), or if “magnitude”, then mag=-2.5*log10(flux)+zp.

  • out_col_name ('str', optional) – The name of the output magnitude column, if None then the output is just the flux column name + “_mag”. The error column is also generated as the out_col_name + “_err”.

  • flux_col ('str', optional) – The name of the ensemble flux column to convert into magnitudes. Uses the Ensemble mapped flux column if not specified.

  • err_col ('str', optional) – The name of the ensemble column containing the errors to propagate. Errors are propagated using the following approximation: Err= (2.5/log(10))*(flux_error/flux), which holds mainly when the error in flux is much smaller than the flux. Uses the Ensemble mapped error column if not specified.

Returns:

ensemble – The ensemble object with a new magnitude (and error) column.

Return type:

tape.ensemble.Ensemble

_generate_object_table()[source]#

Generate an empty object table from the source table.

_lazy_sync_tables_from_frame(frame)[source]#

Call the sync operation for the frame only if the table being modified (frame) needs to be synced. Does nothing in the case that only the table to be modified is dirty or if it is not the object or source frame for this Ensemble.

Parameters:

frame (tape.ensemble_frame.EnsembleFrame) – The frame being modified. Only an ObjectFrame or SourceFrame tracked by this `Ensemble may trigger a sync.

_lazy_sync_tables(table='object')[source]#

Call the sync operation for the table only if the the table being modified (table) needs to be synced. Does nothing in the case that only the table to be modified is dirty.

Parameters:

table (str, optional) – The table being modified. Should be one of “object”, “source”, or “all”

_sync_tables()[source]#

Sync operation to align both tables.

Filtered objects are always removed from the source. But filtered sources may be kept in the object table is the Ensemble’s keep_empty_objects attribute is set to True.

select_random_timeseries(seed=None)[source]#

Selects a random lightcurve from a random partition of the Ensemble.

Parameters:

seed (int, or None) – Sets a seed to return the same object id on successive runs. None by default, in which case a seed is not set for the operation.

Returns:

ts – Timeseries for a single object

Return type:

TimeSeries

Note

This is not uniformly sampled. As a random partition is chosen first to avoid a search in full index space, and partitions may vary in the number of objects they contain. In other words, objects in smaller partitions will have a higher probability of being chosen than objects in larger partitions.

to_timeseries(target, id_col=None, time_col=None, flux_col=None, err_col=None, band_col=None)[source]#

Construct a timeseries object from one target object_id, assumes that the result is a collection of lightcurves (output from query_ids)

Parameters:
  • target (int) – Id of a source to be extracted

  • id_col ('str', optional) – Identifies which column contains the Object IDs

  • time_col ('str', optional) – Identifies which column contains the time information

  • flux_col ('str', optional) – Identifies which column contains the flux/magnitude information

  • err_col ('str', optional) – Identifies which column contains the error information

  • band_col ('str', optional) – Identifies which column contains the band information

Returns:

ts – Timeseries for a single object

Return type:

TimeSeries

Note

All _col parameters when not specified will use the appropriate columns determined on data ingest as critical columns.

_build_index(obj_id, band)[source]#

Build pandas multiindex from object_ids and bands

Parameters:
  • obj_id (np.array or list) – A list of object id for each row in the data.

  • band (np.array or list) – A list of the band for each row in the data.

Returns:

index

Return type:

pd.MultiIndex

sf2(sf_method='basic', argument_container=None, use_map=True)[source]#

Wrapper interface for calling structurefunction2 on the ensemble

Parameters:
  • sf_method ('str') – The structure function calculation method to be used, by default “basic”.

  • argument_container (StructureFunctionArgumentContainer, optional) – Container object for additional configuration options, by default None.

  • use_map (boolean) – Determines whether dask.dataframe.DataFrame.map_partitions is used (True). Using map_partitions is generally more efficient, but requires the data from each lightcurve is housed in a single partition. If False, a groupby will be performed instead.

Returns:

result – Structure function squared for each of input bands.

Return type:

pandas.DataFrame

Note

In case that no value for band_to_calc is passed, the function is executed on all available bands in band.

_translate_meta(meta)[source]#

Translates Dask-style meta into a TapeFrame or TapeSeries object.

Parameters:

meta (dict, tuple, list, pd.Series, pd.DataFrame, pd.Index, dtype, scalar)

Returns:

result – The appropriate meta for Dask producing an tape.ensemble_frame.EnsembleFrame or Ensemble.EnsembleSeries respectively

Return type:

ensemble.TapeFrame or ensemble.TapeSeries

class EnsembleFrame(expr, label=None, ensemble=None)[source]#

Bases: _Frame, dask.dataframe.DataFrame

An extension for a Dask Dataframe for data used by a lightcurve Ensemble.

The underlying non-parallel dataframes are TapeFrames and TapeSeries which extend Pandas frames.

Examples

Instatiation:

import tape
ens = tape.Ensemble()
data = {...} # Some data you want tracked by the Ensemble
ensemble_frame = tape.EnsembleFrame.from_dict(data, label="my_frame", ensemble=ens)
_partition_type#
__getitem__(key)[source]#
classmethod from_tapeframe(data, npartitions=None, chunksize=None, sort=True, label=None, ensemble=None)[source]#

Returns an EnsembleFrame constructed from a TapeFrame.

Parameters:
  • data (TapeFrame) – Frame containing the underlying data fro the EnsembleFram

  • npartitions (int, optional) – The number of partitions of the index to create. Note that depending on the size and index of the dataframe, the output may have fewer partitions than requested.

  • chunksize (int, optional) – Size of the individual chunks of data in non-parallel objects that make up Dask frames.

  • sort (bool, optional) – Whether to sort the frame by a default index.

  • label (str, optional) – The label used to by the Ensemble to identify the frame.

  • ensemble (tape.Ensemble, optional) – A link to the Ensemble object that owns this frame.

Returns:

result – The constructed EnsembleFrame object.

Return type:

tape.EnsembleFrame

classmethod from_dask_dataframe(df, ensemble=None, label=None)[source]#

Returns an EnsembleFrame constructed from a Dask dataframe.

Parameters:
  • df (dask.dataframe.DataFrame or list) – a Dask dataframe to convert to an EnsembleFrame

  • ensemble (tape.ensemble.Ensemble, optional) – A link to the Ensemble object that owns this frame.

  • label (str, optional) – The label used to by the Ensemble to identify the frame.

Returns:

result – The constructed EnsembleFrame object.

Return type:

tape.EnsembleFrame

update_ensemble()[source]#

Updates the Ensemble linked by the EnsembelFrame.ensemble property to track this frame.

Returns:

result – The Ensemble object which tracks this frame, None if no such Ensemble.

Return type:

tape.Ensemble

classmethod from_dict(data, npartitions, orient='columns', dtype=None, columns=None, label=None, ensemble=None)[source]#

Construct a Tape EnsembleFrame from a Python Dictionary

Parameters:
  • data (dict) – Of the form {field : array-like} or {field : dict}.

  • npartitions (int) – The number of partitions of the index to create. Note that depending on the size and index of the dataframe, the output may have fewer partitions than requested.

  • orient ({'columns', 'index', 'tight'}, default 'columns') – The “orientation” of the data. If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’. If ‘tight’, assume a dict with keys [‘index’, ‘columns’, ‘data’, ‘index_names’, ‘column_names’].

  • dtype (bool) – Data type to force, otherwise infer.

  • columns (string, optional) – Column labels to use when orient='index'. Raises a ValueError if used with orient='columns' or orient='tight'.

  • label (str, optional) – The label used to by the Ensemble to identify the frame.

  • ensemble (tape.ensemble.Ensemble, optional) – A link to the Ensemble object that owns this frame.

Returns:

result – The constructed EnsembleFrame object.

Return type:

tape.EnsembleFrame

classmethod from_parquet(path, index=None, columns=None, label=None, ensemble=None, **kwargs)[source]#

Returns an EnsembleFrame constructed from loading a parquet file.

Parameters:
  • path (str or list) – Source directory for data, or path(s) to individual parquet files. Prefix with a protocol like s3:// to read from alternative filesystems. To read from multiple files you can pass a globstring or a list of paths, with the caveat that they must all have the same protocol.

  • index (str, list, False, optional) – Field name(s) to use as the output frame index. Default is None and index will be inferred from the pandas parquet file metadata, if present. Use False to read all fields as columns.

  • columns (str or list, optional) – Field name(s) to read in as columns in the output. By default all non-index fields will be read (as determined by the pandas parquet metadata, if present). Provide a single field name instead of a list to read in the data as a Series.

  • label (str, optional) – The label used to by the Ensemble to identify the frame.

  • ensemble (tape.ensemble.Ensemble, optional) – A link to the Ensemble object that owns this frame.

Returns:

result – The constructed EnsembleFrame object.

Return type:

tape.EnsembleFrame

convert_flux_to_mag(flux_col, zero_point, err_col=None, zp_form='mag', out_col_name=None)[source]#

Converts this EnsembleFrame’s flux column into a magnitude column, returning a new EnsembleFrame.

Parameters:
  • flux_col ('str') – The name of the EnsembleFrame flux column to convert into magnitudes.

  • zero_point ('str') – The name of the EnsembleFrame column containing the zero point information for column transformation.

  • err_col ('str', optional) – The name of the EnsembleFrame column containing the errors to propagate. Errors are propagated using the following approximation: Err= (2.5/log(10))*(flux_error/flux), which holds mainly when the error in flux is much smaller than the flux.

  • zp_form (str, optional) – The form of the zero point column, either “flux” or “magnitude”/”mag”. Determines how the zero point (zp) is applied in the conversion. If “flux”, then the function is applied as mag=-2.5*log10(flux/zp), or if “magnitude”, then mag=-2.5*log10(flux)+zp.

  • out_col_name ('str', optional) – The name of the output magnitude column, if None then the output is just the flux column name + “_mag”. The error column is also generated as the out_col_name + “_err”.

Returns:

result – A new EnsembleFrame object with a new magnitude (and error) column.

Return type:

tape.EnsembleFrame

coalesce(input_cols, output_col, drop_inputs=False)[source]#

Combines multiple input columns into a single output column, with values equal to the first non-nan value encountered in the input cols.

Parameters:
  • input_cols (list) – The list of column names to coalesce into a single column.

  • output_col (str, optional) – The name of the coalesced output column.

  • drop_inputs (bool, optional) – Determines whether the input columns are dropped or preserved. If a mapped column is an input and dropped, the output column is automatically assigned to replace that column mapping internally.

Returns:

ensemble – An ensemble object.

Return type:

tape.ensemble.Ensemble

class EnsembleSeries(expr, label=None, ensemble=None)[source]#

Bases: _Frame, dask.dataframe.Series

A barebones extension of a Dask Series for Ensemble data.

_partition_type#
class ObjectFrame(expr, ensemble=None)[source]#

Bases: EnsembleFrame

A subclass of EnsembleFrame for Object data.

_partition_type#
classmethod from_parquet(path, index=None, columns=None, ensemble=None)[source]#

Returns an ObjectFrame constructed from loading a parquet file.

classmethod from_dask_dataframe(df, ensemble=None)[source]#

Returns an ObjectFrame constructed from a Dask dataframe.

Parameters:
  • df (dask.dataframe.DataFrame or list) – a Dask dataframe to convert to an ObjectFrame

  • ensemble (tape.ensemble.Ensemble, optional) – A link to the Ensemble object that owns this frame.

Returns:

result – The constructed ObjectFrame object.

Return type:

tape.ObjectFrame

class SourceFrame(expr, ensemble=None)[source]#

Bases: EnsembleFrame

A subclass of EnsembleFrame for Source data.

_partition_type#
__getitem__(key)[source]#
classmethod from_parquet(path, index=None, columns=None, ensemble=None)[source]#

Returns a SourceFrame constructed from loading a parquet file.

classmethod from_dask_dataframe(df, ensemble=None)[source]#

Returns a SourceFrame constructed from a Dask dataframe.

Parameters:
  • df (dask.dataframe.DataFrame or list) – a Dask dataframe to convert to a SourceFrame

  • ensemble (tape.ensemble.Ensemble, optional) – A link to the Ensemble object that owns this frame.

Returns:

result – The constructed SourceFrame object.

Return type:

tape.SourceFrame

class TapeFrame(data=None, index: pandas._typing.Axes | None = None, columns: pandas._typing.Axes | None = None, dtype: pandas._typing.Dtype | None = None, copy: bool | None = None)[source]#

Bases: pandas.DataFrame

A barebones extension of a Pandas frame to be used for underlying Ensemble data.

See https://pandas.pydata.org/docs/development/extending.html#subclassing-pandas-data-structures

property _constructor#

Used when a manipulation result has the same dimensions as the original.

property _constructor_expanddim#
class TapeObjectFrame(data=None, index: pandas._typing.Axes | None = None, columns: pandas._typing.Axes | None = None, dtype: pandas._typing.Dtype | None = None, copy: bool | None = None)[source]#

Bases: TapeFrame

A barebones extension of a Pandas frame to be used for underlying Ensemble object data.

See https://pandas.pydata.org/docs/development/extending.html#subclassing-pandas-data-structures

property _constructor#

Used when a manipulation result has the same dimensions as the original.

property _constructor_expanddim#
class TapeSourceFrame(data=None, index: pandas._typing.Axes | None = None, columns: pandas._typing.Axes | None = None, dtype: pandas._typing.Dtype | None = None, copy: bool | None = None)[source]#

Bases: TapeFrame

A barebones extension of a Pandas frame to be used for underlying Ensemble source data

See https://pandas.pydata.org/docs/development/extending.html#subclassing-pandas-data-structures

property _constructor#

Used when a manipulation result has the same dimensions as the original.

property _constructor_expanddim#
class TapeSeries(data=None, index=None, dtype: pandas._typing.Dtype | None = None, name=None, copy: bool | None = None, fastpath: bool | pandas._libs.lib.NoDefault = lib.no_default)[source]#

Bases: pandas.Series

A barebones extension of a Pandas series to be used for underlying Ensemble data.

See https://pandas.pydata.org/docs/development/extending.html#subclassing-pandas-data-structures

property _constructor#

Used when a manipulation result has the same dimensions as the original.

property _constructor_sliced#
calc_stetson_J[source]#
calc_sf2[source]#
class TimeSeries(data=None)[source]#

Represent and analyze Rubin TimeSeries data

property time#

Time values stored as a Pandas Series

property flux#

Flux values stored as a Pandas Series

property flux_err#

Flux error values stored as a Pandas Series

property band#

Band labels stored as a Pandas Index

from_dict(data_dict, time_label='time', flux_label='flux', err_label='flux_err', band_label='band')[source]#

Build dataframe from a python dictionary

Parameters:
  • data_dict (dict) – Dictionary contaning the data.

  • time_label (str) – Name for column containing time information.

  • flux_label (str) – Name for column containing signal (flux, magnitude, etc) information.

  • err_label (str) – Name for column containing error information.

  • band_label (str) – Name for column containing filter information.

dropna(**kwargs)[source]#

Handle NaN values, wrapper for pandas.DataFrame.dropna

from_dataframe(data, object_id, time_label='time', flux_label='flux', err_label='flux_err', band_label='band')[source]#

Loader function for inputing data from a dataframe.

Parameters:
  • data (pandas.DataFrame) – The data for the time serires.

  • object_id (str) – The ID of the current object.

  • time_label (str) – Name for column containing time information.

  • flux_label (str) – Name for column containing signal (flux, magnitude, etc) information.

  • err_label (str) – Name for column containing error information.

  • band_label (str) – Name for column containing filter information.

_build_index(band)[source]#

Build pandas multiindex from band array

stetson_J(band=None)[source]#

Compute the stetsonJ statistic on data from one or several bands

Parameters:

band (str or list of str) – Single band descriptor, or list of such descriptors.

Returns:

stetsonJ – StetsonJ statistic for each of input bands.

Return type:

dict

Note

In case that no value for band is passed, the function is executed on all available bands.

sf2(sf_method='basic', argument_container=None)[source]#

Compute the structure function squared statistic on data

Parameters:
  • bins (numpy.array or list) – Manually provided bins, if not provided then bins are computed using the method kwarg

  • band_to_calc (str or list of str) – Single band descriptor, or list of such descriptors.

  • method ('str') – The binning method to apply, choices of ‘size’; which seeks an even distribution of samples per bin using quantiles, ‘length’; which creates bins of equal length in time and ‘loglength’; which creates bins of equal length in log time.

  • sthresh ('int') – Target number of samples per bin.

Returns:

stetsonJ – Structure function squared statistic for each of input bands.

Return type:

dict

Note

In case that no value for band_to_calc is passed, the function is executed on all available bands.