tape.ensemble_readers#

The following package-level methods can be used to create a new Ensemble object by reading in the given data source.

Functions#

read_ensemble(dirpath[, additional_frames, ...])

Load an ensemble from an on-disk ensemble.

read_pandas_dataframe(source_frame[, object_frame, ...])

Read in Pandas dataframe(s) and return an ensemble object

read_dask_dataframe(source_frame[, object_frame, ...])

Read in Dask dataframe(s) and return an ensemble object

read_parquet(source_file[, object_file, ...])

Read in parquet file(s) into an ensemble object

read_lsdb(source_catalog[, object_catalog, ...])

Read in from LSDB catalog objects.

read_hipscat(source_path[, object_path, ...])

Use LSDB to read from a hipscat directory.

read_source_dict(source_dict[, column_mapper, ...])

Load the sources into an ensemble from a dictionary.

read_dataset(dataset[, dask_client])

Load the ensemble from a TAPE dataset.

Module Contents#

read_ensemble(dirpath, additional_frames=True, column_mapper=None, dask_client=True, **kwargs)[source]#

Load an ensemble from an on-disk ensemble.

Parameters:
  • dirpath ('str' or path-like, optional) – A path to the top-level ensemble directory to load from.

  • additional_frames (bool, or list, optional) – Controls whether EnsembleFrames beyond the Object and Source Frames are loaded from disk. If True or False, this specifies whether all or none of the additional frames are loaded. Alternatively, a list of EnsembleFrame names may be provided to specify which frames should be loaded. Object and Source will always be added and do not need to be specified in the list. By default, all frames will be loaded.

  • column_mapper (Tape.ColumnMapper object, or None, optional) – Supplies a ColumnMapper to the Ensemble, if None (default) searches for a column_mapper.npy file in the directory, which should be created when the ensemble is saved.

  • dask_client (dask.distributed.client or bool, optional) – Accepts an existing dask.distributed.Client, or creates one if client=True, passing any additional kwargs to a dask.distributed.Client constructor call. If client=False, the Ensemble is created without a distributed client.

Returns:

ensemble – An ensemble object.

Return type:

tape.ensemble.Ensemble

read_pandas_dataframe(source_frame, object_frame=None, dask_client=True, column_mapper=None, sync_tables=True, npartitions=None, partition_size=None, **kwargs)[source]#

Read in Pandas dataframe(s) and return an ensemble object

Parameters:
  • source_frame ('pandas.Dataframe') – A Dask dataframe that contains source information to be read into the ensemble

  • object_frame ('pandas.Dataframe', optional) – If not specified, the object frame is generated from the source frame

  • dask_client (dask.distributed.client or bool, optional) – Accepts an existing dask.distributed.Client, or creates one if client=True, passing any additional kwargs to a dask.distributed.Client constructor call. If client=False, the Ensemble is created without a distributed client.

  • column_mapper ('ColumnMapper' object) – If provided, the ColumnMapper is used to populate relevant column information mapped from the input dataset.

  • sync_tables ('bool', optional) – In the case where an `object_frame`is provided, determines whether an initial sync is performed between the object and source tables. If not performed, dynamic information like the number of observations may be out of date until a sync is performed internally.

  • npartitions (int, optional) – If specified, attempts to repartition the ensemble to the specified number of partitions

  • partition_size (int, optional) – If specified, attempts to repartition the ensemble to partitions of size partition_size.

Returns:

ensemble – The ensemble object with the Dask dataframe data loaded.

Return type:

tape.ensemble.Ensemble

read_dask_dataframe(source_frame, object_frame=None, dask_client=True, column_mapper=None, sync_tables=True, npartitions=None, partition_size=None, **kwargs)[source]#

Read in Dask dataframe(s) and return an ensemble object

Parameters:
  • source_frame ('dask.Dataframe') – A Dask dataframe that contains source information to be read into the ensemble

  • object_frame ('dask.Dataframe', optional) – If not specified, the object frame is generated from the source frame

  • dask_client (dask.distributed.client or bool, optional) – Accepts an existing dask.distributed.Client, or creates one if client=True, passing any additional kwargs to a dask.distributed.Client constructor call. If client=False, the Ensemble is created without a distributed client.

  • column_mapper ('ColumnMapper' object) – If provided, the ColumnMapper is used to populate relevant column information mapped from the input dataset.

  • sync_tables ('bool', optional) – In the case where an `object_frame`is provided, determines whether an initial sync is performed between the object and source tables. If not performed, dynamic information like the number of observations may be out of date until a sync is performed internally.

  • npartitions (int, optional) – If specified, attempts to repartition the ensemble to the specified number of partitions

  • partition_size (int, optional) – If specified, attempts to repartition the ensemble to partitions of size partition_size.

Returns:

ensemble – The ensemble object with the Dask dataframe data loaded.

Return type:

tape.ensemble.Ensemble

read_parquet(source_file, object_file=None, column_mapper=None, dask_client=True, sync_tables=True, additional_cols=True, npartitions=None, partition_size=None, **kwargs)[source]#

Read in parquet file(s) into an ensemble object

Parameters:
  • source_file ('str') – Path to a parquet file, or multiple parquet files that contain source information to be read into the ensemble

  • object_file ('str') – Path to a parquet file, or multiple parquet files that contain object information. If not specified, it is generated from the source table

  • column_mapper ('ColumnMapper' object) – If provided, the ColumnMapper is used to populate relevant column information mapped from the input dataset.

  • dask_client (dask.distributed.client or bool, optional) – Accepts an existing dask.distributed.Client, or creates one if client=True, passing any additional kwargs to a dask.distributed.Client constructor call. If client=False, the Ensemble is created without a distributed client.

  • sync_tables ('bool', optional) – In the case where object files are loaded in, determines whether an initial sync is performed between the object and source tables. If not performed, dynamic information like the number of observations may be out of date until a sync is performed internally.

  • additional_cols ('bool', optional) – Boolean to indicate whether to carry in columns beyond the critical columns, true will, while false will only load the columns containing the critical quantities (id,time,flux,err,band)

  • npartitions (int, optional) – If specified, attempts to repartition the ensemble to the specified number of partitions

  • partition_size (int, optional) – If specified, attempts to repartition the ensemble to partitions of size partition_size.

Returns:

ensemble – The ensemble object with parquet data loaded

Return type:

tape.ensemble.Ensemble

read_lsdb(source_catalog, object_catalog=None, column_mapper=None, sync_tables=False, sorted=True, sort=False, dask_client=True, **kwargs)[source]#

Read in from LSDB catalog objects.

Parameters:
  • source_catalog ('dask.Dataframe') – An LSDB catalog that contains source information to be read into the ensemble.

  • object_catalog ('dask.Dataframe', optional) – An LSDB catalog containing object information. If not specified, a minimal ObjectFrame is generated from the source catalog.

  • column_mapper ('ColumnMapper' object) – If provided, the ColumnMapper is used to populate relevant column information mapped from the input dataset.

  • sync_tables ('bool', optional) – In the case where an `object_catalog`is provided, determines whether an initial sync is performed between the object and source tables.

  • sorted (bool, optional) – If the index column is already sorted in increasing order. Defaults to True.

  • sort (bool, optional) – If True, sorts the DataFrame by the id column. Otherwise set the index on the individual existing partitions. Defaults to False.

  • dask_client (dask.distributed.client or bool, optional) – Accepts an existing dask.distributed.Client, or creates one if client=True, passing any additional kwargs to a dask.distributed.Client constructor call. If client=False, the Ensemble is created without a distributed client.

Returns:

ensemble – The ensemble object with the LSDB catalog data loaded.

Return type:

tape.ensemble.Ensemble

read_hipscat(source_path, object_path=None, column_mapper=None, source_index=None, object_index=None, sorted=True, sort=False, dask_client=True, **kwargs)[source]#

Use LSDB to read from a hipscat directory.

This function utilizes LSDB for reading a hipscat directory into TAPE. In cases where a user would like to do operations on the LSDB catalog objects, it’s best to use LSDB itself first, and then load the result into TAPE using tape.Ensemble.from_lsdb. A join is performed between the two tables to modify the source table to use the object index, using object_index and source_index.

Parameters:
  • source_path (str or Path) – A hipscat directory that contains source information to be read into the ensemble.

  • object_path (str or Path, optional) – A hipscat directory containing object information. If not specified, a minimal ObjectFrame is generated from the sources.

  • column_mapper ('ColumnMapper' object) – If provided, the ColumnMapper is used to populate relevant column information mapped from the input dataset.

  • object_index ('str', optional) – The join index of the object table, should be the label for the object ID contained in the object table.

  • source_index ('str', optional) – The join index of the source table, should be the label for the object ID contained in the source table.

  • sorted (bool, optional) – If the index column is already sorted in increasing order. Defaults to True.

  • sort (bool, optional) – If True, sorts the DataFrame by the id column. Otherwise set the index on the individual existing partitions. Defaults to False.

  • dask_client (dask.distributed.client or bool, optional) – Accepts an existing dask.distributed.Client, or creates one if client=True, passing any additional kwargs to a dask.distributed.Client constructor call. If client=False, the Ensemble is created without a distributed client.

Returns:

ensemble – The ensemble object with the hipscat data loaded.

Return type:

tape.ensemble.Ensemble

read_source_dict(source_dict, column_mapper=None, npartitions=1, dask_client=True, **kwargs)[source]#

Load the sources into an ensemble from a dictionary.

Parameters:
  • source_dict ('dict') – The dictionary containing the source information.

  • column_mapper ('ColumnMapper' object) – If provided, the ColumnMapper is used to populate relevant column information mapped from the input dataset.

  • npartitions (int, optional) – If specified, attempts to repartition the ensemble to the specified number of partitions

  • dask_client (dask.distributed.client or bool, optional) – Accepts an existing dask.distributed.Client, or creates one if client=True, passing any additional kwargs to a dask.distributed.Client constructor call. If client=False, the Ensemble is created without a distributed client.

Returns:

ensemble – The ensemble object with dictionary data loaded

Return type:

tape.ensemble.Ensemble

read_dataset(dataset, dask_client=True, **kwargs)[source]#

Load the ensemble from a TAPE dataset.

Parameters:
  • dataset ('str') – The name of the dataset to import

  • dask_client (dask.distributed.client or bool, optional) – Accepts an existing dask.distributed.Client, or creates one if client=True, passing any additional kwargs to a dask.distributed.Client constructor call. If client=False, the Ensemble is created without a distributed client.

Returns:

ensemble – The ensemble object with the dataset loaded

Return type:

tape.ensemble.Ensemble