tape.ensemble_readers#
The following package-level methods can be used to create a new Ensemble object by reading in the given data source.
Functions#
|
Load an ensemble from an on-disk ensemble. |
|
Read in Pandas dataframe(s) and return an ensemble object |
|
Read in Dask dataframe(s) and return an ensemble object |
|
Read in parquet file(s) into an ensemble object |
|
Read in from LSDB catalog objects. |
|
Use LSDB to read from a hipscat directory. |
|
Load the sources into an ensemble from a dictionary. |
|
Load the ensemble from a TAPE dataset. |
Module Contents#
- read_ensemble(dirpath, additional_frames=True, column_mapper=None, dask_client=True, **kwargs)[source]#
Load an ensemble from an on-disk ensemble.
- Parameters:
dirpath ('str' or path-like, optional) – A path to the top-level ensemble directory to load from.
additional_frames (bool, or list, optional) – Controls whether EnsembleFrames beyond the Object and Source Frames are loaded from disk. If True or False, this specifies whether all or none of the additional frames are loaded. Alternatively, a list of EnsembleFrame names may be provided to specify which frames should be loaded. Object and Source will always be added and do not need to be specified in the list. By default, all frames will be loaded.
column_mapper (Tape.ColumnMapper object, or None, optional) – Supplies a ColumnMapper to the Ensemble, if None (default) searches for a column_mapper.npy file in the directory, which should be created when the ensemble is saved.
dask_client (dask.distributed.client or bool, optional) – Accepts an existing dask.distributed.Client, or creates one if client=True, passing any additional kwargs to a dask.distributed.Client constructor call. If client=False, the Ensemble is created without a distributed client.
- Returns:
ensemble – An ensemble object.
- Return type:
tape.ensemble.Ensemble
- read_pandas_dataframe(source_frame, object_frame=None, dask_client=True, column_mapper=None, sync_tables=True, npartitions=None, partition_size=None, **kwargs)[source]#
Read in Pandas dataframe(s) and return an ensemble object
- Parameters:
source_frame ('pandas.Dataframe') – A Dask dataframe that contains source information to be read into the ensemble
object_frame ('pandas.Dataframe', optional) – If not specified, the object frame is generated from the source frame
dask_client (dask.distributed.client or bool, optional) – Accepts an existing dask.distributed.Client, or creates one if client=True, passing any additional kwargs to a dask.distributed.Client constructor call. If client=False, the Ensemble is created without a distributed client.
column_mapper ('ColumnMapper' object) – If provided, the ColumnMapper is used to populate relevant column information mapped from the input dataset.
sync_tables ('bool', optional) – In the case where an `object_frame`is provided, determines whether an initial sync is performed between the object and source tables. If not performed, dynamic information like the number of observations may be out of date until a sync is performed internally.
npartitions (int, optional) – If specified, attempts to repartition the ensemble to the specified number of partitions
partition_size (int, optional) – If specified, attempts to repartition the ensemble to partitions of size partition_size.
- Returns:
ensemble – The ensemble object with the Dask dataframe data loaded.
- Return type:
tape.ensemble.Ensemble
- read_dask_dataframe(source_frame, object_frame=None, dask_client=True, column_mapper=None, sync_tables=True, npartitions=None, partition_size=None, **kwargs)[source]#
Read in Dask dataframe(s) and return an ensemble object
- Parameters:
source_frame ('dask.Dataframe') – A Dask dataframe that contains source information to be read into the ensemble
object_frame ('dask.Dataframe', optional) – If not specified, the object frame is generated from the source frame
dask_client (dask.distributed.client or bool, optional) – Accepts an existing dask.distributed.Client, or creates one if client=True, passing any additional kwargs to a dask.distributed.Client constructor call. If client=False, the Ensemble is created without a distributed client.
column_mapper ('ColumnMapper' object) – If provided, the ColumnMapper is used to populate relevant column information mapped from the input dataset.
sync_tables ('bool', optional) – In the case where an `object_frame`is provided, determines whether an initial sync is performed between the object and source tables. If not performed, dynamic information like the number of observations may be out of date until a sync is performed internally.
npartitions (int, optional) – If specified, attempts to repartition the ensemble to the specified number of partitions
partition_size (int, optional) – If specified, attempts to repartition the ensemble to partitions of size partition_size.
- Returns:
ensemble – The ensemble object with the Dask dataframe data loaded.
- Return type:
tape.ensemble.Ensemble
- read_parquet(source_file, object_file=None, column_mapper=None, dask_client=True, sync_tables=True, additional_cols=True, npartitions=None, partition_size=None, **kwargs)[source]#
Read in parquet file(s) into an ensemble object
- Parameters:
source_file ('str') – Path to a parquet file, or multiple parquet files that contain source information to be read into the ensemble
object_file ('str') – Path to a parquet file, or multiple parquet files that contain object information. If not specified, it is generated from the source table
column_mapper ('ColumnMapper' object) – If provided, the ColumnMapper is used to populate relevant column information mapped from the input dataset.
dask_client (dask.distributed.client or bool, optional) – Accepts an existing dask.distributed.Client, or creates one if client=True, passing any additional kwargs to a dask.distributed.Client constructor call. If client=False, the Ensemble is created without a distributed client.
sync_tables ('bool', optional) – In the case where object files are loaded in, determines whether an initial sync is performed between the object and source tables. If not performed, dynamic information like the number of observations may be out of date until a sync is performed internally.
additional_cols ('bool', optional) – Boolean to indicate whether to carry in columns beyond the critical columns, true will, while false will only load the columns containing the critical quantities (id,time,flux,err,band)
npartitions (int, optional) – If specified, attempts to repartition the ensemble to the specified number of partitions
partition_size (int, optional) – If specified, attempts to repartition the ensemble to partitions of size partition_size.
- Returns:
ensemble – The ensemble object with parquet data loaded
- Return type:
tape.ensemble.Ensemble
- read_lsdb(source_catalog, object_catalog=None, column_mapper=None, sync_tables=False, sorted=True, sort=False, dask_client=True, **kwargs)[source]#
Read in from LSDB catalog objects.
- Parameters:
source_catalog ('dask.Dataframe') – An LSDB catalog that contains source information to be read into the ensemble.
object_catalog ('dask.Dataframe', optional) – An LSDB catalog containing object information. If not specified, a minimal ObjectFrame is generated from the source catalog.
column_mapper ('ColumnMapper' object) – If provided, the ColumnMapper is used to populate relevant column information mapped from the input dataset.
sync_tables ('bool', optional) – In the case where an `object_catalog`is provided, determines whether an initial sync is performed between the object and source tables.
sorted (bool, optional) – If the index column is already sorted in increasing order. Defaults to True.
sort (bool, optional) – If True, sorts the DataFrame by the id column. Otherwise set the index on the individual existing partitions. Defaults to False.
dask_client (dask.distributed.client or bool, optional) – Accepts an existing dask.distributed.Client, or creates one if client=True, passing any additional kwargs to a dask.distributed.Client constructor call. If client=False, the Ensemble is created without a distributed client.
- Returns:
ensemble – The ensemble object with the LSDB catalog data loaded.
- Return type:
tape.ensemble.Ensemble
- read_hipscat(source_path, object_path=None, column_mapper=None, source_index=None, object_index=None, sorted=True, sort=False, dask_client=True, **kwargs)[source]#
Use LSDB to read from a hipscat directory.
This function utilizes LSDB for reading a hipscat directory into TAPE. In cases where a user would like to do operations on the LSDB catalog objects, it’s best to use LSDB itself first, and then load the result into TAPE using tape.Ensemble.from_lsdb. A join is performed between the two tables to modify the source table to use the object index, using object_index and source_index.
- Parameters:
source_path (str or Path) – A hipscat directory that contains source information to be read into the ensemble.
object_path (str or Path, optional) – A hipscat directory containing object information. If not specified, a minimal ObjectFrame is generated from the sources.
column_mapper ('ColumnMapper' object) – If provided, the ColumnMapper is used to populate relevant column information mapped from the input dataset.
object_index ('str', optional) – The join index of the object table, should be the label for the object ID contained in the object table.
source_index ('str', optional) – The join index of the source table, should be the label for the object ID contained in the source table.
sorted (bool, optional) – If the index column is already sorted in increasing order. Defaults to True.
sort (bool, optional) – If True, sorts the DataFrame by the id column. Otherwise set the index on the individual existing partitions. Defaults to False.
dask_client (dask.distributed.client or bool, optional) – Accepts an existing dask.distributed.Client, or creates one if client=True, passing any additional kwargs to a dask.distributed.Client constructor call. If client=False, the Ensemble is created without a distributed client.
- Returns:
ensemble – The ensemble object with the hipscat data loaded.
- Return type:
tape.ensemble.Ensemble
- read_source_dict(source_dict, column_mapper=None, npartitions=1, dask_client=True, **kwargs)[source]#
Load the sources into an ensemble from a dictionary.
- Parameters:
source_dict ('dict') – The dictionary containing the source information.
column_mapper ('ColumnMapper' object) – If provided, the ColumnMapper is used to populate relevant column information mapped from the input dataset.
npartitions (int, optional) – If specified, attempts to repartition the ensemble to the specified number of partitions
dask_client (dask.distributed.client or bool, optional) – Accepts an existing dask.distributed.Client, or creates one if client=True, passing any additional kwargs to a dask.distributed.Client constructor call. If client=False, the Ensemble is created without a distributed client.
- Returns:
ensemble – The ensemble object with dictionary data loaded
- Return type:
tape.ensemble.Ensemble
- read_dataset(dataset, dask_client=True, **kwargs)[source]#
Load the ensemble from a TAPE dataset.
- Parameters:
dataset ('str') – The name of the dataset to import
dask_client (dask.distributed.client or bool, optional) – Accepts an existing dask.distributed.Client, or creates one if client=True, passing any additional kwargs to a dask.distributed.Client constructor call. If client=False, the Ensemble is created without a distributed client.
- Returns:
ensemble – The ensemble object with the dataset loaded
- Return type:
tape.ensemble.Ensemble