tape.ensemble_readers
=====================

.. py:module:: tape.ensemble_readers

.. autoapi-nested-parse::

   The following package-level methods can be used to create a new Ensemble object
   by reading in the given data source.



Functions
---------

.. autoapisummary::

   tape.ensemble_readers.read_ensemble
   tape.ensemble_readers.read_pandas_dataframe
   tape.ensemble_readers.read_dask_dataframe
   tape.ensemble_readers.read_parquet
   tape.ensemble_readers.read_lsdb
   tape.ensemble_readers.read_hipscat
   tape.ensemble_readers.read_source_dict
   tape.ensemble_readers.read_dataset


Module Contents
---------------

.. py:function:: read_ensemble(dirpath, additional_frames=True, column_mapper=None, dask_client=True, **kwargs)

   Load an ensemble from an on-disk ensemble.

   :param dirpath: A path to the top-level ensemble directory to load from.
   :type dirpath: 'str' or path-like, optional
   :param additional_frames: Controls whether EnsembleFrames beyond the Object and Source Frames
                             are loaded from disk. If True or False, this specifies whether all
                             or none of the additional frames are loaded. Alternatively, a list
                             of EnsembleFrame names may be provided to specify which frames
                             should be loaded. Object and Source will always be added and do not
                             need to be specified in the list. By default, all frames will be
                             loaded.
   :type additional_frames: bool, or list, optional
   :param column_mapper: Supplies a ColumnMapper to the Ensemble, if None (default) searches
                         for a column_mapper.npy file in the directory, which should be
                         created when the ensemble is saved.
   :type column_mapper: Tape.ColumnMapper object, or None, optional
   :param dask_client: Accepts an existing `dask.distributed.Client`, or creates one if
                       `client=True`, passing any additional kwargs to a
                       dask.distributed.Client constructor call. If `client=False`, the
                       Ensemble is created without a distributed client.
   :type dask_client: `dask.distributed.client` or `bool`, optional

   :returns: **ensemble** -- An ensemble object.
   :rtype: `tape.ensemble.Ensemble`


.. py:function:: read_pandas_dataframe(source_frame, object_frame=None, dask_client=True, column_mapper=None, sync_tables=True, npartitions=None, partition_size=None, **kwargs)

   Read in Pandas dataframe(s) and return an ensemble object

   :param source_frame: A Dask dataframe that contains source information to be read into the ensemble
   :type source_frame: 'pandas.Dataframe'
   :param object_frame: If not specified, the object frame is generated from the source frame
   :type object_frame: 'pandas.Dataframe', optional
   :param dask_client: Accepts an existing `dask.distributed.Client`, or creates one if
                       `client=True`, passing any additional kwargs to a
                       dask.distributed.Client constructor call. If `client=False`, the
                       Ensemble is created without a distributed client.
   :type dask_client: `dask.distributed.client` or `bool`, optional
   :param column_mapper: If provided, the ColumnMapper is used to populate relevant column
                         information mapped from the input dataset.
   :type column_mapper: 'ColumnMapper' object
   :param sync_tables: In the case where an `object_frame`is provided, determines whether an
                       initial sync is performed between the object and source tables. If
                       not performed, dynamic information like the number of observations
                       may be out of date until a sync is performed internally.
   :type sync_tables: 'bool', optional
   :param npartitions: If specified, attempts to repartition the ensemble to the specified
                       number of partitions
   :type npartitions: `int`, optional
   :param partition_size: If specified, attempts to repartition the ensemble to partitions
                          of size `partition_size`.
   :type partition_size: `int`, optional

   :returns: **ensemble** -- The ensemble object with the Dask dataframe data loaded.
   :rtype: `tape.ensemble.Ensemble`


.. py:function:: read_dask_dataframe(source_frame, object_frame=None, dask_client=True, column_mapper=None, sync_tables=True, npartitions=None, partition_size=None, **kwargs)

   Read in Dask dataframe(s) and return an ensemble object

   :param source_frame: A Dask dataframe that contains source information to be read into the ensemble
   :type source_frame: 'dask.Dataframe'
   :param object_frame: If not specified, the object frame is generated from the source frame
   :type object_frame: 'dask.Dataframe', optional
   :param dask_client: Accepts an existing `dask.distributed.Client`, or creates one if
                       `client=True`, passing any additional kwargs to a
                       dask.distributed.Client constructor call. If `client=False`, the
                       Ensemble is created without a distributed client.
   :type dask_client: `dask.distributed.client` or `bool`, optional
   :param column_mapper: If provided, the ColumnMapper is used to populate relevant column
                         information mapped from the input dataset.
   :type column_mapper: 'ColumnMapper' object
   :param sync_tables: In the case where an `object_frame`is provided, determines whether an
                       initial sync is performed between the object and source tables. If
                       not performed, dynamic information like the number of observations
                       may be out of date until a sync is performed internally.
   :type sync_tables: 'bool', optional
   :param npartitions: If specified, attempts to repartition the ensemble to the specified
                       number of partitions
   :type npartitions: `int`, optional
   :param partition_size: If specified, attempts to repartition the ensemble to partitions
                          of size `partition_size`.
   :type partition_size: `int`, optional

   :returns: **ensemble** -- The ensemble object with the Dask dataframe data loaded.
   :rtype: `tape.ensemble.Ensemble`


.. py:function:: read_parquet(source_file, object_file=None, column_mapper=None, dask_client=True, sync_tables=True, additional_cols=True, npartitions=None, partition_size=None, **kwargs)

   Read in parquet file(s) into an ensemble object

   :param source_file: Path to a parquet file, or multiple parquet files that contain
                       source information to be read into the ensemble
   :type source_file: 'str'
   :param object_file: Path to a parquet file, or multiple parquet files that contain
                       object information. If not specified, it is generated from the
                       source table
   :type object_file: 'str'
   :param column_mapper: If provided, the ColumnMapper is used to populate relevant column
                         information mapped from the input dataset.
   :type column_mapper: 'ColumnMapper' object
   :param dask_client: Accepts an existing `dask.distributed.Client`, or creates one if
                       `client=True`, passing any additional kwargs to a
                       dask.distributed.Client constructor call. If `client=False`, the
                       Ensemble is created without a distributed client.
   :type dask_client: `dask.distributed.client` or `bool`, optional
   :param sync_tables: In the case where object files are loaded in, determines whether an
                       initial sync is performed between the object and source tables. If
                       not performed, dynamic information like the number of observations
                       may be out of date until a sync is performed internally.
   :type sync_tables: 'bool', optional
   :param additional_cols: Boolean to indicate whether to carry in columns beyond the
                           critical columns, true will, while false will only load the columns
                           containing the critical quantities (id,time,flux,err,band)
   :type additional_cols: 'bool', optional
   :param npartitions: If specified, attempts to repartition the ensemble to the specified
                       number of partitions
   :type npartitions: `int`, optional
   :param partition_size: If specified, attempts to repartition the ensemble to partitions
                          of size `partition_size`.
   :type partition_size: `int`, optional

   :returns: **ensemble** -- The ensemble object with parquet data loaded
   :rtype: `tape.ensemble.Ensemble`


.. py:function:: read_lsdb(source_catalog, object_catalog=None, column_mapper=None, sync_tables=False, sorted=True, sort=False, dask_client=True, **kwargs)

   Read in from LSDB catalog objects.

   :param source_catalog: An LSDB catalog that contains source information to be read into
                          the ensemble.
   :type source_catalog: 'dask.Dataframe'
   :param object_catalog: An LSDB catalog containing object information. If not specified,
                          a minimal ObjectFrame is generated from the source catalog.
   :type object_catalog: 'dask.Dataframe', optional
   :param column_mapper: If provided, the ColumnMapper is used to populate relevant column
                         information mapped from the input dataset.
   :type column_mapper: 'ColumnMapper' object
   :param sync_tables: In the case where an `object_catalog`is provided, determines
                       whether an initial sync is performed between the object and source
                       tables.
   :type sync_tables: 'bool', optional
   :param sorted: If the index column is already sorted in increasing order.
                  Defaults to True.
   :type sorted: bool, optional
   :param sort: If True, sorts the DataFrame by the id column. Otherwise set the
                index on the individual existing partitions. Defaults to False.
   :type sort: `bool`, optional
   :param dask_client: Accepts an existing `dask.distributed.Client`, or creates one if
                       `client=True`, passing any additional kwargs to a
                       dask.distributed.Client constructor call. If `client=False`, the
                       Ensemble is created without a distributed client.
   :type dask_client: `dask.distributed.client` or `bool`, optional

   :returns: **ensemble** -- The ensemble object with the LSDB catalog data loaded.
   :rtype: `tape.ensemble.Ensemble`


.. py:function:: read_hipscat(source_path, object_path=None, column_mapper=None, source_index=None, object_index=None, sorted=True, sort=False, dask_client=True, **kwargs)

   Use LSDB to read from a hipscat directory.

   This function utilizes LSDB for reading a hipscat directory into TAPE.
   In cases where a user would like to do operations on the LSDB catalog
   objects, it's best to use LSDB itself first, and then load the result
   into TAPE using `tape.Ensemble.from_lsdb`. A join is performed between
   the two tables to modify the source table to use the object index,
   using `object_index` and `source_index`.

   :param source_path: A hipscat directory that contains source information to be read
                       into the ensemble.
   :type source_path: str or Path
   :param object_path: A hipscat directory containing object information. If not
                       specified, a minimal ObjectFrame is generated from the sources.
   :type object_path: str or Path, optional
   :param column_mapper: If provided, the ColumnMapper is used to populate relevant column
                         information mapped from the input dataset.
   :type column_mapper: 'ColumnMapper' object
   :param object_index: The join index of the object table, should be the label for the
                        object ID contained in the object table.
   :type object_index: 'str', optional
   :param source_index: The join index of the source table, should be the label for the
                        object ID contained in the source table.
   :type source_index: 'str', optional
   :param sorted: If the index column is already sorted in increasing order.
                  Defaults to True.
   :type sorted: bool, optional
   :param sort: If True, sorts the DataFrame by the id column. Otherwise set the
                index on the individual existing partitions. Defaults to False.
   :type sort: `bool`, optional
   :param dask_client: Accepts an existing `dask.distributed.Client`, or creates one if
                       `client=True`, passing any additional kwargs to a
                       dask.distributed.Client constructor call. If `client=False`, the
                       Ensemble is created without a distributed client.
   :type dask_client: `dask.distributed.client` or `bool`, optional

   :returns: **ensemble** -- The ensemble object with the hipscat data loaded.
   :rtype: `tape.ensemble.Ensemble`


.. py:function:: read_source_dict(source_dict, column_mapper=None, npartitions=1, dask_client=True, **kwargs)

   Load the sources into an ensemble from a dictionary.

   :param source_dict: The dictionary containing the source information.
   :type source_dict: 'dict'
   :param column_mapper: If provided, the ColumnMapper is used to populate relevant column
                         information mapped from the input dataset.
   :type column_mapper: 'ColumnMapper' object
   :param npartitions: If specified, attempts to repartition the ensemble to the specified
                       number of partitions
   :type npartitions: `int`, optional
   :param dask_client: Accepts an existing `dask.distributed.Client`, or creates one if
                       `client=True`, passing any additional kwargs to a
                       dask.distributed.Client constructor call. If `client=False`, the
                       Ensemble is created without a distributed client.
   :type dask_client: `dask.distributed.client` or `bool`, optional

   :returns: **ensemble** -- The ensemble object with dictionary data loaded
   :rtype: `tape.ensemble.Ensemble`


.. py:function:: read_dataset(dataset, dask_client=True, **kwargs)

   Load the ensemble from a TAPE dataset.

   :param dataset: The name of the dataset to import
   :type dataset: 'str'
   :param dask_client: Accepts an existing `dask.distributed.Client`, or creates one if
                       `client=True`, passing any additional kwargs to a
                       dask.distributed.Client constructor call. If `client=False`, the
                       Ensemble is created without a distributed client.
   :type dask_client: `dask.distributed.client` or `bool`, optional

   :returns: **ensemble** -- The ensemble object with the dataset loaded
   :rtype: `tape.ensemble.Ensemble`


