[1]:

rel_path = "../../tests/tape_tests/data"

Data Access with TAPE#

This tutorial demonstrates the various ways in which data can be ingest into TAPE.

Loading from Parquet Files#

The Apache Parquet format is an efficient column-oriented data file format well-suited for bulk datasets. TAPE provides functionality to create an Ensemble object from input parquet files via the Ensemble.from_parquet() function. At minimum, a parquet file containing time series information needed to populate the Ensemble source table should be supplied, as shown below. The Ensemble object table is created dynamically from the source table in this instance.

[2]:

from tape.ensemble import Ensemble
from tape.utils import ColumnMapper

ens = Ensemble()  # initialize an ensemble object

# A ColumnMapper is created to map columns of the parquet file to timeseries quantities, such as flux, time, etc.
col_map = ColumnMapper(
    id_col="ps1_objid", time_col="midPointTai", flux_col="psFlux", err_col="psFluxErr", band_col="filterName"
)

# Read in data from a parquet file that contains source (timeseries) data
ens.from_parquet(source_file=f"{rel_path}/source/test_source.parquet", column_mapper=col_map, sorted=True)

ens.source.head(5)  # View the first 5 entries of the source table

[2]:

	midPointTai	psFlux	psFluxErr	filterName
ps1_objid
88472468910699998	58972.382812	21.035807	0.199991	r
88472935274829959	58246.460938	18.149910	0.026191	r
88472935274829959	58249.441406	18.269829	0.028149	r
88472935274829959	58256.421875	18.243782	0.027706	r
88472935274829959	58259.445312	18.198299	0.026956	r

Alternatively, if object level information is available in a second parquet file, that may also be provided to populate the Ensemble object table, as follows:

[3]:

ens = Ensemble()  # initialize an ensemble object

col_map = ColumnMapper(
    id_col="ps1_objid",
    time_col="midPointTai",
    flux_col="psFlux",
    err_col="psFluxErr",
    band_col="filterName",
)

# Read in data from a parquet file that contains source (timeseries) data
ens.from_parquet(
    source_file=f"{rel_path}/source/test_source.parquet",
    object_file=f"{rel_path}/object/test_object.parquet",
    column_mapper=col_map,
    sorted=True,
)

ens.object.head(5)  # View the first 5 entries of the object table

[3]:

	nobs_g	nobs_r	nobs_total
ps1_objid
88472468910699998	0.0	1.0	1.0
88472935274829959	98.0	401.0	499.0
88480000290704349	82.0	106.0	188.0
88480000310609896	0.0	1.0	1.0
88480000340043358	162.0	181.0	343.0

In the above examples, we use the ColumnMapper helper class to facilitate mapping of parquet file columns to a set of internally recognized quantities, such as flux, time, ids, errors, etc. These quantities are used to infer the correct columns to use when applying certain filtering operations, or when using TAPE.analysis functions. It may be the case that you aren’t sure what columns are actually present in a given parquet file before attempting to ingest into TAPE. In these instances, we recommend using the pyarrow package to preview metadata, as shown below.

[4]:

from pyarrow import parquet

parquet.read_schema(f"{rel_path}/source/test_source.parquet", memory_map=True)

[4]:

midPointTai: float
psFlux: float
psFluxErr: float
filterName: string
ps1_objid: int64
-- schema metadata --
pandas: '{"index_columns": ["ps1_objid"], "column_indexes": [{"name": nul' + 795

Apache parquet files have many advantages for the type of scalable workflows that TAPE seeks to enable. A key advantage being that the parquet file supports in-format partitioning of large datasets. TAPE, by virtue of using Dask, inherits these partitions on load, avoiding any need to manually set a partitioning scheme for your data.

TAPE Datasets#

TAPE hosts a number of datasets that are retrievable by the user. These datasets have been added to demonstrate and test the various scientific workflows that TAPE has been developed to support. The Ensemble.available_datasets() may be used to see which datasets are available to retrieve.

[5]:

ens = Ensemble()

ens.available_datasets()

[5]:

{'s82_rrlyrae': 'This dataset contains 483 RR Lyrae from Sesar 2010. This dataset was sourced from astroml, and then reformatted into the parquet format for TAPE.',
 's82_qso': "This dataset contains recalibrated light curves for all spectroscopically confirmed QSOs in the SDSS DR7 stripe 82 (22h 24m < R.A. < 04h 08m and | Dec | < 1.27 deg, about 290 deg2). The total number of QSOs is 9,258, and the observations are spaced out over ~10 years in yearly 'seasons' about 2-3 months long. This dataset was sourced from http://faculty.washington.edu/ivezic/cmacleod/qso_dr7/Southern.html, and then reformatted into the parquet format for ingest into TAPE."}

The function returns a dictionary of datasets and a brief description of their contents. To retrieve them, use the Ensemble.from_dataset() function with the dictionary key value corresponding to a specific dataset. Column mapping information is automatically generated for these known datasets.

[6]:

ens.from_dataset("s82_rrlyrae", sorted=True)  # Let's grab the Stripe 82 RR Lyrae

ens.object.head(5)

[6]:

	ra	dec	rExt	d	rGC	uF	gF	rF	iF	zF	...	rE	rT	iA	i0	iE	iT	zA	z0	zE	zT
#id
4099	0.935679	1.115859	0.089	17.75	20.03	18.134	16.989	16.777	16.703	16.685	...	51075.295112	103	0.317851	16.548633	51075.295084	102	0.302557	16.539893	51075.288235	100
13350	0.283437	1.178522	0.080	24.77	26.55	18.839	17.679	17.544	17.497	17.501	...	54025.326474	112	0.642111	17.147570	54025.326185	116	0.583437	17.190782	54025.327901	114
15927	3.254658	-0.584066	0.090	29.12	30.96	19.288	18.058	17.859	17.792	17.780	...	53680.226214	108	0.368674	17.610787	53680.243421	104	0.345422	17.615747	53680.247101	100
20406	3.244369	0.218891	0.088	9.13	12.76	16.715	15.543	15.336	15.286	15.276	...	54000.276631	108	0.342734	15.118909	54000.293780	102	0.303788	15.132053	54000.296412	100
21992	4.315354	1.054582	0.077	7.35	11.54	16.186	15.040	14.909	14.864	14.853	...	53698.243534	114	0.661144	14.523218	53698.249941	111	0.619123	14.524697	53698.245861	114

5 rows × 37 columns

Loading from Array Data#

If your data is stored in arrays, Ensemble.from_source_dict() offers an interface to load these into an Ensemble object using a dictionary.

Let’s start by creating some example data arrays in a dictionary:

[7]:

import numpy as np

np.random.seed(1)

# initialize a dictionary of empty arrays
source_dict = {
    "id": np.array([]),
    "time": np.array([]),
    "flux": np.array([]),
    "error": np.array([]),
    "band": np.array([]),
}

# Create 10 lightcurves with 100 measurements each
lc_len = 100
for i in range(10):
    source_dict["id"] = np.append(source_dict["id"], np.array([i] * lc_len)).astype(int)
    source_dict["time"] = np.append(source_dict["time"], np.linspace(1, lc_len, lc_len))
    source_dict["flux"] = np.append(source_dict["flux"], 100 + 50 * np.random.rand(lc_len))
    source_dict["error"] = np.append(source_dict["error"], 10 + 5 * np.random.rand(lc_len))
    source_dict["band"] = np.append(source_dict["band"], ["g"] * 50 + ["r"] * 50)

From here, we just need to pass the dictionary along to Ensemble.from_source_dict() and set the ColumnMapper appropriately.

[8]:

colmap = ColumnMapper(id_col="id", time_col="time", flux_col="flux", err_col="error", band_col="band")
ens = Ensemble()
ens.from_source_dict(source_dict, column_mapper=colmap, sorted=True)

ens.info()

Object Table
<class 'tape.ensemble_frame.ObjectFrame'>
Index: 0 entries
Empty ObjectFrameSource Table
<class 'tape.ensemble_frame.SourceFrame'>
Index: 1000 entries, 0 to 9
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   time    1000 non-null      float64
 1   flux    1000 non-null      float64
 2   error   1000 non-null      float64
 3   band    1000 non-null      string
dtypes: float64(3), string(1)
memory usage: 40.0 KB

[ ]:

Data Access with TAPE

Contents

Data Access with TAPE#

Loading from Parquet Files#

TAPE Datasets#

Loading from Array Data#