[1]:
rel_path = "../../tests/tape_tests/data"

Data Access with TAPE#

This tutorial demonstrates the various ways in which data can be ingest into TAPE.

Loading from Parquet Files#

The Apache Parquet format is an efficient column-oriented data file format well-suited for bulk datasets. TAPE provides functionality to create an Ensemble object from input parquet files via the Ensemble.from_parquet() function. At minimum, a parquet file containing time series information needed to populate the Ensemble source table should be supplied, as shown below. The Ensemble object table is created dynamically from the source table in this instance.

[2]:
from tape.ensemble import Ensemble
from tape.utils import ColumnMapper

ens = Ensemble()  # initialize an ensemble object

# A ColumnMapper is created to map columns of the parquet file to timeseries quantities, such as flux, time, etc.
col_map = ColumnMapper(
    id_col="ps1_objid", time_col="midPointTai", flux_col="psFlux", err_col="psFluxErr", band_col="filterName"
)

# Read in data from a parquet file that contains source (timeseries) data
ens.from_parquet(source_file=f"{rel_path}/source/test_source.parquet", column_mapper=col_map, sorted=True)

ens.source.head(5)  # View the first 5 entries of the source table
[2]:
midPointTai psFlux psFluxErr filterName
ps1_objid
88472468910699998 58972.382812 21.035807 0.199991 r
88472935274829959 58246.460938 18.149910 0.026191 r
88472935274829959 58249.441406 18.269829 0.028149 r
88472935274829959 58256.421875 18.243782 0.027706 r
88472935274829959 58259.445312 18.198299 0.026956 r

Alternatively, if object level information is available in a second parquet file, that may also be provided to populate the Ensemble object table, as follows:

[3]:
ens = Ensemble()  # initialize an ensemble object

col_map = ColumnMapper(
    id_col="ps1_objid",
    time_col="midPointTai",
    flux_col="psFlux",
    err_col="psFluxErr",
    band_col="filterName",
)

# Read in data from a parquet file that contains source (timeseries) data
ens.from_parquet(
    source_file=f"{rel_path}/source/test_source.parquet",
    object_file=f"{rel_path}/object/test_object.parquet",
    column_mapper=col_map,
    sorted=True,
)

ens.object.head(5)  # View the first 5 entries of the object table
[3]:
nobs_g nobs_r nobs_total
ps1_objid
88472468910699998 0.0 1.0 1.0
88472935274829959 98.0 401.0 499.0
88480000290704349 82.0 106.0 188.0
88480000310609896 0.0 1.0 1.0
88480000340043358 162.0 181.0 343.0

In the above examples, we use the ColumnMapper helper class to facilitate mapping of parquet file columns to a set of internally recognized quantities, such as flux, time, ids, errors, etc. These quantities are used to infer the correct columns to use when applying certain filtering operations, or when using TAPE.analysis functions. It may be the case that you aren’t sure what columns are actually present in a given parquet file before attempting to ingest into TAPE. In these instances, we recommend using the pyarrow package to preview metadata, as shown below.

[4]:
from pyarrow import parquet

parquet.read_schema(f"{rel_path}/source/test_source.parquet", memory_map=True)
[4]:
midPointTai: float
psFlux: float
psFluxErr: float
filterName: string
ps1_objid: int64
-- schema metadata --
pandas: '{"index_columns": ["ps1_objid"], "column_indexes": [{"name": nul' + 795

Apache parquet files have many advantages for the type of scalable workflows that TAPE seeks to enable. A key advantage being that the parquet file supports in-format partitioning of large datasets. TAPE, by virtue of using Dask, inherits these partitions on load, avoiding any need to manually set a partitioning scheme for your data.

TAPE Datasets#

TAPE hosts a number of datasets that are retrievable by the user. These datasets have been added to demonstrate and test the various scientific workflows that TAPE has been developed to support. The Ensemble.available_datasets() may be used to see which datasets are available to retrieve.

[5]:
ens = Ensemble()

ens.available_datasets()
[5]:
{'s82_rrlyrae': 'This dataset contains 483 RR Lyrae from Sesar 2010. This dataset was sourced from astroml, and then reformatted into the parquet format for TAPE.',
 's82_qso': "This dataset contains recalibrated light curves for all spectroscopically confirmed QSOs in the SDSS DR7 stripe 82 (22h 24m < R.A. < 04h 08m and | Dec | < 1.27 deg, about 290 deg2). The total number of QSOs is 9,258, and the observations are spaced out over ~10 years in yearly 'seasons' about 2-3 months long. This dataset was sourced from http://faculty.washington.edu/ivezic/cmacleod/qso_dr7/Southern.html, and then reformatted into the parquet format for ingest into TAPE."}

The function returns a dictionary of datasets and a brief description of their contents. To retrieve them, use the Ensemble.from_dataset() function with the dictionary key value corresponding to a specific dataset. Column mapping information is automatically generated for these known datasets.

[6]:
ens.from_dataset("s82_rrlyrae", sorted=True)  # Let's grab the Stripe 82 RR Lyrae

ens.object.head(5)
[6]:
ra dec rExt d rGC uF gF rF iF zF ... rE rT iA i0 iE iT zA z0 zE zT
#id
4099 0.935679 1.115859 0.089 17.75 20.03 18.134 16.989 16.777 16.703 16.685 ... 51075.295112 103 0.317851 16.548633 51075.295084 102 0.302557 16.539893 51075.288235 100
13350 0.283437 1.178522 0.080 24.77 26.55 18.839 17.679 17.544 17.497 17.501 ... 54025.326474 112 0.642111 17.147570 54025.326185 116 0.583437 17.190782 54025.327901 114
15927 3.254658 -0.584066 0.090 29.12 30.96 19.288 18.058 17.859 17.792 17.780 ... 53680.226214 108 0.368674 17.610787 53680.243421 104 0.345422 17.615747 53680.247101 100
20406 3.244369 0.218891 0.088 9.13 12.76 16.715 15.543 15.336 15.286 15.276 ... 54000.276631 108 0.342734 15.118909 54000.293780 102 0.303788 15.132053 54000.296412 100
21992 4.315354 1.054582 0.077 7.35 11.54 16.186 15.040 14.909 14.864 14.853 ... 53698.243534 114 0.661144 14.523218 53698.249941 111 0.619123 14.524697 53698.245861 114

5 rows × 37 columns

Loading from Array Data#

If your data is stored in arrays, Ensemble.from_source_dict() offers an interface to load these into an Ensemble object using a dictionary.

Let’s start by creating some example data arrays in a dictionary:

[7]:
import numpy as np

np.random.seed(1)

# initialize a dictionary of empty arrays
source_dict = {
    "id": np.array([]),
    "time": np.array([]),
    "flux": np.array([]),
    "error": np.array([]),
    "band": np.array([]),
}

# Create 10 lightcurves with 100 measurements each
lc_len = 100
for i in range(10):
    source_dict["id"] = np.append(source_dict["id"], np.array([i] * lc_len)).astype(int)
    source_dict["time"] = np.append(source_dict["time"], np.linspace(1, lc_len, lc_len))
    source_dict["flux"] = np.append(source_dict["flux"], 100 + 50 * np.random.rand(lc_len))
    source_dict["error"] = np.append(source_dict["error"], 10 + 5 * np.random.rand(lc_len))
    source_dict["band"] = np.append(source_dict["band"], ["g"] * 50 + ["r"] * 50)

From here, we just need to pass the dictionary along to Ensemble.from_source_dict() and set the ColumnMapper appropriately.

[8]:
colmap = ColumnMapper(id_col="id", time_col="time", flux_col="flux", err_col="error", band_col="band")
ens = Ensemble()
ens.from_source_dict(source_dict, column_mapper=colmap, sorted=True)

ens.info()
Object Table
<class 'tape.ensemble_frame.ObjectFrame'>
Index: 0 entries
Empty ObjectFrameSource Table
<class 'tape.ensemble_frame.SourceFrame'>
Index: 1000 entries, 0 to 9
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   time    1000 non-null      float64
 1   flux    1000 non-null      float64
 2   error   1000 non-null      float64
 3   band    1000 non-null      string
dtypes: float64(3), string(1)
memory usage: 40.0 KB
[ ]: