[1]:
rel_path = "../../tests/tape_tests/data"
Data Access with TAPE#
This tutorial demonstrates the various ways in which data can be ingest into TAPE.
Loading from Parquet Files#
The Apache Parquet format is an efficient column-oriented data file format well-suited for bulk datasets. TAPE provides functionality to create an Ensemble object from input parquet files via the Ensemble.from_parquet() function. At minimum, a parquet file containing time series information needed to populate the Ensemble source table should be supplied, as shown below. The Ensemble object table is created dynamically from the
source table in this instance.
[2]:
from tape.ensemble import Ensemble
from tape.utils import ColumnMapper
ens = Ensemble() # initialize an ensemble object
# A ColumnMapper is created to map columns of the parquet file to timeseries quantities, such as flux, time, etc.
col_map = ColumnMapper(
id_col="ps1_objid", time_col="midPointTai", flux_col="psFlux", err_col="psFluxErr", band_col="filterName"
)
# Read in data from a parquet file that contains source (timeseries) data
ens.from_parquet(source_file=f"{rel_path}/source/test_source.parquet", column_mapper=col_map, sorted=True)
ens.source.head(5) # View the first 5 entries of the source table
[2]:
| midPointTai | psFlux | psFluxErr | filterName | |
|---|---|---|---|---|
| ps1_objid | ||||
| 88472468910699998 | 58972.382812 | 21.035807 | 0.199991 | r |
| 88472935274829959 | 58246.460938 | 18.149910 | 0.026191 | r |
| 88472935274829959 | 58249.441406 | 18.269829 | 0.028149 | r |
| 88472935274829959 | 58256.421875 | 18.243782 | 0.027706 | r |
| 88472935274829959 | 58259.445312 | 18.198299 | 0.026956 | r |
Alternatively, if object level information is available in a second parquet file, that may also be provided to populate the Ensemble object table, as follows:
[3]:
ens = Ensemble() # initialize an ensemble object
col_map = ColumnMapper(
id_col="ps1_objid",
time_col="midPointTai",
flux_col="psFlux",
err_col="psFluxErr",
band_col="filterName",
)
# Read in data from a parquet file that contains source (timeseries) data
ens.from_parquet(
source_file=f"{rel_path}/source/test_source.parquet",
object_file=f"{rel_path}/object/test_object.parquet",
column_mapper=col_map,
sorted=True,
)
ens.object.head(5) # View the first 5 entries of the object table
[3]:
| nobs_g | nobs_r | nobs_total | |
|---|---|---|---|
| ps1_objid | |||
| 88472468910699998 | 0.0 | 1.0 | 1.0 |
| 88472935274829959 | 98.0 | 401.0 | 499.0 |
| 88480000290704349 | 82.0 | 106.0 | 188.0 |
| 88480000310609896 | 0.0 | 1.0 | 1.0 |
| 88480000340043358 | 162.0 | 181.0 | 343.0 |
In the above examples, we use the ColumnMapper helper class to facilitate mapping of parquet file columns to a set of internally recognized quantities, such as flux, time, ids, errors, etc. These quantities are used to infer the correct columns to use when applying certain filtering operations, or when using TAPE.analysis functions. It may be the case that you aren’t sure what columns are actually present in a given parquet file before attempting to ingest into TAPE. In these instances,
we recommend using the pyarrow package to preview metadata, as shown below.
[4]:
from pyarrow import parquet
parquet.read_schema(f"{rel_path}/source/test_source.parquet", memory_map=True)
[4]:
midPointTai: float
psFlux: float
psFluxErr: float
filterName: string
ps1_objid: int64
-- schema metadata --
pandas: '{"index_columns": ["ps1_objid"], "column_indexes": [{"name": nul' + 795
Apache parquet files have many advantages for the type of scalable workflows that TAPE seeks to enable. A key advantage being that the parquet file supports in-format partitioning of large datasets. TAPE, by virtue of using Dask, inherits these partitions on load, avoiding any need to manually set a partitioning scheme for your data.
TAPE Datasets#
TAPE hosts a number of datasets that are retrievable by the user. These datasets have been added to demonstrate and test the various scientific workflows that TAPE has been developed to support. The Ensemble.available_datasets() may be used to see which datasets are available to retrieve.
[5]:
ens = Ensemble()
ens.available_datasets()
[5]:
{'s82_rrlyrae': 'This dataset contains 483 RR Lyrae from Sesar 2010. This dataset was sourced from astroml, and then reformatted into the parquet format for TAPE.',
's82_qso': "This dataset contains recalibrated light curves for all spectroscopically confirmed QSOs in the SDSS DR7 stripe 82 (22h 24m < R.A. < 04h 08m and | Dec | < 1.27 deg, about 290 deg2). The total number of QSOs is 9,258, and the observations are spaced out over ~10 years in yearly 'seasons' about 2-3 months long. This dataset was sourced from http://faculty.washington.edu/ivezic/cmacleod/qso_dr7/Southern.html, and then reformatted into the parquet format for ingest into TAPE."}
The function returns a dictionary of datasets and a brief description of their contents. To retrieve them, use the Ensemble.from_dataset() function with the dictionary key value corresponding to a specific dataset. Column mapping information is automatically generated for these known datasets.
[6]:
ens.from_dataset("s82_rrlyrae", sorted=True) # Let's grab the Stripe 82 RR Lyrae
ens.object.head(5)
[6]:
| ra | dec | rExt | d | rGC | uF | gF | rF | iF | zF | ... | rE | rT | iA | i0 | iE | iT | zA | z0 | zE | zT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| #id | |||||||||||||||||||||
| 4099 | 0.935679 | 1.115859 | 0.089 | 17.75 | 20.03 | 18.134 | 16.989 | 16.777 | 16.703 | 16.685 | ... | 51075.295112 | 103 | 0.317851 | 16.548633 | 51075.295084 | 102 | 0.302557 | 16.539893 | 51075.288235 | 100 |
| 13350 | 0.283437 | 1.178522 | 0.080 | 24.77 | 26.55 | 18.839 | 17.679 | 17.544 | 17.497 | 17.501 | ... | 54025.326474 | 112 | 0.642111 | 17.147570 | 54025.326185 | 116 | 0.583437 | 17.190782 | 54025.327901 | 114 |
| 15927 | 3.254658 | -0.584066 | 0.090 | 29.12 | 30.96 | 19.288 | 18.058 | 17.859 | 17.792 | 17.780 | ... | 53680.226214 | 108 | 0.368674 | 17.610787 | 53680.243421 | 104 | 0.345422 | 17.615747 | 53680.247101 | 100 |
| 20406 | 3.244369 | 0.218891 | 0.088 | 9.13 | 12.76 | 16.715 | 15.543 | 15.336 | 15.286 | 15.276 | ... | 54000.276631 | 108 | 0.342734 | 15.118909 | 54000.293780 | 102 | 0.303788 | 15.132053 | 54000.296412 | 100 |
| 21992 | 4.315354 | 1.054582 | 0.077 | 7.35 | 11.54 | 16.186 | 15.040 | 14.909 | 14.864 | 14.853 | ... | 53698.243534 | 114 | 0.661144 | 14.523218 | 53698.249941 | 111 | 0.619123 | 14.524697 | 53698.245861 | 114 |
5 rows × 37 columns
Loading from Array Data#
If your data is stored in arrays, Ensemble.from_source_dict() offers an interface to load these into an Ensemble object using a dictionary.
Let’s start by creating some example data arrays in a dictionary:
[7]:
import numpy as np
np.random.seed(1)
# initialize a dictionary of empty arrays
source_dict = {
"id": np.array([]),
"time": np.array([]),
"flux": np.array([]),
"error": np.array([]),
"band": np.array([]),
}
# Create 10 lightcurves with 100 measurements each
lc_len = 100
for i in range(10):
source_dict["id"] = np.append(source_dict["id"], np.array([i] * lc_len)).astype(int)
source_dict["time"] = np.append(source_dict["time"], np.linspace(1, lc_len, lc_len))
source_dict["flux"] = np.append(source_dict["flux"], 100 + 50 * np.random.rand(lc_len))
source_dict["error"] = np.append(source_dict["error"], 10 + 5 * np.random.rand(lc_len))
source_dict["band"] = np.append(source_dict["band"], ["g"] * 50 + ["r"] * 50)
From here, we just need to pass the dictionary along to Ensemble.from_source_dict() and set the ColumnMapper appropriately.
[8]:
colmap = ColumnMapper(id_col="id", time_col="time", flux_col="flux", err_col="error", band_col="band")
ens = Ensemble()
ens.from_source_dict(source_dict, column_mapper=colmap, sorted=True)
ens.info()
Object Table
<class 'tape.ensemble_frame.ObjectFrame'>
Index: 0 entries
Empty ObjectFrameSource Table
<class 'tape.ensemble_frame.SourceFrame'>
Index: 1000 entries, 0 to 9
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 time 1000 non-null float64
1 flux 1000 non-null float64
2 error 1000 non-null float64
3 band 1000 non-null string
dtypes: float64(3), string(1)
memory usage: 40.0 KB
[ ]: