[1]:
rel_path = "../../tests/tape_tests/data/small_sky_hipscat"
Using TAPE with LSDB and HiPSCat Data#
The Hierarchical Progressive Survey Catalog (HiPSCat) format is a partitioning of objects on a sphere. Its purpose is for storing data from large astronomy surveys, with the main feature being the adaptive sizing of partitions based on the number of objects in a given region of the sky, using healpix.
The Large Survey Database (LSDB) is a framework that facilitates and enables spatial analysis for extremely large astronomical databases (i.e. querying and crossmatching O(1B) sources). This package uses dask to parallelize operations across multiple HiPSCat partitioned surveys.
Both HiPSCat and LSDB are strong tools in the arsenal of a TAPE user. HiPSCat provides a scalable data format for working at the scale of LSST. While LSDB provides tooling to prepare more complex datasets for TAPE analysis, including operations like cross-matching multiple surveys, cone searches to select data from specific regions of the sky, etc. In this notebook, we’ll walk through the process by which these can be used with TAPE.
Loading from HiPSCat data#
TAPE offers a built-in HiPSCat loader function, which can be used to quickly load in a dataset that is in the HiPSCat format. We’ll use a small dummy dataset for this example. Before loading, let’s just peek at the data we’ll be working with.
[2]:
import pyarrow.parquet as pq
import os
object_path = os.path.join(rel_path, "small_sky_object_catalog")
source_path = os.path.join(rel_path, "small_sky_source_catalog")
# Object Schema
pq.read_metadata(os.path.join(object_path, "_common_metadata")).schema
[2]:
<pyarrow._parquet.ParquetSchema object at 0x7f35437809c0>
required group field_id=-1 schema {
optional int64 field_id=-1 id;
optional double field_id=-1 ra;
optional double field_id=-1 dec;
optional int64 field_id=-1 ra_error;
optional int64 field_id=-1 dec_error;
optional int32 field_id=-1 Norder (Int(bitWidth=8, isSigned=false));
optional int64 field_id=-1 Dir (Int(bitWidth=64, isSigned=false));
optional int64 field_id=-1 Npix (Int(bitWidth=64, isSigned=false));
optional int64 field_id=-1 _hipscat_index (Int(bitWidth=64, isSigned=false));
}
[3]:
# Source Schema
pq.read_metadata(os.path.join(source_path, "_common_metadata")).schema
[3]:
<pyarrow._parquet.ParquetSchema object at 0x7f356c13ea00>
required group field_id=-1 schema {
optional int64 field_id=-1 source_id;
optional double field_id=-1 source_ra;
optional double field_id=-1 source_dec;
optional double field_id=-1 mjd;
optional double field_id=-1 mag;
optional binary field_id=-1 band (String);
optional int64 field_id=-1 object_id;
optional double field_id=-1 object_ra;
optional double field_id=-1 object_dec;
optional int32 field_id=-1 Norder;
optional int32 field_id=-1 Dir;
optional int32 field_id=-1 Npix;
optional int64 field_id=-1 _hipscat_index (Int(bitWidth=64, isSigned=false));
}
The schema indicates which fields are available in each catalog. Notice the _hipscat_index in both, this is a specially constructed index that the data is sorted on and enables efficient use of the HiPSCat format. It’s recommended to use this as the ID column in TAPE when loading from hipscatted object and source catalogs. With this established, let’s load this data into TAPE.
[4]:
from tape import Ensemble, ColumnMapper
ens = Ensemble(client=False)
# Setup a ColumnMapper
colmap = ColumnMapper(
id_col="_hipscat_index", # using _hipscat_index is recommended
time_col="mjd", # pulling these from the source schema list above
flux_col="mag",
err_col="Norder", # we don't have an error column, using a random column for this toy example
band_col="band",
)
ens.from_hipscat(source_path, object_path, column_mapper=colmap, object_index="id", source_index="object_id")
ens.object.head(5)
/home/docs/checkouts/readthedocs.org/user_builds/tape/envs/stable/lib/python3.10/site-packages/lsdb/dask/join_catalog_data.py:195: RuntimeWarning: Right catalog does not have a margin cache. Results may be inaccurate
warnings.warn("Right catalog does not have a margin cache. Results may be inaccurate", RuntimeWarning)
[4]:
| id | ra | dec | ra_error | dec_error | Norder | Dir | Npix | |
|---|---|---|---|---|---|---|---|---|
| _hipscat_index | ||||||||
| 12749688880727326720 | 707 | 308.5 | -69.5 | 0 | 0 | 0 | 0 | 11 |
| 12751184493818150912 | 792 | 320.5 | -69.5 | 0 | 0 | 0 | 0 | 11 |
| 12753202806647685120 | 723 | 315.5 | -68.5 | 0 | 0 | 0 | 0 | 11 |
| 12753202806647685121 | 811 | 315.5 | -68.5 | 0 | 0 | 0 | 0 | 11 |
| 12770681119980912640 | 826 | 335.5 | -69.5 | 0 | 0 | 0 | 0 | 11 |
In the from_hipscat call, we additionally needed to specify object_index and source_index, these are a column from both tables that map to the same object-level identifier. It’s used to join object and source, and convert the source _hipscat_index (which is unique per source) to use the object _hipscat_index (unique per object). From here, the _hipscat_index will serve as an object ID that ties sources together for TAPE operations.
[5]:
# We're now free to work with our TAPE Ensemble as normal
import matplotlib.pyplot as plt
ts = ens.to_timeseries(12751184493818150912) # select a lightcurve using the _hipscat_index
# Let's plot this, though it's toy data so it won't look like anything...
plt.plot(ts.data["mjd"], ts.data["mag"], ".")
plt.title(ts.meta["id"])
[5]:
Text(0.5, 1.0, '12751184493818150912')
Loading from LSDB Catalogs#
Ensemble.from_hipscat is used to directly ingest HiPSCat data into TAPE. In many cases, you may prefer to do a few operations on your HiPSCat data first using LSDB. Let’s walk through how this would look.
[6]:
# Loading into LSDB
import lsdb
# Load the dataset into LSDB catalog objects
object_cat = lsdb.read_hipscat(object_path)
source_cat = lsdb.read_hipscat(source_path)
We’ve now loaded our catalogs into LSDB catalog objects. From here, we can do LSDB operations on the catalogs. For example, let’s perform a cone search to narrow down our list of objects.
[7]:
object_cat_cone = object_cat.cone_search(
ra=315.0,
dec=-69.5,
radius_arcsec=100000.0,
)
print(f"Original Number of Objects: {len(object_cat._ddf)}")
print(f"New Number of Objects: {len(object_cat_cone._ddf)}")
Original Number of Objects: 131
New Number of Objects: 74
With our cone search performed, we can now move into TAPE. We’ll first need to create a new source catalog, joined_source_cat, which incorporates the result of the cone search and also reindexes onto the object _hipscat_index.
[8]:
# We do this to get the source catalog indexed by the objects hipscat index
joined_source_cat = object_cat_cone.join(
source_cat, left_on="id", right_on="object_id", suffixes=("_object", "")
)
colmap = ColumnMapper(
id_col="_hipscat_index",
time_col="mjd",
flux_col="mag",
err_col="Norder", # no error column...
band_col="band",
)
ens = Ensemble(client=False)
# We just pass in the catalog objects
ens.from_lsdb(joined_source_cat, object_cat_cone, column_mapper=colmap)
ens.object.compute()
/home/docs/checkouts/readthedocs.org/user_builds/tape/envs/stable/lib/python3.10/site-packages/lsdb/dask/join_catalog_data.py:195: RuntimeWarning: Right catalog does not have a margin cache. Results may be inaccurate
warnings.warn("Right catalog does not have a margin cache. Results may be inaccurate", RuntimeWarning)
[8]:
| id | ra | dec | ra_error | dec_error | Norder | Dir | Npix | |
|---|---|---|---|---|---|---|---|---|
| _hipscat_index | ||||||||
| 12749688880727326720 | 707 | 308.5 | -69.5 | 0 | 0 | 0 | 0 | 11 |
| 12751184493818150912 | 792 | 320.5 | -69.5 | 0 | 0 | 0 | 0 | 11 |
| 12753202806647685120 | 723 | 315.5 | -68.5 | 0 | 0 | 0 | 0 | 11 |
| 12753202806647685121 | 811 | 315.5 | -68.5 | 0 | 0 | 0 | 0 | 11 |
| 12770681119980912640 | 826 | 335.5 | -69.5 | 0 | 0 | 0 | 0 | 11 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 13351146793404989440 | 753 | 307.5 | -45.5 | 0 | 0 | 0 | 0 | 11 |
| 13358998609274601472 | 769 | 307.5 | -42.5 | 0 | 0 | 0 | 0 | 11 |
| 13368388511275679744 | 764 | 297.5 | -45.5 | 0 | 0 | 0 | 0 | 11 |
| 13369482380335644672 | 785 | 296.5 | -44.5 | 0 | 0 | 0 | 0 | 11 |
| 13369514156621824000 | 709 | 294.5 | -45.5 | 0 | 0 | 0 | 0 | 11 |
74 rows × 8 columns
And from here, we’re once again able to work with our TAPE Ensemble as normal.