[1]:

rel_path = "../../tests/tape_tests/data/small_sky_hipscat"

Using TAPE with LSDB and HiPSCat Data#

The Hierarchical Progressive Survey Catalog (HiPSCat) format is a partitioning of objects on a sphere. Its purpose is for storing data from large astronomy surveys, with the main feature being the adaptive sizing of partitions based on the number of objects in a given region of the sky, using healpix.

The Large Survey Database (LSDB) is a framework that facilitates and enables spatial analysis for extremely large astronomical databases (i.e. querying and crossmatching O(1B) sources). This package uses dask to parallelize operations across multiple HiPSCat partitioned surveys.

Both HiPSCat and LSDB are strong tools in the arsenal of a TAPE user. HiPSCat provides a scalable data format for working at the scale of LSST. While LSDB provides tooling to prepare more complex datasets for TAPE analysis, including operations like cross-matching multiple surveys, cone searches to select data from specific regions of the sky, etc. In this notebook, we’ll walk through the process by which these can be used with TAPE.

Loading from HiPSCat data#

TAPE offers a built-in HiPSCat loader function, which can be used to quickly load in a dataset that is in the HiPSCat format. We’ll use a small dummy dataset for this example. Before loading, let’s just peek at the data we’ll be working with.

[2]:

import pyarrow.parquet as pq
import os

object_path = os.path.join(rel_path, "small_sky_object_catalog")
source_path = os.path.join(rel_path, "small_sky_source_catalog")

# Object Schema
pq.read_metadata(os.path.join(object_path, "_common_metadata")).schema

[2]:

<pyarrow._parquet.ParquetSchema object at 0x7f35437809c0>
required group field_id=-1 schema {
  optional int64 field_id=-1 id;
  optional double field_id=-1 ra;
  optional double field_id=-1 dec;
  optional int64 field_id=-1 ra_error;
  optional int64 field_id=-1 dec_error;
  optional int32 field_id=-1 Norder (Int(bitWidth=8, isSigned=false));
  optional int64 field_id=-1 Dir (Int(bitWidth=64, isSigned=false));
  optional int64 field_id=-1 Npix (Int(bitWidth=64, isSigned=false));
  optional int64 field_id=-1 _hipscat_index (Int(bitWidth=64, isSigned=false));
}

[3]:

# Source Schema
pq.read_metadata(os.path.join(source_path, "_common_metadata")).schema

[3]:

<pyarrow._parquet.ParquetSchema object at 0x7f356c13ea00>
required group field_id=-1 schema {
  optional int64 field_id=-1 source_id;
  optional double field_id=-1 source_ra;
  optional double field_id=-1 source_dec;
  optional double field_id=-1 mjd;
  optional double field_id=-1 mag;
  optional binary field_id=-1 band (String);
  optional int64 field_id=-1 object_id;
  optional double field_id=-1 object_ra;
  optional double field_id=-1 object_dec;
  optional int32 field_id=-1 Norder;
  optional int32 field_id=-1 Dir;
  optional int32 field_id=-1 Npix;
  optional int64 field_id=-1 _hipscat_index (Int(bitWidth=64, isSigned=false));
}

The schema indicates which fields are available in each catalog. Notice the _hipscat_index in both, this is a specially constructed index that the data is sorted on and enables efficient use of the HiPSCat format. It’s recommended to use this as the ID column in TAPE when loading from hipscatted object and source catalogs. With this established, let’s load this data into TAPE.

[4]:

from tape import Ensemble, ColumnMapper

ens = Ensemble(client=False)

# Setup a ColumnMapper
colmap = ColumnMapper(
    id_col="_hipscat_index",  # using _hipscat_index is recommended
    time_col="mjd",  # pulling these from the source schema list above
    flux_col="mag",
    err_col="Norder",  # we don't have an error column, using a random column for this toy example
    band_col="band",
)

ens.from_hipscat(source_path, object_path, column_mapper=colmap, object_index="id", source_index="object_id")

ens.object.head(5)

/home/docs/checkouts/readthedocs.org/user_builds/tape/envs/stable/lib/python3.10/site-packages/lsdb/dask/join_catalog_data.py:195: RuntimeWarning: Right catalog does not have a margin cache. Results may be inaccurate
  warnings.warn("Right catalog does not have a margin cache. Results may be inaccurate", RuntimeWarning)

[4]:

	id	ra	dec	ra_error	dec_error	Norder	Dir	Npix
_hipscat_index
12749688880727326720	707	308.5	-69.5	0	0	0	0	11
12751184493818150912	792	320.5	-69.5	0	0	0	0	11
12753202806647685120	723	315.5	-68.5	0	0	0	0	11
12753202806647685121	811	315.5	-68.5	0	0	0	0	11
12770681119980912640	826	335.5	-69.5	0	0	0	0	11

In the from_hipscat call, we additionally needed to specify object_index and source_index, these are a column from both tables that map to the same object-level identifier. It’s used to join object and source, and convert the source _hipscat_index (which is unique per source) to use the object _hipscat_index (unique per object). From here, the _hipscat_index will serve as an object ID that ties sources together for TAPE operations.

[5]:

# We're now free to work with our TAPE Ensemble as normal
import matplotlib.pyplot as plt

ts = ens.to_timeseries(12751184493818150912)  # select a lightcurve using the _hipscat_index

# Let's plot this, though it's toy data so it won't look like anything...
plt.plot(ts.data["mjd"], ts.data["mag"], ".")
plt.title(ts.meta["id"])

[5]:

Text(0.5, 1.0, '12751184493818150912')

../_images/tutorials_working_with_hipscat_and_lsdb_10_1.png

Loading from LSDB Catalogs#

Ensemble.from_hipscat is used to directly ingest HiPSCat data into TAPE. In many cases, you may prefer to do a few operations on your HiPSCat data first using LSDB. Let’s walk through how this would look.

[6]:

# Loading into LSDB
import lsdb

# Load the dataset into LSDB catalog objects
object_cat = lsdb.read_hipscat(object_path)
source_cat = lsdb.read_hipscat(source_path)

We’ve now loaded our catalogs into LSDB catalog objects. From here, we can do LSDB operations on the catalogs. For example, let’s perform a cone search to narrow down our list of objects.

[7]:

object_cat_cone = object_cat.cone_search(
    ra=315.0,
    dec=-69.5,
    radius_arcsec=100000.0,
)

print(f"Original Number of Objects: {len(object_cat._ddf)}")
print(f"New Number of Objects: {len(object_cat_cone._ddf)}")

Original Number of Objects: 131
New Number of Objects: 74

With our cone search performed, we can now move into TAPE. We’ll first need to create a new source catalog, joined_source_cat, which incorporates the result of the cone search and also reindexes onto the object _hipscat_index.

[8]:

# We do this to get the source catalog indexed by the objects hipscat index
joined_source_cat = object_cat_cone.join(
    source_cat, left_on="id", right_on="object_id", suffixes=("_object", "")
)

colmap = ColumnMapper(
    id_col="_hipscat_index",
    time_col="mjd",
    flux_col="mag",
    err_col="Norder",  # no error column...
    band_col="band",
)

ens = Ensemble(client=False)

# We just pass in the catalog objects
ens.from_lsdb(joined_source_cat, object_cat_cone, column_mapper=colmap)

ens.object.compute()

/home/docs/checkouts/readthedocs.org/user_builds/tape/envs/stable/lib/python3.10/site-packages/lsdb/dask/join_catalog_data.py:195: RuntimeWarning: Right catalog does not have a margin cache. Results may be inaccurate
  warnings.warn("Right catalog does not have a margin cache. Results may be inaccurate", RuntimeWarning)

[8]:

	id	ra	dec	ra_error	dec_error	Norder	Dir	Npix
_hipscat_index
12749688880727326720	707	308.5	-69.5	0	0	0	0	11
12751184493818150912	792	320.5	-69.5	0	0	0	0	11
12753202806647685120	723	315.5	-68.5	0	0	0	0	11
12753202806647685121	811	315.5	-68.5	0	0	0	0	11
12770681119980912640	826	335.5	-69.5	0	0	0	0	11
...	...	...	...	...	...	...	...	...
13351146793404989440	753	307.5	-45.5	0	0	0	0	11
13358998609274601472	769	307.5	-42.5	0	0	0	0	11
13368388511275679744	764	297.5	-45.5	0	0	0	0	11
13369482380335644672	785	296.5	-44.5	0	0	0	0	11
13369514156621824000	709	294.5	-45.5	0	0	0	0	11

74 rows × 8 columns

And from here, we’re once again able to work with our TAPE Ensemble as normal.

Using TAPE with LSDB and HiPSCat Data

Contents

Using TAPE with LSDB and HiPSCat Data#

Loading from HiPSCat data#

Loading from LSDB Catalogs#