Common Data Operations with TAPE#
In this notebook, we’ll highlight a handful of common dataframe operations that can be performed within TAPE.
Note:
TAPEextends thePandas/DaskAPI, and so users familiar with those APIs can expect many operations to be near-identical when working withTAPE.
Let’s consider a small example dataset of Stripe 82 RRLyrae:
[1]:
from tape import Ensemble
ens = Ensemble()
ens.from_dataset("s82_rrlyrae", sorted=True)
[1]:
<tape.ensemble.Ensemble at 0x7f2050387040>
Inspection#
These functions provide views into the contents of your Ensemble dataframe, especially important when dealing with large data volumes that cannot be brought into memory all at once.
Lazy View of an EnsembleFrame#
The most basic inspection method is to just call the EnsembleFrame (dataframe) objects themselves. This returns a lazy (no data is loaded) view of the EnsembleFrame.
[2]:
ens.object
[2]:
| ra | dec | rExt | d | rGC | uF | gF | rF | iF | zF | VF | ugmin | ugminErr | grmin | grminErr | type | P | uA | u0 | uE | uT | gA | g0 | gE | gT | rA | r0 | rE | rT | iA | i0 | iE | iT | zA | z0 | zE | zT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| npartitions=5 | |||||||||||||||||||||||||||||||||||||
| 4099 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | float64 | string | float64 | float64 | float64 | float64 | int64 | float64 | float64 | float64 | int64 | float64 | float64 | float64 | int64 | float64 | float64 | float64 | int64 | float64 | float64 | float64 | int64 |
| 848438 | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3138275 | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5011634 | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
[3]:
ens.source
[3]:
| ra | dec | mjd | flux | error | band | |
|---|---|---|---|---|---|---|
| npartitions=5 | ||||||
| 4099 | float64 | float64 | float64 | float64 | float64 | string |
| 848438 | ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... | ... |
| 3138275 | ... | ... | ... | ... | ... | ... |
| 5011634 | ... | ... | ... | ... | ... | ... |
Using Compute() to view the data#
When an EnsembleFrame’s contents are small enough to fit into memory, you can use compute() to view the actual data.
Note:
compute()also involves actual computation of the in-memory data, working on any loading/filtering/analysis needed to produce the result, as such this can take a long time!
[4]:
ens.object.compute()
[4]:
| ra | dec | rExt | d | rGC | uF | gF | rF | iF | zF | ... | rE | rT | iA | i0 | iE | iT | zA | z0 | zE | zT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| #id | |||||||||||||||||||||
| 4099 | 0.935679 | 1.115859 | 0.089 | 17.75 | 20.03 | 18.134 | 16.989 | 16.777 | 16.703 | 16.685 | ... | 51075.295112 | 103 | 0.317851 | 16.548633 | 51075.295084 | 102 | 0.302557 | 16.539893 | 51075.288235 | 100 |
| 13350 | 0.283437 | 1.178522 | 0.080 | 24.77 | 26.55 | 18.839 | 17.679 | 17.544 | 17.497 | 17.501 | ... | 54025.326474 | 112 | 0.642111 | 17.147570 | 54025.326185 | 116 | 0.583437 | 17.190782 | 54025.327901 | 114 |
| 15927 | 3.254658 | -0.584066 | 0.090 | 29.12 | 30.96 | 19.288 | 18.058 | 17.859 | 17.792 | 17.780 | ... | 53680.226214 | 108 | 0.368674 | 17.610787 | 53680.243421 | 104 | 0.345422 | 17.615747 | 53680.247101 | 100 |
| 20406 | 3.244369 | 0.218891 | 0.088 | 9.13 | 12.76 | 16.715 | 15.543 | 15.336 | 15.286 | 15.276 | ... | 54000.276631 | 108 | 0.342734 | 15.118909 | 54000.293780 | 102 | 0.303788 | 15.132053 | 54000.296412 | 100 |
| 21992 | 4.315354 | 1.054582 | 0.077 | 7.35 | 11.54 | 16.186 | 15.040 | 14.909 | 14.864 | 14.853 | ... | 53698.243534 | 114 | 0.661144 | 14.523218 | 53698.249941 | 111 | 0.619123 | 14.524697 | 53698.245861 | 114 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4956681 | 58.700931 | 1.228830 | 1.051 | 36.88 | 43.44 | 19.476 | 18.513 | 18.454 | 18.480 | 18.505 | ... | 53272.495204 | 105 | 0.730201 | 18.060217 | 53272.485932 | 113 | 0.583450 | 18.176327 | 53272.488449 | 117 |
| 4983075 | 57.156605 | 0.134676 | 0.527 | 29.15 | 35.64 | 19.114 | 18.054 | 17.869 | 17.818 | 17.842 | ... | 54064.415166 | 101 | 0.278196 | 17.676564 | 54064.408405 | 101 | 0.247065 | 17.722055 | 54064.404969 | 100 |
| 4984662 | 57.128875 | -0.389138 | 0.584 | 39.05 | 45.43 | 19.745 | 18.701 | 18.489 | 18.454 | 18.452 | ... | 53994.472022 | 104 | 0.359037 | 18.272159 | 53994.465668 | 108 | 0.309688 | 18.303267 | 53994.475330 | 100 |
| 4992418 | 57.151443 | 0.892965 | 0.479 | 31.46 | 37.97 | 19.279 | 18.214 | 18.042 | 18.000 | 17.997 | ... | 53681.416038 | 109 | 0.560006 | 17.708949 | 53681.425071 | 107 | 0.512215 | 17.718540 | 53681.406509 | 115 |
| 5011634 | 56.510537 | -0.561729 | 0.451 | 30.58 | 36.98 | 19.232 | 18.106 | 18.041 | 18.058 | 18.116 | ... | 52934.364056 | 0 | 0.281035 | 17.916158 | 52934.356016 | 1 | 0.255751 | 17.987333 | 52934.369944 | 0 |
483 rows × 37 columns
Grab small in-memory views with head()#
Often, you’ll want to peek at your data even though the full-size is too large for memory.
Note: some partitions may be empty and
headwill have to traverse these empty partitions to find enough rows for your result. An empty table with many partitions (O(100)k) might be costly even for an ultimately empty result.
[5]:
ens.source.head(5) # grabs the first 5 rows
# can also use tail to grab the last 5 rows
[5]:
| ra | dec | mjd | flux | error | band | |
|---|---|---|---|---|---|---|
| #id | ||||||
| 4099 | 0.935679 | 1.115859 | 53288.253649 | 18.316 | 0.016 | u |
| 4099 | 0.935679 | 1.115859 | 53294.301322 | 18.477 | 0.020 | u |
| 4099 | 0.935679 | 1.115859 | 53302.249316 | -99.990 | 0.019 | u |
| 4099 | 0.935679 | 1.115859 | 53312.204844 | 18.478 | 0.040 | u |
| 4099 | 0.935679 | 1.115859 | 54009.258962 | -99.990 | 0.017 | u |
Getting Individual Lightcurves#
Several methods exist to access individual lightcurves within the Ensemble.
Access using a known ID#
If you’d like to access a particular lightcurve given an ID, you can use the to_timeseries() function. This allows you to supply a given object ID, and returns a TimeSeries object (see working_with_the_timeseries).
Note: that this loads data from all available bands.
[6]:
ts = ens.to_timeseries(13350)
ts.data
[6]:
| ra | dec | mjd | flux | error | band | ||
|---|---|---|---|---|---|---|---|
| band | index | ||||||
| i | 0 | 0.283437 | 1.178522 | 52253.185568 | 17.287 | 0.006 | i |
| 1 | 0.283437 | 1.178522 | 52557.307400 | 17.300 | 0.006 | i | |
| 2 | 0.283437 | 1.178522 | 52578.203484 | 17.422 | 0.006 | i | |
| 3 | 0.283437 | 1.178522 | 52908.281967 | 17.708 | 0.007 | i | |
| 4 | 0.283437 | 1.178522 | 52911.295754 | 17.273 | 0.006 | i | |
| ... | ... | ... | ... | ... | ... | ... | ... |
| z | 60 | 0.283437 | 1.178522 | 54405.293571 | 17.543 | 0.022 | z |
| 61 | 0.283437 | 1.178522 | 54406.231836 | 17.177 | 0.020 | z | |
| 62 | 0.283437 | 1.178522 | 54358.275040 | 17.664 | 0.025 | z | |
| 63 | 0.283437 | 1.178522 | 54359.269103 | 17.608 | 0.034 | z | |
| 64 | 0.283437 | 1.178522 | 54365.308108 | 17.591 | 0.027 | z |
325 rows × 6 columns
[7]:
import matplotlib.pyplot as plt
for band in ts.data.band.unique():
plt.errorbar(
ts.data.loc[band]["mjd"],
ts.data.loc[band]["flux"],
yerr=ts.data.loc[band]["error"],
fmt=".",
label=band,
)
plt.ylim(16, 20)
plt.legend()
plt.title(ts.meta["id"])
[7]:
Text(0.5, 1.0, '13350')
Access a random lightcurve#
Alternatively, if you aren’t interested in a particular lightcurve, you can draw a random one from the Ensemble using Ensemble.select_random_timeseries().
[8]:
ens.select_random_timeseries(seed=1).data
Selected Object 4455741 from Partition 4
[8]:
| ra | dec | mjd | flux | error | band | ||
|---|---|---|---|---|---|---|---|
| band | index | ||||||
| z | 0 | -50.103767 | -0.884631 | 53637.132779 | 16.749 | 0.011 | z |
| 1 | -50.103767 | -0.884631 | 53644.125222 | 16.847 | 0.018 | z | |
| 2 | -50.103767 | -0.884631 | 53649.119849 | -99.990 | 0.025 | z | |
| 3 | -50.103767 | -0.884631 | 53654.159851 | 16.667 | 0.011 | z | |
| 4 | -50.103767 | -0.884631 | 53656.107440 | 16.864 | 0.013 | z | |
| ... | ... | ... | ... | ... | ... | ... | |
| 53 | -50.103767 | -0.884631 | 54381.180258 | 16.696 | 0.020 | z | |
| 54 | -50.103767 | -0.884631 | 54415.132771 | 16.672 | 0.014 | z | |
| 55 | -50.103767 | -0.884631 | 53668.110351 | 16.791 | 0.014 | z | |
| i | 54 | -50.103767 | -0.884631 | 53272.184238 | 16.852 | 0.005 | i |
| 55 | -50.103767 | -0.884631 | 54358.133780 | 16.736 | 0.007 | i |
280 rows × 6 columns
Filtering#
Queries#
Queries mirror the Pandas implementation. Specifically, the function takes a string that provides an expression indicating which rows to keep.
[9]:
# define a query to remove the top 5% of flux values
highest_flux = ens.source[ens._flux_col].quantile(0.95).compute()
ens.source.query(f"{ens._flux_col} < {highest_flux}").compute()
[9]:
| ra | dec | mjd | flux | error | band | |
|---|---|---|---|---|---|---|
| #id | ||||||
| 4099 | 0.935679 | 1.115859 | 53616.310409 | -99.990 | 0.051 | u |
| 4099 | 0.935679 | 1.115859 | 53623.300622 | -99.990 | 0.031 | u |
| 4099 | 0.935679 | 1.115859 | 52557.310040 | 18.336 | 0.017 | u |
| 4099 | 0.935679 | 1.115859 | 52578.206125 | 18.176 | 0.014 | u |
| 4099 | 0.935679 | 1.115859 | 52908.284607 | 18.249 | 0.017 | u |
| ... | ... | ... | ... | ... | ... | ... |
| 5011634 | 56.510537 | -0.561729 | 53693.385785 | 18.514 | 0.016 | g |
| 5011634 | 56.510537 | -0.561729 | 53699.406619 | 18.506 | 0.009 | g |
| 5011634 | 56.510537 | -0.561729 | 53704.369106 | 18.634 | 0.009 | g |
| 5011634 | 56.510537 | -0.561729 | 53989.465164 | 18.983 | 0.018 | g |
| 5011634 | 56.510537 | -0.561729 | 54008.451534 | 19.000 | 0.017 | g |
138941 rows × 6 columns
Note: When filtering, or doing any operations that modify a dataframe, the result is a new dataframe that does not automically update the
Ensemble. If you’d like to update theEnsemblewith the result of any of the following operations, be sure to add.update_ensemble()to the end of the call.
Filtering by Number of Observations#
Filters based on number of observations are more directly supported within the TAPE API. First, using a dedicated function to calculate the number of observations per lightcurve, Ensemble.calc_nobs():
[10]:
ens.calc_nobs(by_band=True, temporary=False)
ens.object.head(5)[["nobs_u", "nobs_g", "nobs_r", "nobs_i", "nobs_z", "nobs_total"]]
[10]:
| nobs_u | nobs_g | nobs_r | nobs_i | nobs_z | nobs_total | |
|---|---|---|---|---|---|---|
| #id | ||||||
| 4099 | 64 | 64 | 64 | 64 | 64 | 320 |
| 13350 | 65 | 65 | 65 | 65 | 65 | 325 |
| 15927 | 64 | 64 | 64 | 64 | 64 | 320 |
| 20406 | 65 | 65 | 65 | 65 | 65 | 325 |
| 21992 | 79 | 79 | 79 | 79 | 79 | 395 |
You can then query on these columns as normal.
[11]:
ens.object.query("nobs_total > 322")[["nobs_u", "nobs_g", "nobs_r", "nobs_i", "nobs_z", "nobs_total"]].head(5)
[11]:
| nobs_u | nobs_g | nobs_r | nobs_i | nobs_z | nobs_total | |
|---|---|---|---|---|---|---|
| #id | ||||||
| 13350 | 65 | 65 | 65 | 65 | 65 | 325 |
| 20406 | 65 | 65 | 65 | 65 | 65 | 325 |
| 21992 | 79 | 79 | 79 | 79 | 79 | 395 |
| 46988 | 65 | 65 | 65 | 65 | 65 | 325 |
| 91658 | 65 | 65 | 65 | 65 | 65 | 325 |
Alternatively, if you’d like to just quickly filter by the number of total observations, you can use Ensemble.prune().
[12]:
ens.prune(322) # equivalent to the above
ens.object[["nobs_total"]].head(5)
[12]:
| nobs_total | |
|---|---|
| #id | |
| 13350 | 325 |
| 20406 | 325 |
| 21992 | 395 |
| 46988 | 325 |
| 91658 | 325 |
Removing NaNs#
Removing Rows with NaN values follows the Pandas API, using dropna():
[13]:
# Remove any rows with a NaN value in any of the specified columns
ens.source.dropna(subset=["flux", "mjd", "error", "band"]).update_ensemble()
ens.source
[13]:
| ra | dec | mjd | flux | error | band | |
|---|---|---|---|---|---|---|
| npartitions=5 | ||||||
| 4099 | float64 | float64 | float64 | float64 | float64 | string |
| 848438 | ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... | ... |
| 3138275 | ... | ... | ... | ... | ... | ... |
| 5011634 | ... | ... | ... | ... | ... | ... |
Analysis#
Applying Functions with Ensemble.batch()#
The Ensemble provides a powerful batching interface, Ensemble.batch(), with in-built parallelization (provided the input data is in multiple partitions).
[14]:
import numpy as np
# Defining a simple function
def my_flux_average(flux_array, band_array, method="mean", band=None):
"""Read in an array of fluxes, and return the average of the fluxes by band"""
if band != None:
mask = [band_array == band] # Create a band by band mask
band_flux = flux_array[tuple(mask)] # Mask the flux array
if method == "mean":
res = np.mean(band_flux)
elif method == "median":
res = np.median(band_flux)
else:
res = None
return res
With the function defined, we next supply it to Ensemble.batch(). The column labels of the Ensemble columns we want to use as arguments must be provided, as well as any keyword arguments. In this case, we pass along "flux" and "band", so that the Ensemble will map those columns to flux_array and band_array respectively. We also pass method='median' and band='g', which will pass those kwargs along to my_flux_average.
[15]:
# Applying the function to the ensemble
res = ens.batch(my_flux_average, "flux", "band", meta=None, method="median", band="g")
res.compute()
Temporary columns dropped from Object Table: ['nobs_total']
Using generated label, result_1, for a batch result.
[15]:
| result | |
|---|---|
| #id | |
| 13350 | 17.8260 |
| 20406 | 15.6750 |
| 21992 | 15.1520 |
| 46988 | 15.5000 |
| 91658 | 16.1550 |
| ... | ... |
| 4920018 | 14.8620 |
| 4947744 | 18.5270 |
| 4983075 | 18.8380 |
| 4984662 | 19.5020 |
| 4992418 | 19.0345 |
103 rows × 1 columns
Ensemble.batch() supports many different variations of custom user functions, and additionally has a small suite of tailored analysis functions designed for it. For more details on batch, see the batch showcase.
Column Assignment#
The ensemble object supports assignment through the Pandas assign function. We can pass in either a callable or a series to assign to the new column. New column names are produced automatically from the argument name.
For example, if we want to compute the lower bound of an error range as the estimated flux minus twice the estimated error, we would use:
[16]:
lower_bnd = ens.source.assign(lower_bnd=lambda x: x["flux"] - 2.0 * x["error"])
lower_bnd.head(5)
[16]:
| ra | dec | mjd | flux | error | band | lower_bnd | |
|---|---|---|---|---|---|---|---|
| #id | |||||||
| 13350 | 0.283437 | 1.178522 | 52253.185568 | 17.287 | 0.006 | i | 17.275 |
| 13350 | 0.283437 | 1.178522 | 52557.307400 | 17.300 | 0.006 | i | 17.288 |
| 13350 | 0.283437 | 1.178522 | 52578.203484 | 17.422 | 0.006 | i | 17.410 |
| 13350 | 0.283437 | 1.178522 | 52908.281967 | 17.708 | 0.007 | i | 17.694 |
| 13350 | 0.283437 | 1.178522 | 52911.295754 | 17.273 | 0.006 | i | 17.261 |
We can also assign our computed batch result as a new object column using the same methodology.
[17]:
ens.object.assign(g_average=res["result"])[["ra", "dec", "g_average"]].head(5)
[17]:
| ra | dec | g_average | |
|---|---|---|---|
| #id | |||
| 13350 | 0.283437 | 1.178522 | 17.826 |
| 20406 | 3.244369 | 0.218891 | 15.675 |
| 21992 | 4.315354 | 1.054582 | 15.152 |
| 46988 | 2.426843 | -0.562932 | 15.500 |
| 91658 | 0.846748 | -0.994204 | 16.155 |
Dask Tips#
Using persist() to Save Computation Time#
When calling compute(), all work needed to produce the in-memory result is performed. This work is reperformed each time compute() is called, leading to the potential to duplicate a lot of computational work, especially in exploratory notebooks where you’re testing different workflows. In such cases, it can be advantageous to call persist().
persist() returns a lazy view of a result, but actively begins computation of that result behind the scenes, leading to successive calls simply grabbing the result from persist() rather than needing to compute the result themselves. As a result, persist() should only be used when your data can fit into memory.
[18]:
ens.source.persist() # persist performs all queued data loading tasks
ens.source.compute() # which allows compute to just pull the result immediately.
[18]:
| ra | dec | mjd | flux | error | band | |
|---|---|---|---|---|---|---|
| #id | ||||||
| 13350 | 0.283437 | 1.178522 | 53312.204767 | 18.133 | 0.016 | g |
| 13350 | 0.283437 | 1.178522 | 53314.205904 | 17.826 | 0.005 | g |
| 13350 | 0.283437 | 1.178522 | 53616.310269 | 18.134 | 0.013 | g |
| 13350 | 0.283437 | 1.178522 | 52934.213035 | 17.967 | 0.007 | r |
| 13350 | 0.283437 | 1.178522 | 52936.208961 | 17.816 | 0.006 | r |
| ... | ... | ... | ... | ... | ... | ... |
| 4992418 | 57.151443 | 0.892965 | 52558.478255 | 20.058 | 0.044 | u |
| 4992418 | 57.151443 | 0.892965 | 53657.478608 | 20.492 | 0.095 | u |
| 4992418 | 57.151443 | 0.892965 | 53622.479323 | 20.437 | 0.062 | u |
| 4992418 | 57.151443 | 0.892965 | 52935.376893 | 20.385 | 0.060 | u |
| 4992418 | 57.151443 | 0.892965 | 52578.349866 | 20.348 | 0.060 | u |
38510 rows × 6 columns
Repartitioning#
With Dask and TAPE data is stored in separate sub-containers called “partitions”, `Dask has recommendations <https://docs.dask.org/en/stable/best-practices.html#dask-best-practices>`__ for the optimal amount of data stored in a given partition, and even if the initial data follows these recommendations, filtering steps can cause partitions to contain very little data. In this case, it may be best to call repartition().
[19]:
ens.source.repartition(partition_size="100MB") # 100MBs is generally recommended
# In this case, we have a small set of data that easily fits into one partition
[19]:
| ra | dec | mjd | flux | error | band | |
|---|---|---|---|---|---|---|
| npartitions=1 | ||||||
| 4099 | float64 | float64 | float64 | float64 | float64 | string |
| 5011634 | ... | ... | ... | ... | ... | ... |
Sampling#
In addition to filtering by specific constraints, it’s possible to select a subset of your data to work with. Ensemble.sample() will randomly select a fraction of objects from the full object list. This will return a new ensemble object to work with.
[20]:
subset_ens = ens.sample(frac=0.5) # select ~half of the objects
print("Number of pre-sampled objects: ", len(ens.object))
print("Number of post-sampled objects: ", len(subset_ens.object))
Number of pre-sampled objects: 103
Number of post-sampled objects: 52
For reproducible results, you can also specify a random seed via the random_state parameter. By re-using the same seed in your random_state, you can ensure that a given Ensemble will always be sampled the same way.
[21]:
subset_ens = ens.sample(
frac=0.2, # select a ~fifth of the objects
random_state=53783594, # set a random seed for reproducibility
)
print("Number of pre-sampled objects: ", len(ens.object))
print("Number of post-sampled objects: ", len(subset_ens.object))
Number of pre-sampled objects: 103
Number of post-sampled objects: 21
Note: Using
Ensemble.sampleto filter large datasets is not recommended, as it does not handle repartitioning. Instead, using partition slicing, shown below.
[22]:
# partition slicing
# specify a subset of partitions, propagates to the object table automatically
ens.source.partitions[0:1].update_ensemble()
[22]:
<tape.ensemble.Ensemble at 0x7f2050387040>
Saving Intermediate Results#
In some situations, you may find yourself running a given workflow many times. Due to the nature of lazy-computation, this will involve repeated execution of data I/O, pre-processing steps, initial analysis, etc. In these situations, it may be effective to instead save the ensemble state to disk after completion of these initial processing steps. To accomplish this, we can use the Ensemble.save_ensemble() function.
[23]:
ens.object.head(5)
[23]:
| ra | dec | rExt | d | rGC | uF | gF | rF | iF | zF | ... | iT | zA | z0 | zE | zT | nobs_g | nobs_i | nobs_r | nobs_u | nobs_z | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| #id | |||||||||||||||||||||
| 13350 | 0.283437 | 1.178522 | 0.080 | 24.77 | 26.55 | 18.839 | 17.679 | 17.544 | 17.497 | 17.501 | ... | 116 | 0.583437 | 17.190782 | 54025.327901 | 114 | 65 | 65 | 65 | 65 | 65 |
| 20406 | 3.244369 | 0.218891 | 0.088 | 9.13 | 12.76 | 16.715 | 15.543 | 15.336 | 15.286 | 15.276 | ... | 102 | 0.303788 | 15.132053 | 54000.296412 | 100 | 65 | 65 | 65 | 65 | 65 |
| 21992 | 4.315354 | 1.054582 | 0.077 | 7.35 | 11.54 | 16.186 | 15.040 | 14.909 | 14.864 | 14.853 | ... | 111 | 0.619123 | 14.524697 | 53698.245861 | 114 | 79 | 79 | 79 | 79 | 79 |
| 46988 | 2.426843 | -0.562932 | 0.140 | 7.85 | 11.69 | 16.424 | 15.194 | 15.038 | 14.993 | 15.004 | ... | 112 | 0.540125 | 14.715373 | 52253.188970 | 108 | 65 | 65 | 65 | 65 | 65 |
| 91658 | 0.846748 | -0.994204 | 0.098 | 11.07 | 14.04 | 17.087 | 15.933 | 15.791 | 15.733 | 15.731 | ... | 119 | 0.510250 | 15.470862 | 54348.319546 | 110 | 65 | 65 | 65 | 65 | 65 |
5 rows × 42 columns
[24]:
ens.save_ensemble(".", "ensemble", additional_frames=False) # Saves to disk
Saved to ./ensemble
The above command creates an “ensemble” directory in the current working directory. This directory contains a subdirectory of parquet files for each EnsembleFrame object that was included in the additional_frames kwarg. Note that if additional_frames was set to True or False this would save all or none of the additional EnsembleFrame objects respectively, and that the object (unless it has no columns) and source frames are always saved.
From here, we can just load the ensemble from disk.
[25]:
new_ens = Ensemble()
new_ens.from_ensemble("./ensemble")
[25]:
<tape.ensemble.Ensemble at 0x7f1ffe163a90>
[ ]: