hats.catalog#
Catalog data wrappers
Submodules#
Classes#
A HATS Catalog for enabling fast joins between two HATS catalogs |
|
A HATS Catalog with data stored in a HEALPix Hive partitioned structure |
|
A collection of HATS Catalog with data stored in a HEALPix Hive partitioned structure |
|
Enum for possible types of catalog |
|
Container class for catalog metadata |
|
A base HATS dataset that contains a properties file and the data contained in parquet files |
|
Container class for catalog metadata |
|
An index into HATS Catalog for enabling fast lookups on non-spatial values. |
|
A HATS table to represent non-point-source data in a continuous map. |
|
A HATS Catalog used to contain the 'margin' of another HATS catalog. |
|
Container class for per-partition info. |
Package Contents#
- class AssociationCatalog(catalog_info: hats.catalog.dataset.table_properties.TableProperties, pixels: hats.catalog.partition_info.PartitionInfo | hats.pixel_tree.pixel_tree.PixelTree | list[hats.pixel_math.HealpixPixel], catalog_path: str | pathlib.Path | upath.UPath | None = None, moc: mocpy.MOC | None = None, schema: pyarrow.Schema | None = None, original_schema: pyarrow.Schema | None = None)[source]#
Bases:
hats.catalog.healpix_dataset.healpix_dataset.HealpixDatasetA HATS Catalog for enabling fast joins between two HATS catalogs
- class Catalog(catalog_info: hats.catalog.dataset.table_properties.TableProperties, pixels: hats.catalog.partition_info.PartitionInfo | hats.pixel_tree.pixel_tree.PixelTree | list[hats.pixel_math.HealpixPixel], catalog_path: str | pathlib.Path | upath.UPath | None = None, moc: mocpy.MOC | None = None, schema: pyarrow.Schema | None = None, original_schema: pyarrow.Schema | None = None)[source]#
Bases:
hats.catalog.healpix_dataset.healpix_dataset.HealpixDatasetA HATS Catalog with data stored in a HEALPix Hive partitioned structure
Catalogs of this type are partitioned spatially, contain partition_info metadata specifying the pixels in Catalog, and on disk conform to the parquet partitioning structure Norder=/Dir=/Npix=.parquet
- generate_negative_tree_pixels() list[hats.pixel_math.HealpixPixel][source]#
Get the leaf nodes at each healpix order that have zero catalog data.
For example, if an example catalog only had data points in pixel 0 at order 0, then this method would return order 0’s pixels 1 through 11. Used for getting full coverage on margin caches.
- Returns:
- list[HealpixPixel]
List of HealpixPixels representing the ‘negative tree’ for the catalog.
- class CatalogCollection(collection_path: upath.UPath, collection_properties: hats.catalog.dataset.collection_properties.CollectionProperties, main_catalog: hats.catalog.Catalog)[source]#
A collection of HATS Catalog with data stored in a HEALPix Hive partitioned structure
Catalogs of this type are described by a collection.properties file which specifies the underlying main catalog, margin catalog and index catalog paths. These catalogs are stored at the root of the collection, each in its separate directory:
catalog_collection/ ├── main_catalog/ ├── margin_catalog/ ├── index_catalog/ ├── collection.properties
Margin and index catalogs are optional but there could also be multiple of them. The catalogs used by default are specified in the collection.properties file in the default_margin and default_index keywords.
- collection_path#
- collection_properties#
- main_catalog#
- property main_catalog_dir: upath.UPath#
Path to the main catalog directory
- property all_margins: list[str] | None#
The list of margin catalog names in the collection
- property default_margin: str | None#
The name of the default margin
- property default_margin_catalog_dir: upath.UPath | None#
Path to the default margin catalog directory
- property all_indexes: dict[str, str] | None#
The mapping of indexes in the collection
- property default_index_field: str | None#
The name of the default index field
- property default_index_catalog_dir: upath.UPath | None#
Path to the default index catalog directory
- get_index_dir_for_field(field_name: str | None = None) upath.UPath | None[source]#
Path to the field’s index catalog directory
- get_healpix_pixels() list[hats.pixel_math.HealpixPixel][source]#
The list of HEALPix pixels of the main catalog
- class CatalogType[source]#
Bases:
str,enum.EnumEnum for possible types of catalog
- OBJECT = 'object'#
- SOURCE = 'source'#
- ASSOCIATION = 'association'#
- INDEX = 'index'#
- MARGIN = 'margin'#
- MAP = 'map'#
- class CollectionProperties(/, **data: Any)[source]#
Bases:
pydantic.BaseModelContainer class for catalog metadata
- name: str = None#
- hats_primary_table_url: str = None#
Reference to object catalog. Relevant for nested, margin, association, and index.
- all_margins: Annotated[list[str] | None, Field(default=None)]#
- default_margin: str | None = None#
- all_indexes: Annotated[dict[str, str] | None, Field(default=None)]#
- default_index: str | None = None#
- model_config#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- classmethod space_delimited_list(str_value: str) list[str][source]#
Convert a space-delimited list string into a python list of strings.
- Parameters:
- str_value: str
a space-delimited list string
- Returns:
- list[str]
a python list of strings
- classmethod index_tuples(str_value: str) dict[str, str][source]#
Convert a space-delimited list string into a python list of strings.
- Parameters:
- str_value: str
a space-delimited list string
- Returns:
- dict[str, str]
a python dict of strings
- serialize_list_as_space_delimited_list(str_list: Iterable[str]) str[source]#
Convert a python list of strings into a space-delimited string.
- Parameters:
- str_list: Iterable[str]
a python list of strings
- Returns:
- str
a space-delimited string
- serialize_dict_as_space_delimited_list(str_dict: dict[str, str]) str[source]#
Convert a python list of strings into a space-delimited string.
- Parameters:
- str_dict: dict[str, str]
a python dict of strings
- Returns:
- str
a space-delimited string
- check_allowed_and_required() typing_extensions.Self[source]#
Check that type-specific fields are appropriate, and required fields are set.
- check_default_margin_exists() typing_extensions.Self[source]#
Check that the default margin is in the list of all margins.
- check_default_index_exists() typing_extensions.Self[source]#
Check that the default index is in the list of all indexes.
- explicit_dict()[source]#
Create a dict, based on fields that have been explicitly set, and are not “extra” keys.
- classmethod read_from_dir(catalog_dir: str | pathlib.Path | upath.UPath) typing_extensions.Self[source]#
Read field values from a java-style properties file.
- Parameters:
- catalog_dir: str | Path | UPath
base directory of catalog.
- Returns:
- CollectionProperties
new object from the contents of a
collection.propertiesfile in the directory.
- class Dataset(catalog_info: hats.catalog.dataset.table_properties.TableProperties, catalog_path: str | pathlib.Path | upath.UPath | None = None, schema: pyarrow.Schema | None = None, original_schema: pyarrow.Schema | None = None)[source]#
A base HATS dataset that contains a properties file and the data contained in parquet files
- catalog_info#
- catalog_name#
- catalog_path = None#
- catalog_base_dir = None#
- schema = None#
- original_schema = None#
- property on_disk: bool#
Is the catalog stored on disk?
- aggregate_column_statistics(exclude_hats_columns: bool = True, exclude_columns: list[str] = None, include_columns: list[str] = None)[source]#
Read footer statistics in parquet metadata, and report on global min/max values.
- Parameters:
- exclude_hats_columnsbool
exclude HATS spatial and partitioning fields from the statistics. Defaults to True.
- exclude_columnslist[str]
additional columns to exclude from the statistics.
- include_columnslist[str]
if specified, only return statistics for the column names provided. Defaults to None, and returns all non-hats columns.
- Returns:
- Dataframe
aggregated statistics.
- per_pixel_statistics(exclude_hats_columns: bool = True, exclude_columns: list[str] = None, include_columns: list[str] = None, include_stats: list[str] = None, multi_index=False)[source]#
Read footer statistics in parquet metadata, and report on statistics about each pixel partition.
- Parameters:
- exclude_hats_columnsbool
exclude HATS spatial and partitioning fields from the statistics. Defaults to True.
- exclude_columnslist[str]
additional columns to exclude from the statistics.
- include_columnslist[str]
if specified, only return statistics for the column names provided. Defaults to None, and returns all non-hats columns.
- include_statslist[str]
if specified, only return the kinds of values from list (min_value, max_value, null_count, row_count). Defaults to None, and returns all values.
- multi_indexbool
should the returned frame be created with a multi-index, first on pixel, then on column name? Default is False, and instead indexes on pixel, with separate columns per-data-column and stat value combination. (Default value = False)
- Returns:
- Dataframe
all statistics.
- class TableProperties(/, **data: Any)[source]#
Bases:
pydantic.BaseModelContainer class for catalog metadata
- catalog_name: str = None#
- catalog_type: hats.catalog.catalog_type.CatalogType = None#
- total_rows: int | None = None#
- ra_column: str | None = None#
- dec_column: str | None = None#
- default_columns: list[str] | None = None#
Which columns should be read from parquet files, when user doesn’t otherwise specify.
- healpix_column: str | None = None#
Column name that provides a spatial index of healpix values at some fixed, high order. A typical value would be
_healpix_29, but can vary.
- healpix_order: int | None = None#
For the spatial index of healpix values in
hats_col_healpixwhat is the fixed, high order. A typicaly value would be 29, but can vary.
- primary_catalog: str | None = None#
Reference to object catalog. Relevant for nested, margin, association, and index.
- margin_threshold: float | None = None#
Threshold of the pixel boundary, expressed in arcseconds.
- primary_column: str | None = None#
Column name in the primary (left) side of join.
- primary_column_association: str | None = None#
Column name in the association table that matches the primary (left) side of join.
- join_catalog: str | None = None#
Catalog name for the joining (right) side of association.
- join_column: str | None = None#
Column name in the joining (right) side of join.
- join_column_association: str | None = None#
Column name in the association table that matches the joining (right) side of join.
- assn_max_separation: float | None = None#
The maximum separation between two points in an association catalog, expressed in arcseconds.
- contains_leaf_files: bool | None = None#
Whether or not the association catalog contains leaf parquet files.
- indexing_column: str | None = None#
Column that we provide an index over.
- extra_columns: list[str] | None = None#
Any additional payload columns included in index.
- npix_suffix: str = None#
Suffix of the Npix partitions. In the standard HATS directory structure, this is
'.parquet'because there is a single file in each Npix partition and it is named like'Npix=313.parquet'. Other valid directory structures include those with the same single file per partition but which use a different suffix (e.g.,'npix_suffix' = '.parq'or'.snappy.parquet'), and also those in which the Npix partitions are actually directories containing 1+ files underneath (and then'npix_suffix' = '/').
- skymap_order: int | None = None#
Nested Order of the healpix skymap stored in the default skymap.fits.
- skymap_alt_orders: list[int] | None = None#
Nested Order (K) of the healpix skymaps stored in altnernative skymap.K.fits.
- model_config#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- classmethod space_delimited_list(str_value: str) list[str][source]#
Convert a space-delimited list string into a python list of strings.
- Parameters:
- str_value: str
a space-delimited list string
- Returns:
- list[str]
python list of strings
- classmethod space_delimited_int_list(str_value: str | list[int]) list[int][source]#
Convert a space-delimited list string into a python list of integers.
- Parameters:
- str_valuestr | list[int]
string representation of a list of integers, delimited by space, comma, or semicolon, or a list of integers.
- Returns:
- list[int]
a python list of integers
- Raises:
- ValueError
if any non-digit characters are encountered
- serialize_as_space_delimited_list(str_list: Iterable) str[source]#
Convert a python list of strings into a space-delimited string.
- Parameters:
- str_list: Iterable
a python list of strings
- Returns:
- str
a space-delimited string.
- check_required() typing_extensions.Self[source]#
Check that type-specific fields are appropriate, and required fields are set.
- copy_and_update(**kwargs)[source]#
Create a validated copy of these table properties, updating the fields provided in kwargs.
- Parameters:
- **kwargs
values to update
- Returns:
- TableProperties
new instance of properties object
- explicit_dict(by_alias=False, exclude_none=True)[source]#
Create a dict, based on fields that have been explicitly set, and are not “extra” keys.
- Parameters:
- by_aliasbool
(Default value = False)
- exclude_nonebool
(Default value = True)
- Returns:
- dict
all keys that are attributes of this class and not “extra”.
- extra_dict(by_alias=False, exclude_none=True)[source]#
Create a dict, based on fields that are “extra” keys.
- Parameters:
- by_aliasbool
(Default value = False)
- exclude_nonebool
(Default value = True)
- Returns:
- dict
all keys that are not attributes of this class, e.g. “extra”.
- classmethod read_from_dir(catalog_dir: str | pathlib.Path | upath.UPath) typing_extensions.Self[source]#
Read field values from a java-style properties file.
- Parameters:
- catalog_dir: str | Path | UPath
path to a catalog directory.
- Returns:
- TableProperties
object created from the contents of a
hats.propertiesfile in the given directory
- to_properties_file(catalog_dir: str | pathlib.Path | upath.UPath)[source]#
Write fields to a java-style properties file.
- Parameters:
- catalog_dir: str | Path | UPath
directory to write the file
- static new_provenance_dict(path: str | pathlib.Path | upath.UPath | None = None, builder: str | None = None, **kwargs) dict[source]#
Constructs the provenance properties for a HATS catalog.
- Parameters:
- path: str | Path | UPath | None
The path to the catalog directory.
- builderstr | None
The name and version of the tool that created the catalog.
- **kwargs
Additional properties to include/override in the dictionary.
- Returns:
- dict
A dictionary with properties for the HATS catalog.
- class IndexCatalog(catalog_info: hats.catalog.dataset.table_properties.TableProperties, catalog_path: str | pathlib.Path | upath.UPath | None = None, schema: pyarrow.Schema | None = None, original_schema: pyarrow.Schema | None = None)[source]#
Bases:
hats.catalog.dataset.DatasetAn index into HATS Catalog for enabling fast lookups on non-spatial values.
Note that this is not a true “HATS Catalog”, as it is not partitioned spatially.
- loc_partitions(ids) list[hats.pixel_math.HealpixPixel][source]#
Find the set of partitions in the primary catalog for the ids provided.
- Parameters:
- ids
primary catalog for the ids
- Returns:
- list[HealpixPixel]
partitions of leaf parquet files in the primary catalog that may contain rows for the id values
- class MapCatalog(catalog_info: hats.catalog.dataset.table_properties.TableProperties, pixels: hats.catalog.partition_info.PartitionInfo | hats.pixel_tree.pixel_tree.PixelTree | list[hats.pixel_math.HealpixPixel], catalog_path: str | pathlib.Path | upath.UPath | None = None, moc: mocpy.MOC | None = None, schema: pyarrow.Schema | None = None, original_schema: pyarrow.Schema | None = None)[source]#
Bases:
hats.catalog.healpix_dataset.healpix_dataset.HealpixDatasetA HATS table to represent non-point-source data in a continuous map.
- class MarginCatalog(catalog_info: hats.catalog.dataset.table_properties.TableProperties, pixels: hats.catalog.partition_info.PartitionInfo | hats.pixel_tree.pixel_tree.PixelTree | list[hats.pixel_math.HealpixPixel], catalog_path: str | pathlib.Path | upath.UPath | None = None, moc: mocpy.MOC | None = None, schema: pyarrow.Schema | None = None, original_schema: pyarrow.Schema | None = None)[source]#
Bases:
hats.catalog.healpix_dataset.healpix_dataset.HealpixDatasetA HATS Catalog used to contain the ‘margin’ of another HATS catalog.
Catalogs of this type are used alongside a primary catalog, and contains the margin points for each HEALPix pixel - any points that are within a certain distance from the HEALPix pixel boundary. This is used to ensure spatial operations such as crossmatching can be performed efficiently while maintaining accuracy.
- filter_by_moc(moc: mocpy.MOC) typing_extensions.Self[source]#
Filter the pixels in the margin catalog to only include the margin pixels that overlap with the moc
For the case of margin pixels, this includes any pixels whose margin areas may overlap with the moc. This is not always done with a high accuracy, but always includes any pixels that will overlap, and may include extra partitions that do not.
- Parameters:
- mocmocpy.MOC
the moc to filter by
- Returns:
- MarginCatalog
A new margin catalog with only the pixels that overlap or that have margin area that overlap with the moc. Note that we reset the total_rows to None, as updating would require a scan over the new pixel sizes.
- class PartitionInfo(pixel_list: list[hats.pixel_math.healpix_pixel.HealpixPixel], catalog_base_dir: str = None)[source]#
Container class for per-partition info.
- METADATA_ORDER_COLUMN_NAME = 'Norder'#
- METADATA_PIXEL_COLUMN_NAME = 'Npix'#
- pixel_list#
- catalog_base_dir = None#
- get_healpix_pixels() list[hats.pixel_math.healpix_pixel.HealpixPixel][source]#
Get healpix pixel objects for all pixels represented as partitions.
- Returns:
- list[HealpixPixel]
List of HealpixPixel
- get_highest_order() int[source]#
Get the highest healpix order for the dataset.
- Returns:
- int
int representing highest order.
- write_to_file(partition_info_file: str | pathlib.Path | upath.UPath | None = None, catalog_path: str | pathlib.Path | upath.UPath | None = None)[source]#
Write all partition data to CSV file.
If no paths are provided, the catalog base directory from the read_from_dir call is used.
- Parameters:
- partition_info_filestr | Path | UPath | None
path to where the partition_info.csv file will be written.
- catalog_pathstr | Path | UPath | None
base directory for a catalog where the partition_info.csv file will be written.
- Raises:
- ValueError
if no path is provided, and could not be inferred.
- classmethod read_from_dir(catalog_base_dir: str | pathlib.Path | upath.UPath | None, compute_from_catalog: bool = False) PartitionInfo[source]#
Read partition info from a file within a hats directory.
This will look for a partition_info.csv file, and if not found, will look for a _metadata file. The second approach is typically slower for large catalogs therefore a warning is issued to the user. In internal testing with large catalogs, the first approach takes less than a second, while the second can take 10-20 seconds.
If neither file is found, and compute_from_catalog is set to True, the partition info will be computed from the individual catalog files. This is the slowest approach, and a warning is issued to the user. In internal testing with large catalogs, this approach can take (??) time.
- Parameters:
- catalog_base_dirstr | Path | UPath | None
Path to the root directory of the catalog
- compute_from_catalogbool
Whether to compute partition info from catalog files if no metadata or partition info file is found.
- Returns:
- PartitionInfo
A PartitionInfo object with the data from the file
- Raises:
- FileNotFoundError
if neither desired file is found in the catalog_base_dir
- classmethod read_from_file(metadata_file: str | pathlib.Path | upath.UPath) PartitionInfo[source]#
Read partition info from a _metadata file to create an object
- Parameters:
- metadata_filestr | Path | UPath
path to the _metadata file
- Returns:
- PartitionInfo
A PartitionInfo object with the data from the file
- classmethod read_from_csv(partition_info_file: str | pathlib.Path | upath.UPath) PartitionInfo[source]#
Read partition info from a partition_info.csv file to create an object
- Parameters:
- partition_info_filestr | Path | UPath
path to the partition_info.csv file
- Returns:
- PartitionInfo
A PartitionInfo object with the data from the file
- as_dataframe()[source]#
Construct a pandas dataframe for the partition info pixels.
- Returns:
- pd.DataFrame
Pandas Dataframe with order, directory, and pixel info.
- classmethod from_healpix(healpix_pixels: list[hats.pixel_math.healpix_pixel.HealpixPixel]) PartitionInfo[source]#
Create a partition info object from a list of constituent healpix pixels.
- Parameters:
- healpix_pixels: list[HealpixPixel]
a list of constituent healpix pixels
- Returns:
- PartitionInfo
A PartitionInfo object with the same healpix pixels