hats.catalog

Contents

hats.catalog#

Catalog data wrappers

Submodules#

Classes#

AssociationCatalog

A HATS Catalog for enabling fast joins between two HATS catalogs

Catalog

A HATS Catalog with data stored in a HEALPix Hive partitioned structure

CatalogCollection

A collection of HATS Catalog with data stored in a HEALPix Hive partitioned structure

CatalogType

Enum for possible types of catalog

CollectionProperties

Container class for catalog metadata

Dataset

A base HATS dataset that contains a properties file and the data contained in parquet files

TableProperties

Container class for catalog metadata

IndexCatalog

An index into HATS Catalog for enabling fast lookups on non-spatial values.

MapCatalog

A HATS table to represent non-point-source data in a continuous map.

MarginCatalog

A HATS Catalog used to contain the 'margin' of another HATS catalog.

PartitionInfo

Container class for per-partition info.

Package Contents#

class AssociationCatalog(catalog_info: hats.catalog.dataset.table_properties.TableProperties, pixels: hats.catalog.partition_info.PartitionInfo | hats.pixel_tree.pixel_tree.PixelTree | list[hats.pixel_math.HealpixPixel], catalog_path: str | pathlib.Path | upath.UPath | None = None, moc: mocpy.MOC | None = None, schema: pyarrow.Schema | None = None, original_schema: pyarrow.Schema | None = None)[source]#

Bases: hats.catalog.healpix_dataset.healpix_dataset.HealpixDataset

A HATS Catalog for enabling fast joins between two HATS catalogs

class Catalog(catalog_info: hats.catalog.dataset.table_properties.TableProperties, pixels: hats.catalog.partition_info.PartitionInfo | hats.pixel_tree.pixel_tree.PixelTree | list[hats.pixel_math.HealpixPixel], catalog_path: str | pathlib.Path | upath.UPath | None = None, moc: mocpy.MOC | None = None, schema: pyarrow.Schema | None = None, original_schema: pyarrow.Schema | None = None)[source]#

Bases: hats.catalog.healpix_dataset.healpix_dataset.HealpixDataset

A HATS Catalog with data stored in a HEALPix Hive partitioned structure

Catalogs of this type are partitioned spatially, contain partition_info metadata specifying the pixels in Catalog, and on disk conform to the parquet partitioning structure Norder=/Dir=/Npix=.parquet

generate_negative_tree_pixels() list[hats.pixel_math.HealpixPixel][source]#

Get the leaf nodes at each healpix order that have zero catalog data.

For example, if an example catalog only had data points in pixel 0 at order 0, then this method would return order 0’s pixels 1 through 11. Used for getting full coverage on margin caches.

Returns:
list[HealpixPixel]

List of HealpixPixels representing the ‘negative tree’ for the catalog.

class CatalogCollection(collection_path: upath.UPath, collection_properties: hats.catalog.dataset.collection_properties.CollectionProperties, main_catalog: hats.catalog.Catalog)[source]#

A collection of HATS Catalog with data stored in a HEALPix Hive partitioned structure

Catalogs of this type are described by a collection.properties file which specifies the underlying main catalog, margin catalog and index catalog paths. These catalogs are stored at the root of the collection, each in its separate directory:

catalog_collection/
├── main_catalog/
├── margin_catalog/
├── index_catalog/
├── collection.properties

Margin and index catalogs are optional but there could also be multiple of them. The catalogs used by default are specified in the collection.properties file in the default_margin and default_index keywords.

collection_path#
collection_properties#
main_catalog#
property main_catalog_dir: upath.UPath#

Path to the main catalog directory

property all_margins: list[str] | None#

The list of margin catalog names in the collection

property default_margin: str | None#

The name of the default margin

property default_margin_catalog_dir: upath.UPath | None#

Path to the default margin catalog directory

property all_indexes: dict[str, str] | None#

The mapping of indexes in the collection

property default_index_field: str | None#

The name of the default index field

property default_index_catalog_dir: upath.UPath | None#

Path to the default index catalog directory

get_index_dir_for_field(field_name: str | None = None) upath.UPath | None[source]#

Path to the field’s index catalog directory

get_healpix_pixels() list[hats.pixel_math.HealpixPixel][source]#

The list of HEALPix pixels of the main catalog

class CatalogType[source]#

Bases: str, enum.Enum

Enum for possible types of catalog

OBJECT = 'object'#
SOURCE = 'source'#
ASSOCIATION = 'association'#
INDEX = 'index'#
MARGIN = 'margin'#
MAP = 'map'#
classmethod all_types()[source]#

Fetch a list of all catalog types

class CollectionProperties(/, **data: Any)[source]#

Bases: pydantic.BaseModel

Container class for catalog metadata

name: str = None#
hats_primary_table_url: str = None#

Reference to object catalog. Relevant for nested, margin, association, and index.

all_margins: Annotated[list[str] | None, Field(default=None)]#
default_margin: str | None = None#
all_indexes: Annotated[dict[str, str] | None, Field(default=None)]#
default_index: str | None = None#
model_config#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

classmethod space_delimited_list(str_value: str) list[str][source]#

Convert a space-delimited list string into a python list of strings.

Parameters:
str_value: str

a space-delimited list string

Returns:
list[str]

a python list of strings

classmethod index_tuples(str_value: str) dict[str, str][source]#

Convert a space-delimited list string into a python list of strings.

Parameters:
str_value: str

a space-delimited list string

Returns:
dict[str, str]

a python dict of strings

serialize_list_as_space_delimited_list(str_list: Iterable[str]) str[source]#

Convert a python list of strings into a space-delimited string.

Parameters:
str_list: Iterable[str]

a python list of strings

Returns:
str

a space-delimited string

serialize_dict_as_space_delimited_list(str_dict: dict[str, str]) str[source]#

Convert a python list of strings into a space-delimited string.

Parameters:
str_dict: dict[str, str]

a python dict of strings

Returns:
str

a space-delimited string

check_allowed_and_required() typing_extensions.Self[source]#

Check that type-specific fields are appropriate, and required fields are set.

check_default_margin_exists() typing_extensions.Self[source]#

Check that the default margin is in the list of all margins.

check_default_index_exists() typing_extensions.Self[source]#

Check that the default index is in the list of all indexes.

explicit_dict()[source]#

Create a dict, based on fields that have been explicitly set, and are not “extra” keys.

__str__()[source]#

Friendly string representation based on named fields.

classmethod read_from_dir(catalog_dir: str | pathlib.Path | upath.UPath) typing_extensions.Self[source]#

Read field values from a java-style properties file.

Parameters:
catalog_dir: str | Path | UPath

base directory of catalog.

Returns:
CollectionProperties

new object from the contents of a collection.properties file in the directory.

to_properties_file(catalog_dir: str | pathlib.Path | upath.UPath)[source]#

Write fields to a java-style properties file.

Parameters:
catalog_dir: str | Path | UPath

base directory of catalog.

class Dataset(catalog_info: hats.catalog.dataset.table_properties.TableProperties, catalog_path: str | pathlib.Path | upath.UPath | None = None, schema: pyarrow.Schema | None = None, original_schema: pyarrow.Schema | None = None)[source]#

A base HATS dataset that contains a properties file and the data contained in parquet files

catalog_info#
catalog_name#
catalog_path = None#
catalog_base_dir = None#
schema = None#
original_schema = None#
property on_disk: bool#

Is the catalog stored on disk?

aggregate_column_statistics(exclude_hats_columns: bool = True, exclude_columns: list[str] = None, include_columns: list[str] = None)[source]#

Read footer statistics in parquet metadata, and report on global min/max values.

Parameters:
exclude_hats_columnsbool

exclude HATS spatial and partitioning fields from the statistics. Defaults to True.

exclude_columnslist[str]

additional columns to exclude from the statistics.

include_columnslist[str]

if specified, only return statistics for the column names provided. Defaults to None, and returns all non-hats columns.

Returns:
Dataframe

aggregated statistics.

per_pixel_statistics(exclude_hats_columns: bool = True, exclude_columns: list[str] = None, include_columns: list[str] = None, include_stats: list[str] = None, multi_index=False)[source]#

Read footer statistics in parquet metadata, and report on statistics about each pixel partition.

Parameters:
exclude_hats_columnsbool

exclude HATS spatial and partitioning fields from the statistics. Defaults to True.

exclude_columnslist[str]

additional columns to exclude from the statistics.

include_columnslist[str]

if specified, only return statistics for the column names provided. Defaults to None, and returns all non-hats columns.

include_statslist[str]

if specified, only return the kinds of values from list (min_value, max_value, null_count, row_count). Defaults to None, and returns all values.

multi_indexbool

should the returned frame be created with a multi-index, first on pixel, then on column name? Default is False, and instead indexes on pixel, with separate columns per-data-column and stat value combination. (Default value = False)

Returns:
Dataframe

all statistics.

class TableProperties(/, **data: Any)[source]#

Bases: pydantic.BaseModel

Container class for catalog metadata

catalog_name: str = None#
catalog_type: hats.catalog.catalog_type.CatalogType = None#
total_rows: int | None = None#
ra_column: str | None = None#
dec_column: str | None = None#
default_columns: list[str] | None = None#

Which columns should be read from parquet files, when user doesn’t otherwise specify.

healpix_column: str | None = None#

Column name that provides a spatial index of healpix values at some fixed, high order. A typical value would be _healpix_29, but can vary.

healpix_order: int | None = None#

For the spatial index of healpix values in hats_col_healpix what is the fixed, high order. A typicaly value would be 29, but can vary.

primary_catalog: str | None = None#

Reference to object catalog. Relevant for nested, margin, association, and index.

margin_threshold: float | None = None#

Threshold of the pixel boundary, expressed in arcseconds.

primary_column: str | None = None#

Column name in the primary (left) side of join.

primary_column_association: str | None = None#

Column name in the association table that matches the primary (left) side of join.

join_catalog: str | None = None#

Catalog name for the joining (right) side of association.

join_column: str | None = None#

Column name in the joining (right) side of join.

join_column_association: str | None = None#

Column name in the association table that matches the joining (right) side of join.

assn_max_separation: float | None = None#

The maximum separation between two points in an association catalog, expressed in arcseconds.

contains_leaf_files: bool | None = None#

Whether or not the association catalog contains leaf parquet files.

indexing_column: str | None = None#

Column that we provide an index over.

extra_columns: list[str] | None = None#

Any additional payload columns included in index.

npix_suffix: str = None#

Suffix of the Npix partitions. In the standard HATS directory structure, this is '.parquet' because there is a single file in each Npix partition and it is named like 'Npix=313.parquet'. Other valid directory structures include those with the same single file per partition but which use a different suffix (e.g., 'npix_suffix' = '.parq' or '.snappy.parquet'), and also those in which the Npix partitions are actually directories containing 1+ files underneath (and then 'npix_suffix' = '/').

skymap_order: int | None = None#

Nested Order of the healpix skymap stored in the default skymap.fits.

skymap_alt_orders: list[int] | None = None#

Nested Order (K) of the healpix skymaps stored in altnernative skymap.K.fits.

model_config#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

classmethod space_delimited_list(str_value: str) list[str][source]#

Convert a space-delimited list string into a python list of strings.

Parameters:
str_value: str

a space-delimited list string

Returns:
list[str]

python list of strings

classmethod space_delimited_int_list(str_value: str | list[int]) list[int][source]#

Convert a space-delimited list string into a python list of integers.

Parameters:
str_valuestr | list[int]

string representation of a list of integers, delimited by space, comma, or semicolon, or a list of integers.

Returns:
list[int]

a python list of integers

Raises:
ValueError

if any non-digit characters are encountered

serialize_as_space_delimited_list(str_list: Iterable) str[source]#

Convert a python list of strings into a space-delimited string.

Parameters:
str_list: Iterable

a python list of strings

Returns:
str

a space-delimited string.

check_required() typing_extensions.Self[source]#

Check that type-specific fields are appropriate, and required fields are set.

copy_and_update(**kwargs)[source]#

Create a validated copy of these table properties, updating the fields provided in kwargs.

Parameters:
**kwargs

values to update

Returns:
TableProperties

new instance of properties object

explicit_dict(by_alias=False, exclude_none=True)[source]#

Create a dict, based on fields that have been explicitly set, and are not “extra” keys.

Parameters:
by_aliasbool

(Default value = False)

exclude_nonebool

(Default value = True)

Returns:
dict

all keys that are attributes of this class and not “extra”.

extra_dict(by_alias=False, exclude_none=True)[source]#

Create a dict, based on fields that are “extra” keys.

Parameters:
by_aliasbool

(Default value = False)

exclude_nonebool

(Default value = True)

Returns:
dict

all keys that are not attributes of this class, e.g. “extra”.

__repr__()[source]#
__str__()[source]#

Friendly string representation based on named fields.

classmethod read_from_dir(catalog_dir: str | pathlib.Path | upath.UPath) typing_extensions.Self[source]#

Read field values from a java-style properties file.

Parameters:
catalog_dir: str | Path | UPath

path to a catalog directory.

Returns:
TableProperties

object created from the contents of a hats.properties file in the given directory

to_properties_file(catalog_dir: str | pathlib.Path | upath.UPath)[source]#

Write fields to a java-style properties file.

Parameters:
catalog_dir: str | Path | UPath

directory to write the file

static new_provenance_dict(path: str | pathlib.Path | upath.UPath | None = None, builder: str | None = None, **kwargs) dict[source]#

Constructs the provenance properties for a HATS catalog.

Parameters:
path: str | Path | UPath | None

The path to the catalog directory.

builderstr | None

The name and version of the tool that created the catalog.

**kwargs

Additional properties to include/override in the dictionary.

Returns:
dict

A dictionary with properties for the HATS catalog.

class IndexCatalog(catalog_info: hats.catalog.dataset.table_properties.TableProperties, catalog_path: str | pathlib.Path | upath.UPath | None = None, schema: pyarrow.Schema | None = None, original_schema: pyarrow.Schema | None = None)[source]#

Bases: hats.catalog.dataset.Dataset

An index into HATS Catalog for enabling fast lookups on non-spatial values.

Note that this is not a true “HATS Catalog”, as it is not partitioned spatially.

loc_partitions(ids) list[hats.pixel_math.HealpixPixel][source]#

Find the set of partitions in the primary catalog for the ids provided.

Parameters:
ids

primary catalog for the ids

Returns:
list[HealpixPixel]

partitions of leaf parquet files in the primary catalog that may contain rows for the id values

class MapCatalog(catalog_info: hats.catalog.dataset.table_properties.TableProperties, pixels: hats.catalog.partition_info.PartitionInfo | hats.pixel_tree.pixel_tree.PixelTree | list[hats.pixel_math.HealpixPixel], catalog_path: str | pathlib.Path | upath.UPath | None = None, moc: mocpy.MOC | None = None, schema: pyarrow.Schema | None = None, original_schema: pyarrow.Schema | None = None)[source]#

Bases: hats.catalog.healpix_dataset.healpix_dataset.HealpixDataset

A HATS table to represent non-point-source data in a continuous map.

class MarginCatalog(catalog_info: hats.catalog.dataset.table_properties.TableProperties, pixels: hats.catalog.partition_info.PartitionInfo | hats.pixel_tree.pixel_tree.PixelTree | list[hats.pixel_math.HealpixPixel], catalog_path: str | pathlib.Path | upath.UPath | None = None, moc: mocpy.MOC | None = None, schema: pyarrow.Schema | None = None, original_schema: pyarrow.Schema | None = None)[source]#

Bases: hats.catalog.healpix_dataset.healpix_dataset.HealpixDataset

A HATS Catalog used to contain the ‘margin’ of another HATS catalog.

Catalogs of this type are used alongside a primary catalog, and contains the margin points for each HEALPix pixel - any points that are within a certain distance from the HEALPix pixel boundary. This is used to ensure spatial operations such as crossmatching can be performed efficiently while maintaining accuracy.

filter_by_moc(moc: mocpy.MOC) typing_extensions.Self[source]#

Filter the pixels in the margin catalog to only include the margin pixels that overlap with the moc

For the case of margin pixels, this includes any pixels whose margin areas may overlap with the moc. This is not always done with a high accuracy, but always includes any pixels that will overlap, and may include extra partitions that do not.

Parameters:
mocmocpy.MOC

the moc to filter by

Returns:
MarginCatalog

A new margin catalog with only the pixels that overlap or that have margin area that overlap with the moc. Note that we reset the total_rows to None, as updating would require a scan over the new pixel sizes.

class PartitionInfo(pixel_list: list[hats.pixel_math.healpix_pixel.HealpixPixel], catalog_base_dir: str = None)[source]#

Container class for per-partition info.

METADATA_ORDER_COLUMN_NAME = 'Norder'#
METADATA_PIXEL_COLUMN_NAME = 'Npix'#
pixel_list#
catalog_base_dir = None#
get_healpix_pixels() list[hats.pixel_math.healpix_pixel.HealpixPixel][source]#

Get healpix pixel objects for all pixels represented as partitions.

Returns:
list[HealpixPixel]

List of HealpixPixel

get_highest_order() int[source]#

Get the highest healpix order for the dataset.

Returns:
int

int representing highest order.

write_to_file(partition_info_file: str | pathlib.Path | upath.UPath | None = None, catalog_path: str | pathlib.Path | upath.UPath | None = None)[source]#

Write all partition data to CSV file.

If no paths are provided, the catalog base directory from the read_from_dir call is used.

Parameters:
partition_info_filestr | Path | UPath | None

path to where the partition_info.csv file will be written.

catalog_pathstr | Path | UPath | None

base directory for a catalog where the partition_info.csv file will be written.

Raises:
ValueError

if no path is provided, and could not be inferred.

classmethod read_from_dir(catalog_base_dir: str | pathlib.Path | upath.UPath | None, compute_from_catalog: bool = False) PartitionInfo[source]#

Read partition info from a file within a hats directory.

This will look for a partition_info.csv file, and if not found, will look for a _metadata file. The second approach is typically slower for large catalogs therefore a warning is issued to the user. In internal testing with large catalogs, the first approach takes less than a second, while the second can take 10-20 seconds.

If neither file is found, and compute_from_catalog is set to True, the partition info will be computed from the individual catalog files. This is the slowest approach, and a warning is issued to the user. In internal testing with large catalogs, this approach can take (??) time.

Parameters:
catalog_base_dirstr | Path | UPath | None

Path to the root directory of the catalog

compute_from_catalogbool

Whether to compute partition info from catalog files if no metadata or partition info file is found.

Returns:
PartitionInfo

A PartitionInfo object with the data from the file

Raises:
FileNotFoundError

if neither desired file is found in the catalog_base_dir

classmethod read_from_file(metadata_file: str | pathlib.Path | upath.UPath) PartitionInfo[source]#

Read partition info from a _metadata file to create an object

Parameters:
metadata_filestr | Path | UPath

path to the _metadata file

Returns:
PartitionInfo

A PartitionInfo object with the data from the file

classmethod read_from_csv(partition_info_file: str | pathlib.Path | upath.UPath) PartitionInfo[source]#

Read partition info from a partition_info.csv file to create an object

Parameters:
partition_info_filestr | Path | UPath

path to the partition_info.csv file

Returns:
PartitionInfo

A PartitionInfo object with the data from the file

as_dataframe()[source]#

Construct a pandas dataframe for the partition info pixels.

Returns:
pd.DataFrame

Pandas Dataframe with order, directory, and pixel info.

classmethod from_healpix(healpix_pixels: list[hats.pixel_math.healpix_pixel.HealpixPixel]) PartitionInfo[source]#

Create a partition info object from a list of constituent healpix pixels.

Parameters:
healpix_pixels: list[HealpixPixel]

a list of constituent healpix pixels

Returns:
PartitionInfo

A PartitionInfo object with the same healpix pixels

calculate_fractional_coverage()[source]#

Calculate what fraction of the sky is covered by partition tiles.