hats.io.parquet_metadata#

Utility functions for handling parquet metadata files

Functions#

write_parquet_metadata(catalog_path[, ...])

Write Parquet dataset-level metadata files (and optional thumbnail) for a catalog.

read_row_group_fragments(metadata_file)

Generator for metadata fragment row groups in a parquet metadata file.

aggregate_column_statistics(metadata_file[, ...])

Read footer statistics in parquet metadata, and report on global min/max values.

per_pixel_statistics(metadata_file[, ...])

Read footer statistics in parquet metadata, and report on statistics about

Module Contents#

write_parquet_metadata(catalog_path: str | pathlib.Path | upath.UPath, order_by_healpix=True, output_path: str | pathlib.Path | upath.UPath | None = None, create_thumbnail: bool = False, thumbnail_threshold: int = 1000000, create_metadata: bool = True)[source]#

Write Parquet dataset-level metadata files (and optional thumbnail) for a catalog.

Creates files:

catalog/
├── data_thumbnail.parquet    (only if create_thumbnail=True)
├── ...
└── dataset/
    ├── _common_metadata      (always written)
    ├── _metadata             (only if create_metadata=True)
    └──  ...

data_thumbnail.parquet gives the user a quick overview of the whole dataset. It is a compact file containing one row from each data partition, up to a maximum of thumbnail_threshold rows.

dataset/_common_metadata contains the full schema of the dataset. This file will know all of the columns and their types, as well as any file-level key-value metadata associated with the full Parquet dataset.

dataset/_metadata contains the combined row group footers from all Parquet files in the dataset, which allows readers to read the entire dataset without having to open each individual Parquet file. This file can be large for datasets with many files, so users may choose to omit it by setting create_metadata=False.

Parameters:
catalog_pathstr | Path | UPath

Base path for the catalog root.

order_by_healpixbool, default=True

If True, reorder combined metadata by breadth-first Healpix pixel ordering (e.g., secondary indexes). Set False for datasets that should not be reordered. Does not modify dataset files on disk.

output_pathstr | Path | UPath | None, default=None

Base path to write metadata files. If None, uses catalog_path.

create_thumbnailbool, default=False

If True, writes a compact data_thumbnail.parquet containing one row per sampled file.

thumbnail_thresholdint, default=1_000_000

Maximum number of rows in the thumbnail (or maximum number of files, if thumbnail_threshold exceeds the number of files). One row per partition.

create_metadatabool, default=True

If True, writes dataset/_metadata combining row group footers.

Returns:
int

Total number of rows across all parquet files in the dataset.

Notes

For more information on the general Parquet metadata files, and why we write them, see https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files

For more information on HATS-specific metadata files and conventions, see https://www.ivoa.net/documents/Notes/HATS/

read_row_group_fragments(metadata_file: str)[source]#

Generator for metadata fragment row groups in a parquet metadata file.

Parameters:
metadata_filestr

path to _metadata file.

Yields:
RowGroupFragment

metadata for individual row groups

aggregate_column_statistics(metadata_file: str | pathlib.Path | upath.UPath, exclude_hats_columns: bool = True, exclude_columns: list[str] = None, include_columns: list[str] = None, only_numeric_columns: bool = False, include_pixels: list[hats.pixel_math.healpix_pixel.HealpixPixel] = None)[source]#

Read footer statistics in parquet metadata, and report on global min/max values.

Parameters:
metadata_filestr | Path | UPath

path to _metadata file

exclude_hats_columnsbool

exclude HATS spatial and partitioning fields from the statistics. Defaults to True.

exclude_columnslist[str]

additional columns to exclude from the statistics.

include_columnslist[str]

if specified, only return statistics for the column names provided. Defaults to None, and returns all non-hats columns.

only_numeric_columnsbool

only include columns that are numeric (integer or floating point) in the statistics. If True, the entire frame should be numeric. (Default value = False)

include_pixelslist[HealpixPixel]

if specified, only return statistics for the pixels indicated. Defaults to none, and returns all pixels.

Returns:
pd.Dataframe

Pandas dataframe with global summary statistics

per_pixel_statistics(metadata_file: str | pathlib.Path | upath.UPath, exclude_hats_columns: bool = True, exclude_columns: list[str] = None, include_columns: list[str] = None, only_numeric_columns: bool = False, include_stats: list[str] = None, multi_index: bool = False, include_pixels: list[hats.pixel_math.healpix_pixel.HealpixPixel] = None, per_row_group: bool = False)[source]#

Read footer statistics in parquet metadata, and report on statistics about each pixel partition.

The statistics gathered are a subset of the available attributes in the pyarrow.parquet.ColumnChunkMetaData:

  • min_value - minimum value seen in a single data partition

  • max_value - maximum value seen in a single data partition

  • null_count - number of null values

  • row_count - total number of values. note that this will only vary by column if you have some nested columns in your dataset

  • disk_bytes - Compressed size of the data in the parquet file, in bytes

  • memory_bytes - Uncompressed size, in bytes

Parameters:
metadata_filestr | Path | UPath

path to _metadata file

exclude_hats_columnsbool

exclude HATS spatial and partitioning fields from the statistics. Defaults to True.

exclude_columnslist[str]

additional columns to exclude from the statistics.

include_columnslist[str]

if specified, only return statistics for the column names provided. Defaults to None, and returns all non-hats columns.

only_numeric_columnsbool

only include columns that are numeric (integer or floating point) in the statistics. If True, the entire frame should be numeric. (Default value = False)

include_statslist[str]

if specified, only return the kinds of values from list (min_value, max_value, null_count, row_count, disk_bytes, memory_bytes). Defaults to None, and returns all values.

multi_indexbool

should the returned frame be created with a multi-index, first on pixel, then on column name? Default is False, and instead indexes on pixel, with separate columns per-data-column and stat value combination. (Default value = False)

include_pixelslist[HealpixPixel]

if specified, only return statistics for the pixels indicated. Defaults to none, and returns all pixels.

per_row_groupbool

should the returned data be even more fine-grained and provide per row group (within each pixel) level statistics? Default is currently False.

Returns:
pd.Dataframe

Pandas dataframe with granular per-pixel statistics