hats.io.file_io

hats.io.file_io#

Submodules#

Functions#

`delete_file`(file_handle)	Deletes file from filesystem.
`load_csv_to_pandas`(→ pandas.DataFrame)	Load a csv file to a pandas dataframe
`load_csv_to_pandas_generator`(...)	Load a csv file to a pandas dataframe
`load_text_file`(file_pointer[, encoding])	Load a text file content to a list of strings.
`make_directory`(file_pointer[, exist_ok])	Make a directory at a given file pointer
`read_fits_image`(→ numpy.ndarray)	Read the object spatial distribution information from a healpix FITS file.
`read_parquet_dataset`(→ tuple[str \| list[str], ...)	Read parquet dataset from directory pointer or list of files.
`read_parquet_file`(→ pyarrow.parquet.ParquetFile)	Read single parquet file.
`read_parquet_file_to_pandas`(→ nested_pandas.NestedFrame)	Reads parquet file(s) to a pandas DataFrame
`read_parquet_metadata`(→ pyarrow.parquet.FileMetaData)	Read FileMetaData from footer of a single Parquet file.
`remove_directory`(file_pointer[, ignore_errors])	Remove a directory, and all contents, recursively.
`write_dataframe_to_csv`(dataframe, file_pointer, **kwargs)	Write a pandas DataFrame to a CSV file
`write_dataframe_to_parquet`(dataframe, file_pointer)	Write a pandas DataFrame to a parquet file
`write_fits_image`(histogram, map_file_pointer)	Write the object spatial distribution information to a healpix FITS file.
`write_parquet_metadata`(schema, file_pointer[, ...])	Write a metadata only parquet file from a schema
`write_string_to_file`(file_pointer, string[, encoding])	Write a string to a text file
`append_paths_to_pointer`(→ upath.UPath)	Append directories and/or a file name to a specified file pointer.
`directory_has_contents`(→ bool)	Checks if a directory already has some contents (any files or subdirectories)
`does_file_or_directory_exist`(→ bool)	Checks if a file or directory exists for a given file pointer
`find_files_matching_path`(→ list[upath.UPath])	Find files or directories matching the provided path parts.
`get_upath`(→ upath.UPath)	Returns a UPath file pointer from a path string or other path-like type.
`get_upath_for_protocol`(→ upath.UPath)	Create UPath with protocol-specific configurations.
`is_regular_file`(→ bool)	Checks if a regular file (NOT a directory) exists for a given file pointer.

Package Contents#

delete_file(file_handle: str | pathlib.Path | upath.UPath)[source]#

Deletes file from filesystem.

Parameters:

file_handle: str | Path | UPath: location of file pointer

load_csv_to_pandas(file_pointer: str | pathlib.Path | upath.UPath, **kwargs) → pandas.DataFrame[source]#

Load a csv file to a pandas dataframe

Parameters:

file_pointer: str | Path | UPath: location of csv file to load
**kwargs: arguments to pass to pandas read_csv loading method

Returns:

pd.DataFrame: contents of the CVS file, as a dataframe.

load_csv_to_pandas_generator(file_pointer: str | pathlib.Path | upath.UPath, *, chunksize=10000, compression=None, **kwargs) → collections.abc.Generator[pandas.DataFrame][source]#

Load a csv file to a pandas dataframe

Parameters:

file_pointer: str | Path | UPath: location of csv file to load
chunksizeint: (Default value = 10_000) number of rows to load per chunk
compressionstr: (Default value = None) for compressed CSVs, the manner of compression. e.g. ‘gz’, ‘bzip’.
**kwargs: arguments to pass to pandas read_csv loading method

Yields:

pd.DataFrame: chunked contents of the CVS file, as a dataframe.

load_text_file(file_pointer: str | pathlib.Path | upath.UPath, encoding: str = 'utf-8')[source]#

Load a text file content to a list of strings.

Parameters:

file_pointer: str | Path | UPath: location of file to read
encoding: str: (Default value = “utf-8”) string encoding method used by the file

Returns:

str: full string contents of the file as a list of strings, one per line.

make_directory(file_pointer: str | pathlib.Path | upath.UPath, exist_ok: bool = False)[source]#

Make a directory at a given file pointer

Will raise an error if a directory already exists, unless exist_ok is True in which case any existing directories will be left unmodified.

Parameters:

file_pointer: str | Path | UPath: location in file system to make directory
exist_ok: bool: (Default value = False) If false will raise error if directory exists. If true existing directories will be ignored and not modified

read_fits_image(map_file_pointer: str | pathlib.Path | upath.UPath) → numpy.ndarray[source]#

Read the object spatial distribution information from a healpix FITS file.

Parameters:

map_file_pointer: str | Path | UPath: location of file to be read

Returns:

np.ndarray: one-dimensional numpy array of integers where the value at each index corresponds to the number of objects found at the healpix pixel.

Read parquet dataset from directory pointer or list of files.

Note that pyarrow.dataset reads require that directory pointers don’t contain a leading slash, and the protocol prefix may additionally be removed. As such, we also return the directory path that is formatted for pyarrow ingestion for follow-up.

See more info on source specification and possible kwargs at https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html

Parameters:

source: str | Path | UPath | list[str | Path | UPath]: directory, path, or list of paths to read data from
**kwargs: additional arguments passed to pyarrow.dataset.dataset

Returns:

tuple[str | list[str], Dataset]: Tuple containing a path to the dataset (that is formatted for pyarrow ingestion) and the dataset read from disk.

read_parquet_file(file_pointer: str | pathlib.Path | upath.UPath, **kwargs) → pyarrow.parquet.ParquetFile[source]#

Read single parquet file.

Parameters:

file_pointer: str | Path | UPath: location of parquet file
**kwargs: additional arguments to be passed to pyarrow.parquet.ParquetFile

Returns:

pq.ParquetFile: full contents of parquet file

read_parquet_file_to_pandas(file_pointer: str | pathlib.Path | upath.UPath, is_dir: bool | None = None, **kwargs) → nested_pandas.NestedFrame[source]#

Reads parquet file(s) to a pandas DataFrame

Parameters:

file_pointer: str | Path | UPath: File Pointer to a parquet file or a directory containing parquet files
is_dirbool | None: If True, the pointer represents a pixel directory, otherwise, the pointer represents a file. In both cases there is no need to check the pointer’s content type. If is_dir is None (default), this method will resort to upath.is_dir() to identify the type of pointer. Inferring the type for HTTP is particularly expensive because it requires downloading the contents of the pointer in its entirety.
**kwargs: Additional arguments to pass to pandas read_parquet method

Returns:

NestedFrame: Pandas DataFrame with the data from the parquet file(s)

read_parquet_metadata(file_pointer: str | pathlib.Path | upath.UPath, **kwargs) → pyarrow.parquet.FileMetaData[source]#

Read FileMetaData from footer of a single Parquet file.

Parameters:

file_pointer: str | Path | UPath: location of file to read metadata from
**kwargs: additional arguments to be passed to pyarrow.parquet.read_metadata

Returns:

pq.FileMetaData: parqeut file metadata (includes schema)

remove_directory(file_pointer: str | pathlib.Path | upath.UPath, ignore_errors=False)[source]#

Remove a directory, and all contents, recursively.

Parameters:

file_pointer: str | Path | UPath: directory in file system to remove
ignore_errorsbool: (Default value = False) if True errors resulting from failed removals will be ignored

write_dataframe_to_csv(dataframe: pandas.DataFrame, file_pointer: str | pathlib.Path | upath.UPath, **kwargs)[source]#

Write a pandas DataFrame to a CSV file

Parameters:

dataframe: pd.DataFrame: DataFrame to write
file_pointer: str | Path | UPath: location of file to write to
**kwargs: args to pass to pandas to_csv method

write_dataframe_to_parquet(dataframe: pandas.DataFrame, file_pointer)[source]#

Write a pandas DataFrame to a parquet file

Parameters:

dataframe: pd.DataFrame: DataFrame to write
file_pointerstr | Path | UPath: location of file to write to

write_fits_image(histogram: numpy.ndarray, map_file_pointer: str | pathlib.Path | upath.UPath)[source]#

Write the object spatial distribution information to a healpix FITS file.

Parameters:

histogram: np.ndarray: one-dimensional numpy array of long integers where the value at each index corresponds to the number of objects found at the healpix pixel.
map_file_pointer: str | Path | UPath: location of file to be written

write_parquet_metadata(schema, file_pointer: str | pathlib.Path | upath.UPath, metadata_collector: list | None = None, **kwargs)[source]#

Write a metadata only parquet file from a schema

Parameters:

schemapa.Schema: pyarrow schema to be written
file_pointer: str | Path | UPath: location of file to be written to
metadata_collector: list | None: (Default value = None) where to collect metadata information
**kwargs: additional arguments to be passed to pyarrow.parquet.write_metadata

write_string_to_file(file_pointer: str | pathlib.Path | upath.UPath, string: str, encoding: str = 'utf-8')[source]#

Write a string to a text file

Parameters:

file_pointer: str | Path | UPath: file location to write file to
string: str: string to write to file
encoding: str: (Default value = “utf-8”) encoding method to write to file with

append_paths_to_pointer(pointer: str | pathlib.Path | upath.UPath, *paths: str) → upath.UPath[source]#

Append directories and/or a file name to a specified file pointer.

Parameters:

pointerstr | Path | UPath: FilePointer object to add path to
*paths: str: any number of directory names optionally followed by a file name to append to the pointer

Returns:

UPath: New file pointer to path given by joining given pointer and path names

directory_has_contents(pointer: str | pathlib.Path | upath.UPath) → bool[source]#

Checks if a directory already has some contents (any files or subdirectories)

Parameters:

pointerstr | Path | UPath: File Pointer to check for existing contents

Returns:

bool: True if there are any files or subdirectories below this directory.

does_file_or_directory_exist(pointer: str | pathlib.Path | upath.UPath) → bool[source]#

Checks if a file or directory exists for a given file pointer

Parameters:

pointerstr | Path | UPath: File Pointer to check if file or directory exists at

Returns:

bool: True if file or directory at pointer exists, False if not

find_files_matching_path(pointer: str | pathlib.Path | upath.UPath, *paths: str) → list[upath.UPath][source]#

Find files or directories matching the provided path parts.

Parameters:

pointerstr | Path | UPath: base File Pointer in which to find contents
*paths: str: any number of directory names optionally followed by a file name. directory or file names may be replaced with * as a matcher.

Returns:

list[UPath]: New file pointers to files found matching the path

get_upath(path: str | pathlib.Path | upath.UPath) → upath.UPath[source]#

Returns a UPath file pointer from a path string or other path-like type.

Parameters:

path: str | Path | UPath: base file path to be normalized to UPath

Returns:

UPath: Instance of UPath.

get_upath_for_protocol(path: str | pathlib.Path) → upath.UPath[source]#

Create UPath with protocol-specific configurations.

If we access pointers on S3 and credentials are not found we assume an anonymous access, i.e., that the bucket is public.

Parameters:

path: str | Path | UPath: base file path to be normalized to UPath

Returns:

UPath: Instance of UPath.

is_regular_file(pointer: str | pathlib.Path | upath.UPath) → bool[source]#

Checks if a regular file (NOT a directory) exists for a given file pointer.

Parameters:

pointerstr | Path | UPath: File Pointer to check if a regular file

Returns:

bool: True if regular file at pointer exists, False if not or is a directory