hats.io.file_io#

Submodules#

Functions#

delete_file(file_handle)

Deletes file from filesystem.

load_csv_to_pandas(→ pandas.DataFrame)

Load a csv file to a pandas dataframe

load_csv_to_pandas_generator(...)

Load a csv file to a pandas dataframe

load_text_file(file_pointer[, encoding])

Load a text file content to a list of strings.

make_directory(file_pointer[, exist_ok])

Make a directory at a given file pointer

read_fits_image(→ numpy.ndarray)

Read the object spatial distribution information from a healpix FITS file.

read_parquet_dataset(→ tuple[str | list[str], ...)

Read parquet dataset from directory pointer or list of files.

read_parquet_file(→ pyarrow.parquet.ParquetFile)

Read single parquet file.

read_parquet_file_to_pandas(→ nested_pandas.NestedFrame)

Reads parquet file(s) to a pandas DataFrame

read_parquet_metadata(→ pyarrow.parquet.FileMetaData)

Read FileMetaData from footer of a single Parquet file.

remove_directory(file_pointer[, ignore_errors])

Remove a directory, and all contents, recursively.

write_dataframe_to_csv(dataframe, file_pointer, **kwargs)

Write a pandas DataFrame to a CSV file

write_dataframe_to_parquet(dataframe, file_pointer)

Write a pandas DataFrame to a parquet file

write_fits_image(histogram, map_file_pointer)

Write the object spatial distribution information to a healpix FITS file.

write_parquet_metadata(schema, file_pointer[, ...])

Write a metadata only parquet file from a schema

write_string_to_file(file_pointer, string[, encoding])

Write a string to a text file

append_paths_to_pointer(→ upath.UPath)

Append directories and/or a file name to a specified file pointer.

directory_has_contents(→ bool)

Checks if a directory already has some contents (any files or subdirectories)

does_file_or_directory_exist(→ bool)

Checks if a file or directory exists for a given file pointer

find_files_matching_path(→ list[upath.UPath])

Find files or directories matching the provided path parts.

get_upath(→ upath.UPath)

Returns a UPath file pointer from a path string or other path-like type.

get_upath_for_protocol(→ upath.UPath)

Create UPath with protocol-specific configurations.

is_regular_file(→ bool)

Checks if a regular file (NOT a directory) exists for a given file pointer.

Package Contents#

delete_file(file_handle: str | pathlib.Path | upath.UPath)[source]#

Deletes file from filesystem.

Parameters:
file_handle: str | Path | UPath

location of file pointer

load_csv_to_pandas(file_pointer: str | pathlib.Path | upath.UPath, **kwargs) pandas.DataFrame[source]#

Load a csv file to a pandas dataframe

Parameters:
file_pointer: str | Path | UPath

location of csv file to load

**kwargs

arguments to pass to pandas read_csv loading method

Returns:
pd.DataFrame

contents of the CVS file, as a dataframe.

load_csv_to_pandas_generator(file_pointer: str | pathlib.Path | upath.UPath, *, chunksize=10000, compression=None, **kwargs) collections.abc.Generator[pandas.DataFrame][source]#

Load a csv file to a pandas dataframe

Parameters:
file_pointer: str | Path | UPath

location of csv file to load

chunksizeint

(Default value = 10_000) number of rows to load per chunk

compressionstr

(Default value = None) for compressed CSVs, the manner of compression. e.g. ‘gz’, ‘bzip’.

**kwargs

arguments to pass to pandas read_csv loading method

Yields:
pd.DataFrame

chunked contents of the CVS file, as a dataframe.

load_text_file(file_pointer: str | pathlib.Path | upath.UPath, encoding: str = 'utf-8')[source]#

Load a text file content to a list of strings.

Parameters:
file_pointer: str | Path | UPath

location of file to read

encoding: str

(Default value = “utf-8”) string encoding method used by the file

Returns:
str

full string contents of the file as a list of strings, one per line.

make_directory(file_pointer: str | pathlib.Path | upath.UPath, exist_ok: bool = False)[source]#

Make a directory at a given file pointer

Will raise an error if a directory already exists, unless exist_ok is True in which case any existing directories will be left unmodified.

Parameters:
file_pointer: str | Path | UPath

location in file system to make directory

exist_ok: bool

(Default value = False) If false will raise error if directory exists. If true existing directories will be ignored and not modified

read_fits_image(map_file_pointer: str | pathlib.Path | upath.UPath) numpy.ndarray[source]#

Read the object spatial distribution information from a healpix FITS file.

Parameters:
map_file_pointer: str | Path | UPath

location of file to be read

Returns:
np.ndarray

one-dimensional numpy array of integers where the value at each index corresponds to the number of objects found at the healpix pixel.

read_parquet_dataset(source: str | pathlib.Path | upath.UPath | list[str | pathlib.Path | upath.UPath], **kwargs) tuple[str | list[str], pyarrow.dataset.Dataset][source]#

Read parquet dataset from directory pointer or list of files.

Note that pyarrow.dataset reads require that directory pointers don’t contain a leading slash, and the protocol prefix may additionally be removed. As such, we also return the directory path that is formatted for pyarrow ingestion for follow-up.

See more info on source specification and possible kwargs at https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html

Parameters:
source: str | Path | UPath | list[str | Path | UPath]

directory, path, or list of paths to read data from

**kwargs

additional arguments passed to pyarrow.dataset.dataset

Returns:
tuple[str | list[str], Dataset]

Tuple containing a path to the dataset (that is formatted for pyarrow ingestion) and the dataset read from disk.

read_parquet_file(file_pointer: str | pathlib.Path | upath.UPath, **kwargs) pyarrow.parquet.ParquetFile[source]#

Read single parquet file.

Parameters:
file_pointer: str | Path | UPath

location of parquet file

**kwargs

additional arguments to be passed to pyarrow.parquet.ParquetFile

Returns:
pq.ParquetFile

full contents of parquet file

read_parquet_file_to_pandas(file_pointer: str | pathlib.Path | upath.UPath, is_dir: bool | None = None, **kwargs) nested_pandas.NestedFrame[source]#

Reads parquet file(s) to a pandas DataFrame

Parameters:
file_pointer: str | Path | UPath

File Pointer to a parquet file or a directory containing parquet files

is_dirbool | None

If True, the pointer represents a pixel directory, otherwise, the pointer represents a file. In both cases there is no need to check the pointer’s content type. If is_dir is None (default), this method will resort to upath.is_dir() to identify the type of pointer. Inferring the type for HTTP is particularly expensive because it requires downloading the contents of the pointer in its entirety.

**kwargs

Additional arguments to pass to pandas read_parquet method

Returns:
NestedFrame

Pandas DataFrame with the data from the parquet file(s)

read_parquet_metadata(file_pointer: str | pathlib.Path | upath.UPath, **kwargs) pyarrow.parquet.FileMetaData[source]#

Read FileMetaData from footer of a single Parquet file.

Parameters:
file_pointer: str | Path | UPath

location of file to read metadata from

**kwargs

additional arguments to be passed to pyarrow.parquet.read_metadata

Returns:
pq.FileMetaData

parqeut file metadata (includes schema)

remove_directory(file_pointer: str | pathlib.Path | upath.UPath, ignore_errors=False)[source]#

Remove a directory, and all contents, recursively.

Parameters:
file_pointer: str | Path | UPath

directory in file system to remove

ignore_errorsbool

(Default value = False) if True errors resulting from failed removals will be ignored

write_dataframe_to_csv(dataframe: pandas.DataFrame, file_pointer: str | pathlib.Path | upath.UPath, **kwargs)[source]#

Write a pandas DataFrame to a CSV file

Parameters:
dataframe: pd.DataFrame

DataFrame to write

file_pointer: str | Path | UPath

location of file to write to

**kwargs

args to pass to pandas to_csv method

write_dataframe_to_parquet(dataframe: pandas.DataFrame, file_pointer)[source]#

Write a pandas DataFrame to a parquet file

Parameters:
dataframe: pd.DataFrame

DataFrame to write

file_pointerstr | Path | UPath

location of file to write to

write_fits_image(histogram: numpy.ndarray, map_file_pointer: str | pathlib.Path | upath.UPath)[source]#

Write the object spatial distribution information to a healpix FITS file.

Parameters:
histogram: np.ndarray

one-dimensional numpy array of long integers where the value at each index corresponds to the number of objects found at the healpix pixel.

map_file_pointer: str | Path | UPath

location of file to be written

write_parquet_metadata(schema, file_pointer: str | pathlib.Path | upath.UPath, metadata_collector: list | None = None, **kwargs)[source]#

Write a metadata only parquet file from a schema

Parameters:
schemapa.Schema

pyarrow schema to be written

file_pointer: str | Path | UPath

location of file to be written to

metadata_collector: list | None

(Default value = None) where to collect metadata information

**kwargs

additional arguments to be passed to pyarrow.parquet.write_metadata

write_string_to_file(file_pointer: str | pathlib.Path | upath.UPath, string: str, encoding: str = 'utf-8')[source]#

Write a string to a text file

Parameters:
file_pointer: str | Path | UPath

file location to write file to

string: str

string to write to file

encoding: str

(Default value = “utf-8”) encoding method to write to file with

append_paths_to_pointer(pointer: str | pathlib.Path | upath.UPath, *paths: str) upath.UPath[source]#

Append directories and/or a file name to a specified file pointer.

Parameters:
pointerstr | Path | UPath

FilePointer object to add path to

*paths: str

any number of directory names optionally followed by a file name to append to the pointer

Returns:
UPath

New file pointer to path given by joining given pointer and path names

directory_has_contents(pointer: str | pathlib.Path | upath.UPath) bool[source]#

Checks if a directory already has some contents (any files or subdirectories)

Parameters:
pointerstr | Path | UPath

File Pointer to check for existing contents

Returns:
bool

True if there are any files or subdirectories below this directory.

does_file_or_directory_exist(pointer: str | pathlib.Path | upath.UPath) bool[source]#

Checks if a file or directory exists for a given file pointer

Parameters:
pointerstr | Path | UPath

File Pointer to check if file or directory exists at

Returns:
bool

True if file or directory at pointer exists, False if not

find_files_matching_path(pointer: str | pathlib.Path | upath.UPath, *paths: str) list[upath.UPath][source]#

Find files or directories matching the provided path parts.

Parameters:
pointerstr | Path | UPath

base File Pointer in which to find contents

*paths: str

any number of directory names optionally followed by a file name. directory or file names may be replaced with * as a matcher.

Returns:
list[UPath]

New file pointers to files found matching the path

get_upath(path: str | pathlib.Path | upath.UPath) upath.UPath[source]#

Returns a UPath file pointer from a path string or other path-like type.

Parameters:
path: str | Path | UPath

base file path to be normalized to UPath

Returns:
UPath

Instance of UPath.

get_upath_for_protocol(path: str | pathlib.Path) upath.UPath[source]#

Create UPath with protocol-specific configurations.

If we access pointers on S3 and credentials are not found we assume an anonymous access, i.e., that the bucket is public.

Parameters:
path: str | Path | UPath

base file path to be normalized to UPath

Returns:
UPath

Instance of UPath.

is_regular_file(pointer: str | pathlib.Path | upath.UPath) bool[source]#

Checks if a regular file (NOT a directory) exists for a given file pointer.

Parameters:
pointerstr | Path | UPath

File Pointer to check if a regular file

Returns:
bool

True if regular file at pointer exists, False if not or is a directory