Metadata-Version: 2.4
Name: earthdata-varinfo
Version: 4.0.0
Summary: A package for parsing Earth Observation science granule structure and extracting relations between science variables and their associated metadata, such as coordinates.
Home-page: https://github.com/nasa/earthdata-varinfo
Author: NASA EOSDIS SDPS Data Services Team
Author-email: owen.m.littlejohns@nasa.gov
License: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: netCDF4>=1.7.2
Requires-Dist: numpy<2.3,>=1.24.2
Requires-Dist: python-cmr~=0.12.0
Requires-Dist: requests~=2.31.0
Requires-Dist: urllib3~=2.6.1
Provides-Extra: dev
Requires-Dist: ipython~=8.18.1; extra == "dev"
Requires-Dist: jsonschema~=4.23.0; extra == "dev"
Requires-Dist: pre-commit~=4.2.0; extra == "dev"
Requires-Dist: pycodestyle~=2.12.1; extra == "dev"
Requires-Dist: pylint~=3.3.6; extra == "dev"
Requires-Dist: pytest~=8.3.5; extra == "dev"
Requires-Dist: pytest-cov~=6.0.0; extra == "dev"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: license-file
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# earthdata-varinfo

A Python package developed as part of the NASA Earth Observing System Data and
Information System (EOSDIS) for parsing Earth Observation science granule
structure and extracting relations between science variables and their
associated metadata, such as coordinates. This package also includes the
capability to generate variable (UMM-Var) metadata records that are compatible
with the NASA EOSDIS Common Metadata Repository
([CMR](https://www.earthdata.nasa.gov/eosdis/science-system-description/eosdis-components/cmr)).

For general usage of classes and functions in `earthdata-varinfo`, see:
<https://github.com/nasa/earthdata-varinfo/blob/main/docs/earthdata-varinfo.ipynb>.

## Features:

### CFConfig

A class that takes a JSON file and retrieves all related configuration based on
the supplied mission name and collection shortname. The JSON file is optional,
and if not supplied, a `CFConfig` class will be constructed with largely empty
attributes.

``` python
from varinfo import CFConfig

cf_config = CFConfig('ICESat2', 'ATL03', config_file='config/0.0.1/sample_config_0.0.1.json')
metadata_attributes = cf_config.get_metadata_attributes('/full/variable/path')
```

### VarInfo

A group of classes that contain metadata attributes for all groups and
variables in a single granule, and the relations between all variables within
that granule. Current classes include:

* VarInfoBase: An abstract base class that contains core logic and methods used
  by the child classes that parse different sources of granule information.
* VarInfoFromDmr: Child class that maps input from a `.dmr` file downloaded
  from Hyrax in the cloud. This inherits all the methods and logic of
  VarInfoBase.
* VarInfoFromNetCDF4: Child class that maps input directly from a NetCDF-4
  file. Thus inherits all the methods and logic of VarInfoBase.

``` python
from varinfo import VarInfoFromDmr

var_info = VarInfoFromDmr('/path/to/local/file.dmr',
                          config_file='config/0.0.1/sample_config_0.0.1.json')

# Retrieve a set of variables with coordinate metadata:
var_info.get_science_variables()

# Retrieve a set of variables without coordinate metadata:
var_info.get_metadata_variables()

# Augment a set of desired variables with all variables required to support
# the requested set. For example coordinate variables.
var_info.get_required_variables({'/path/to/science/variable'})

# Retrieve an ordered list of dimensions associated with all specified variables.
var_info.get_required_dimensions({'/path/to/science/variable'})

# Retrieve all spatial dimensions associated with the specified set of science
# variables.
var_info.get_spatial_dimensions({'/path/to/science/variable'})
```

The `VarInfoFromDmr` and `VarInfoFromNetCDF4` classes also have an optional
argument `short_name`, which can be used upon instantiation to specify the
short name of the collection to which the granule belongs. This option is the
preferred way to specify a collection short name, and particularly encouraged
for use when a granule does not contain the collection short name within its
metadata attributes (e.g., ABoVE collections from ORNL).

``` python
var_info = VarInfoFromDmr('/path/to/local/file.dmr', short_name='ATL03')
```

Note: as there are now two optional parameters, `short_name` and `config_file`,
it is best to ensure that both are specified as named arguments upon
instantiation.

### UMM-Var generation

`earthdata-varinfo` can generate variable metadata records compatible with the
CMR UMM-Var schema:

```  python
from varinfo import VarInfoFromNetCDF4
from varinfo.umm_var import export_all_umm_var_to_json, get_all_umm_var

# Instantiate a VarInfoFromNetCDF4 object for a local NetCDF-4 file.
var_info = VarInfoFromNetCDF4('/path/to/local/file.nc4', short_name='ATL03')

# Retrieve a dictionary of UMM-Var JSON records. Keys are the full variable
# paths, values are UMM-Var schema-compatible, JSON-serialisable dictionaries.
umm_var = get_all_umm_var(var_info)

# Write each UMM-Var dictionary to its own JSON file:
export_all_umm_var_to_json(list(umm_var.values()), output_dir='local_dir')
```

### End-to-end UMM-Var generation and publication:

```  python
from cmr import CMR_OPS
from varinfo.generate_umm_var import generate_collection_umm_var

# Defaults to UAT, and not to publish:
umm_var_json = generate_collection_umm_var(<UAT collection concept ID>,
                                           <authorization header>)

# To use a production collection:
umm_var_json = generate_collection_umm_var(<Production collection concept ID>,
                                           <authorization header>,
                                           cmr_env=CMR_OPS)

# To generate and publish records for a UAT collection (note the authorization
# header must contain a LaunchPad token):
umm_var_json = generate_collection_umm_var(<UAT collection concept ID>,
                                           <authorization header>,
                                           publish=True)

# Use a DMR file to generate UMM-Var, defaults to UAT, and not to publish:
umm_var_json = generate_collection_umm_var(<UAT collection concept ID>,
                                           <authorization header>)

# To generate and publish records from a DMR file for a UAT collection
# (note the authorization header must contain a LaunchPad token):
umm_var_json = generate_collection_umm_var(<UAT collection concept ID>,
                                           <authorization header>,
                                           publish=True, use_dmr=True)
```

Expected outputs:

* `publish=False`, or not specifying a value will result in JSON output
  containing the UMM-Var JSON for each identified variable.
* `publish=True` will return a list of strings. Each string is either the
  concept ID of a new  UMM-Var record, or a string including the full path of
  a variable that failed to publish and the error messages returned from CMR.

Native IDs for generated UMM-Var records will be of format:

```
<collection concept ID>-<variable Name>
```

For variables that are hierarchical, slashes will be converted to underscores,
to ensure the native ID is compatible with the CMR API.

## Configuration file schema:

The configuration file schema is defined as a JSON schema file in the `config`
directory. Each new iteration to the schema should be placed in its own
semantically versioned subdirectory, and a sample configuration file should be
provided. Additionally, notes on the schema changes should be provided in
`config/CHANGELOG.md`.

## Installing

### Using pip

Install the latest version of the package from PyPI using pip:

```bash
$ pip install earthdata-varinfo
```

### Other methods:

For local development, it is possible to clone the repository and then install
the version being developed in editable mode:

```bash
$ git clone https://github.com/nasa/earthdata-varinfo
$ cd earthdata-varinfo
$ pip install -e .
```

## Contributing

Contributions are welcome! For more information see `CONTRIBUTING.md`.

## Developing

Development within this repository should occur on a feature branch. Pull
Requests (PRs) are created with a target of the `main` branch before being
reviewed and merged.

Releases are created when a feature branch is merged to `main` and that branch
also contains an update to the `VERSION` file.

### Development Setup:

Prerequisites:

  - Python 3.9+, ideally installed in a virtual environment, such as `pyenv` or
    `conda`.
  - A local copy of this repository.

Set up conda virtual environment:

```bash
conda create --name earthdata-varinfo python=3.12 --channel conda-forge \
    --override-channels -y
conda activate earthdata-varinfo
```

Install dependencies:

```bash
$ make develop
```

or

```bash
pip install -r requirements.txt -r dev-requirements.txt
```

Run a linter against package code (preferably do this prior to submitting code
for a PR review):

```bash
$ make lint
```

Run `unittest` suite (run via `pytest`, but written using `unittest` classes):

```bash
$ make test
```

Note, the test execution will fail if code coverage of unit tests falls below
95%. This threshold is also used during the GitHub workflow CI/CD.

### pre-commit hooks:

This repository uses [pre-commit](https://pre-commit.com/) to enable pre-commit
checking the repository for some coding standard best practices. These include:

* Removing trailing whitespaces.
* Removing blank lines at the end of a file.
* JSON files have valid formats.
* [ruff](https://github.com/astral-sh/ruff) Python linting checks.
* [black](https://black.readthedocs.io/en/stable/index.html) Python code
  formatting checks.

To enable these checks:

```bash
# Install pre-commit Python package as part of test requirements:
pip install -r dev-requirements.txt

# Install the git hook scripts:
pre-commit install

# (Optional) Run against all files:
pre-commit run --all-files
```

When you try to make a new commit locally, `pre-commit` will automatically run.
If any of the hooks detect non-compliance (e.g., trailing whitespace), that
hook will state it failed, and also try to fix the issue. You will need to
review and `git add` the changes before you can make a commit.

It is planned to implement additional hooks, possibly including tools such as
`mypy`.

[pre-commit.ci](pre-commit.ci) is configured such that these same hooks will be
automatically run for every pull request.

## Releasing:

All CI/CD for this repository is defined in the `.github/workflows` directory:

* run_tests.yml - A reusable workflow that runs the unit test suite under a
  matrix of Python versions.
* run_tests_on_pull_requests.yml - Triggered for all PRs against main. It runs
  the workflow in run_test.yml to ensure all tests pass on the new code.
* publish_to_pypi.yml - Triggered either manually or for commits to the main
  branch that contain changes to the `VERSION` file.

The `publish_to_pypi.yml` workflow will:

* Run the full unit test suite, to prevent publication of broken code.
* Extract the semantic version number from `VERSION`.
* Extract the release notes for the most recent version from `CHANGELOG.md`.
* Build the package to be published to PyPI.
* Publish the package to PyPI.
* Publish a GitHub release under the semantic version number, with associated
  git tag.

Before triggering a release, ensure the `VERSION` and `CHANGELOG.md`
files are updated accordingly.

## Get in touch:

You can reach out to the maintainers of this repository via email:

* david.p.auty@nasa.gov
* owen.m.littlejohns@nasa.gov
