Metadata-Version: 2.4
Name: envdataprep
Version: 0.1.1
Summary: Extensible Environmental Data Preprocessing Framework
Author-email: Gongda Lu <gongda.lu@outlook.com>
License-Expression: MIT
Keywords: environmental,data,preprocessing,satellites,models,renalysis,forecasts
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: netCDF4>=1.6.0
Requires-Dist: xarray>=2023.1.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: tqdm>=4.64.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Dynamic: license-file

# EnvDataPrep: High-performance Environmental Data Pre-processing
[![Python](https://img.shields.io/badge/python-3.12%2B-blue.svg)](https://www.python.org/downloads/)
[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
[![Development Status](https://img.shields.io/badge/status-alpha-orange.svg)]()

## Why EnvDataPrep?
EnvDataPrep saves **Money**, **Time** and **Disk Storage** for those who deal with environmental datasets.

## Current Capacity
The first feature is the **Unified, Configuration-Driven Extraction of NetCDF Files**. Below is an example usage that reduces a TROPOMI satellite product file size by ~90%:
```python
"""Example of extracting a subset of variables from a netCDF file."""

from pathlib import Path
import envdataprep as edp

# Set up input and output directories
ROOT = Path("E:/Samples/Satellites")
input_dir = ROOT / "Input"
output_dir = ROOT / "Output"

# Select an example file
file_name = "S5P_RPRO_L2__NO2____20190101T233659_20190102T011828_06322_03_020400_20221106T093236.nc"
file_path = input_dir / file_name

# List all variables in the file
variables = edp.list_netcdf_variables(file_path)
print(*variables, sep='\n')

# List the variables to extract
variable_paths = [
    "PRODUCT/latitude",
    "PRODUCT/longitude",
    "PRODUCT/nitrogendioxide_tropospheric_column",
    "PRODUCT/SUPPORT_DATA/INPUT_DATA/cloud_fraction_crb",
]

# Extract and save out
# By default, the output file preserves the original group structure
edp.subset_netcdf(
    file_path,
    output_dir,
    variable_paths,
    output_name="example_extracted_data.nc",
    compression_method='zlib',
    compression_level=9,
)
```

## Installation

### Prerequisites
- **[Mamba](https://mamba.readthedocs.io/) (recommended) or [Conda](https://docs.conda.io/en/latest/)**
  - Preferred for installing scientific Python dependencies (netCDF4, xarray, numpy)
  - Handles complex dependency resolution more reliably than pip alone
- **Alternative**: pip installation should also work, but will be more complicated

### Quick Setup
```bash
# 1. Get the code
git clone https://github.com/envmini/envdataprep.git
cd envdataprep

# 2. Create environment (choose one)
# Option A: Using Mamba (faster, recommended)
mamba env create -f environment.yml
mamba activate envdataprep

# Option B: Using Conda
# conda env create -f environment.yml
# conda activate envdataprep

# 3. Install package in development mode
pip install -e . --no-deps

# 4. Verify installation
python -c "import envdataprep; print('Installation successful!')"
```
**Why development mode for now?**
- Package not yet published to PyPI/conda-forge
- Allows you to get latest features and contribute feedback
- Easy to update with `git pull`

**Note**: We use `pip install -e .` even in conda/mamba environments, but this `pip` command uses the pip from your active conda environment, not system pip, You can verify this with:
```bash
which pip  # Should show: .../miniforge3/envs/envdataprep/bin/pip
```

## License
This project is licensed under the [MIT License](LICENSE).

[⬆ Back to top](#envdataprep-high-peformance-environmental-data-pre-processing)
