Metadata-Version: 2.4
Name: zarr-copy
Version: 0.4.0
Summary: Copy a zarr dataset while changing chunking - variable by variable
Project-URL: Homepage, https://gitlab.dkrz.de/nils/zarr-copy
Requires-Python: >=3.9
Requires-Dist: fsspec>=2025.10.0
Requires-Dist: numcodecs>=0.12
Requires-Dist: s3fs>=2025.10.0
Requires-Dist: zarr>=2.18.2
Description-Content-Type: text/markdown

# zarr-copy

<p align="center">
  <img src="icon.svg" alt="zarr-copy logo" width="160"/>
</p>

Copy a zarr dataset while optionally rechunking variables — variable by variable.

## Installation

```bash
pip install -e .
```

Or with [uv](https://github.com/astral-sh/uv):

```bash
uv sync
```

## Usage

```
zarr-copy [-h] [-v VAR [VAR ...]] [--rechunk DIM=SIZE [DIM=SIZE ...]]
          [--double-remap DIM] [--tmp-path DIR] [--transform FILE]
          PATH [PATH ...] DST
```

### Positional arguments

| Argument        | Description                                                                                  |
|-----------------|----------------------------------------------------------------------------------------------|
| `PATH [PATH …]` | One or more source paths or S3 URIs. Multiple sources require `--transform`.                 |
| `DST`           | Path or S3 URI of the destination zarr group (always the last positional argument).          |

### Optional arguments

| Flag                   | Description                                                                              |
|------------------------|------------------------------------------------------------------------------------------|
| `-v VAR …`             | Variable names to copy. Copies all variables when omitted.                               |
| `--rechunk DIM=SIZE …` | Rechunk one or more dimensions, e.g. `--rechunk time=1 lat=256`.                         |
| `--double-remap DIM`   | Use the two-pass copy strategy for the given dimension to reduce peak memory usage.      |
| `--tmp-path DIR`       | Directory for temporary files used by `--double-remap`. Defaults to the system temp dir. |
| `--transform FILE`     | Python config file defining output variable names and how to compute them (see below).   |

### Examples

Copy all variables without rechunking:

```bash
zarr-copy /data/input.zarr /data/output.zarr
```

Copy selected variables and rechunk the `time` dimension:

```bash
zarr-copy /data/input.zarr /data/output.zarr -v temperature salinity --rechunk time=1
```

Copy from S3, rechunk multiple dimensions, and use the double-remap strategy:

```bash
zarr-copy s3://my-bucket/input.zarr /data/output.zarr \
  --rechunk time=1 lat=256 lon=256 \
  --double-remap time \
  --tmp-path /scratch
```

Write to S3 from a local source:

```bash
zarr-copy /data/input.zarr s3://my-bucket/output.zarr \
  --rechunk lat=256 lon=256
```

Increase spatial chunk sizes for better read performance along lat/lon:

```bash
zarr-copy /data/input.zarr /data/output.zarr --rechunk lat=256 lon=256
```

Optimise for time-series access by making the time dimension unchunked (one chunk per step):

```bash
zarr-copy /data/input.zarr /data/output.zarr --rechunk time=1
```

Rechunk a large dataset where the time axis is being heavily reordered, using
`/scratch` for intermediate storage to cap memory use:

```bash
zarr-copy /data/input.zarr /data/output.zarr \
  -v u v w temperature \
  --rechunk time=1 level=1 lat=180 lon=360 \
  --double-remap time \
  --tmp-path /scratch
```

## Notes

- Rechunking requires `_ARRAY_DIMENSIONS` metadata on each array (standard for
  xarray/CF-convention zarr stores).
- S3 access is handled transparently via [s3fs](https://s3fs.readthedocs.io/).
- The `--double-remap` strategy writes data to a temporary intermediate zarr
  (lz4-compressed) to avoid holding large in-memory buffers during rechunking.

## Transform: rename and compute output variables

The `--transform` flag accepts a Python file that defines a module-level
`transform` dict.  Each key is the **output variable name**; each value is
either:

- a **string** — rename from the named source variable (simple copy), or
- a **callable** — compute the output from one or more source variables whose
  names are taken from the callable's **parameter names**.

When `--transform` is provided the `-v` flag is ignored; all output variables
are determined by the spec.

### Transform config file example

```python
# transform.py
transform = {
    "time": "time",                                   # rename only
    "lon":  "lon",
    "lat":  "lat",
    "rlut": lambda ttr: ttr / -3600,                  # single source
    "rsut": lambda tisr, tsr: (tisr - tsr) / 3600,   # two sources
    "pr":   lambda tp: tp * (1000 / 3600),
    "psl":  "msl",                                    # rename
    "ts":   "skt",
    "uas":  "10u",
}
```

### Usage

```bash
zarr-copy /data/era5.zarr /data/cmip.zarr --transform transform.py
```

Combine with `--rechunk` to rename, compute, and rechunk in one pass:

```bash
zarr-copy /data/era5.zarr /data/cmip.zarr \
  --transform transform.py \
  --rechunk time=1 lat=256 lon=256
```

Merge variables from two source datasets into one output using a transform.
The last positional argument is always the destination; sources are searched in
order and the first match wins:

```bash
zarr-copy /data/era5_atm.zarr /data/era5_sfc.zarr /data/cmip.zarr \
  --transform transform.py \
  --rechunk time=1 lat=256 lon=256
```

## SLURM: parallel rechunking on HPC

For large datasets, `zarr-copy-slurm` submits one SLURM job per variable so
all variables are rechunked concurrently.  It accepts the same rechunking
flags as `zarr-copy` and forwards them to each job.

```
zarr-copy-slurm [-h] [-v VAR [VAR ...]] [--rechunk DIM=SIZE [DIM=SIZE ...]]
                [--double-remap DIM] [--tmp-path DIR] [--transform FILE]
                [--cpus N] [--mem MEM] [--walltime HH:MM:SS] [--log-dir DIR]
                [--dry-run]
                PATH [PATH ...] DST
```

### Examples

Preview the generated job scripts without submitting (useful for checking
before committing to a long run):

```bash
zarr-copy-slurm /data/input.zarr /data/output.zarr \
  --rechunk time=1 lat=256 lon=256 \
  --dry-run
```

Submit one job per variable, rechunking the time dimension into single steps:

```bash
zarr-copy-slurm /data/input.zarr /data/output.zarr --rechunk time=1
```

Submit jobs for selected variables only, with custom resource requests:

```bash
zarr-copy-slurm /data/input.zarr /data/output.zarr \
  -v temperature salinity u v \
  --rechunk time=1 lat=256 lon=256 \
  --cpus 8 --mem 64G --walltime 08:00:00
```

Large dataset with the double-remap strategy, temporaries on scratch:

```bash
zarr-copy-slurm /data/input.zarr /data/output.zarr \
  --rechunk time=1 level=1 lat=180 lon=360 \
  --double-remap time \
  --tmp-path /scratch/$USER \
  --log-dir /data/logs
```

Submit one job per output variable defined by a transform spec:

```bash
zarr-copy-slurm /data/era5.zarr /data/cmip.zarr \
  --transform transform.py \
  --rechunk time=1 lat=256 lon=256
```

Without installing the package, run directly with `uvx`:

```bash
uvx --from zarr-copy zarr-copy-slurm /data/input.zarr /data/output.zarr \
  --rechunk time=1 --dry-run
```
