Metadata-Version: 2.4
Name: annslicer
Version: 0.1.5
Summary: Out-of-core sharding of large .h5ad AnnData files with minimal memory usage.
Author: sfleming
License: MIT
Project-URL: Homepage, https://github.com/sfleming/annslicer
Project-URL: Bug Tracker, https://github.com/sfleming/annslicer/issues
Keywords: anndata,h5ad,bioinformatics,single-cell,genomics
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE.md
Requires-Dist: anndata>=0.9
Requires-Dist: numpy>=1.24
Requires-Dist: scipy>=1.10
Requires-Dist: h5py>=3.8
Provides-Extra: dev
Requires-Dist: annslicer[zarr]; extra == "dev"
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-benchmark>=4.0; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Requires-Dist: mypy==1.15.0; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Provides-Extra: zarr
Requires-Dist: zarr>=2.10; extra == "zarr"
Dynamic: license-file

# annslicer

**Out-of-core sharding and merging of large AnnData files with minimal memory usage.**

![Diagram](diagram.png)

Large single-cell datasets stored as `.h5ad` or `.zarr` files can easily exceed available RAM. `annslicer` slices them into manageable shards — and merges them back — without loading full matrices into memory. It uses best practices from `anndata` with a few small speed improvements for random shuffling.

Consolidates best practices into a simple command-line tool.

```bash
annslicer slice input.h5ad output_prefix
```

```bash
annslicer merge output.h5ad shard_0.h5ad shard_1.h5ad
```

## Features

- Shards and merges `X`, all `layers`, `obs`, `var`, `obsm`, and `uns`
- Handles both dense and sparse (CSR) matrices
- Constant, low memory footprint regardless of file size
- Input supports both `.h5ad` and `.zarr` formats for slicing
- Merge output supports both `.h5ad` and `.zarr` formats
- Optional **cell shuffling** (`--shuffle`) for representative shards without loading the full matrix
- Simple CLI and Python API

## Installation

```bash
pip install annslicer
```

For Zarr input/output support (optional):

```bash
pip install annslicer[zarr]
```

## CLI Usage

`annslicer` provides two subcommands: `slice` and `merge`.

### Sharding a large file

```bash
annslicer slice input.h5ad output_prefix --size 10000
```

Both `.h5ad` and `.zarr` inputs are supported.

| Argument | Description |
|---|---|
| `input.h5ad` or `input.zarr` | Path to the source file |
| `output_prefix` | Prefix for output files (e.g. `atlas` → `atlas_shard001.h5ad`, …) |
| `--size N` | Number of cells per shard (default: `10000`) |
| `--shuffle` | Randomly assign cells to shards (each shard is a representative draw) |
| `--seed N` | Random seed for reproducible shuffling (requires `--shuffle`) |
| `--compression FILTER` | HDF5 compression filter for shard files (e.g. `gzip`, `lzf`); default: no compression |

**Example — basic sharding:**

```bash
annslicer slice /data/large_atlas.h5ad /outputs/atlas --size 20000
```

**Example — shuffled sharding from a large h5ad:**

```bash
annslicer slice /data/large_atlas.h5ad /outputs/atlas --size 10000 --shuffle --seed 0
```

**Example — gzip-compressed shards:**

```bash
annslicer slice /data/large_atlas.h5ad /outputs/atlas --size 10000 --compression gzip
```

Produces: `atlas_shard_0.h5ad`, `atlas_shard_1.h5ad`, …

### Merging shards back into one file

```bash
annslicer merge output.h5ad shard_0.h5ad shard_1.h5ad shard_2.h5ad
```

Output format is inferred from the extension — use `.zarr` for Zarr output (requires `annslicer[zarr]`):

```bash
annslicer merge output.zarr shard_0.h5ad shard_1.h5ad shard_2.h5ad
```

Input files can also be specified as glob patterns (expanded lexicographically):

```bash
annslicer merge output.h5ad "shards/atlas_shard_*.h5ad"
```

| Argument | Description |
|---|---|
| `output_file` | Path for the merged output file (`.h5ad` or `.zarr`) |
| `input_files` | One or more shard paths or glob patterns, in order |
| `--join {inner,outer}` | How to join var (gene) axes when files differ (default: `outer`) |

When shards have **different gene sets**, `--join outer` (default) takes the union of all genes and fills missing entries with zeros; `--join inner` keeps only genes present in every shard. Layers absent from any shard are always dropped.

### Global options

| Flag | Description |
|---|---|
| `--debug` | Enable verbose debug-level logging |

## Python API

```python
from annslicer import shard_h5ad, merge_out_of_core

# Basic sharding (h5ad or zarr input)
shard_h5ad("large_atlas.h5ad", "atlas", shard_size=20000)
shard_h5ad("large_atlas.zarr", "atlas", shard_size=20000)  # requires annslicer[zarr]

# Shuffled sharding — cells are randomly distributed across shards
shard_h5ad("large_atlas.h5ad", "atlas", shard_size=20000, shuffle=True, seed=0)

# Gzip-compressed shards — smaller files at the cost of write speed
shard_h5ad("large_atlas.h5ad", "atlas", shard_size=20000, compression="gzip")

# Custom output filenames — provide explicit paths instead of auto-generated names
shard_h5ad(
    "large_atlas.h5ad",
    "atlas",  # ignored when output_filenames is provided
    shard_size=20000,
    output_filenames=["batch_0.h5ad", "batch_1.h5ad", "batch_2.h5ad"],
)

# Merge shards back into one file (identical-var fast path used automatically)
merge_out_of_core(["atlas_shard_0.h5ad", "atlas_shard_1.h5ad"], "merged.h5ad")

# Merge shards with different gene sets — outer join (union, fills absent genes with 0)
merge_out_of_core(["shard_a.h5ad", "shard_b.h5ad"], "merged.h5ad", join="outer")

# Merge shards with different gene sets — inner join (intersection only)
merge_out_of_core(["shard_a.h5ad", "shard_b.h5ad"], "merged.h5ad", join="inner")
```

## How it works

### Slicing
1. Opens the input file ("backed" AnnData for `.h5ad`; `anndata.io.sparse_dataset` for `.zarr`).
2. If `shuffle=True`, generates a global cell permutation upfront using `numpy.random.default_rng`.
3. For each shard, reads only the relevant rows from `X` and each layer via sorted fancy indexing — no full matrix is ever loaded into memory.
4. When shuffling, rows are read in sorted index order (maximising sequential I/O) and then reordered in-memory to the desired shuffled order.
5. Reassembles a valid `AnnData` object per shard and writes it to disk.

### Merging
1. Reads `obs`, `var`, and `uns` from **all** shards to build a skeleton output file.
2. Computes the merged `var` index: union (outer join) or intersection (inner join) of gene sets across all shards. If every shard shares the identical `var`, remapping is skipped entirely (fast path).
3. Scans shards to calculate total non-zero sizes for pre-allocation (for an inner join, entries for excluded genes are filtered during the scan).
4. Streams `X`, layers, and `obsm` data shard-by-shard directly into the pre-allocated output arrays, remapping column indices on the fly where needed.
5. Layers absent from any shard are dropped so every cell has consistent layer coverage.

> **Note:** CSC (column-compressed) sparse matrices are not supported for out-of-core row-slicing. Convert to CSR before sharding.

## Benchmarks

Run on a dummy sparse anndata object with 200k cells and 10k genes.

### For h5ad format

| Slicing method | Mean runtime (s) | Peak memory (MB) |
|---|---|---|
| `annslicer slice` | 0.584 | 211.4 |
| `anndata` backed | 0.601 | 203.7 |
| `annslicer slice --shuffle` | 1.731 | 221.8 |
| `anndata` backed with shuffle | 3.830 | 209.1 |

### For zarr format

| Slicing method | Mean runtime (s) | Peak memory (MB) |
|---|---|---|
| `annslicer slice` | 1.050 | 62.1 |
| `anndata` backed | 0.799 | 54.4 |
| `annslicer slice --shuffle` | 5.544 | 142.9 |
| `anndata` backed with shuffle | 6.591 | 151.4 |

Based on these benchmarks, for making randomly shuffled data shards, we recommend using `annslicer slice --shuffle` on an h5ad format file.

## License

BSD 3-clause
