Metadata-Version: 2.4
Name: fiberhmm
Version: 2.4.0
Summary: Hidden Markov Model for calling chromatin footprints from fiber-seq and DAF-seq data
Author: FiberHMM Authors
License: MIT
Project-URL: Homepage, https://github.com/fiberseq/FiberHMM
Project-URL: Documentation, https://github.com/fiberseq/FiberHMM#readme
Project-URL: Repository, https://github.com/fiberseq/FiberHMM
Project-URL: Issues, https://github.com/fiberseq/FiberHMM/issues
Keywords: bioinformatics,chromatin,fiber-seq,DAF-seq,hidden-markov-model,nucleosome,footprinting,epigenetics,single-molecule
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: MacOS
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.20
Requires-Dist: scipy>=1.7
Requires-Dist: pandas>=1.3
Requires-Dist: pysam>=0.19
Requires-Dist: tqdm>=4.60
Provides-Extra: numba
Requires-Dist: numba>=0.55; extra == "numba"
Provides-Extra: plots
Requires-Dist: matplotlib>=3.5; extra == "plots"
Provides-Extra: posteriors
Requires-Dist: h5py>=3.0; extra == "posteriors"
Provides-Extra: all
Requires-Dist: numba>=0.55; extra == "all"
Requires-Dist: matplotlib>=3.5; extra == "all"
Requires-Dist: h5py>=3.0; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: black>=23.0; extra == "dev"
Requires-Dist: ruff>=0.1; extra == "dev"
Dynamic: license-file

# FiberHMM

Hidden Markov Model toolkit for calling chromatin footprints from Fiber-seq, DAF-seq, and other single-molecule footprinting data.

FiberHMM identifies protected regions (footprints) and accessible regions (methylase-sensitive patches, MSPs) from single-molecule DNA modification data, including m6A methylation (fiber-seq) and deamination marks (DAF-seq).

## Key Features

- **No genome context files** -- hexamer context computed directly from read sequences
- **Fibertools-compatible output** -- tagged BAM with `ns`/`nl`/`as`/`al` tags, ready for downstream tools
- **Native HMM implementation** -- no hmmlearn dependency; Numba JIT optional for ~10x speedup
- **Region-parallel processing** -- scales linearly with cores for large genomes
- **Multi-platform** -- supports PacBio fiber-seq, Nanopore fiber-seq, and DAF-seq
- **Legacy model support** -- loads old hmmlearn-trained pickle/NPZ models seamlessly

## Installation

### Using pip

```bash
pip install fiberhmm
```

### From source

```bash
git clone https://github.com/fiberseq/FiberHMM.git
cd FiberHMM
pip install -e .
```

### Optional dependencies

```bash
pip install numba        # ~10x faster HMM computation
pip install matplotlib   # --stats visualization
pip install h5py         # HDF5 posteriors export
```

For bigBed output, install [`bedToBigBed`](https://hgdownload.soe.ucsc.edu/admin/exe/) from UCSC tools.

## Quick Start

### 1. Generate emission probabilities

Requires accessible (naked DNA) and inaccessible (native chromatin) control BAMs:

```bash
python generate_probs.py \
    -a accessible_control.bam \
    -u inaccessible_control.bam \
    -o probs/ \
    --stats
```

### 2. Train HMM

```bash
python train_model.py \
    -i sample.bam \
    -p probs/tables/accessible_A_k3.tsv probs/tables/inaccessible_A_k3.tsv \
    -o models/ \
    --stats
```

### 3. Call footprints

```bash
python apply_model.py \
    -i experiment.bam \
    -m models/best-model.json \
    -o output/ \
    -c 8 \
    --scores
```

### 4. Extract to BED12/bigBed

```bash
python extract_tags.py -i output/experiment_footprints.bam
```

## Pre-trained Models

FiberHMM ships with pre-trained models in `models/` ready for immediate use:

| Model | File | Enzyme | Platform | Mode |
|-------|------|--------|----------|------|
| **Hia5 PacBio** | `hia5_pacbio.json` | Hia5 (m6A) | PacBio | `pacbio-fiber` |
| **Hia5 Nanopore** | `hia5_nanopore.json` | Hia5 (m6A) | Nanopore | `nanopore-fiber` |
| **DddA PacBio** | `ddda_pacbio.json` | DddA (deamination) | PacBio | `daf` |
| **DddB Nanopore** | `dddb_nanopore.json` | DddB (deamination) | Nanopore | `daf` |

```bash
# Example: call footprints with a pre-trained model
python apply_model.py -i experiment.bam -m models/hia5_pacbio.json -o output/ -c 8
```

## Analysis Modes

| Mode | Flag | Description | Target bases |
|------|------|-------------|--------------|
| **PacBio fiber-seq** | `--mode pacbio-fiber` | Default. m6A at A and T (both strands) | A, T (with RC) |
| **Nanopore fiber-seq** | `--mode nanopore-fiber` | m6A at A only (single strand) | A only |
| **DAF-seq** | `--mode daf` | Deamination at C/G (strand-specific) | C or G |

All scripts accept `--mode`; context size `-k` (3-10) determines the hexamer table size.

**Note on mode selection:** The `pacbio-fiber` vs `nanopore-fiber` distinction only matters for Hia5 (m6A), where PacBio detects modifications on both strands while Nanopore detects only one. For deaminase-based methods (DddA, DddB), `--mode daf` is always used regardless of sequencing platform -- the chemistry is inherently strand-specific. Accuracy may differ between platforms, but the mode is the same.

## Output

### BAM Tags

`apply_model.py` adds footprint tags compatible with the fibertools ecosystem:

| Tag | Type | Description |
|-----|------|-------------|
| `ns` | B,I | Nucleosome/footprint starts (0-based query coords) |
| `nl` | B,I | Nucleosome/footprint lengths |
| `as` | B,I | Accessible/MSP starts |
| `al` | B,I | Accessible/MSP lengths |
| `nq` | B,C | Footprint quality scores (0-255, with `--scores`) |
| `aq` | B,C | MSP quality scores (0-255, with `--scores`) |

### BED12/bigBed Extraction

Use `extract_tags.py` to extract features from tagged BAMs for browser visualization:

```bash
# Extract all feature types to bigBed
python extract_tags.py -i output/sample_footprints.bam

# Extract only footprints
python extract_tags.py -i output/sample_footprints.bam --footprint

# Keep BED files alongside bigBed
python extract_tags.py -i output/sample_footprints.bam --keep-bed
```

## Scripts Reference

### generate_probs.py / `fiberhmm-probs`

Generate emission probability tables from accessible and inaccessible control BAMs.

```bash
python generate_probs.py \
    -a accessible_control.bam \
    -u inaccessible_control.bam \
    -o probs/ \
    --mode pacbio-fiber \
    -k 3 4 5 6 \
    --stats
```

### train_model.py / `fiberhmm-train`

Train the HMM on BAM data using precomputed emission probabilities.

```bash
python train_model.py \
    -i sample.bam \
    -p probs/tables/accessible_A_k3.tsv probs/tables/inaccessible_A_k3.tsv \
    -o models/ \
    -k 3 \
    --stats
```

Output:
```
models/
├── best-model.json      # Primary format (recommended)
├── best-model.npz       # NumPy format
├── all_models.json      # All training iterations
├── training-reads.tsv   # Read IDs used for training
├── config.json          # Training parameters
└── plots/               # (with --stats)
```

### apply_model.py / `fiberhmm-apply`

Apply a trained model to call footprints. Supports region-parallel processing.

```bash
python apply_model.py \
    -i experiment.bam \
    -m models/best-model.json \
    -o output/ \
    -c 8 \
    --scores
```

Key options:

| Flag | Default | Description |
|------|---------|-------------|
| `-c/--cores` | 1 | CPU cores (0 = auto-detect) |
| `--region-size` | 10000000 | Region size for parallel chunks |
| `--skip-scaffolds` | false | Skip scaffold/contig chromosomes |
| `--chroms` | all | Process only these chromosomes |
| `--scores` | false | Compute per-footprint confidence scores |
| `-q/--min-mapq` | 20 | Min mapping quality |
| `--min-read-length` | 1000 | Min aligned read length (bp) |
| `-e/--edge-trim` | 10 | Edge masking (bp) |

#### Read Filtering

By default, `apply_model.py` skips reads that are unlikely to produce reliable footprint calls. Skipped reads are still written to the output BAM but without `ns`/`nl`/`as`/`al` tags. A summary of skip reasons is printed when the run completes.

Reads are skipped (written unchanged) when:

| Reason | Default | Override |
|--------|---------|----------|
| **MAPQ too low** | < 20 | `--min-mapq 0` |
| **Aligned length too short** | < 1000 bp | `--min-read-length 0` |
| **No MM/ML modification tags** | -- | Cannot override (no data to process) |
| **Unmapped** | -- | Cannot override |
| **No footprints detected** | HMM found nothing | Cannot override (tags require content) |

These defaults exist because low-MAPQ alignments have unreliable reference positions (making footprint coordinates meaningless), and very short reads provide insufficient signal for the HMM. You can lower or disable these thresholds, but results on marginal reads may be unreliable.

### extract_tags.py / `fiberhmm-extract`

Extract footprint/MSP/m6A/m5C features from tagged BAMs to BED12/bigBed.

```bash
python extract_tags.py -i output/sample_footprints.bam -o output/ -c 8
```

### export_posteriors.py / `fiberhmm-posteriors`

Export per-position HMM posterior probabilities for downstream analysis (e.g., CNN training, custom scoring). This runs the HMM forward-backward algorithm on each read to produce P(footprint) at every position -- a continuous probability rather than the binary footprint/MSP calls from `apply_model.py`.

**Input is the same BAM you would pass to `apply_model.py`** (any BAM with MM/ML modification tags). This is a parallel pipeline, not a downstream step -- you do not need to run `apply_model.py` first.

```bash
# Export to gzipped TSV (no extra deps)
python export_posteriors.py -i experiment.bam -m model.json -o posteriors.tsv.gz -c 4

# Export to HDF5 (requires: pip install h5py)
python export_posteriors.py -i experiment.bam -m model.json -o posteriors.h5 -c 4
```

### fiberhmm-utils

Consolidated utility for model and probability management. Four subcommands:

**convert** -- Convert legacy pickle/NPZ models to JSON:
```bash
fiberhmm-utils convert old_model.pickle new_model.json
```

**inspect** -- Print model metadata, parameters, and emission statistics:
```bash
fiberhmm-utils inspect model.json
fiberhmm-utils inspect model.json --full   # full emission table
```

**transfer** -- Transfer emission probabilities between modalities (e.g., fiber-seq to DAF-seq) using accessibility priors from a matched cell type:
```bash
fiberhmm-utils transfer \
    --target daf_sample.bam \
    --reference-bam fiberseq_footprints.bam \
    -o daf_probs/ \
    --mode daf \
    --stats
```

**adjust** -- Scale emission probabilities in a model (clamped to [0, 1]):
```bash
fiberhmm-utils adjust model.json --state accessible --scale 1.1 -o adjusted.json
```

## Fibertools Integration

FiberHMM produces BAM output with the same tag conventions used by [fibertools](https://github.com/fiberseq/fibertools-rs):

- `ns`/`nl` for nucleosome footprints
- `as`/`al` for methylase-sensitive patches (MSPs)
- `nq`/`aq` for quality scores

This means FiberHMM output can be used directly with any tool in the fibertools ecosystem, including `ft extract`, downstream analysis pipelines, and genome browsers that support fibertools-style BAM tags.

## FiberBrowser

FiberBrowser, a dedicated genome browser for single-molecule chromatin data, is coming soon.

## Performance Tips

1. **Use multiple cores**: `-c 8` (or more) for parallel processing
2. **Adjust region size for small genomes**:
   - Yeast (~12 MB): `--region-size 500000`
   - Drosophila (~140 MB): `--region-size 2000000`
   - Human/mammalian: default 10 MB is fine
3. **Skip scaffolds**: `--skip-scaffolds` to avoid thousands of small contigs
4. **Install numba**: `pip install numba` for ~10x faster HMM training

## Model Formats

| Format | Extension | Description |
|--------|-----------|-------------|
| **JSON** | `.json` | Primary format -- portable, human-readable |
| NPZ | `.npz` | NumPy archive -- supported for loading |
| Pickle | `.pickle` | Legacy format -- supported for loading |

New models are always saved in JSON. Convert legacy models with:

```bash
fiberhmm-utils convert old_model.pickle new_model.json
```

## License

MIT License. See [LICENSE](LICENSE) for details.
