Metadata-Version: 2.4
Name: sniffcell
Version: 0.6.0
Summary: SniffCell: Annotate SVs cell type based on CpG methylation
Home-page: https://github.com/Fu-Yilei/SniffCell
Author: Yilei Fu
Author-email: yilei.fu@bcm.edu
License: MIT
Project-URL: Bug Tracker, https://github.com/Fu-Yilei/SniffCell/issues
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pysam>=0.21.0
Requires-Dist: edlib>=1.3.9
Requires-Dist: psutil>=5.9.4
Requires-Dist: numpy>=2.2.0
Requires-Dist: pandas>=2.3.0
Requires-Dist: scipy
Requires-Dist: tqdm
Requires-Dist: scikit-learn
Requires-Dist: matplotlib
Dynamic: author
Dynamic: author-email
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: license-file
Dynamic: summary

# SniffCell - annotate structural variants with methylation-derived cell-type signals
[![PyPI version](https://img.shields.io/pypi/v/sniffcell.svg)](https://pypi.org/project/sniffcell/)
[![Install](https://img.shields.io/badge/Install-PyPI-3776AB?logo=pypi&logoColor=white)](https://pypi.org/project/sniffcell/)
[![Docs](https://img.shields.io/badge/Docs-GitHub-181717?logo=github)](https://github.com/Fu-Yilei/SniffCell/wiki)
[![Issues](https://img.shields.io/badge/Issues-GitHub-181717?logo=github)](https://github.com/Fu-Yilei/SniffCell/issues)

SniffCell analyzes long-read methylation around SVs and provides cell-type-aware annotations.

## Version
Current package version in code: `v0.6.0`.

## Install
```bash
pip install sniffcell
```

For local development:
```bash
pip install -e .
```

## CLI commands
```text
sniffcell {find,deconv,anno,svanno,dmsv,viz}
```

Command status in the current code:
- `find`: implemented.
- `anno`: implemented.
- `svanno`: implemented.
- `dmsv`: implemented.
- `viz`: implemented.
- `deconv`: placeholder stub (currently prints args only).

## Input assumptions
- BAM: long-read BAM with modified base tags; `HP` haplotype tag is optional.
- Reference: FASTA indexed for region fetches.
- VCF: `INS` and `DEL` records are used.
- VCF INFO field `RNAMES` is used for supporting reads unless overridden by `--kanpig_read_names`.
- VCF INFO fields `STDEV_POS`, `STDEV_LEN`, `SVLEN` are used to derive `ref_start` and `ref_end` windows.
- BED for `anno`: one tab-delimited hierarchical DMR file from `sniffcell find` with at least `chr`, `start`, `end`, `best_group`, `best_dir`.

## `find`: call hierarchical ctDMRs from atlas matrices
Finds cell-type-specific DMR regions from an explicit hierarchy schema in `atlas/index_to_major_celltypes.json`, then writes one annotation-ready BED/TSV.

Hierarchy schema:
- Add a top-level `__hierarchy__` object.
- Define each hierarchy key with `source_key` and optional `children`.
- Each child can point to another `source_key` and optional `groups`.

Example:
```json
"__hierarchy__": {
  "pbmc-lymphocytes": {
    "source_key": "pbmc-lymphocytes",
    "children": {
      "lymphocytes": {
        "source_key": "pbmc",
        "groups": ["T-cell", "NK-cell", "B-cell"]
      }
    }
  }
}
```

Example:
```bash
sniffcell find \
  -n atlas/all_celltypes_blocks.npy \
  -i atlas/all_celltypes_blocks.index.gz \
  -cf atlas/index_to_major_celltypes.json \
  -m atlas/all_celltypes.txt \
  -ck pbmc-lymphocytes \
  -o pbmc_hierarchy.tsv \
  --diff_threshold 0.40 \
  --min_rows 2 \
  --min_cpgs 3 \
  --max_gap_bp 500
```

Outputs:
- `<output>`: annotation-ready hierarchical BED/TSV for `sniffcell anno`.
- `<output>.igv.bed`: companion IGV BED9 (headerless, IGV-ready).

Key columns in `<output>` include:
- `best_group`, `best_dir`
- `code_order` (global leaf schema)
- `best_group_leaves`, `other_group_leaves`
- `hierarchy_level`, `hierarchy_path`, `hierarchy_source_key`
- per-node means (`mean_<group>`)

## `anno`: annotate SVs with one hierarchical BED file
`anno` processes DMR regions near SVs, classifies reads per region, then summarizes per-SV assignment.

Basic example:
```bash
sniffcell anno \
  -i sample.bam \
  -v sample.vcf.gz \
  -r ref.fa \
  -b pbmc_hierarchy.tsv \
  -o anno_out \
  -w 10000 \
  -t 8
```

`anno` outputs:
- `reads_classification.tsv`: per-read region-level assignments.
- `blocks_classification.tsv`: per-region methylation summaries.
- `sv_assignment.tsv`: SV-level assignment summary (produced by running `svanno` internally at end of `anno`).
- `sv_assignment_readable.tsv`: readable SV summary focused on classified cell types per SV.
- `sv_assignment_readable_long.tsv`: long-format `SV x celltype` table with counts/fractions.
- `anno_run_manifest.json`: run log/manifest with input paths and outputs (used by `sniffcell viz --anno_output`).

SV assignment options (available in both `anno` and `svanno`):
- `--evidence_mode {all_rows,per_read}`: how ctDMR evidence is aggregated for each SV.
- `--min_overlap_pct`: minimum overlap fraction required to keep `assigned_code`.
- `--min_agreement_pct`: minimum majority agreement required to keep `assigned_code`.

Defaults are strict:
- `--evidence_mode all_rows` (uses every supporting-read x ctDMR row; no per-read vote collapse)
- `--min_agreement_pct 1.0` (any conflicting code makes `assigned_code` empty / unreliable)

Conflict rule:
- `assigned_code` is forced empty when evidence has a hard conflict (`has_hard_conflict=True`), i.e. code constraints intersect to an empty set (for example `1110` with `0001` in the same schema).

### How hierarchical codes are handled
1. One BED/TSV from `find` is loaded.
2. Regions are filtered by SV proximity with `--window`.
3. Every kept region is processed independently to generate per-read codes.
4. `code_order` defines the shared leaf-level bit schema.
5. `best_group_leaves` defines which bits are set for the target cluster in each DMR.
6. During SV assignment, reads are linked to SVs by chromosome-aware interval matching (`--window`), then evidence is aggregated by `--evidence_mode` (`all_rows` by default; `per_read` is optional).

## `svanno`: recompute SV-level assignment from precomputed read classifications
Use when you already have `reads_classification.tsv` and want to regenerate SV summaries.

Example:
```bash
sniffcell svanno \
  -v sample.vcf.gz \
  -i anno_out/reads_classification.tsv \
  -w 10000 \
  --evidence_mode all_rows \
  --min_agreement_pct 1.0 \
  -o anno_out
```

Output:
- `sv_assignment.tsv`
- `sv_assignment_readable.tsv`
- `sv_assignment_readable_long.tsv`

Readable summary columns include:
- `id`, `sv_chr`, `sv_pos`, `sv_len`, `vaf`
- `n_supporting`, `n_overlapped`, `overlap_pct`, `majority_pct`
- `classified_celltypes`, `classified_celltype_count`
- `classified_celltype_counts`, `classified_celltype_fractions`, `classification_summary`
- `is_multi_celltype_link`

`sv_assignment.tsv` also includes:
- `has_hard_conflict`: whether constraints are mutually incompatible.
- `intersection_code`: bitwise intersection of observed constraints in the dominant schema.

Long-format columns include:
- `id`, `sv_chr`, `sv_pos`, `sv_len`
- `celltype`, `rank`, `supporting_read_count`, `supporting_read_fraction`
- `n_supporting`, `n_overlapped`, `overlap_pct`

## `viz`: visualize one SV with reads and ctDMR overlap
Generate a figure (PNG/PDF) centered on one SV ID, showing:
- all reads in `SV +/- window` (supporting reads highlighted),
- SV interval,
- overlapping ctDMRs from a `find` BED/TSV.
- all cell-type methylation values on those ctDMRs from `mean_*` columns (heatmap panel).

Simple example (from an `anno` output folder):
```bash
sniffcell viz \
  --anno_output anno_out \
  -s sniffles.SV123 \
  -o anno_out/sniffles.SV123
```

Outputs:
- Default output: `anno_out/sniffles.SV123.png` (or `.pdf`)
- Add `--export_tables` if you also want TSV outputs (`.summary.tsv`, `.supporting_reads_assignment.tsv`, `.supporting_reads_ctdmr_methylation.tsv`)

## `dmsv`: test differential methylation around SVs
Computes per-CpG statistics between supporting and non-supporting reads near each SV.

Example:
```bash
sniffcell dmsv \
  -i sample.bam \
  -v sample.vcf.gz \
  -r ref.fa \
  -o dmsv_out \
  -m 3 \
  -f 1000 \
  -c 5 \
  -t 8
```

Outputs:
- `dmsv_out/significant_SVs.tsv`: per-SV summary including significance counts and effect summaries.
- `dmsv_out/sv_details/<sv_id>.tsv.gz`: per-CpG stats table for each SV.

Current implementation note:
- `dmsv` parses `--test_type` but the current backend path uses consistency-aware MWU screening in `statistical_test_around_sv.py`.

## `deconv`
`deconv` CLI arguments exist but implementation is currently a placeholder (`deconv_main` only prints arguments).

## Practical example
```bash
sniffcell anno \
  -i data/sample.bam \
  -v data/sample.vcf.gz \
  -b dmrs/pbmc_hierarchy.tsv \
  -o results/anno.w10000 \
  -r refs/GRCh38.fa \
  -w 10000 \
  -t 8
```
