Metadata-Version: 2.4
Name: py-gbcms
Version: 2.0.0
Summary: Python implementation of GetBaseCountsMultiSample (gbcms) for calculating base counts in BAM files
Project-URL: Homepage, https://github.com/msk-access/getbasecounts
Project-URL: Repository, https://github.com/msk-access/getbasecounts
Project-URL: Documentation, https://github.com/msk-access/getbasecounts#readme
Project-URL: Bug Tracker, https://github.com/msk-access/getbasecounts/issues
Author-email: MSK-ACCESS <shahr2@mskcc.org>
License: AGPL-3.0
License-File: LICENSE
Keywords: bam,base-counts,bioinformatics,gbcms,genomics,maf,vcf
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: GNU Affero General Public License v3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.11
Requires-Dist: joblib>=1.3.0
Requires-Dist: numba>=0.58.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: pydantic-settings>=2.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pysam>=0.22.0
Requires-Dist: rich>=13.0.0
Requires-Dist: scipy>=1.11.0
Requires-Dist: typer>=0.9.0
Provides-Extra: all
Requires-Dist: cyvcf2>=0.30.0; extra == 'all'
Provides-Extra: dev
Requires-Dist: black>=23.0.0; extra == 'dev'
Requires-Dist: mypy>=1.5.0; extra == 'dev'
Requires-Dist: pre-commit>=3.3.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
Requires-Dist: pytest-mock>=3.11.0; extra == 'dev'
Requires-Dist: pytest>=7.4.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Requires-Dist: types-pyyaml>=6.0.0; extra == 'dev'
Provides-Extra: fast
Requires-Dist: cyvcf2>=0.30.0; extra == 'fast'
Description-Content-Type: text/markdown

# py-gbcms - Python Implementation of gbcms

[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![License](https://img.shields.io/badge/License-AGPL%203.0-blue.svg)](https://opensource.org/licenses/AGPL-3.0)

A high-performance Python reimplementation of [GetBaseCountsMultiSample](https://github.com/msk-access/GetBaseCountsMultiSample) for calculating base counts in multiple BAM files at variant positions specified in VCF or MAF files.

**Package**: `py-gbcms` | **CLI**: `gbcms`

## ✨ Features

- 🚀 **High Performance**: Multi-threaded processing with efficient BAM file handling
  - **Smart hybrid counting** strategy: numba for simple SNPs (50-100x faster), counter.py for complex variants
  - **Numba JIT compilation** for optimized counting operations
  - **joblib** for efficient local parallelization
  - **Ray** support for distributed computing across clusters
- 🔬 **Strand Bias Analysis**: Statistical strand bias detection using Fisher's exact test
- 🎨 **Beautiful CLI**: Rich terminal output with progress bars and colored logging
- 🔒 **Type Safety**: Pydantic models for runtime validation and type checking
- 🐳 **Docker Support**: Containerized deployment for reproducibility
- 📊 **Multiple Formats**: Support for both VCF and MAF input/output formats with strand bias
- 🧪 **Well Tested**: Comprehensive unit tests with high coverage
- 🔧 **Modern Python**: Built with type hints, Pydantic models, and modern Python practices

## 🚀 Quick Start

### Installation

# Install with all features
uv pip install "py-gbcms[all]"

# Or with pip
pip install "py-gbcms[all]"

# For development (includes scipy for strand bias and type checking)
uv pip install -e ".[dev]"

**Core Dependencies:**
- `pysam>=0.22.0` - BAM file processing
- `numpy>=1.24.0` - Numerical computations
- `scipy>=1.11.0` - Statistical analysis (Fisher's exact test for strand bias)
- `pandas>=2.0.0` - Data manipulation
- `pydantic>=2.0.0` - Runtime validation
- `numba>=0.58.0` - JIT compilation for performance
- `joblib>=1.3.0` - Parallel processing

**Optional Dependencies:**
- `cyvcf2>=0.30.0` - Fast VCF parsing (`py-gbcms[fast]`)
- `ray>=2.7.0` - Distributed computing (`py-gbcms[ray]`)

**Development Dependencies:**
- `scipy-stubs>=1.11.0` - Type stubs for scipy (for mypy type checking)

**Requirements:** Python 3.11 or later

### Basic Usage

```bash
# Run
gbcms count run \
    --fasta reference.fa \
    --bam sample1:sample1.bam \
    --vcf variants.vcf \
    --output counts.txt \
    --thread 8

# Run with MAF file
gbcms count run \
    --fasta reference.fa \
    --bam-fof bam_files.txt \
    --maf variants.maf \
    --output counts.txt

### Docker Usage

```bash
docker pull ghcr.io/msk-access/getbasecounts:latest

# Run the container
docker run --rm \
    -v $(pwd)/data:/data \
    ghcr.io/msk-access/getbasecounts:latest \
    gbcms count run \
    --omaf \
    --fasta /data/reference.fa \
    --bam sample1:/data/sample1.bam \
    --vcf /data/variants.vcf \
    --output /data/counts.maf
```

### BAM File of Files Format

Create a tab-separated file (`bam_files.txt`):
```
sample1	/path/to/sample1.bam
sample2	/path/to/sample2.bam
sample3	/path/to/sample3.bam
```

Then use:

```bash
gbcms count run \
    --fasta reference.fa \
    --bam-fof bam_files.txt \
    --vcf variants.vcf \

## Command Line Options

### Commands

gbcms uses subcommands for different operations:

- `gbcms count run`: Run base counting on variants (main command)
- `gbcms validate files`: Validate input files before processing
- `gbcms version`: Show version information
- `gbcms info`: Show tool capabilities and information

### Count Run Options

#### Required Arguments

- `--fasta`, `-f`: Reference genome FASTA file (must be indexed with .fai)
- `--output`, `-o`: Output file path

#### BAM Input (at least one required)

- `--bam`, `-b`: BAM file in format `SAMPLE_NAME:BAM_FILE` (can be specified multiple times)
- `--bam-fof`: File containing sample names and BAM paths (tab-separated)

#### Variant Input (one required)

- `--maf`: Input variant file in MAF format (can be specified multiple times)
- `--vcf`: Input variant file in VCF format (can be specified multiple times)

#### Output Options

- `--omaf`: Output in MAF format (only with MAF input)
- `--positive-count/--no-positive-count`: Output positive strand counts (default: True)
- `--negative-count/--no-negative-count`: Output negative strand counts (default: False)
- `--fragment-count/--no-fragment-count`: Output fragment counts (default: False)
- `--fragment-fractional-weight`: Use fractional weights for fragments (default: False)

#### Quality Filters
- `--maq`: Mapping quality threshold (default: 20)
- `--baq`: Base quality threshold (default: 0)
- `--filter-duplicate`: Filter duplicate reads (default: True)
- `--filter-improper-pair`: Filter improperly paired reads (default: False)
- `--filter-qc-failed`: Filter QC failed reads (default: False)
- `--filter-indel/--no-filter-indel`: Filter reads with indels (default: False)
- `--filter-non-primary/--no-filter-non-primary`: Filter non-primary alignments (default: False)

#### Performance Options
- `--thread`, `-t`: Number of threads (default: 1)
- `--max-block-size`: Maximum variants per block (default: 10000)
- `--max-block-dist`: Maximum block distance in bp (default: 100000)

#### Advanced Options
- `--generic-counting`: Use generic counting algorithm for complex variants
- `--suppress-warning`: Maximum warnings per type (default: 3)
- `--verbose`, `-v`: Enable verbose logging

### Validate Files Options

- `--fasta`, `-f`: Reference FASTA file to validate
- `--bam`, `-b`: BAM files to validate (SAMPLE:PATH format, multiple allowed)
- `--vcf`: VCF files to validate (multiple allowed)
- `--maf`: MAF files to validate (multiple allowed)

## Output Format

### VCF Format (Proper VCF with INFO fields)

**Extension**: `.vcf`

**Structure**: Standard VCF format with count and strand bias information in FORMAT and INFO fields

**Example**:
```vcf
##fileformat=VCFv4.2
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total depth across all samples">
##INFO=<ID=SB,Number=3,Type=Float,Description="Strand bias p-value, odds ratio, direction">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Total depth for this sample">
##FORMAT=<ID=SB,Number=3,Type=Float,Description="Strand bias for this sample">
#CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO                    FORMAT      SAMPLE1             SAMPLE2
chr1    100  .   A    T    .     .       DP=95;SB=0.001234,2.5,reverse  DP:RD:AD:SB  50:30:20:0.001234:2.5:reverse  45:25:20:0.95:1.2:none
```

**INFO Fields**:
- `DP`: Total depth across all samples
- `SB`: Strand bias (p-value,odds_ratio,direction) - most significant across samples
- `FSB`: Fragment strand bias (when fragment counting enabled)

**FORMAT Fields**:
- `DP`: Total depth for this sample
- `RD`: Reference allele depth for this sample
- `AD`: Alternate allele depth for this sample
- `DPP`: Positive strand depth (if enabled)
- `RDP`: Positive strand reference depth (if enabled)
- `ADP`: Positive strand alternate depth (if enabled)
- `DPF`: Fragment depth (if enabled)
- `RDF`: Fragment reference depth (if enabled)
- `ADF`: Fragment alternate depth (if enabled)
- `SB`: Strand bias (p-value,odds_ratio,direction) for this sample
- `FSB`: Fragment strand bias for this sample (if enabled)

### MAF Format

When using `--omaf`, the output maintains the MAF format with updated count columns for tumor and normal samples, including strand bias information.

## Development

### Setup Development Environment

```bash
# Clone the repository
git clone https://github.com/msk-access/py-gbcms.git
cd py-gbcms

# Install with development dependencies (includes scipy-stubs for type checking)
uv pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install
```

### Running Tests

```bash
# Run all tests
pytest

# Run with coverage
pytest --cov=gbcms --cov-report=html

# Run specific test file
pytest tests/test_counter.py -v
```

### Code Quality

```bash
# Format code
black src/ tests/

# Lint code
ruff check src/ tests/

# Type checking (requires scipy-stubs)
mypy src/
```

### Building Docker Image

```bash
# Build the image
docker build -t gbcms:latest .

# Run tests in container
docker run --rm gbcms:latest pytest
```

## Performance Comparison

Compared to the original C++ implementation:

| Feature | C++ Version | Python (Smart Hybrid) | Python (Pure counter.py) |
|---------|-------------|----------------------|-------------------------|
| **Simple SNPs** | ~1x | **~50-100x faster** | ~0.8-1.2x |
| **Complex Variants** | ~1x | **~1x** | ~0.8-1.2x |
| **Overall** | Baseline | **~10-50x faster** | ~0.8-1.2x |
| Memory | Baseline | ~1.2x | ~1.2x |
| Multi-threading | OpenMP | joblib/Ray | joblib/Ray |
| Dependencies | bamtools, zlib | pysam, numpy, numba, joblib, ray |
| Scalability | Single machine | Single machine | Multi-node clusters |

**Smart Hybrid Strategy:**
- **Simple SNPs**: Uses `numba_counter` (50-100x faster than C++)
- **Complex variants**: Uses `counter.py` (C++ equivalent accuracy)
- **Automatic selection**: Optimal algorithm chosen per variant type
*Performance varies based on workload and Python version. Python 3.11+ shows significant improvements.

**With Numba JIT compilation and smart algorithm selection. See [Fast VCF Parsing](docs/CYVCF2_SUPPORT.md) for benchmarks.

## Architecture

The package is organized into distinct modules with clear responsibilities:

```
py-gbcms/
├── src/gbcms/
│   ├── __init__.py
│   ├── cli.py              # 🎨 Typer CLI interface with Rich
│   ├── config.py           # ⚙️  Configuration dataclasses (legacy)
│   ├── models.py           # 🔒 Pydantic models for type safety ⭐
│   ├── variant.py          # 📄 Variant loading (VCF/MAF)
│   ├── counter.py          # 🐢 Pure Python counting (baseline)
│   ├── numba_counter.py    # ⚡ Numba-optimized counting (50-100x faster) ⭐
│   ├── parallel.py         # 🔄 joblib/Ray parallelization ⭐
│   ├── reference.py        # 🧬 Reference sequence handling
│   ├── output.py           # 📤 Output formatting with strand bias ⭐
│   └── processor.py        # 🎯 Main processing pipeline
├── tests/                  # 🧪 Comprehensive test suite
├── scripts/                # 🛠️  Setup and verification scripts
├── Dockerfile              # 🐳 Production container
├── pyproject.toml          # 📦 Package configuration
└── docs/
    ├── README.md           # Main documentation
    ├── ARCHITECTURE.md     # Module relationships ⭐
    ├── INSTALLATION.md     # Setup guide
    ├── QUICKSTART.md       # 5-minute start
    ├── ADVANCED_FEATURES.md # Pydantic, Numba, Ray ⭐
    └── CLI_FEATURES.md     # CLI documentation
```

### Module Relationships

```
CLI (cli.py)
    ↓
Configuration (models.py)
    ↓
Processor (processor.py)
    ├─→ Variant Loader (variant.py)
    ├─→ Reference (reference.py)
    ├─→ Smart Counting Engine ⭐
    │   ├─→ counter.py (Pure Python - accurate, slower)
    │   └─→ numba_counter.py (JIT compiled - 50-100x faster for SNPs) ⭐
    ├─→ Strand Bias Analysis (counter.py + output.py) ⭐
    ├─→ Parallelization (parallel.py)
    └─→ Output (output.py)
```

**Key Algorithm Selection:**
- **`Smart Hybrid Strategy`**: Automatically chooses optimal algorithm per variant
- **`counter.py`**: Pure Python, easy to debug, baseline performance
- **`numba_counter.py`**: JIT-compiled, 50-100x faster for simple SNPs, for production

**New Features:**
- **Strand Bias Analysis**: Statistical detection using Fisher's exact test ⭐
- **Enhanced Output**: VCF/MAF formats with strand bias columns ⭐

See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for detailed module relationships.

## Documentation

📚 **[Complete Documentation](docs/README.md)** | [Quick Start](docs/QUICKSTART.md) | [Contributing Guide](CONTRIBUTING.md) | [Package Structure](docs/PACKAGE_STRUCTURE.md) | [Testing Guide](docs/TESTING_GUIDE.md)

- **User Guide**
  - [Input & Output Formats](docs/INPUT_OUTPUT.md)

- **Advanced**
  - [Advanced Features](docs/ADVANCED_FEATURES.md)
  - [Parallelization Guide](docs/PARALLELIZATION_GUIDE.md)
  - [Fast VCF Parsing (cyvcf2)](docs/CYVCF2_SUPPORT.md)

- **Reference**
  - [Architecture](docs/ARCHITECTURE.md)
  - [C++ Comparison](docs/CPP_FEATURE_COMPARISON.md)
  - [FAQ](docs/FAQ.md)

- **Docker & Deployment**
  - [Docker Guide](docs/DOCKER_GUIDE.md)

## Advanced Features

GetBaseCounts includes several advanced features for performance and scalability:

### 🔒 Type Safety with Pydantic

```python
from gbcms.models import GetBaseCountsConfig

# Runtime validation of all inputs
config = GetBaseCountsConfig(
    fasta_file=Path("reference.fa"),
    bam_files=[...],
    variant_files=[...],
)
```

### 🔬 Strand Bias Analysis

```python
# Strand bias is automatically calculated for all variants
# Uses Fisher's exact test for statistical rigor
from gbcms.counter import BaseCounter

counter = BaseCounter(config)
# Strand bias calculated during counting and included in output
```

**Features:**
- **Fisher's exact test** for statistically sound strand bias detection
- **Automatic direction detection** (forward, reverse, or none)
- **Minimum depth filtering** (10 reads) for reliable calculations
- **VCF and MAF output** with strand bias columns

### ⚡ Performance with Smart Hybrid Strategy

```python
# Automatic algorithm selection based on variant complexity
from gbcms.counter import BaseCounter

counter = BaseCounter(config)
# Automatically uses:
# - numba_counter for simple SNPs (50-100x faster)
# - counter.py for complex variants (maximum accuracy)
```

### 🔄 Parallelization with joblib

```bash
# Use joblib for efficient local parallelization
gbcms count run --thread 16 --backend joblib ...
```

### 🌐 Distributed Computing with Ray

```bash
# Install with Ray support
uv pip install "gbcms[ray]"

# Use Ray for distributed processing
gbcms count run --thread 32 --backend ray --use-ray ...
```

See [ADVANCED_FEATURES.md](ADVANCED_FEATURES.md) for detailed documentation and benchmarks.

## Contributing

Contributions are welcome! Please:

1. Fork the repository
2. Create a feature branch
3. Make your changes with tests
4. Run the test suite and linters
5. Submit a pull request

## Citation

If you use this tool in your research, please cite:

```
GetBaseCountsMultiSample: A tool for calculating base counts in multiple BAM files
MSK-ACCESS Team
https://github.com/msk-access/GetBaseCountsMultiSample
```

## License

AGPL-3.0 License - See [LICENSE](LICENSE) for details.

## Support

- 🐛 Report bugs: [GitHub Issues](https://github.com/msk-access/py-gbcms/issues)
- 💬 Ask questions: [GitHub Discussions](https://github.com/msk-access/py-gbcms/discussions)
- 📧 Email: shahr2@mskcc.org

## Acknowledgments

This is a Python reimplementation of the original C++ tool developed by the MSK-ACCESS team. Special thanks to the original authors and contributors.
