Metadata-Version: 2.4
Name: pathway-subtyping
Version: 0.2.0
Summary: A disease-agnostic framework for identifying molecular subtypes through pathway-based analysis of rare genetic variants
Author-email: Rohit Chauhan <info@topmist.com>
Maintainer-email: Rohit Chauhan <info@topmist.com>
License: MIT
Project-URL: Homepage, https://github.com/topmist-admin/pathway-subtyping-framework
Project-URL: Documentation, https://github.com/topmist-admin/pathway-subtyping-framework#readme
Project-URL: Repository, https://github.com/topmist-admin/pathway-subtyping-framework.git
Project-URL: Issues, https://github.com/topmist-admin/pathway-subtyping-framework/issues
Project-URL: Changelog, https://github.com/topmist-admin/pathway-subtyping-framework/releases
Keywords: genomics,pathway-analysis,molecular-subtypes,clustering,rare-variants,bioinformatics,autism,neurogenetics
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Healthcare Industry
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Scientific/Engineering :: Medical Science Apps.
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy<2.0.0,>=1.24.0
Requires-Dist: pandas<3.0.0,>=2.0.0
Requires-Dist: scikit-learn<2.0.0,>=1.3.0
Requires-Dist: scipy<2.0.0,>=1.11.0
Requires-Dist: pysam<1.0.0,>=0.21.0
Requires-Dist: pyyaml<7.0,>=6.0
Requires-Dist: click<9.0.0,>=8.0.0
Requires-Dist: matplotlib<4.0.0,>=3.7.0
Requires-Dist: seaborn<1.0.0,>=0.12.0
Requires-Dist: jinja2<4.0.0,>=3.1.0
Requires-Dist: tqdm<5.0.0,>=4.65.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: isort>=5.12.0; extra == "dev"
Requires-Dist: flake8>=6.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Requires-Dist: pre-commit>=3.0.0; extra == "dev"
Provides-Extra: notebooks
Requires-Dist: jupyter>=1.0.0; extra == "notebooks"
Requires-Dist: ipykernel>=6.0.0; extra == "notebooks"
Requires-Dist: nbformat>=5.0.0; extra == "notebooks"
Provides-Extra: docs
Requires-Dist: sphinx>=6.0.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=1.2.0; extra == "docs"
Requires-Dist: myst-parser>=1.0.0; extra == "docs"
Provides-Extra: all
Requires-Dist: pathway-subtyping[dev,docs,notebooks]; extra == "all"
Dynamic: license-file

# Pathway Subtyping Framework

**A Disease-Agnostic Tool for Pathway-Based Molecular Subtype Discovery**

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.18442427.svg)](https://doi.org/10.5281/zenodo.18442427)
[![PyPI version](https://badge.fury.io/py/pathway-subtyping.svg)](https://pypi.org/project/pathway-subtyping/)
[![CI](https://github.com/topmist-admin/pathway-subtyping-framework/actions/workflows/ci.yml/badge.svg)](https://github.com/topmist-admin/pathway-subtyping-framework/actions/workflows/ci.yml)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

---

## Overview

The **Pathway Subtyping Framework** is an open-source computational tool for identifying molecular subtypes in genetically heterogeneous diseases. Instead of analyzing individual genes, it aggregates rare variant burden at the **biological pathway level**, enabling:

- Better signal detection across genetically diverse cohorts
- Identification of biologically coherent patient subgroups
- Cross-cohort validation of discovered subtypes

Originally developed for [autism research](https://github.com/topmist-admin/autism-pathway-framework), this generalized version can be adapted for any disease with:
- Genetic heterogeneity (many implicated genes)
- Convergent pathway biology
- Available exome/genome sequencing data

## Supported Disease Areas

| Disease | Status | Pathway File |
|---------|--------|--------------|
| Autism Spectrum Disorder | Validated | `autism_pathways.gmt` |
| Schizophrenia | Template | `schizophrenia_pathways.gmt` |
| Epilepsy | Template | `epilepsy_pathways.gmt` |
| Intellectual Disability | Template | `intellectual_disability_pathways.gmt` |
| Parkinson's Disease | Template | `parkinsons_pathways.gmt` |
| Bipolar Disorder | Template | `bipolar_pathways.gmt` |
| *Your disease* | [Adapt it →](docs/guides/adapting-for-your-disease.md) | `your_pathways.gmt` |

## Key Features

| Feature | Description |
|---------|-------------|
| **Pathway Scoring** | Aggregate gene burdens across biological pathways |
| **Multiple Clustering** | GMM, K-means, Hierarchical, Spectral with cross-validation |
| **Ancestry Correction** | PCA-based population stratification correction with independence testing |
| **Batch Correction** | ComBat-style batch effect detection and correction |
| **Sensitivity Analysis** | Parameter robustness testing across algorithms, features, normalization |
| **Validation Gates** | Negative controls + bootstrap stability + ancestry independence testing |
| **Statistical Rigor** | FDR correction, effect sizes, confidence intervals |
| **Power Analysis** | Sample size recommendations, Type I error estimation |
| **Simulation** | Synthetic data generation with ground truth for validation |
| **Reproducibility** | Deterministic execution, pinned dependencies, Docker |
| **Config-Driven** | YAML configuration for all parameters |

## Quick Start

### Installation

```bash
# Clone the repository
git clone https://github.com/topmist-admin/pathway-subtyping-framework
cd pathway-subtyping-framework

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows

# Install the package
pip install -e .

# Verify installation
psf --version
```

### Run with Sample Data

```bash
# Run the pipeline with synthetic test data
psf --config configs/test_synthetic.yaml

# View results
cat outputs/synthetic_test/report.md
```

### Run with Your Data

```bash
# Copy and customize a config
cp configs/example_autism.yaml configs/my_analysis.yaml

# Edit paths in my_analysis.yaml, then run
psf --config configs/my_analysis.yaml
```

### Try in Browser (No Installation)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/topmist-admin/pathway-subtyping-framework/blob/main/examples/notebooks/01_getting_started.ipynb)

### Docker

```bash
# Run pipeline
docker-compose run pipeline

# Run tests
docker-compose run test

# Start Jupyter notebook
docker-compose up jupyter
# Open http://localhost:8888
```

## Adapting for Your Disease

1. **Create a pathway GMT file** with disease-relevant gene sets
2. **Copy an example config** and point to your data
3. **Run the pipeline** — validation gates will tell you if subtypes are meaningful

See the full guide: [Adapting for Your Disease](docs/guides/adapting-for-your-disease.md)

## How It Works

```
VCF Input → Variant Filter → Gene Burden → Pathway Aggregation → [Ancestry Correction] → [Batch Correction] → GMM Clustering → [Sensitivity Analysis] → Validation → Report
```

### 1. Pathway Scoring
Rare damaging variants are aggregated into pathway-level disruption scores:
- Loss-of-function variants weighted higher
- Missense variants weighted by CADD score
- Scores normalized across samples

### 2. Subtype Discovery
Multiple clustering algorithms identify patient subgroups:
- **GMM** (default): Soft assignments, automatic selection via BIC
- **K-means**: Fast, spherical clusters
- **Hierarchical**: Dendogram-based, no K required
- **Spectral**: Nonlinear boundaries
- Cross-validation for stability assessment
- Algorithm comparison with pairwise ARI

### 3. Validation Gates
Built-in tests prevent overfitting:
- **Label shuffle**: Randomized labels should NOT cluster (ARI < 0.15)
- **Random genes**: Fake pathways should NOT work (ARI < 0.15)
- **Bootstrap**: Clusters should be stable under resampling (ARI > 0.8)
- **Ancestry independence**: Clusters should not correlate with ancestry PCs (when provided)

### 4. Statistical Rigor
Publication-quality statistics:
- **FDR correction**: Benjamini-Hochberg for multiple testing
- **Effect sizes**: Cohen's d with 95% bootstrap confidence intervals
- **Power analysis**: Sample size recommendations for target effect sizes
- **Type I error**: Estimation via null simulations

See [docs/METHODS.md](docs/METHODS.md) for full statistical methodology.

## Data Requirements

| Input | Format | Notes |
|-------|--------|-------|
| Variants | VCF | Annotated with gene symbols, consequences |
| Phenotypes | CSV | Sample IDs + clinical features |
| Pathways | GMT | Gene sets for your disease |

**Your data stays on your infrastructure.** The framework runs locally or in your cloud environment.

## Project Structure

```
pathway-subtyping-framework/
├── src/pathway_subtyping/     # Core Python package
│   ├── pipeline.py            # Main pipeline
│   ├── clustering.py          # Multiple clustering algorithms
│   ├── statistical_rigor.py   # FDR, effect sizes, burden weights
│   ├── simulation.py          # Synthetic data & power analysis
│   ├── validation.py          # Validation gates
│   ├── ancestry.py            # Population stratification correction
│   ├── batch_correction.py    # Batch effect detection & correction
│   ├── sensitivity.py         # Parameter sensitivity analysis
│   └── data_quality.py        # VCF quality checks
├── configs/                   # Example YAML configurations
├── data/
│   ├── pathways/              # Pathway GMT files (6 diseases)
│   └── sample/                # Synthetic test data
├── docs/
│   ├── METHODS.md             # Statistical methods documentation
│   └── guides/                # User guides
├── examples/notebooks/        # Jupyter tutorials
├── tests/                     # Test suite (347 tests)
├── Dockerfile                 # Container support
└── docker-compose.yml         # Easy orchestration
```

## Development

```bash
# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run linting
black src/ tests/
isort src/ tests/
flake8 src/ tests/

# Set up pre-commit hooks
pre-commit install
```

## Related Projects

- **[Autism Pathway Framework](https://github.com/topmist-admin/autism-pathway-framework)** — The original autism-focused implementation with SFARI cohort validation

## Contributing

Contributions welcome! Areas where help is needed:
- Additional disease pathway definitions
- Performance optimization for large cohorts
- Documentation and tutorials

See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

## Citation

If you use this framework, please cite:

```
Chauhan R. Pathway Subtyping Framework. GitHub. 2026.
https://github.com/topmist-admin/pathway-subtyping-framework
```

For autism-specific work, also cite:
```
Chauhan R. Autism Pathway Framework. Zenodo. 2026.
DOI: 10.5281/zenodo.18403844
```

## License

MIT License — see [LICENSE](LICENSE) for details.

## Contact

**Rohit Chauhan**
- Email: info@topmist.com
- GitHub: [@topmist-admin](https://github.com/topmist-admin)

---

> **RESEARCH USE ONLY** — This framework is for hypothesis generation. Not for clinical diagnosis or treatment decisions.
