Metadata-Version: 2.4
Name: protqc
Version: 0.1.0
Summary: Physics-based verification of AI-designed protein structures
Author: Ömür Koray Güzel
License: MIT
Keywords: protein,design,verification,physics,molecular-dynamics,AI
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pyyaml>=6.0
Requires-Dist: numpy>=1.26.0
Requires-Dist: rich>=13.7.0
Requires-Dist: jinja2>=3.1.0
Provides-Extra: chat
Requires-Dist: litellm>=1.40.0; extra == "chat"
Provides-Extra: all
Requires-Dist: litellm>=1.40.0; extra == "all"
Requires-Dist: fair-esm>=2.0.0; extra == "all"
Requires-Dist: openmm>=8.1.0; extra == "all"
Requires-Dist: pdbfixer>=1.9; extra == "all"
Requires-Dist: mdtraj>=1.10.0; extra == "all"
Requires-Dist: MDAnalysis>=2.7.0; extra == "all"
Requires-Dist: freesasa>=2.2.0; extra == "all"
Requires-Dist: biopython>=1.84; extra == "all"
Requires-Dist: pandas>=2.2.0; extra == "all"
Requires-Dist: matplotlib>=3.9.0; extra == "all"
Requires-Dist: seaborn>=0.13.0; extra == "all"
Requires-Dist: scipy>=1.13.0; extra == "all"
Requires-Dist: scikit-learn>=1.5.0; extra == "all"
Requires-Dist: tqdm>=4.66.0; extra == "all"
Requires-Dist: requests>=2.32.0; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: pytest-cov>=5.0.0; extra == "dev"
Dynamic: license-file

# ProtQC

**Physics-based verification of AI-designed protein structures**

[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)

*Catches structural hallucinations before wet-lab*

---

## Why ProtQC?

AI protein design tools (AlphaFold, RFdiffusion, ProteinMPNN, BoltzGen) routinely produce structures with high confidence scores (pLDDT > 90) that still fail experimentally. A protein can look perfect by pLDDT yet harbor internal voids, unstable hydrogen bond networks, or thermodynamic instabilities that only surface in solution.

ProtQC combines six physics-based metrics into a composite risk score, catching high-pLDDT hallucinations that no single metric detects on its own.

## Quick Start

```bash
protqc analyze protein.pdb
```

## The 6 Metrics

| # | Metric | Source | What It Catches |
|---|--------|--------|-----------------|
| 1 | **pLDDT** | Structure prediction | Low confidence regions |
| 2 | **MD RMSD** | OpenMM | Backbone instability under simulation |
| 3 | **Cavity Volume** | fpocket | Internal voids and packing defects |
| 4 | **H-bond Persistence** | MDTraj | Weak hydrogen bond networks |
| 5 | **SS Preservation** | MDTraj DSSP | Secondary structure loss during MD |
| 6 | **SASA Polar Ratio** | FreeSASA | Abnormal surface accessibility |

Each metric produces a normalized 0–1 sub-score. The composite risk score is a weighted sum, mapped to a verdict:

- **PASS** (risk < 0.30) — Design is physically plausible
- **WARNING** (0.30 ≤ risk < 0.50) — Proceed with caution; review flagged metrics
- **FAIL** (risk ≥ 0.50) — Design has significant structural issues

### Risk Scoring Weights

```yaml
risk_weights:
  plddt: 0.12
  md_rmsd: 0.29
  cavity: 0.12
  hbond_persistence: 0.24
  ss_preservation: 0.18
  sasa_ratio: 0.05
```

## Validated Results

| Protein | Verdict | Risk Score |
|---------|---------|------------|
| Ubiquitin (1UBQ) | PASS | 0.257 |
| GFP (1EMA) | PASS | 0.281 |
| Alpha-synuclein (1XQ8) | FAIL | 0.555 |

### Performance

| Protein | MD Duration | Wall Time | GPU |
|---------|-------------|-----------|-----|
| Ubiquitin (76 aa) | 10 ns | ~23 min | RTX 4070 |
| GFP (238 aa) | 10 ns | ~49 min | RTX 4070 |

## Usage

ProtQC provides three usage modes:

### CLI — Single Protein Analysis

```bash
# Analyze a PDB file
protqc analyze protein.pdb

# Enter a PDB ID — auto-downloads from RCSB
protqc analyze 1UBQ

# Skip MD simulation for quick structural checks
protqc analyze protein.pdb --skip-md

# Set MD simulation length
protqc analyze protein.pdb --md-duration 10

# Use pre-computed MD trajectory
protqc analyze protein.pdb --trajectory md_output.csv

# Generate FastQC-style HTML report
protqc analyze protein.pdb --html report.html

# JSON output
protqc analyze protein.pdb --format json
```

### Interactive Mode

```bash
# Launch interactive prompt — guides you through analysis
protqc
```

### AI Chat Assistant

```bash
# Start AI-powered chat for interpreting results
protqc chat
```

Chat supports 8 providers via LiteLLM: **OpenAI**, **Anthropic**, **Google**, **DeepSeek**, **OpenRouter**, **Moonshot**, **MiniMax**, **Zhipu**.

## Installation

### Docker (recommended — all platforms)

Docker is the easiest way to run ProtQC with all dependencies (OpenMM, CUDA, fpocket, FreeSASA, MDTraj):

```bash
# Build the image
docker build -t protqc .

# Analyze a protein (GPU-accelerated)
docker run --gpus all -v $(pwd)/data:/app/data protqc analyze data/benchmark/ubiquitin.pdb

# Run with MD simulation
docker run --gpus all -v $(pwd)/data:/app/data protqc analyze data/benchmark/ubiquitin.pdb --md-duration 10

# CPU-only (MD will be slow)
docker run -v $(pwd)/data:/app/data -e CUDA_VISIBLE_DEVICES="" protqc analyze protein.pdb --skip-md
```

**Docker Compose:**

```bash
# GPU-accelerated
docker compose run protqc analyze data/benchmark/ubiquitin.pdb

# CPU-only variant
docker compose run protqc-cpu analyze data/benchmark/ubiquitin.pdb --skip-md
```

> **Note:** GPU support requires the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). Without a GPU, MD simulations still work but are significantly slower (~10–50x). Use `--skip-md` for quick checks without MD.

### Source install (Linux only)

```bash
conda create -n protqc python=3.11
conda activate protqc

# OpenMM from conda-forge (includes CUDA support)
conda install -c conda-forge openmm

# ProtQC + all dependencies
pip install -e '.[all]'
```

> **Platform Support:** Source installation requires Linux. OpenMM and fpocket have limited support on macOS/Windows. Use Docker on non-Linux platforms.

## Configuration

All thresholds, weights, and verdict boundaries are defined in [`configs/thresholds.yaml`](configs/thresholds.yaml). Key tunables:

- **Intrinsically disordered proteins:** Increase `physics_verifier.md_rmsd_max_angstrom` (e.g., 8.0–10.0) since higher RMSD is expected
- **Membrane proteins:** Adjust `surface.sasa_polar_ratio_min/max` for transmembrane segments

## Limitations

ProtQC is a rapid pre-screening tool, not a substitute for comprehensive computational or experimental validation:

- **MD simulation length.** The default 10 ns simulation is a rapid pre-screen that catches catastrophic failures (large RMSD drift, complete unfolding). Subtle instabilities — slow conformational changes, partial unfolding events, aggregation-prone intermediates — may require 100–500 ns simulations for reliable detection (Lindorff-Larsen et al. 2011; Ferruz et al. 2022). Treat a ProtQC PASS as "no obvious red flags," not "experimentally validated."

- **Cavity detection.** fpocket was designed for identifying druggable surface binding pockets, not for internal void quality control (Le Guilloux et al. 2009). The suspicious cavity flagging (volume > 800 A^3, druggability < 0.4) is a literature-informed heuristic (Schmidtke et al. 2010), not a validated structural defect detector. Combine with packing density metrics or Voronoi-based tools for higher confidence.

- **Risk score weights.** The current weights are expert estimates based on published benchmarks (Dauparas et al. 2022; Ferruz et al. 2022) and will be refined through calibration on larger, more diverse protein sets. Different protein families (membrane proteins, IDPs, repeat proteins) may need substantially different weight profiles.

## Related Tools

| Tool | Focus |
|------|-------|
| [CHAPERONg](https://github.com/paulshamrat/CHAPERONg) | Automated GROMACS MD analysis |
| [MolProbity](https://github.com/rlabduke/MolProbity) | Stereochemistry validation |
| [QMEAN](https://swissmodel.expasy.org/qmean/) | Statistical potential scoring |
| [VoroMQA](https://bioinformatics.lt/wtsam/voromqa) | Voronoi tessellation quality |
| [ProSA](https://prosa.services.came.sbg.ac.at/prosa.php) | Statistical analysis of protein structures |
| [ProteinDJ](https://github.com/PapenfussLab/proteindj) | AI protein design evaluation |
| [BinderFlow](https://github.com/cryoEM-CNIO/BinderFlow) | Binder design pipeline |
| [OVO](https://github.com/MSDLLCpapers/ovo) | De novo protein design ecosystem |

## Roadmap

**v0.2.0** — Benchmark dataset (25 proteins, Garcia/Hermosilla/Chevalier), Colab MCP integration, weight calibration, replica runs

**v0.3.0** — Thermal stability prediction, MultiQC-style batch reports, Nextflow/Snakemake templates, REST API

## License

MIT

## Citation

```
Güzel, Ö.K. (2026). ProtQC: Physics-based verification of AI-designed protein designs.
github.com/korayguzel/protqc
```
