Metadata-Version: 2.4
Name: k-sites
Version: 1.2.1
Summary: K-Sites: AI-Powered CRISPR Guide RNA Design Platform with Multi-Database Integration (GO.org, UniProt, KEGG), Exponential Decay Pleiotropy Scoring, Evidence-Based Filtering (IDA/IMP/IGI vs IEA), and RAG-Based Phenotype Prediction
Home-page: https://github.com/KanakaKK/K-sites
Author: Kanaka KK, Sandip Garai, Jeevan C, Tanzil Fatima
Author-email: kanakakk@example.com
Project-URL: Bug Reports, https://github.com/KanakaKK/K-sites/issues
Project-URL: Source, https://github.com/KanakaKK/K-sites
Project-URL: Documentation, https://github.com/KanakaKK/K-sites/blob/main/README.md
Keywords: crispr,bioinformatics,genomics,gene-editing,rna-guides,biology,research,pathway-analysis,go-terms
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.25.0
Requires-Dist: biopython>=1.78
Requires-Dist: neo4j>=4.4.0
Requires-Dist: pyyaml>=5.4.0
Requires-Dist: tqdm>=4.60.0
Requires-Dist: dataclasses-json>=0.5.0
Requires-Dist: pandas>=1.3.0
Requires-Dist: openpyxl>=3.0.0
Requires-Dist: numpy>=1.21.0
Provides-Extra: rag
Requires-Dist: sentence-transformers>=2.2.0; extra == "rag"
Requires-Dist: faiss-cpu>=1.7.4; extra == "rag"
Provides-Extra: webapp
Requires-Dist: flask>=3.0.0; extra == "webapp"
Requires-Dist: flask-sqlalchemy>=3.0.0; extra == "webapp"
Requires-Dist: flask-cors>=4.0.0; extra == "webapp"
Requires-Dist: flask-mail>=0.9.1; extra == "webapp"
Requires-Dist: email-validator>=2.1.0; extra == "webapp"
Requires-Dist: jinja2>=3.1.0; extra == "webapp"
Provides-Extra: all
Requires-Dist: sentence-transformers>=2.2.0; extra == "all"
Requires-Dist: faiss-cpu>=1.7.4; extra == "all"
Requires-Dist: flask>=3.0.0; extra == "all"
Requires-Dist: flask-sqlalchemy>=3.0.0; extra == "all"
Requires-Dist: flask-cors>=4.0.0; extra == "all"
Requires-Dist: flask-mail>=0.9.1; extra == "all"
Requires-Dist: email-validator>=2.1.0; extra == "all"
Requires-Dist: jinja2>=3.1.0; extra == "all"
Requires-Dist: python-dotenv>=1.0.0; extra == "all"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license-file
Dynamic: project-url
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# K-Sites v1.2.0: Advanced CRISPR Guide RNA Design Platform

[![PyPI version](https://badge.fury.io/py/k-sites.svg)](https://pypi.org/project/k-sites/)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

**K-Sites** is a comprehensive CRISPR guide RNA design platform that integrates **GO term analysis** with **KEGG pathway graph analytics** to identify non-pleiotropic gene targets and design optimal gRNAs with pathway-aware off-target filtering.

---

## 🌟 Key Features

### 1. Multi-Database Integration
Queries **GO.org, UniProt, and KEGG simultaneously** using parallel processing for comprehensive gene data retrieval:
- **GO.org (QuickGO)**: Gene Ontology annotations with evidence codes
- **UniProt**: Protein knowledgebase (function, domains, pathways)
- **KEGG**: Pathway annotations and pathway counts

```bash
# Use all databases (default)
k-sites --go-term GO:0006281 --organism 9606 --output report.html

# Select specific databases
k-sites --go-term GO:0006281 --organism 9606 --output report.html --databases quickgo uniprot kegg
```

### 2. Pleiotropy Scoring Algorithm
**Exponential decay scoring** based on number of associated Biological Process GO terms:

```
Score = 10 × (1 - exp(-λ × (n-1)))
```
Where:
- `n` = number of OTHER BP terms (excluding target GO term)
- `λ` = 0.3 (decay rate)
- Score range: **0-10** (0 = highly specific, 10 = highly pleiotropic)

**Specificity Score**: Inverse on 0-1 scale (1 = most specific)
```
Specificity = 1.0 - (Pleiotropy / 10.0)
```

```bash
k-sites --go-term GO:0006281 --organism 9606 --output report.html --max-pleiotropy 3
```

### 3. Evidence-Based Filtering
Distinguishes **experimental evidence** from **computational predictions**:

| Evidence Type | Codes | Weight |
|--------------|-------|--------|
| **Experimental** | IDA, IMP, IGI, IPI, IEP, HTP, HDA, HMP, HGI, HEP | 1.0 |
| **Computational** | ISS, ISO, ISA, ISM, IGC, IBA, IBD, IKR, IRD, RCA | 0.6 |
| **Prediction (IEA)** | IEA | 0.3 |

```bash
# Experimental only (default)
k-sites --go-term GO:0006281 --organism 9606 --output report.html --evidence-filter experimental

# Computational only
k-sites --go-term GO:0006281 --organism 9606 --output report.html --evidence-filter computational

# All evidence types
k-sites --go-term GO:0006281 --organism 9606 --output report.html --evidence-filter all
```

### 4. RAG-Based Phenotype Prediction
Literature mining with semantic analysis:
- **PubMed integration**: Real-time NCBI Entrez API queries
- **Semantic embeddings**: SentenceTransformer (all-MiniLM-L6-v2)
- **Vector search**: FAISS L2 distance indexing
- **Severity classification**: LETHAL, SEVERE, MODERATE, MILD, UNKNOWN
- **Risk assessment**: CRITICAL, HIGH, MEDIUM, LOW, UNKNOWN

```bash
k-sites --go-term GO:0006281 --organism 9606 --output report.html --predict-phenotypes --rag-report
```

### 5. CRISPR gRNA Design
- **Multi-Cas support**: SpCas9, SaCas9, Cas12a, Cas9-NG, xCas9
- **Doench 2016**: On-target efficiency scoring (20 position-dependent weights)
- **CFD Algorithm**: Off-target prediction with position-weighted mismatches
- **Pathway-aware filtering**: Prevents disruption of critical pathways

### 6. Cross-Species Validation
Validates gene specificity across model organisms:
- **9606** - Homo sapiens (Human)
- **10090** - Mus musculus (Mouse)
- **7227** - Drosophila melanogaster (Fly)
- **6239** - Caenorhabditis elegans (Worm)

```bash
k-sites --go-term GO:0006281 --organism 9606 --output report.html \
        --species-validation 9606 10090 7227
```

---

## 🛠️ Installation

### Prerequisites
- Python 3.8+
- Git

### Install from PyPI

```bash
# Basic installation
pip install k-sites

# With RAG phenotype prediction support
pip install 'k-sites[rag]'

# With web application support
pip install 'k-sites[webapp]'

# With all features
pip install 'k-sites[all]'
```

### Install from Source

```bash
git clone https://github.com/kkokay07/K-Sites.git
cd K-Sites
pip install -e .
```

### Environment Variables

```bash
export NCBI_EMAIL="your.email@example.com"  # Required for NCBI API calls
export NCBI_API_KEY="your_ncbi_api_key"     # Optional for higher rate limits
```

---

## 🚀 Usage

### Command Line Interface

#### Basic Usage
```bash
k-sites --go-term GO:0006281 --organism "Homo sapiens" --output report.html
```

#### Advanced Usage with All Features
```bash
k-sites --go-term GO:0006281 --organism 9606 --output results/report.html \
        --use-graph \
        --max-pleiotropy 5 \
        --evidence-filter experimental \
        --species-validation 9606 10090 10116 \
        --predict-phenotypes \
        --rag-report \
        --databases all
```

#### Search GO Terms
```bash
k-sites --go-term-search "DNA repair" --organism "Homo sapiens" --output report.html
```

### Programmatic API

```python
from k_sites.workflow.pipeline import run_k_sites_pipeline

results = run_k_sites_pipeline(
    go_term="GO:0006281",  # DNA repair
    organism="9606",       # Human
    max_pleiotropy=3,
    use_graph=True,
    evidence_filter="experimental",
    species_validation=["9606", "10090"],
    predict_phenotypes=True,
    databases=["quickgo", "uniprot", "kegg"]
)

# Generate report
from k_sites.reporting.report_generator import generate_html_report
generate_html_report(results, "output_report.html")
```

---

## 📊 Output Files

| File | Description |
|------|-------------|
| `report.html` | Interactive HTML report with visualizations |
| `report_comprehensive.csv` | Full results with all metrics |
| `report_gene_summary.csv` | Gene-level summary statistics |
| `report_grna_sequences.fasta` | gRNA sequences in FASTA format |
| `report_sequences.gb` | GenBank format sequences |
| `rag_reports/*.html` | Per-gene RAG literature analysis (with `--rag-report`) |

---

## ⚙️ CLI Options

| Option | Description | Default |
|--------|-------------|---------|
| `--go-term` | GO term to analyze (e.g., GO:0006281) | Required |
| `--organism` | Organism as TaxID or scientific name | Required |
| `--output` | Output HTML report path | Required |
| `--use-graph` | Enable Neo4j pathway analysis | True |
| `--no-graph` | Disable Neo4j, use GO-only mode | - |
| `--max-pleiotropy` | Max allowed pleiotropy (0-10) | 10 |
| `--evidence-filter` | Evidence type: experimental/computational/all | experimental |
| `--species-validation` | Species for cross-validation | 9606 10090 7227 6239 |
| `--predict-phenotypes` | Enable RAG phenotype prediction | False |
| `--rag-report` | Generate detailed RAG reports | False |
| `--databases` | Select databases: quickgo/uniprot/ncbi/pubmed/kegg/all | all |
| `--go-term-search` | Search GO terms by keyword | - |

---

## 🏗️ Architecture

```
k_sites/
├── data_retrieval/       # Multi-database integration (GO, UniProt, KEGG)
│   ├── multi_database_client.py
│   ├── go_gene_mapper.py
│   └── organism_resolver.py
├── gene_analysis/        # Pleiotropy scoring algorithms
│   └── pleiotropy_scorer.py
├── crispr_design/        # gRNA design and scoring
│   └── guide_designer.py
├── neo4j/               # Graph database integration
│   ├── graph_client.py
│   └── ingest_kegg.py
├── rag_system/          # Literature-based phenotype prediction
│   └── literature_context.py
├── reporting/           # Report generation
│   ├── report_generator.py
│   ├── csv_export.py
│   └── rag_report_generator.py
└── workflow/            # Pipeline orchestration
    └── pipeline.py
```

---

## 🧪 Testing

```bash
# Run all tests
python -m pytest tests/

# Run specific test modules
python -m pytest tests/test_non_pleiotropic_features.py
python -m pytest tests/test_crispr_design.py
python -m pytest tests/test_rag_phenotype.py
```

---

## 📚 Documentation

- **Feature Documentation**: [FEATURES.md](FEATURES.md)
- **PyPI Package**: https://pypi.org/project/k-sites/
- **GitHub Repository**: https://github.com/kkokay07/K-Sites

---

## 👥 Developers

- **Kanaka KK** - Lead Architect
- **Sandip Garai** - Neo4j Graph Integration Specialist
- **Jeevan C** - CRISPR Algorithm Developer
- **Tanzil Fatima** - Bioinformatics Analyst

---

## 📄 License

MIT License - see [LICENSE](LICENSE) file for details.

---

## 🙏 Acknowledgments

- Thanks to the GO Consortium for gene ontology resources
- Thanks to UniProt for protein knowledgebase
- Thanks to KEGG for pathway data
- Thanks to NCBI for biological databases (PubMed, Entrez)
- Thanks to the Neo4j community for graph database technology

---

## 📈 Version History

| Version | Date | Key Features |
|---------|------|--------------|
| 1.0.0 | 2026-02-05 | Initial release with core functionality |
| 1.1.0 | 2026-02-13 | Web application, RAG system, enhanced reporting |
| 1.2.0 | 2026-02-13 | Multi-database selection, RAG reports, improved CLI |
