Metadata-Version: 2.4
Name: ragbio
Version: 2.0.1
Summary: A study-aware biomedical RAG framework for PubMed retrieval, citation-grounded summaries, and downstream omics integration
Author: Manish Kumar
License: MIT
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: faiss-cpu<2.0,>=1.7.4
Requires-Dist: ollama<1.0,>=0.1.0
Requires-Dist: langchain<0.3,>=0.1.20
Requires-Dist: langchain-community<0.3,>=0.0.30
Requires-Dist: biopython<2.0,>=1.82
Requires-Dist: requests<3.0,>=2.31
Requires-Dist: beautifulsoup4<5.0,>=4.12
Requires-Dist: pandas<3.0,>=2.0
Requires-Dist: python-dotenv<2.0,>=1.0.0
Requires-Dist: neo4j<6.0,>=5.18
Requires-Dist: tqdm<5.0,>=4.66
Dynamic: license-file

# RAG-Powered Gene Discovery Assistant (`ragbio`)

`ragbio` is a **study-aware, retrieval-augmented generation (RAG) toolkit** for biomedical knowledge discovery built on **PubMed literature**, **FAISS vector search**, and **Ollama-based large language models** (DeepSeek, LLaMA-family).

It is designed as a **reusable Python package** that can operate standalone **or** as a backend service inside larger platforms such as **OmniBioAI**, powering chat, literature summarization, and downstream bioinformatics workflows.

---

## Key Capabilities

`ragbio` enables:

* **Semantic search over PubMed abstracts** using FAISS
* **Study-scoped ingestion and indexing** for reproducibility
* **Multi-study search** (single study or global across all studies)
* **Instant PMID retrieval** (FAISS-only, no LLM calls)
* **LLM-based biomedical summarization** grounded in retrieved literature
* **Structured JSON outputs** for literature summarizers and reporting
* **Optional drug–target–disease extraction** and KG population
* **Progress reporting hooks** for real-time UI dashboards
* **Cache-aware retrieval** for low-latency interactive use

---

## Example Questions

* *Which genes are associated with oxidative stress in Alzheimer’s disease?*
* *What therapies target amyloid pathways according to recent literature?*
* *Summarize evidence linking TP53 variants to cancer therapies.*
* *Return PMIDs related to BRCA1 drug resistance (no summarization).*

---

## High-Level Architecture

```
User Query
│
├─► FAISS Retrieval (study-specific or multi-study)
│
├─► Top-K PubMed Abstracts
│
├─► (Optional) LLM Summarization (RAG)
│
├─► (Optional) Structured Extraction (JSON)
│
└─► Outputs:
     • PMIDs (instant)
     • Grounded summaries
     • Structured JSON (literature summarizer / KG)
```

**Important design choice (v1.1):**
FAISS retrieval is executed **exactly once per request**, and all downstream steps reuse the same retrieved documents. There is no duplicate search.

---

## Installation

### Install from PyPI (recommended)

```bash
pip install ragbio
```

### Development install (from source)

```bash
git clone https://github.com/man4ish/omnibioai-rag.git
cd omnibioai-rag
pip install -e .
```

---

## Data Organization (Study-Aware)

By default, all PubMed data is organized under:

```
data/PubMed/
├── Abstracts/<study>/
├── Metadata/<study>/
├── PDFs/<study>/
└── Index/<study>/
```

This enables:

* Clean separation of case studies
* Reproducible indexing
* Safe multi-study search

---

## Usage Guide

### 1️⃣ Ingest PubMed Literature (Study-Aware)

```bash
python -m ragbio.utils.rag_data_loader \
  --study Alzheimer_CaseStudy \
  --search "Alzheimer Disease AND therapy" \
  --retmax 500 \
  --retstart 0
```

This step:

* Fetches PubMed abstracts and metadata
* Stores results under `Abstracts/<study>/`
* Optionally downloads open-access PDFs

---

### 2️⃣ Generate Embeddings & Build FAISS Index

```bash
python -m ragbio.embeddings.embedding_engine \
  --study Alzheimer_CaseStudy
```

* Reads abstracts from `Abstracts/<study>/`
* Generates embeddings via Ollama
* Writes FAISS index to `Index/<study>/`

---

### 3️⃣ Run RAG Queries (CLI)

```bash
python -m ragbio.pipeline.rag_pipeline \
  --query "Which therapies target amyloid pathways in Alzheimer’s disease?" \
  --top_k 10 \
  --structured \
  --study Alzheimer_CaseStudy
```

Outputs include:

* Grounded summary
* Supporting PMIDs
* Optional structured JSON
* Optional Neo4j KG updates

---

### 4️⃣ Instant PMID Retrieval (No LLM)

For low-latency applications (chat, TES, pipelines):

```bash
python -m ragbio.pipeline.rag_pipeline \
  --query "TP53 apoptosis cancer therapy" \
  --pmids-only \
  --top_k 20
```

✔ FAISS-only
✔ Cache-aware
✔ Suitable for real-time UI

---

## Python API (Recommended for Integration)

### Public API (v1.1)

```python
from ragbio.pipeline import (
    RAGAssistant,
    get_pmids,
    run_rag_json,
)
```

---

### Instant PMID Retrieval

```python
pmids = get_pmids(
    query="BRCA1 drug resistance",
    top_k=20,
    study=None,   # search across ALL studies
)
```

---

### Structured RAG Output (Literature Summarizer)

```python
result = run_rag_json(
    query="TP53 variants and chemotherapy response",
    top_k=10,
    study="Cancer_Study",
)
```

Returns structured JSON suitable for:

* Literature summarization
* ReportingService
* Downstream AI agents

---

### Advanced Usage (Long-Lived Assistant)

```python
assistant = RAGAssistant(study="Alzheimer_CaseStudy")

pmids = assistant.get_pmids("amyloid beta clearance")

data = assistant.run_rag_json(
    "amyloid beta clearance therapies",
    structured=True,
)
```

---

## Multi-Study Search (v1.1)

* `study="default"` → search one study
* `study=None` or `"*"` → search **all indexed studies**
* Results are merged, ranked, and deduplicated

This allows:

* Cross-project reuse of indexed literature
* Global chat-style queries
* Meta-analysis across studies

---

## Caching & Performance

* PMID retrieval results are cached per:

  * query
  * study
  * index version
  * embedding model
* Cache invalidates automatically if index changes
* Designed for **sub-second responses** in chat workflows

---

## Progress Reporting (UI-Ready)

All major steps emit progress events that can be wired to:

* OmniBioAI progress bars
* WebSocket updates
* TES run monitors

Example stages:

* `retrieval_start`
* `retrieval_complete`
* `llm_summarization`
* `structured_extraction_complete`

---

## Technologies Used

| Category         | Tools                               |
| ---------------- | ----------------------------------- |
| Language         | Python 3.10+                        |
| Retrieval        | FAISS                               |
| Embeddings       | Ollama embedding models             |
| LLMs             | DeepSeek, LLaMA-family (via Ollama) |
| Data Source      | PubMed (NCBI Entrez)                |
| Graph (optional) | Neo4j                               |
| UI (optional)    | Streamlit, Cytoscape                |

---

## Design Principles

* **Study-first organization**
* **Explicit retrieval control**
* **No hidden FAISS calls**
* **Composable APIs**
* **Safe defaults, override when needed**
* **Platform-friendly (OmniBioAI, TES, agents)**

---

## Roadmap

### v1.1 (current)

* Multi-study search
* Instant PMID retrieval
* Structured JSON output
* Cache-aware retrieval
* Public Python API

### v1.2+

* Streaming RAG responses
* Retrieval metrics & dashboards
* Neo4j-first knowledge graphs
* FastAPI / Django service mode
* Citation confidence scoring
* Multi-study comparative dashboards

---

## License

MIT License

