Metadata-Version: 2.4
Name: ragbio
Version: 0.2.0
Summary: A retrieval-augmented biomedical literature framework for evidence discovery, citation mapping, and downstream omics analysis
Author: Manish Kumar
License: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: faiss-cpu<2.0,>=1.7.4
Requires-Dist: sentence-transformers<3.0,>=2.2.2
Requires-Dist: pymed<1.0,>=0.4.0
Requires-Dist: requests<3.0,>=2.31
Requires-Dist: pandas<3.0,>=2.0
Requires-Dist: ollama<1.0,>=0.1.0
Requires-Dist: langchain<0.3,>=0.1.20
Requires-Dist: langchain-community<0.3,>=0.0.30
Requires-Dist: biopython<2.0,>=1.82
Requires-Dist: beautifulsoup4<5.0,>=4.12
Requires-Dist: sqlite-utils<4.0,>=3.38
Requires-Dist: neo4j<6.0,>=5.18
Requires-Dist: py2neo>=2021.2.4
Requires-Dist: python-dotenv<2.0,>=1.0.0
Dynamic: license-file

# RAG-Powered Biomedical Evidence Framework (ragbio)

A reusable **retrieval-augmented generation (RAG)** toolkit for biomedical knowledge discovery built on **PubMed literature**, **vector search**, and **Ollama-based LLMs** (DeepSeek / LLaMA3).

`ragbio` enables **study-aware ingestion, embedding, and querying** of biomedical literature to support gene–disease–therapy exploration, summarization, and network visualization.

Now published as a **pip-installable Python package** and designed for integration into research pipelines and bioinformatics workflows.

---

## Overview

The RAG-powered assistant enables:

* Semantic search over PubMed abstracts
* Study-scoped literature ingestion for reproducibility
* Summarization of complex biomedical evidence using LLMs
* Citation-aware responses grounded in PubMed IDs
* Modular ingestion → embedding → retrieval pipeline
* Optional gene–disease–drug network visualization

**Example questions**

* *Which genes are linked to oxidative stress in Alzheimer’s disease?*
* *What therapies target amyloid pathways according to recent literature?*
* *Summarize evidence connecting TP53 variants to cancer therapies.*

---

## Architecture

```
User Question
│
▼
FAISS Vector Retrieval (PubMed Abstracts)
│
▼
Top-K Relevant Abstracts
│
▼
Ollama LLM (DeepSeek / LLaMA3)
│
▼
Grounded Biomedical Summary + PMIDs
│
▼
(Optional) Gene–Disease–Drug Network Visualization
```

---

## Installation

### Install from PyPI (recommended)

```bash
pip install ragbio
```

### Development install (from source)

```bash
git clone https://github.com/<your-username>/rag-gene-discovery-assistant.git
cd rag-gene-discovery-assistant
pip install -e .
```

---

## Usage

### 1. Ingest PubMed Literature (study-aware)

```bash
python -m ragbio.utils.rag_data_loader \
  --study Alzheimer_CaseStudy \
  --search "Alzheimer Disease AND therapy" \
  --retmax 500 \
  --retstart 0
```

This creates the following structure (default: `data/PubMed/`):

```
PubMed/
├── Abstracts/Alzheimer_CaseStudy/
├── Metadata/Alzheimer_CaseStudy/
├── PDFs/Alzheimer_CaseStudy/
└── Index/Alzheimer_CaseStudy/
```

---

### 2. Generate Embeddings & Build FAISS Index

```bash
python -m ragbio.embeddings.embedding_engine \
  --study Alzheimer_CaseStudy
```

* Reads from `Abstracts/<study>/`
* Writes FAISS index to `Index/<study>/`

---

### 3. Run RAG Queries

```bash
python -m ragbio.pipeline.rag_pipeline \
  --query "Which therapies target amyloid pathways in Alzheimer’s disease?" \
  --top_k 10 \
  --structured \
  --study Alzheimer_CaseStudy
```

Outputs are generated per study for clean provenance and reproducibility.

---

### 4. Visualize Gene–Disease–Drug Networks (optional)

Launch the Streamlit app:

```bash
streamlit run ragbio/pipeline/rag_cytoscape_streamlit.py --study Alzheimer_CaseStudy
```

This reads structured outputs and visualizes gene–disease–drug relationships as an interactive network.

![RAG Network Graph](ragbio/images/network_graph.png)

*Example: Gene–disease–drug co-occurrence network derived from PubMed abstracts.*

---

### 5. Optional: Notebook Exploration

Explore example workflows in:

```
notebooks/RAG_GeneDiscovery_Assistant.ipynb
```

---

## Technologies Used

| Category      | Tools                                  |
| ------------- | -------------------------------------- |
| Embeddings    | Ollama embedding models (configurable) |
| LLMs          | DeepSeek, LLaMA3 (via Ollama)          |
| Retrieval     | FAISS                                  |
| Data Sources  | PubMed (NCBI Entrez)                   |
| Visualization | Streamlit, Cytoscape                   |
| Language      | Python 3.10+                           |

---

## Design Principles

* **Study-first organization** for reproducibility
* **Separation of concerns** (ingestion ≠ embedding ≠ retrieval)
* **Grounded answers** with PubMed citations
* **Composable modules** usable outside the CLI
* **Safe defaults** with override via CLI or environment variables

---

## Future Enhancements

* Neo4j-backed gene–disease–drug knowledge graphs
* Comparative evaluation of DeepSeek vs BioGPT outputs
* Variant-level evidence integration
* API support for FastAPI / Django
* Automated citation grounding and confidence scoring
* Multi-study dashboards and comparisons
