Metadata-Version: 2.4
Name: ragbio
Version: 0.1.14
Summary: RAG pipeline for gene-disease literature summarization
Author: Manish Kumar
License: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: faiss-cpu
Requires-Dist: sentence-transformers
Requires-Dist: pymed
Requires-Dist: requests
Requires-Dist: pandas
Requires-Dist: ollama
Requires-Dist: langchain
Requires-Dist: biopython
Requires-Dist: beautifulsoup4
Requires-Dist: sqlite-utils
Requires-Dist: py2neo
Dynamic: license-file

# RAG-Powered Gene Discovery Assistant (`ragbio`)

A generative AI tool for biomedical knowledge discovery using **Hugging Face embeddings** and **Ollama LLMs (DeepSeek / LLaMA3)**.
This project integrates **retrieval-augmented generation (RAG)** with PubMed literature and gene annotation data to summarize gene–disease relationships.
Now packaged as a **reusable Python package**, it can be imported and used in multiple bioinformatics projects.

---

## Overview

The RAG-powered assistant enables:

* Semantic search over PubMed abstracts and gene annotations.
* Summarization of complex biomedical information.
* Citation tracking with PubMed IDs.
* Modular and reusable pipeline for gene–disease exploration.
* Case study–specific queries using `query_name` to organize outputs and visualizations.

**Example queries:**

* "Which genes are linked to oxidative stress in Alzheimer’s disease?"
* "Summarize recent findings about TP53 variants in cancer."

---

## Architecture

```
User Query
│
▼
Vector Retrieval (BioBERT / BioSentVec Embeddings)
│
▼
Top Abstracts + Gene Annotations
│
▼
Ollama LLM (DeepSeek / LLaMA3)
│
▼
Summarized Biomedical Answer + Citations
│
▼
Optional: Gene–Disease–Drug Network (Cytoscape / Streamlit)
```

---

## Installation

Clone and install as a package:

```bash
git clone https://github.com/<your-username>/rag-gene-discovery-assistant.git
cd rag-gene-discovery-assistant
pip install -e .
```

**Example `requirements.txt`**

```
langchain
faiss-cpu
sentence-transformers
biopython
pymed
requests
sqlite-utils
ollama
pandas
beautifulsoup4
streamlit
```

---

## Usage

### 1. Fetch PubMed Data

```python
from ragbio.utils.rag_data_loader import main as fetch_pubmed_data

# Download abstracts and metadata
fetch_pubmed_data()
```

### 2. Run RAG Query (with `query_name`)

```python
from ragbio.pipeline.rag_pipeline import RAGAssistant

# Use query_name to organize output
assistant = RAGAssistant(output_dir="output/Alzheimer_CaseStudy", query_name="Alzheimer_CaseStudy")
summary, pmids, structured = assistant.run_pipeline(
    "genes linked to Alzheimer’s disease",
    top_k=10,
    structured=True
)
```

*Outputs are stored under `output/<query_name>/` for easier tracking.*

---

### 3. Visualize Gene–Disease–Drug Networks

You can visualize the structured outputs in Cytoscape via Streamlit:

```bash
streamlit run ragbio/pipeline/rag_cytoscape_streamlit.py -- --query_name Alzheimer_CaseStudy
```

This reads all JSON files from `output/<query_name>/` and plots the gene–target–drug–disease network.

#### Example Output Network

![RAG Network Graph](ragbio/images/network_graph.png)

*Figure 1: Gene–disease–drug co-occurrence network generated from top PubMed abstracts.*

---

### 4. Optional: Explore in Notebook

Open `notebooks/RAG_GeneDiscovery_Assistant.ipynb` to see example queries, visualizations, and outputs.

---

## Technologies Used

| Category     | Tool                                        |
| ------------ | ------------------------------------------- |
| Embeddings   | BioBERT, BioSentVec (Hugging Face)          |
| LLM Backend  | DeepSeek / LLaMA3 (Ollama)                  |
| Retrieval    | FAISS                                       |
| Data Sources | PubMed, UniProt, NCBI Gene                  |
| Language     | Python 3.10+                                |
| Frameworks   | LangChain, Sentence Transformers, Streamlit |

---

## Future Enhancements

* Compare **DeepSeek/LLaMA3** with **BioGPT** outputs.
* Integrate **Neo4j** for gene–disease–drug knowledge graph visualization.
* Fine-tune LLMs on curated variant interpretation reports for improved clinical relevance.
* Extend package API for **direct integration in Django, FastAPI, and Streamlit apps**.
* Add support for **multiple query_name outputs** to track different case studies.
