Metadata-Version: 2.4
Name: neuromind
Version: 0.1.0
Summary: Neuromind RAG research library: download PMC papers, index into ChromaDB, and query with multi-level RAG
License: MIT License
        
        Copyright (c) 2025 Ziyuan Huang
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Repository, https://github.com/melhzy/alzheimers
Project-URL: Bug Tracker, https://github.com/melhzy/alzheimers/issues
Keywords: alzheimer,RAG,retrieval-augmented-generation,biomedical,NLP,LLM,ChromaDB,pubmed,gut-microbiome
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: License :: OSI Approved :: MIT License
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.28.0
Requires-Dist: chromadb>=0.4.0
Requires-Dist: pubmed-stream>=0.1.0
Provides-Extra: llm
Requires-Dist: ollama>=0.1.0; extra == "llm"
Provides-Extra: embeddings
Requires-Dist: sentence-transformers>=2.2.0; extra == "embeddings"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.0; extra == "dev"
Provides-Extra: all
Requires-Dist: ollama>=0.1.0; extra == "all"
Requires-Dist: sentence-transformers>=2.2.0; extra == "all"
Dynamic: license-file

# neuromind

An umbrella Python library for **biomedical RAG research**.

## Sub-packages

| Package | Description |
|---------|-------------|
| `neuromind.alzheimers` | Alzheimer's disease RAG pipeline |

---

## neuromind.alzheimers

Combines
PubMed Central paper downloading, ChromaDB vector indexing, and a multi-level
Retrieval-Augmented Generation (RAG) pipeline.

Built on:
- **[pubmed-stream](https://pypi.org/project/pubmed-stream/)** — concurrent PMC article downloads via NCBI E-utilities
- **[ChromaDB](https://www.trychroma.com/)** — local persistent vector store
- **[Ollama](https://ollama.com/)** — local LLM inference (128K context)

Inspired by the `RAG_Comparison_Demo.ipynb` notebook, which demonstrates six
evidence-depth conditions from *No RAG* through *Comprehensive RAG* on
Alzheimer's disease / gut-microbiome research questions.

---

## Installation

```bash
# Core library
pip install .

# With Ollama Python client
pip install ".[llm]"

# With biomedical sentence-transformers embeddings
pip install ".[embeddings]"

# Everything
pip install ".[all]"
```

### Requirements

- Python ≥ 3.8
- **Runtime**: `requests`, `chromadb`, `pubmed-stream`
- **Optional**: `ollama` (Python package), `sentence-transformers`
- **LLM server**: [Ollama](https://ollama.com/) running locally

---

## Quick Start

### Python API

```python
from neuromind.alzheimers import AlzheimersRAG, RAGMode
# or, via top-level re-exports:
from neuromind import AlzheimersRAG, RAGMode

rag = AlzheimersRAG(
    db_path="./neuromind_db",
    model="llama3.2",          # any model available in Ollama
)

# 1. Download papers from PubMed Central
rag.download("Alzheimer's disease gut microbiome", max_results=50)

# 2. Index into ChromaDB
stats = rag.index()
print(stats)  # IndexStats(indexed=48, chunks=1_024, ...)

# 3. Query with RAG (default: DETAILED — 15 papers, ~70K context tokens)
result = rag.query(
    "What is the evidence for gut microbiome dysbiosis in Alzheimer's disease?",
    mode="detailed",
)
print(result.text)
print(result.summary())   # [detailed] 612 words | 15 sources | 68,402 ctx tokens | 42.3s
```

### CLI

```bash
# Download papers
alzheimers download "Alzheimer's disease gut microbiome" --max-results 100

# Download the full built-in AD corpus (15 topics)
alzheimers download --corpus --max-results 100

# Index into ChromaDB
alzheimers index ./publications --db-path ./neuromind_db

# Query
alzheimers query "What is the role of tau in Alzheimer's?" --mode detailed

# Compare all RAG modes on the same question
alzheimers compare "What therapeutic strategies target the gut-brain axis in AD?"

# python -m neuromind.alzheimers also works
python -m neuromind.alzheimers query "..." --model llama3.2
```

---

## RAG Modes

| Mode | Papers | ~Context tokens | Use case |
|------|-------:|----------------:|----------|
| `no_rag` | 0 | 0 | Baseline / hallucination check |
| `retrieval_only` | 5 | 0 | Inspect raw evidence |
| `basic` | 5 | ~20 K | Fast, grounded facts |
| `standard` | 10 | ~40 K | Daily clinical queries |
| `detailed` ⭐ | 15 | ~70 K | **Recommended default** |
| `comprehensive` | 25 | ~100 K | Research-grade depth |

---

## Architecture

```
User Query
    │
    ├─► [No RAG]          LLM only (training knowledge)
    │
    └─► [RAG modes]       pubmed-stream → ChromaDB → LLM
                               │               │
                         NCBI E-utilities   MedEmbed /
                         (ESearch + EFetch) MiniLM-L6-v2
```

**Data flow for RAG modes:**

```
query  →  Retriever.retrieve(top_k)
              │
              ▼
        ChromaDB (cosine similarity)
              │
              ▼
        build_context (token budget)
              │
              ▼
        OllamaLLM.generate(prompt)
              │
              ▼
        RAGResult (.text, .sources, .summary())
```

---

## Python API Reference

### `AlzheimersRAG`

```python
rag = AlzheimersRAG(
    db_path="./neuromind_db",   # ChromaDB location
    collection_name="publications",
    model="llama3.2",            # Ollama model tag
    ollama_host="http://localhost:11434",
    num_ctx=131_072,             # LLM context window
    embedding_fn=None,           # custom ChromaDB embedding function
)
```

| Method | Description |
|--------|-------------|
| `rag.download(keyword, max_results, ...)` | Download PMC articles |
| `rag.download_corpus(topics, max_results_per_topic, ...)` | Download full AD corpus |
| `rag.index(papers_dir, chunk_size, chunk_overlap)` | Index into ChromaDB |
| `rag.query(question, mode, max_response_tokens, temperature)` | RAG query |
| `rag.compare_modes(question, modes, ...)` | Run all modes, return dict |

### Low-level building blocks

```python
from neuromind.alzheimers import download_papers, download_ad_corpus
from neuromind.alzheimers import index_directory
from neuromind.alzheimers.retriever import Retriever
from neuromind.alzheimers.llm import OllamaLLM
from neuromind.alzheimers.rag import RAGPipeline
from neuromind.alzheimers.types import RAGMode, RAGResult, IndexStats
```

---

## Custom Embeddings (e.g. MedEmbed)

To use the biomedical MedEmbed embeddings from the original notebook:

```python
from medembed_embedder import LangchainMedEmbedEmbeddings
import chromadb

# Wrap MedEmbed as a ChromaDB embedding function
class MedEmbedFn:
    def __init__(self):
        self._model = LangchainMedEmbedEmbeddings(device="cpu")
    def __call__(self, input):          # ChromaDB EmbeddingFunction protocol
        return self._model.embed_documents(input)

from neuromind.alzheimers import AlzheimersRAG
rag = AlzheimersRAG(db_path="./neuromind_db", embedding_fn=MedEmbedFn())
```

---

## NCBI API Key

## Vector Database Path

Point `AlzheimersRAG` at your ChromaDB directory via the `NEUROMIND_DB_PATH`
environment variable instead of passing `db_path` every time:

```bash
export NEUROMIND_DB_PATH="/path/to/your/knowledge_base"
```

```python
from neuromind.alzheimers import AlzheimersRAG
rag = AlzheimersRAG()          # picks up NEUROMIND_DB_PATH automatically
rag = AlzheimersRAG(db_path="/explicit/override")  # or pass it directly
```

If the variable is not set the default is `./neuromind_db`.

---

## NCBI API Key

An NCBI API key raises the rate limit from 3 → 10 requests/second,
making downloads ~3× faster.

```bash
export NCBI_API_KEY="your_key_here"
export NCBI_EMAIL="you@example.com"
```

Get a free key at <https://www.ncbi.nlm.nih.gov/account/>.

---

## Project Structure

```
neuromind-project/
├── neuromind/
│   ├── __init__.py           # umbrella re-exports
│   └── alzheimers/
│       ├── __init__.py       # AlzheimersRAG façade + public API
│       ├── __main__.py       # python -m neuromind.alzheimers
│       ├── cli.py            # CLI (download / index / query / compare)
│       ├── downloader.py     # pubmed-stream wrapper + AD_SEARCH_TERMS
│       ├── indexer.py        # text chunking + ChromaDB upsert
│       ├── retriever.py      # ChromaDB semantic retrieval
│       ├── llm.py            # Ollama Python client / HTTP fallback
│       ├── rag.py            # RAGPipeline (all 6 modes)
│       └── types.py          # RAGMode, RAGResult, IndexStats, …
├── examples/
│   └── basic_usage.py
├── RAG_Comparison_Demo.ipynb
├── pyproject.toml
└── README.md
```

---

## License

MIT — see [LICENSE](LICENSE).
