Metadata-Version: 2.3
Name: ragtoolbox
Version: 0.1.0
Summary: Modular toolkit for building Retrieval-Augmented Generation (RAG) pipelines.
License: MIT
Author: Nick-Nunley
Author-email: nicknunley17@ucla.edu
Requires-Python: >=3.10,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Provides-Extra: chromadb
Provides-Extra: ncbi
Provides-Extra: openai
Provides-Extra: transformers
Requires-Dist: beautifulsoup4 (>=4.12.0,<5.0.0)
Requires-Dist: biopython (>=1.8.0) ; extra == "ncbi"
Requires-Dist: chromadb (>=1.0.15) ; extra == "chromadb"
Requires-Dist: fastembed (>=0.7.1,<0.8.0)
Requires-Dist: html2text (>=2025.4.15,<2026.0.0)
Requires-Dist: huggingface_hub (>=0.33.5,<0.34.0)
Requires-Dist: nltk (>=3.9.1,<4.0.0)
Requires-Dist: numpy (>=1.26,<2.3) ; python_version < "3.11"
Requires-Dist: numpy (>=2.3,<3.0) ; python_version >= "3.11"
Requires-Dist: openai (>=1.95.1) ; extra == "openai"
Requires-Dist: pdfplumber (>=0.11.7,<0.12.0)
Requires-Dist: pyyaml (>=6.0.2,<7.0.0)
Requires-Dist: readability-lxml (>=0.8.4.1,<0.9.0.0)
Requires-Dist: requests (>=2.32.4,<3.0.0)
Requires-Dist: torch (>=2.7.1,<3.0.0) ; extra == "transformers"
Requires-Dist: transformers (>=4.30.0,<5.0.0) ; extra == "transformers"
Description-Content-Type: text/markdown

# RAGToolBox

[![CI](https://github.com/Nick-Nunley/RAGToolBox/actions/workflows/CI.yml/badge.svg)](https://github.com/Nick-Nunley/RAGToolBox/actions/workflows/CI.yml)
[![codecov](https://codecov.io/github/Nick-Nunley/RAGToolBox/graph/badge.svg?token=MUXZWZMZV0)](https://codecov.io/github/Nick-Nunley/RAGToolBox)

**RAGToolBox** is a modular, extensible Python package for building Retrieval-Augmented Generation (RAG) pipelines. It provides end-to-end components for:

* **Loading** content (local files, web pages, PubMed/PMC articles)
* **Chunking** text (paragraphs, sentences, sliding windows, section‑aware, hierarchical)
* **Storing** embeddings (SQLite and optional ChromaVectorStore backends)
* **Indexing** content with parallel embedding and metadata parsing
* **Retrieval** *via* FastEmbed or (optional) openai embeddings and similarity search
* **Augmentation** using customizable prompts and LLMs (local or Hugging Face)

---

## Table of Contents

1. [Installation](#installation)
2. [Quickstart](#quickstart)
    * [Loading Documents](#loading-documents)
    * [Indexing Pipeline](#indexing-pipeline)
    * [Retrieval & Augmentation](#retrieval--augmentation)
3. [CLI Usage](#cli-usage)
4. [Configuration](#configuration)
5. [Testing](#testing)
6. [Contributing](#contributing)
7. [License](#license)

---

## Installation

Install the latest release from PyPI:

```bash
pip install ragtoolbox

# with optional extras
pip install "ragtoolbox[transformers,chromadb,openai,ncbi]"
```

Or install from source:

```bash
git clone https://github.com/Nick-Nunley/RAGToolBox.git
cd RAGToolBox
pip install .
```

### Prerequisites

* Python 3.10+
* Core dependencies (see also `pyproject.toml`):
    * numpy
    * requests
    * pyyaml
    * pdfplumber
    * html2text
    * pytest
    * beautifulsoup4
    * readability-lxml
    * nltk
    * fastembed
    * huggingface_hub
* Optional dependencies (for extended capabilities):
    * openai
    * chromadb
    * torch
    * transformers
    * biopython

---

## Quickstart

### Loading Documents

Use the built‑in factory to detect and process various formats:

```python
from RAGToolBox.loader import BaseLoader

# Raw bytes fetched externally (e.g., via requests or open file)
raw_bytes = open("example.pdf", "rb").read()
LoaderClass = BaseLoader.detect_loader("example.pdf", raw_bytes)
loader = LoaderClass("example.pdf", "assets/kb")
loader.raw_content = raw_bytes
loader.process()
```

### Indexing Pipeline

Chunk, embed, and store your KB in one pipeline:

```python
from pathlib import Path
from RAGToolBox.chunk import SectionAwareChunker, SlidingWindowChunker, HierarchicalChunker
from RAGToolBox.index import Indexer, IndexerConfig

chunker = HierarchicalChunker([
    SectionAwareChunker(max_chunk_size=1000, overlap=200),
    SlidingWindowChunker(window_size=1000, overlap=200)
])
indexer = Indexer(
    chunker=chunker,
    embedding_model="fastembed",
    config = IndexerConfig(
        vector_store_backend="sqlite",
        vector_store_config={"db_path": Path("assets/kb/embeddings/embeddings.db")}
    )
)
indexer.index(chunked_results)
```

### Retrieval & Augmentation

Retrieve relevant chunks and generate an LLM response:

```python
from pathlib import Path
from RAGToolBox.retriever import Retriever, RetrievalConfig
from RAGToolBox.augmenter import Augmenter, GenerationConfig

# Example query
user_query = "What is RAG?"

# Initialize retriever
retriever = Retriever(
    embedding_model="fastembed",
    db_path=Path("assets/kb/embeddings/embeddings.db")
)

# Perform retrieval
contexts = retriever.retrieve(user_query, RetrievalConfig(top_k=5))

# Initialize augmenter
augmenter = Augmenter(
    model_name="google/gemma-2-2b-it",
    prompt_type="default",
    api_key="${HUGGINGFACE_API_KEY}",
    use_local=False
)

# Generate a response with sources
result = augmenter.generate_response_with_sources(
    query=user_query,
    retrieved_chunks=contexts,
    gen_config=GenerationConfig(temperature=0.25, max_new_tokens=200)
)
print(result["response"])
```

---

## CLI Usage

Each module exposes a CLI entrypoint. Use `-h` for details:

```bash
# Load documents from URLs or files
python -m RAGToolBox.loader https://example.com/article.pdf --output-dir assets/kb

# Build index (chunk + embed)
python -m RAGToolBox.index --kb-dir assets/kb --embedding-model fastembed --vector-store sqlite

# Retrieve top-10 chunks\ npython -m RAGToolBox.retriever --query "Explain RAG" --embedding-model openai

# Augment with LLM
python -m RAGToolBox.augmenter "What is RAG?" --sources

# For a concise one-liner answer
python -m RAGToolBox.augmenter "What is RAG?" --prompt-type concise

# Interactively chat with your knowledgebase
python -m RAGToolBox.augmenter --chat
```

---

## Testing

Run the full test suite:

```bash
bash tests/Run_tests.sh
```

Continuous integration is configured *via* GitHub Actions (see `.github/workflows/ci.yml`).

---

## Note

**RAGToolBox** is in active development. The current release (v0.1.0) is an unstable preview. APIs and behavior may change, and some features may not yet be fully tested.

---

## Contributing

1. Fork the repo
2. Create a feature branch: `git checkout -b feature/your-feature`
3. Commit changes: `git commit -m "Add new feature"`
4. Push to your fork and open a PR

Please follow the [Contributing Guidelines](CONTRIBUTING.md).

---

## License

This project is licensed under the MIT License. See [LICENSE](LICENSE) for details.

