Metadata-Version: 2.4
Name: ragbandit-core
Version: 0.1.1
Summary: Core utilities for document processing, RAG configuration, querying, and evaluation.
Author-email: Martim Chaves <martim@ragbandit.com>
License: MIT
Project-URL: Homepage, https://github.com/MartimChaves/ragbandit-core
Project-URL: Documentation, https://github.com/MartimChaves/ragbandit-core#readme
Project-URL: Source, https://github.com/MartimChaves/ragbandit-core
Project-URL: Issues, https://github.com/MartimChaves/ragbandit-core/issues
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE.md
Requires-Dist: pydantic>=2.11.7
Requires-Dist: llama-index>=0.12.52
Requires-Dist: mistralai>=1.7.0
Requires-Dist: ragas>=0.3.0
Requires-Dist: cryptography>=44.0.2
Dynamic: license-file

# ragbandit-core

Core utilities for:

* Document ingestion & processing (OCR, chunking, embedding)
* Building and running Retrieval-Augmented Generation (RAG) pipelines
* Evaluating answers with automated metrics

## Quick start

```bash
pip install ragbandit-core
```

```python
from ragbandit.documents import (
    DocumentPipeline,
    ReferencesProcessor,
    FootnoteProcessor,
    MistralOCRDocument,
    MistralEmbedder,
    SemanticChunker
)
import os
import logging
from dotenv import load_dotenv
load_dotenv()

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s"
)

MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")

file_path = "./data/raw/[document_name].pdf"

doc_pipeline = DocumentPipeline(
    chunker=SemanticChunker(min_chunk_size=500, api_key=MISTRAL_API_KEY),
    embedder=MistralEmbedder(model="mistral-embed", api_key=MISTRAL_API_KEY),  # noqa
    ocr_processor=MistralOCRDocument(api_key=MISTRAL_API_KEY),
    processors=[
        ReferencesProcessor(api_key=MISTRAL_API_KEY),
        FootnoteProcessor(api_key=MISTRAL_API_KEY),
    ],
)

extended_response = doc_pipeline.process(file_path)

```

### Running Steps Manually

For more control, you can run each pipeline step independently:

```python
from ragbandit.documents import (
    DocumentPipeline,
    ReferencesProcessor,
    MistralOCRDocument,
    MistralEmbedder,
    SemanticChunker
)
import os
from dotenv import load_dotenv
load_dotenv()

MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")
file_path = "./data/raw/[document_name].pdf"

# Create pipeline with only the components you need
pipeline = DocumentPipeline(
    ocr_processor=MistralOCRDocument(api_key=MISTRAL_API_KEY),
    processors=[ReferencesProcessor(api_key=MISTRAL_API_KEY)],
    chunker=SemanticChunker(min_chunk_size=500, api_key=MISTRAL_API_KEY),
    embedder=MistralEmbedder(model="mistral-embed", api_key=MISTRAL_API_KEY),
)

# Step 1: Run OCR
ocr_result = pipeline.run_ocr(file_path)

# Step 2: Run processors (optional)
processing_results = pipeline.run_processors(ocr_result)
final_doc = processing_results[-1]  # Get the last processor's output

# Step 3: Chunk the document
chunk_result = pipeline.run_chunker(final_doc)

# Step 4: Embed chunks
embedding_result = pipeline.run_embedder(chunk_result)
```

You can also create separate pipelines for different steps:

```python
# OCR-only pipeline
ocr_pipeline = DocumentPipeline(
    ocr_processor=MistralOCRDocument(api_key=MISTRAL_API_KEY)
)
ocr_result = ocr_pipeline.run_ocr(file_path)

# Later, chunk with a different pipeline
chunk_pipeline = DocumentPipeline(
    chunker=SemanticChunker(min_chunk_size=500, api_key=MISTRAL_API_KEY)
)
chunks = chunk_pipeline.run_chunker(ocr_result)
```

## Package layout

```
ragbandit-core/
├── src/ragbandit/
│   ├── documents/   # document ingestion, OCR, chunking, 
└── tests/
```

## License

MIT
