Metadata-Version: 2.4
Name: krira-augment
Version: 2.1.0
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Rust
Classifier: Topic :: Text Processing
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Dist: openpyxl>=3.0 ; extra == 'xlsx'
Requires-Dist: pdfplumber>=0.10 ; extra == 'pdf'
Requires-Dist: python-docx>=0.8 ; extra == 'docx'
Requires-Dist: polars>=0.20 ; extra == 'csv'
Requires-Dist: openpyxl>=3.0 ; extra == 'all'
Requires-Dist: pdfplumber>=0.10 ; extra == 'all'
Requires-Dist: python-docx>=0.8 ; extra == 'all'
Requires-Dist: polars>=0.20 ; extra == 'all'
Requires-Dist: pytest>=7.0 ; extra == 'dev'
Requires-Dist: pytest-cov>=4.0 ; extra == 'dev'
Requires-Dist: black>=23.0 ; extra == 'dev'
Requires-Dist: mypy>=1.0 ; extra == 'dev'
Requires-Dist: ruff>=0.1 ; extra == 'dev'
Provides-Extra: xlsx
Provides-Extra: pdf
Provides-Extra: docx
Provides-Extra: csv
Provides-Extra: all
Provides-Extra: dev
License-File: LICENSE
Summary: Production-grade document chunking library for RAG systems - Rust-powered Python library
Keywords: rag,chunking,nlp,document-processing,ai,rust,pyo3
Author-email: Krira Labs <contact@kriralabs.com>
License: MIT
Requires-Python: >=3.10
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Homepage, https://github.com/Krira-Labs/krira-chunker
Project-URL: Repository, https://github.com/Krira-Labs/krira-chunker
Project-URL: Documentation, https://github.com/Krira-Labs/krira-chunker#readme
Project-URL: Issues, https://github.com/Krira-Labs/krira-chunker/issues

# Krira Augment

**High-Performance Rust Chunking Engine for RAG Pipelines**

[![PyPI version](https://badge.fury.io/py/krira-augment.svg)](https://badge.fury.io/py/krira-augment)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Rust](https://img.shields.io/badge/Built_with-Rust-orange)](https://www.rust-lang.org/)

Process gigabytes of text in seconds. **40x faster than LangChain** with **O(1) memory usage**.

---

## Performance Benchmarks

| Dataset Size | LangChain/Pandas | Krira (Rust) | Speedup |
|--------------|------------------|--------------|---------|
| 100 MB | ~45 sec | 0.8 sec | 56x |
| 1 GB | ~8.0 min | 12.0 sec | 40x |
| 10 GB | Timeout / OOM | 2.1 min | Stable |

**Memory stays constant (O(1)) regardless of file size.**

---

## Installation

```bash
pip install krira-augment
```

---

## Complete Example: OpenAI + Pinecone

```python
from krira_augment import Pipeline, PipelineConfig
import json
import openai
import pinecone

# API Keys
OPENAI_API_KEY = "sk-..."        # https://platform.openai.com/api-keys
PINECONE_API_KEY = "pcone-..."   # https://app.pinecone.io/
PINECONE_INDEX_NAME = "my-rag"

# Step 1: Chunk the file
config = PipelineConfig(chunk_size=512, chunk_overlap=50)
pipeline = Pipeline(config=config)
result = pipeline.process("sample.csv", output_path="chunks.jsonl")

print(f"Chunks Created: {result.chunks_created}")
print(f"Execution Time: {result.execution_time:.2f}s")
print(f"Throughput: {result.mb_per_second:.2f} MB/s")
print(f"Preview: {result.preview_chunks[:3]}")

# Step 2: Embed and store
openai.api_key = OPENAI_API_KEY
pinecone.init(api_key=PINECONE_API_KEY)
index = pinecone.Index(PINECONE_INDEX_NAME)

with open("chunks.jsonl", "r") as f:
    for line_num, line in enumerate(f, 1):
        chunk = json.loads(line)
        
        response = openai.Embedding.create(
            input=chunk["text"],
            model="text-embedding-3-small"
        )
        embedding = response["data"][0]["embedding"]
        
        index.upsert(vectors=[(f"chunk_{line_num}", embedding, chunk.get("metadata", {}))])
        
        if line_num % 100 == 0:
            print(f"Processed {line_num} chunks...")

print("Done! All chunks embedded and stored in Pinecone.")
```

---

## Other Integrations

After running **Step 1** (chunking), replace **Step 2** with any of these integrations:

### OpenAI + Qdrant

```python
import openai
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct

openai.api_key = "sk-..."
qdrant = QdrantClient(url="https://xyz.qdrant.io", api_key="qdrant-...")

with open("chunks.jsonl", "r") as f:
    for line_num, line in enumerate(f, 1):
        chunk = json.loads(line)
        response = openai.Embedding.create(input=chunk["text"], model="text-embedding-3-small")
        embedding = response["data"][0]["embedding"]
        qdrant.upsert(collection_name="my-chunks", points=[PointStruct(id=line_num, vector=embedding, payload=chunk.get("metadata", {}))])
        
        if line_num % 100 == 0:
            print(f"Processed {line_num} chunks...")
```

### OpenAI + Weaviate

```python
import openai
import weaviate

openai.api_key = "sk-..."
client = weaviate.Client(
    url="https://xyz.weaviate.network",
    auth_client_secret=weaviate.AuthApiKey(api_key="weaviate-...")
)

with open("chunks.jsonl", "r") as f:
    for line_num, line in enumerate(f, 1):
        chunk = json.loads(line)
        response = openai.Embedding.create(input=chunk["text"], model="text-embedding-3-small")
        embedding = response["data"][0]["embedding"]
        client.data_object.create(
            data_object={"text": chunk["text"], "metadata": chunk.get("metadata", {})},
            class_name="Chunk",
            vector=embedding
        )
        
        if line_num % 100 == 0:
            print(f"Processed {line_num} chunks...")
```

### Cohere + Pinecone

```python
import cohere
import pinecone

co = cohere.Client("co-...")
pinecone.init(api_key="pcone-...")
index = pinecone.Index("my-rag")

with open("chunks.jsonl", "r") as f:
    for line_num, line in enumerate(f, 1):
        chunk = json.loads(line)
        response = co.embed(texts=[chunk["text"]], model="embed-english-v3.0")
        embedding = response.embeddings[0]
        index.upsert(vectors=[(f"chunk_{line_num}", embedding, chunk.get("metadata", {}))])
        
        if line_num % 100 == 0:
            print(f"Processed {line_num} chunks...")
```

### Cohere + Qdrant

```python
import cohere
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct

co = cohere.Client("co-...")
qdrant = QdrantClient(url="https://xyz.qdrant.io", api_key="qdrant-...")

with open("chunks.jsonl", "r") as f:
    for line_num, line in enumerate(f, 1):
        chunk = json.loads(line)
        response = co.embed(texts=[chunk["text"]], model="embed-english-v3.0")
        embedding = response.embeddings[0]
        qdrant.upsert(
            collection_name="my-chunks",
            points=[PointStruct(id=line_num, vector=embedding, payload=chunk.get("metadata", {}))]
        )
        
        if line_num % 100 == 0:
            print(f"Processed {line_num} chunks...")
```

### Local (Sentence Transformers) + ChromaDB (FREE)

```python
from sentence_transformers import SentenceTransformer
import chromadb
import json

model = SentenceTransformer('all-MiniLM-L6-v2')
client = chromadb.Client()
collection = client.create_collection("my_chunks")

with open("chunks.jsonl", "r") as f:
    for line_num, line in enumerate(f, 1):
        chunk = json.loads(line)
        embedding = model.encode(chunk["text"])
        collection.add(
            ids=[f"chunk_{line_num}"],
            embeddings=[embedding.tolist()],
            metadatas=[chunk.get("metadata", {})],
            documents=[chunk["text"]]
        )
        
        if line_num % 100 == 0:
            print(f"Processed {line_num} chunks...")
```

### Hugging Face + FAISS (FREE)

```python
from transformers import AutoTokenizer, AutoModel
import torch
import faiss
import numpy as np
import json

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
index = faiss.IndexFlatL2(384)

embeddings_list = []
with open("chunks.jsonl", "r") as f:
    for line_num, line in enumerate(f, 1):
        chunk = json.loads(line)
        inputs = tokenizer(chunk["text"], return_tensors="pt", truncation=True, max_length=512)
        with torch.no_grad():
            outputs = model(**inputs)
            embedding = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
        embeddings_list.append(embedding)
        
        if line_num % 100 == 0:
            print(f"Processed {line_num} chunks...")

embeddings_array = np.array(embeddings_list).astype('float32')
index.add(embeddings_array)
faiss.write_index(index, "my_vectors.index")
print("Done! Vectors saved to my_vectors.index")
```

---

## Streaming Mode (No Files)

Process chunks without saving to disk - maximum efficiency for real-time pipelines:

### Complete Example: OpenAI + Pinecone (Streaming)

```python
from krira_augment import Pipeline, PipelineConfig
import openai
import pinecone

# API Keys
OPENAI_API_KEY = "sk-..."        # https://platform.openai.com/api-keys
PINECONE_API_KEY = "pcone-..."   # https://app.pinecone.io/
PINECONE_INDEX_NAME = "my-rag"

# Initialize
openai.api_key = OPENAI_API_KEY
pinecone.init(api_key=PINECONE_API_KEY)
index = pinecone.Index(PINECONE_INDEX_NAME)

# Configure pipeline
config = PipelineConfig(chunk_size=512, chunk_overlap=50)
pipeline = Pipeline(config=config)

# Stream and embed (no file created)
chunk_count = 0
print("Starting streaming pipeline...")

for chunk in pipeline.process_stream("data.csv"):
    chunk_count += 1
    
    # Embed
    response = openai.Embedding.create(
        input=chunk["text"],
        model="text-embedding-3-small"
    )
    embedding = response["data"][0]["embedding"]
    
    # Store immediately
    index.upsert(vectors=[(
        f"chunk_{chunk_count}",
        embedding,
        chunk["metadata"]
    )])
    
    # Progress
    if chunk_count % 100 == 0:
        print(f"Processed {chunk_count} chunks...")

print(f"Done! Embedded {chunk_count} chunks. No intermediate file created.")
```

---

## Other Streaming Integrations

Replace the embedding/storage logic with any of these:

### OpenAI + Qdrant (Streaming)

```python
from krira_augment import Pipeline, PipelineConfig
import openai
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct

# Initialize
openai.api_key = "sk-..."
qdrant = QdrantClient(url="https://xyz.qdrant.io", api_key="qdrant-...")

# Configure and stream
config = PipelineConfig(chunk_size=512, chunk_overlap=50)
pipeline = Pipeline(config=config)

chunk_count = 0
for chunk in pipeline.process_stream("data.csv"):
    chunk_count += 1
    
    # Embed
    response = openai.Embedding.create(input=chunk["text"], model="text-embedding-3-small")
    embedding = response["data"][0]["embedding"]
    
    # Store
    qdrant.upsert(
        collection_name="my-chunks",
        points=[PointStruct(id=chunk_count, vector=embedding, payload=chunk["metadata"])]
    )
    
    if chunk_count % 100 == 0:
        print(f"Processed {chunk_count} chunks...")

print(f"Done! {chunk_count} chunks embedded.")
```

### OpenAI + Weaviate (Streaming)

```python
from krira_augment import Pipeline, PipelineConfig
import openai
import weaviate

# Initialize
openai.api_key = "sk-..."
client = weaviate.Client(
    url="https://xyz.weaviate.network",
    auth_client_secret=weaviate.AuthApiKey(api_key="weaviate-...")
)

# Configure and stream
config = PipelineConfig(chunk_size=512, chunk_overlap=50)
pipeline = Pipeline(config=config)

chunk_count = 0
for chunk in pipeline.process_stream("data.csv"):
    chunk_count += 1
    
    # Embed
    response = openai.Embedding.create(input=chunk["text"], model="text-embedding-3-small")
    embedding = response["data"][0]["embedding"]
    
    # Store
    client.data_object.create(
        data_object={"text": chunk["text"], "metadata": chunk["metadata"]},
        class_name="Chunk",
        vector=embedding
    )
    
    if chunk_count % 100 == 0:
        print(f"Processed {chunk_count} chunks...")

print(f"Done! {chunk_count} chunks embedded.")
```

### Cohere + Pinecone (Streaming)

```python
from krira_augment import Pipeline, PipelineConfig
import cohere
import pinecone

# Initialize
co = cohere.Client("co-...")
pinecone.init(api_key="pcone-...")
index = pinecone.Index("my-rag")

# Configure and stream
config = PipelineConfig(chunk_size=512, chunk_overlap=50)
pipeline = Pipeline(config=config)

chunk_count = 0
for chunk in pipeline.process_stream("data.csv"):
    chunk_count += 1
    
    # Embed
    response = co.embed(texts=[chunk["text"]], model="embed-english-v3.0")
    embedding = response.embeddings[0]
    
    # Store
    index.upsert(vectors=[(f"chunk_{chunk_count}", embedding, chunk["metadata"])])
    
    if chunk_count % 100 == 0:
        print(f"Processed {chunk_count} chunks...")

print(f"Done! {chunk_count} chunks embedded.")
```

### Cohere + Qdrant (Streaming)

```python
from krira_augment import Pipeline, PipelineConfig
import cohere
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct

# Initialize
co = cohere.Client("co-...")
qdrant = QdrantClient(url="https://xyz.qdrant.io", api_key="qdrant-...")

# Configure and stream
config = PipelineConfig(chunk_size=512, chunk_overlap=50)
pipeline = Pipeline(config=config)

chunk_count = 0
for chunk in pipeline.process_stream("data.csv"):
    chunk_count += 1
    
    # Embed
    response = co.embed(texts=[chunk["text"]], model="embed-english-v3.0")
    embedding = response.embeddings[0]
    
    # Store
    qdrant.upsert(
        collection_name="my-chunks",
        points=[PointStruct(id=chunk_count, vector=embedding, payload=chunk["metadata"])]
    )
    
    if chunk_count % 100 == 0:
        print(f"Processed {chunk_count} chunks...")

print(f"Done! {chunk_count} chunks embedded.")
```

### Local (Sentence Transformers) + ChromaDB (Streaming, FREE)

```python
from krira_augment import Pipeline, PipelineConfig
from sentence_transformers import SentenceTransformer
import chromadb

# Initialize (no API keys needed)
model = SentenceTransformer('all-MiniLM-L6-v2')
client = chromadb.Client()
collection = client.create_collection("my_chunks")

# Configure and stream
config = PipelineConfig(chunk_size=512, chunk_overlap=50)
pipeline = Pipeline(config=config)

chunk_count = 0
for chunk in pipeline.process_stream("data.csv"):
    chunk_count += 1
    
    # Embed locally (free, runs on your machine)
    embedding = model.encode(chunk["text"])
    
    # Store locally
    collection.add(
        ids=[f"chunk_{chunk_count}"],
        embeddings=[embedding.tolist()],
        metadatas=[chunk["metadata"]],
        documents=[chunk["text"]]
    )
    
    if chunk_count % 100 == 0:
        print(f"Processed {chunk_count} chunks...")

print(f"Done! {chunk_count} chunks embedded. All local, no API costs.")
```

### Hugging Face + FAISS (Streaming, FREE)

```python
from krira_augment import Pipeline, PipelineConfig
from transformers import AutoTokenizer, AutoModel
import torch
import faiss
import numpy as np

# Initialize (no API keys needed)
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
index = faiss.IndexFlatL2(384)

# Configure and stream
config = PipelineConfig(chunk_size=512, chunk_overlap=50)
pipeline = Pipeline(config=config)

chunk_count = 0
embeddings_batch = []
BATCH_SIZE = 100  # Process in batches for efficiency

for chunk in pipeline.process_stream("data.csv"):
    chunk_count += 1
    
    # Embed locally
    inputs = tokenizer(chunk["text"], return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
        embedding = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
    
    embeddings_batch.append(embedding)
    
    # Add to FAISS in batches
    if len(embeddings_batch) >= BATCH_SIZE:
        embeddings_array = np.array(embeddings_batch).astype('float32')
        index.add(embeddings_array)
        embeddings_batch = []
        print(f"Processed {chunk_count} chunks...")

# Add remaining embeddings
if embeddings_batch:
    embeddings_array = np.array(embeddings_batch).astype('float32')
    index.add(embeddings_array)

# Save index
faiss.write_index(index, "my_vectors.index")
print(f"Done! {chunk_count} chunks embedded and saved to my_vectors.index")
```

---

## Streaming Mode Advantages

| Feature | File-Based | Streaming |
|---------|------------|-----------|
| **Disk I/O** | Creates chunks.jsonl | None |
| **Memory Usage** | O(1) constant | O(1) constant |
| **Speed** | Chunking + Embedding | Overlapped (faster) |
| **Use Case** | Large files, batch processing | Real-time, no storage |
| **Flexibility** | Can re-process chunks | Single pass only |

---

## When to Use Streaming vs File-Based

**Use Streaming When:**
- You want maximum speed (no disk writes)
- You don't need to save chunks for later
- You're building real-time pipelines
- You have limited disk space

**Use File-Based When:**
- You want to inspect/debug chunks
- You need to re-process with different embeddings
- You want to share chunks with your team
- You're experimenting with different models

---

## Error Handling (Production Ready)

```python
from krira_augment import Pipeline, PipelineConfig
import openai
import pinecone
import time

openai.api_key = "sk-..."
pinecone.init(api_key="pcone-...")
index = pinecone.Index("my-rag")

config = PipelineConfig(chunk_size=512, chunk_overlap=50)
pipeline = Pipeline(config=config)

chunk_count = 0
error_count = 0

for chunk in pipeline.process_stream("data.csv"):
    chunk_count += 1
    
    try:
        # Embed
        response = openai.Embedding.create(input=chunk["text"], model="text-embedding-3-small")
        embedding = response["data"][0]["embedding"]
        
        # Store
        index.upsert(vectors=[(f"chunk_{chunk_count}", embedding, chunk["metadata"])])
        
    except Exception as e:
        error_count += 1
        print(f"Error on chunk {chunk_count}: {e}")
        
        # Retry logic
        if "rate_limit" in str(e).lower():
            print("Rate limited, waiting 60 seconds...")
            time.sleep(60)
            # Retry (add your retry logic here)
    
    if chunk_count % 100 == 0:
        print(f"Processed {chunk_count} chunks, {error_count} errors")

print(f"Done! {chunk_count} chunks processed, {error_count} errors")
```

---

## Supported Formats

| Format | Extension | Method |
|--------|-----------|--------|
| **CSV** | `.csv` | Direct processing |
| **Text** | `.txt` | Direct processing |
| **JSONL** | `.jsonl` | Direct processing |
| **JSON** | `.json` | Auto-flattening |
| **PDF** | `.pdf` | pdfplumber extraction |
| **Word** | `.docx` | python-docx extraction |
| **Excel** | `.xlsx` | openpyxl extraction |
| **XML** | `.xml` | ElementTree parsing |
| **URLs** | `http://` | BeautifulSoup scraping |

---

## Provider Comparison

| Embedding | Vector Store | Cost | API Keys | Streaming Support |
|-----------|--------------|------|----------|-------------------|
| OpenAI | Pinecone | Paid | 2 | ✅ Yes |
| OpenAI | Qdrant | Paid | 2 | ✅ Yes |
| OpenAI | Weaviate | Paid | 2 | ✅ Yes |
| Cohere | Pinecone | Paid | 2 | ✅ Yes |
| Cohere | Qdrant | Paid | 2 | ✅ Yes |
| SentenceTransformers | ChromaDB | **FREE** | 0 | ✅ Yes |
| Hugging Face | FAISS | **FREE** | 0 | ✅ Yes |

---

## API Keys Setup

Get your keys from:
- **OpenAI:** https://platform.openai.com/api-keys
- **Cohere:** https://dashboard.cohere.com/api-keys
- **Pinecone:** https://app.pinecone.io/
- **Qdrant:** https://cloud.qdrant.io/
- **Weaviate:** https://console.weaviate.cloud/

---

## Development

1. **Clone the repo**
2. **Install Maturin**
   ```bash
   pip install maturin
   ```
3. **Build and Install locally**
   ```bash
   python -m build
   pip install dist/*.whl --force-reinstall
   ```

---

## License

MIT License. (c) 2025 Krira Labs.

