Metadata-Version: 2.4
Name: llm-ctx-mgr
Version: 0.1.0
Summary: A middleware layer for managing, budgeting, and optimizing LLM context windows.
Project-URL: Homepage, https://github.com/adiptamartulandi/llm-ctx-mgr
Project-URL: Repository, https://github.com/adiptamartulandi/llm-ctx-mgr
Author: Adipta Martulandi
License-Expression: MIT
Keywords: ai,budget,context,llm,token
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Provides-Extra: all
Requires-Dist: anthropic>=0.40; extra == 'all'
Requires-Dist: fastembed>=0.4; extra == 'all'
Requires-Dist: google-genai>=1.0; extra == 'all'
Requires-Dist: llmlingua>=0.2.2; extra == 'all'
Requires-Dist: numpy>=1.24; extra == 'all'
Requires-Dist: tiktoken>=0.7; extra == 'all'
Requires-Dist: tokenizers>=0.19; extra == 'all'
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.40; extra == 'anthropic'
Provides-Extra: dev
Requires-Dist: anthropic>=0.40; extra == 'dev'
Requires-Dist: fastembed>=0.4; extra == 'dev'
Requires-Dist: google-genai>=1.0; extra == 'dev'
Requires-Dist: llmlingua>=0.2.2; extra == 'dev'
Requires-Dist: numpy>=1.24; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Requires-Dist: tiktoken>=0.7; extra == 'dev'
Requires-Dist: tokenizers>=0.19; extra == 'dev'
Provides-Extra: distill
Requires-Dist: llmlingua>=0.2.2; extra == 'distill'
Provides-Extra: google
Requires-Dist: google-genai>=1.0; extra == 'google'
Provides-Extra: huggingface
Requires-Dist: tokenizers>=0.19; extra == 'huggingface'
Provides-Extra: openai
Requires-Dist: tiktoken>=0.7; extra == 'openai'
Provides-Extra: prune
Requires-Dist: fastembed>=0.4; extra == 'prune'
Requires-Dist: numpy>=1.24; extra == 'prune'
Description-Content-Type: text/markdown

# Project: `llm-ctx-mgr - llm context manager (engineering)`

### **1. Background & Problem Statement**

Large Language Models (LLMs) are moving from simple "Prompt Engineering" (crafting a single query) to "Context Engineering" (managing a massive ecosystem of retrieved documents, tools, and history).

The current problem is **Context Pollution**:

1. **Overloading:** RAG (Retrieval Augmented Generation) pipelines often dump too much data, exceeding token limits.
2. **Noise:** Duplicate or irrelevant information confuses the model and increases hallucination rates.
3. **Formatting Chaos:** Different models (Claude vs. Llama vs. GPT) require different formatting (XML vs. Markdown vs. Plain Text), leading to messy, hard-to-maintain string concatenation code.
4. **Black Box:** Developers rarely see exactly what "context" was sent to the LLM until after a failure occurs.

**The Solution:** `llm-ctx-mgr` acts as a **middleware layer** for the LLM pipeline. It creates a structured, optimized, and budget-aware "context payload" before it reaches the model.

---

### **2. Architecture: Where It Fits**

The package sits strictly between the **Retrieval/Agent Layer** (e.g., LangChain, LlamaIndex) and the **Execution Layer** (the LLM API).

#### **Diagram: The "Before" (Standard Pipeline)**

*Without `llm-ctx-mgr`, retrieval is messy and often truncated arbitrarily.*

```mermaid
graph LR
    A[User Query] --> B[LangChain Retriever]
    B --> C{Result: 15 Docs}
    C -->|Raw Dump| D[LLM Context Window]
    D -->|Token Limit Exceeded!| E[Truncated/Error]

```

#### **Diagram: The "After" (With `llm-ctx-mgr`)**

*With your package, the context is curated, prioritized, and formatted.*

```mermaid
graph LR
    A[User Query] --> B[LangChain Retriever]
    B --> C[Raw Data: 15 Docs + History]
    C --> D[**llm-ctx-mgr**]
    
    subgraph "Your Middleware"
    D --> E["1. Token Budgeting"]
    E --> F["2. Semantic Pruning"]
    F --> G["3. Formatting (XML/JSON)"]
    end
    
    G --> H[Optimized Prompt]
    H --> I[LLM API]

```

---

### **3. Key Features & Tools**

Here is the breakdown of the 4 core modules, the features they provide, and the libraries powering them.

#### **Module A: The Budget Controller (`budget`)**

* **Goal:** Ensure the context never exceeds the model's limit (e.g., 8192 tokens) while keeping the most important information.
* **Feature:** `PriorityQueue`. Users assign a priority (Critical, High, Medium, Low) to every piece of context. If the budget is full, "Low" items are dropped first.
* **Supported Providers & Tools:**
    * **OpenAI** (`gpt-4`, `o1`, `o3`, etc.): **`tiktoken`** — fast, local token counting.
    * **HuggingFace** (`meta-llama/...`, `mistralai/...`): **`tokenizers`** — for open-source models.
    * **Google** (`gemini-2.0-flash`, `gemma-...`): **`google-genai`** — API-based `count_tokens`.
    * **Anthropic** (`claude-sonnet-4-20250514`, etc.): **`anthropic`** — API-based `count_tokens`.
* **Installation (pick what you need):**
    ```bash
    pip install llm-ctx-mgr[openai]       # tiktoken
    pip install llm-ctx-mgr[huggingface]   # tokenizers
    pip install llm-ctx-mgr[google]        # google-genai
    pip install llm-ctx-mgr[anthropic]     # anthropic
    pip install llm-ctx-mgr[all]           # everything
    ```

#### **Module B: The Semantic Pruner (`prune`)**

* **Goal:** Remove redundancy. If three retrieved documents say "Python is great," keep only the best one.
* **Features:**
    * **`Deduplicator` (block-level):** Calculates cosine similarity between context blocks and removes duplicate blocks. Among duplicates, the highest-priority block is kept.
    * **`Deduplicator.deduplicate_chunks()` (chunk-level):** Splits a single block's content by separator (e.g. `\n\n`), deduplicates the chunks internally, and reassembles the cleaned content. Ideal for RAG results where multiple retrieved chunks within one block are semantically redundant.
* **Tools:**
    * **`FastEmbed`**: Lightweight embedding generation (CPU-friendly, no heavy PyTorch needed).
    * **`Numpy`**: For efficient vector math (dot products).
* **Installation:**
    ```bash
    pip install llm-ctx-mgr[prune]
    ```
    
#### **Module C: Context Distillation (`distill`)**

* **Goal:** Compress individual blocks by removing non-essential tokens (e.g., reduces a 5000-token document to 2500 tokens) using a small ML model.
* **Feature:** `Compressor`. Uses **LLMLingua-2** (small BERT-based token classifier) to keep only the most important words.
* **Tools:**
    * **`llmlingua`**: Microsoft's library for prompt compression.
    * **`onnxruntime`** / **`transformers`**: For running the small BERT model.
* **Installation:**
    ```bash
    pip install llm-ctx-mgr[distill]
    ```

#### **Module D: The Formatter (`format`)**

* **Goal:** Adapt the text structure to the specific LLM being used without changing the data.
* **Feature:** `ModelAdapter`.
* *Claude Mode:* Wraps data in XML tags (`<doc id="1">...</doc>`).
* *Llama Mode:* Uses specific Markdown headers or `[INST]` tags.


* **Tools:**
* **`Jinja2`**: For powerful, logic-based string templates.
* **`Pydantic`**: To enforce strict schema validation on the input data.



#### **Module E: Observability (`inspect`)**

* **Goal:** Let the developer see exactly what is happening.
* **Feature:** `ContextVisualizer` and `Snapshot`. Prints a colored bar chart of token usage to the terminal and saves the final prompt to a JSON file for debugging.
* **Tools:**
* **`Rich`**: For beautiful terminal output and progress bars.



---

### **4. Installation & Usage Guide**

#### **Installation**

```bash
pip install llm-ctx-mgr[all]
```

#### **Feature A: Budgeting & Priority Pruning**
*Ensure your context fits the token limit by prioritizing critical information.*

```python
from context_manager import ContextEngine, ContextBlock
from context_manager.strategies import PriorityPruning

# 1. Initialize Engine with a token limit
engine = ContextEngine(
    model="gpt-4",
    token_limit=4000,
    pruning_strategy=PriorityPruning()
)

# 2. Add Critical Context (System Prompts) - NEVER dropped
engine.add(ContextBlock(
    content="You are a helpful AI assistant.",
    role="system",
    priority="critical"
))

# 3. Add High Priority Context (User History) - Dropped only if critical fills budget
engine.add(ContextBlock(
    content="User: Explain quantum computing.",
    role="history",
    priority="high"
))

# 4. Add Medium/Low Priority (RAG Docs) - Dropped first
docs = ["Quantum computing uses qubits...", "Quantum mechanics is...", "Cake recipes..."]
for doc in docs:
    engine.add(ContextBlock(
        content=doc,
        role="rag_context",
        priority="medium"
    ))

# 5. Compile - Triggers budgeting and pruning
final_prompt = engine.compile()
print(f"Final token count: {engine.compiled_tokens}")
```

#### **Feature B: Semantic Pruning (Deduplication)**
*Remove duplicate or highly similar content to save space and reduce noise.*

```python
from context_manager import ContextEngine, ContextBlock
from context_manager.prune import Deduplicator

# 1. Initialize Deduplicator (uses FastEmbed by default)
dedup = Deduplicator(threshold=0.85)

# 2. Initialize Engine with Deduplicator
engine = ContextEngine(
    model="gpt-4",
    token_limit=4000,
    deduplicator=dedup
)

# 3. Add duplicate content (simulating RAG retrieval)
# The second block will be detected as a duplicate and removed/merged
engine.add(ContextBlock(
    content="Python was created by Guido van Rossum.",
    role="rag_context",
    priority="medium"
))
engine.add(ContextBlock(
    content="Guido van Rossum created the Python language.",
    role="rag_context",
    priority="low"  # Lower priority duplicate is dropped
))

# 4. Compile - Deduplication happens before budgeting
final_prompt = engine.compile()
```

#### **Feature C: Context Distillation (Compression)**
*Compress long documents using LLMLingua to keep essential information within budget.*

```python
from context_manager import ContextEngine, ContextBlock, Priority
from context_manager.distill import LLMLinguaCompressor

# 1. Initialize Compressor (loads small local model)
compressor = LLMLinguaCompressor(
    model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    device_map="cpu"
)

# 2. Initialize Engine
engine = ContextEngine(
    model="gpt-4",
    token_limit=2000,
    compressor=compressor
)

# 3. Add a long document marked for compression
long_text = "..." * 1000  # Very long text
engine.add(ContextBlock(
    content=long_text,
    role="rag_context",
    priority=Priority.HIGH,
    can_compress=True  # <--- Triggers compression for this block
))

# 4. Compile - Compression happens first, then deduplication, then budgeting
final_prompt = engine.compile()
```

### **5. Roadmap for Development**

1. **v0.1 (MVP):** `tiktoken` counting and `PriorityPruning`. (Done)
2. **v0.2 (Structure):** `Jinja2` templates for formatting. (Done)
3. **v0.3 (Smarts):** `FastEmbed` for semantic deduplication. (Done)
4. **v0.4 (Vis):** `Rich` terminal visualization. (Done)
5. **v0.5 (Distill):** `LLMLingua` integration for context compression. (Done)
6. **v0.6 (Next):** Streaming support and advanced caching strategies.

This design gives you a clear path to building a high-value tool that solves a specific, painful problem for AI engineers.
