Metadata-Version: 2.4
Name: sutra-llm
Version: 0.2.1
Summary: Chain small language models to outperform large ones — runs locally on 8GB RAM
Author-email: Shubham Kumar <uselessmailforsurf@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/metalheadshubham/SUTRA
Project-URL: Repository, https://github.com/metalheadshubham/SUTRA
Keywords: llm,ollama,ai,local-ai,council,inference
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: ollama>=0.4.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Dynamic: license-file

# S.U.T.R.A

```
 ███████╗ ██╗   ██╗ ████████╗ ██████╗   █████╗
 ██╔════╝ ██║   ██║ ╚══██╔══╝ ██╔══██╗ ██╔══██╗
 ███████╗ ██║   ██║    ██║    ██████╔╝ ███████║
 ╚════██║ ██║   ██║    ██║    ██╔══██╗ ██╔══██║
 ███████║ ╚██████╔╝    ██║    ██║  ██║ ██║  ██║
 ╚══════╝  ╚═════╝     ╚═╝    ╚═╝  ╚═╝ ╚═╝  ╚═╝
```

*Structured Universal Transfer via Retrieval Adaptation*
*by Shubham Kumar*

---

ok so idk how to explain this properly but basically i got tired of paying for API calls and also tired of my laptop dying trying to run a 70B model so i built this thing.

the idea came from a weird place — i was reading about how LLMs do in-context learning and thought, what if instead of trying to make one model smarter, you just... make multiple small models argue with each other and then have a slightly bigger one read the argument and write the answer. turns out that actually works. not always. but enough.

---

## the numbers first

because that's what you're actually here for.

ran the full HumanEval benchmark. 164 python coding problems. standard eval everyone uses.

### SUTRA vs cloud models (HumanEval pass@1)

<img width="1622" height="608" alt="Benchmark" src="https://github.com/user-attachments/assets/d41e7fe0-8451-4a56-9614-acb54cd3a2f2" />




> SUTRA uses `llama-3.3-70b` as the backbone + two `llama-3.1-8b` council passes. cloud model scores from published benchmarks. this is not a parameter-count comparison — SUTRA costs 4 inference calls vs 1, but runs on consumer hardware with no API subscription.

SUTRA sits above GPT-4 original. on a machine with 8GB RAM. running entirely offline.

### full benchmark breakdown

| method | pass@1 | solved | delta |
|--------|--------|--------|-------|
| raw `llama-3.3-70b` (baseline) | 0.805 | 132/164 | — |
| **SUTRA council** | **0.854** | **140/164** | **+8 problems** |

council only ran on the 32 problems the baseline failed. rescued 8 of them. zero regressions — if baseline passed it, council never ran on it so it literally cannot make things worse.

rescued: `HumanEval/8` `/32` `/64` `/65` `/83` `/86` `/93` `/121`

### hard problems specifically (20 problems, HumanEval 32–163 range)

this is where it actually matters. easy problems don't need council — the large model handles them fine alone.

| method | pass@1 | solved |
|--------|--------|--------|
| raw `llama-3.3-70b` | 0.600 | 12/20 |
| **SUTRA council** | **0.800** | **16/20** |
| delta | **+0.200** | **+4 problems** |

**+20% on hard problems.** 25% rescue rate on failures. that's the real signal.

### easy problems (same 20-problem test)

| method | pass@1 | solved |
|--------|--------|--------|
| raw `llama-3.3-70b` | 0.900 | 18/20 |
| SUTRA council | 0.800 | 16/20 |
| delta | -0.100 | -2 problems |

council makes easy problems slightly worse. the 8b critique adds noise when the 70b already knows the answer. that's why `/quick` exists — skip council when you don't need it.

raw data in `benchmark/sutra_full_results.json`.

---

## what even is this

it's a pipeline. four stages. runs entirely locally on ollama. never puts two models in RAM at the same time because 8GB doesn't go very far.

```
your query
  ↓
small model (temp 0.3)  →  answer A
small model (temp 0.8)  →  answer B   ← same model, different temp = genuinely different answer
small model (temp 0.3)  →  critiques both
large model (temp 0.2)  →  reads everything, writes final answer
```

the temperature thing matters. running the same model twice at different temps gives you one conservative answer and one exploratory one. the critique finds where they disagree or where both went wrong. the large model synthesizes something that (hopefully) avoids both failure modes.

is this just fancy prompting? kind of. but it's fancy prompting with a real benchmark behind it so.

---

## is it fast

lmaooo no. on CPU with 8GB RAM you're looking at 5-10 minutes per query in council mode. on a machine with a GPU it drops to like 1-2 minutes. on Groq's free API it runs in ~30 seconds.

the architecture is sound — the slowness is a hardware constraint not a design flaw. `/quick` mode skips council and just asks the large model directly when you need speed.

---

## install

you need [Ollama](https://ollama.com) running. that's it.

```bash
git clone https://github.com/metalheadshubham/SUTRA
cd SUTRA
pip install -r requirements.txt
ollama pull qwen2.5-coder:3b
ollama pull qwen2.5-coder:7b
python -m agent_handoff
```

Python 3.10+. 8GB RAM minimum. works on Windows, Mac, Linux (Mac is faster bc Apple Silicon).

---

## models

default setup i benchmarked:
- **small model** (stages 1, 2, 3): `qwen2.5-coder:3b`
- **large model** (synthesis): `qwen2.5-coder:7b`

if you have more RAM or a GPU:
- `deepseek-r1:8b` as large — actual chain-of-thought reasoning, significantly better on hard problems but very slow on CPU
- `llama3.1:8b` as large — good general purpose alternative

the CLI auto-detects your Ollama models and recommends roles based on parameter count.

---

## commands

```
/quick <query>      skip council, ask large model directly (fast path)
/council <query>    force council mode explicitly
/save               save last output — asks you where
/load <file>        attach a file to context for next query
/unload             remove attached files
/project            see files saved + loaded this session
/logs               last 5 runs with timing
/plugins            loaded plugins + hooks
/install <url>      install a plugin from a .py URL
/help               all commands
```

multiline input — type `"""` to open a block, paste whatever, `"""` to send. good for pasting code you want fixed.

---

## workspace

load files into context so the council reads them without you copy-pasting anything.

```bash
❯ implement a FastAPI router for user auth

❯ /save
  Save to (path or filename): src/auth.py
  ✓ Saved to C:\project\src\auth.py · registered in workspace

❯ /load auth.py
  ✓ Loaded auth.py (~340 tokens) · attached to context

❯ add rate limiting to this
  [council now sees auth.py automatically]
```

if the file is large (>800 tokens) it warns you to switch to a bigger model before continuing.

---

## plugins

drop a `.py` file into `~/.sutra/plugins/`. loaded on startup. hooks:

```python
def pre_query(query): ...
def post_answer_a(text): ...
def post_answer_b(text): ...
def post_critique(text): ...
def post_synthesis(text): ...
```

built-in logger saves every run to `~/.sutra/logs/YYYY-MM-DD.jsonl` automatically.

---

## roadmap

- difficulty router — auto-detect when to use council vs quick
- MCP server — expose council as an MCP tool so Claude/Cursor can call it
- math + reasoning benchmarks — HumanEval is coding only, want GSM8K
- ablation study — prove the critique step is doing something (vs just calling twice)
- same benchmark on fully local models (no Groq)

---

## project structure

```
agent_handoff/
├── cli.py          # terminal UI — where you spend your time
├── handoff.py      # AgentHandoff (old) + CouncilHandoff (new)
├── templates.py    # prompts for each council stage
├── protocol.py     # HandoffPacket, CouncilResult dataclasses
├── parser.py       # extracts structured output from model responses
├── cache.py        # SHA-256 keyed cache with TTL
└── utils.py        # helpers
benchmark/
└── sutra_full_results.json   # raw data behind the numbers above
tests/              # 56 tests, all pass
```

---

## license

MIT. do whatever.just give me the credits

---

*shit works.*
