Metadata-Version: 2.4
Name: ecate-llm
Version: 0.1.1
Summary: Intelligent multi-model LLM router with auto-calibration and multilingual support
Author: nullcline-labs
License: AGPL-3.0
Keywords: llm,router,proxy,openai,anthropic,multilingual
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: GNU Affero General Public License v3
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: aiohttp>=3.9
Requires-Dist: aiosqlite>=0.19
Requires-Dist: pyyaml>=6.0
Requires-Dist: pydantic>=2.0
Requires-Dist: langdetect>=1.0.9
Requires-Dist: sentence-transformers>=2.2
Requires-Dist: scikit-learn>=1.3
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21; extra == "dev"
Requires-Dist: pytest-aiohttp>=1.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"

<p align="center">
  <img src="docs/images/icon.png" alt="Ecate.LLM" width="128" />
</p>

# Ecate.LLM

> *Ecate (Ἑκάτη) — Greek goddess of crossroads, guiding travelers to the right path.*

An OpenAI-compatible proxy that **learns to route requests across your models** — you bring the models, Ecate learns which one handles what best.

[![Tests](https://github.com/nullcline-labs/ecate.llm/actions/workflows/test.yml/badge.svg)](https://github.com/nullcline-labs/ecate.llm/actions/workflows/test.yml) [![PyPI](https://img.shields.io/pypi/v/ecate-llm)](https://pypi.org/project/ecate-llm/)

### Features

- **Calibrate your own models** — no pre-built rankings, learns routing from your actual model lineup
- **Multilingual routing** — automatic language detection (55+ languages) with per-model language filters
- **Provider-agnostic** — OpenAI, Anthropic, Ollama, vLLM, or any OpenAI-compatible endpoint
- **Embedding-based routing** — fast local inference (~10ms), no external API calls for routing
- **Cost-aware strategies** — balance quality vs cost with configurable `auto`, `best`, or `cheap` modes
- **Automatic fallback** — circuit breaker detects failures and routes around unavailable models
- **Real-time dashboard** — monitor costs, latency, and routing decisions per model
- **OpenAI-compatible API** — drop-in replacement, works with any OpenAI SDK

## Why

You're using your most capable and expensive model on every request, but 60% of your queries could be handled by a model that costs 50x less. Ecate fixes this automatically.

| Without Ecate | With Ecate |
|---|---|
| Every request → your most expensive model | Simple requests → cheapest capable model |
| Manual model selection per endpoint | One endpoint, automatic routing |
| No visibility into what you're spending | Real-time dashboard with cost breakdown |
| One model fails → your app fails | Automatic fallback to next-best model |

## Bring Your Own Models

Ecate doesn't ship with hardcoded model rankings. **You configure the models you have access to** — whether that's OpenAI, Anthropic, local Ollama, or any OpenAI-compatible endpoint — and Ecate learns how to route between them.

```yaml
# config.yaml — your models, your providers, your costs
providers:
  - id: anthropic
    models: [claude-sonnet-4, claude-haiku]
  - id: openai  
    models: [gpt-4o, gpt-4o-mini]
  - id: local
    base_url: http://localhost:11434/v1
    models: [llama3, mistral]
```

When you run calibration, Ecate evaluates **your specific models** on benchmark tasks and learns a routing vector for each one. This means:

- **No generic rankings** — routing is tuned to how *your* models actually perform
- **Mix any providers** — compare Claude vs GPT vs local models fairly
- **Your costs, your tradeoffs** — calibration respects the pricing you configure

```
┌─────────────────────────────────────────────────────────────┐
│  Calibration: "How does each of MY models handle this?"    │
│                                                             │
│  Task: "Write a SQL query..."                               │
│    → claude-sonnet: ✓ pass                                  │
│    → gpt-4o-mini:   ✓ pass                                  │
│    → llama3:        ✗ fail                                  │
│                                                             │
│  Result: learned vectors that reflect YOUR model lineup    │
└─────────────────────────────────────────────────────────────┘
```

After calibration, Ecate routes each request to the best model *from your configured set*:

```
Your app  →  Ecate  →  Best model (from YOUR models)
                ↓
         "What's 2+2?"  →  gpt-4o-mini ($0.0002)
         "Design a distributed system"  →  claude-sonnet ($0.015)
         "Analizza questa architettura"  →  gpt-4o ($0.008)
```

## How Routing Works

Ecate uses **embedding-based routing** with vectors learned from calibration:

1. **Embed the prompt** locally (~10ms) using a multilingual model (512d)
2. **Score each model** via dot product with that model's learned vector
3. **Apply strategy** to balance quality vs cost
4. **Route to optimal model** from your configured set

```
Prompt → embed(512d) → dot(model_vectors) → score - λ*log(cost) → select
```

The key insight: models that excel at certain prompt types will have vectors that align with those prompts in embedding space. Calibration *learns* these vectors for your specific models.

## Dashboard

Access at `http://localhost:8000/dashboard`:

- **Overview** — Cost breakdown, request count, latency
- **Models** — Add/edit/remove models, view costs
- **Routing** — Recent routing decisions, model distribution
- **Calibrate** — Per-model calibration controls, run history

![Overview Dashboard](docs/images/overview.png)

## Installation

```bash
pip install ecate-llm
```

## Quick Start

### 1. Create a config file

```bash
curl -O https://raw.githubusercontent.com/nullcline-labs/ecate.llm/main/config.example.yaml
mv config.example.yaml config.yaml
```

Edit `config.yaml` with your API keys and models:

```yaml
providers:
  - id: anthropic
    type: anthropic
    base_url: https://api.anthropic.com
    api_key: ${ANTHROPIC_API_KEY}  # Or hardcode your key
    models:
      - id: claude-sonnet-4-20250514
        input_cost_per_mtok: 3.0
        output_cost_per_mtok: 15.0

  - id: openai
    type: openai
    base_url: https://api.openai.com/v1
    api_key: ${OPENAI_API_KEY}
    models:
      - id: gpt-4o
        input_cost_per_mtok: 2.5
        output_cost_per_mtok: 10.0
      - id: gpt-4o-mini
        input_cost_per_mtok: 0.15
        output_cost_per_mtok: 0.6
```

### 2. Start the server

```bash
ecate --config config.yaml
```

Server starts at `http://localhost:8000`. On first run, an API key is generated and saved to `data/default_api_key.txt`.

### 3. Calibrate your models

Open the dashboard at `http://localhost:8000/dashboard` → **Calibrate** tab → **Start Calibration**.

Or via API:
```bash
curl -X POST http://localhost:8000/api/calibration/start \
  -H "Authorization: Bearer $(cat data/default_api_key.txt)"
```

Calibration runs ~200 tasks per model and takes a few minutes. Cost: ~$2-5 depending on models.

### 4. Use it

Drop-in replacement for OpenAI. Just change the base URL:

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key=open("data/default_api_key.txt").read().strip()
)

# Ecate routes automatically to the best model
response = client.chat.completions.create(
    model="auto",  # Let Ecate decide
    messages=[{"role": "user", "content": "What's the capital of France?"}]
)

print(response.choices[0].message.content)
# Check which model was used:
# response.headers["X-Ecate-Routed-To"] → "openai/gpt-4o-mini"
```

### Example: Cost-optimized routing

```python
# Simple question → routes to cheapest capable model
client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "What's 2+2?"}]
)
# → routed to gpt-4o-mini ($0.0002)

# Complex task → routes to most capable model
client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Design a distributed cache system with consistency guarantees"}]
)
# → routed to claude-sonnet ($0.015)
```

### Example: Force a specific strategy

```python
# Always use the highest-scoring model (ignore cost)
client.chat.completions.create(model="auto:best", messages=[...])

# Always use the cheapest model above quality threshold
client.chat.completions.create(model="auto:cheap", messages=[...])

# Bypass routing entirely — direct passthrough
client.chat.completions.create(model="openai/gpt-4o", messages=[...])
```

## Routing Strategies

Use the `model` field to control routing:

| Model Value | Behavior |
|---|---|
| `"auto"` | Best tradeoff: `score - λ*log(cost)` |
| `"auto:best"` | Highest embedding score regardless of cost |
| `"auto:cheap"` | Cheapest model above score threshold |
| `"openai/gpt-4o"` | Direct passthrough — bypass routing |

The `cost_weight` parameter (λ, default 0.3) controls the cost/quality tradeoff in auto mode.

## Calibration

Calibration teaches Ecate how each model performs:

1. **Run tasks** — Each model answers ~200 benchmark tasks
2. **Judge responses** — An LLM judge scores each response (pass/fail)
3. **Learn vectors** — Logistic regression learns a 512d vector per model
4. **Save vectors** — Stored in `data/model_vectors.json`

### Per-Model Calibration

In the dashboard, each model card shows:
- **Calibration toggle** — Include in batch calibration
- **Samples setting** — Tasks per model (default 200)
- **Calibrate button** — Run single-model calibration
- **Status badge** — "Ready" (green) or "Not calibrated" (orange)

### Adding a New Model

To add a new model without re-calibrating everything:

1. Add the model to `config.yaml`
2. Go to Dashboard → Calibrate
3. Click "Calibrate" on the new model card
4. Wait ~5 minutes for ~200 evaluations
5. Model is now routable

### Custom Tasks

Add domain-specific tasks:

```yaml
# my_tasks/customer_support.yaml
- id: support_001
  category: instruction_following
  difficulty: moderate
  language: it
  prompt: |
    Rispondi al cliente che chiede un rimborso per un prodotto
    difettoso acquistato 3 mesi fa.
  expected_behavior: |
    Should acknowledge the issue, express empathy, and offer a solution.
```

```yaml
# config.yaml
calibration:
  custom_tasks_dir: ./my_tasks
```

## Multilingual Support

Ecate supports routing across languages:

- **Embedder** — Uses `distiluse-base-multilingual-cased-v1` (50+ languages)
- **Language detection** — Automatically detects the language of incoming prompts
- **Per-model language filter** — Restrict which languages route to which models

### Language Filtering

By default, all models are considered for all languages. To restrict a model to specific languages:

```yaml
models:
  - id: gpt-4o
    # No supported_languages → routes for any language
    
  - id: local-italian-model
    supported_languages: [it]  # Only Italian prompts route here
    
  - id: multilingual-model
    supported_languages: [en, fr, de, es]  # Explicit allowlist
```

When a prompt comes in, Ecate:
1. Detects the language
2. Filters to models that support that language
3. Routes among the remaining candidates

This lets you mix specialized single-language models with general multilingual ones.

## Architecture

```
Client (any OpenAI SDK)
       │
       ▼
┌─────────────────────────────┐
│     Ecate Proxy :8000       │
│                             │
│  Auth → Language Detect     │
│           ↓                 │
│     Local Embedder (10ms)   │  ← distiluse 512d
│           ↓                 │
│     Vector Router           │  ← dot(prompt, model_vectors)
│           ↓                 │
│     Format Bridge           │  ← OpenAI ↔ Anthropic
│           ↓                 │
│     Usage Logger            │  ← Async SQLite writes
│                             │
│  Dashboard + API            │
│  data/model_vectors.json    │
│  data/ecate.db              │
└─────────────────────────────┘
       │
  ┌────┼────┐
  ▼    ▼    ▼
 Any  Any  Any     ← Anthropic, OpenAI, Ollama, vLLM, ...
```

Single container. No Redis, no Postgres, no external dependencies.

## Configuration Reference

```yaml
server:
  host: 0.0.0.0
  port: 8000

routing:
  default_strategy: auto          # auto | best | cheap
  vectors_path: data/model_vectors.json
  embedder_model: sentence-transformers/distiluse-base-multilingual-cased-v1
  cost_weight: 0.3                # λ for auto strategy
  score_threshold: 0.0            # Min score for cheap strategy
  enable_fallback: true
  max_retries: 1
  circuit_breaker_threshold: 3

calibration:
  judge_model: anthropic/claude-sonnet-4-20250514
  concurrency_per_provider: 3
  samples_per_model: 200
  custom_tasks_dir: null
  languages: [en, de, fr]          # Filter calibration tasks by language (null = all)

providers:
  - id: anthropic
    type: anthropic               # anthropic | openai | openai-compatible
    base_url: https://api.anthropic.com
    api_key: ${ANTHROPIC_API_KEY}
    models:
      - id: claude-sonnet-4-20250514
        input_cost_per_mtok: 3.0
        output_cost_per_mtok: 15.0
        max_context: 200000
        supports_tools: true
        supports_vision: true
        supported_languages: [en, de]  # Optional: only route these languages to this model
        calibration_enabled: true      # Include in calibration
        calibration_samples: 200       # Tasks for this model
```

## Resilience

- **Automatic fallback** — If selected model fails, routes to next-best
- **Circuit breaker** — 3 failures in 5 minutes → model temporarily removed
- **Per-key budgets** — Daily/monthly cost limits per API key

## Development

```bash
git clone https://github.com/nullcline-labs/ecate.llm
cd ecate.llm

# Install with router dependencies
pip install -e ".[router,dev]"

# Run tests
pytest tests/ -v

# Run locally
python -m ecate.app --config config.yaml
```

## API Reference

```
# Proxy
POST /v1/chat/completions    # OpenAI-compatible
GET  /v1/models              # List available models

# Dashboard
GET  /api/overview
GET  /api/cost/by-model
GET  /api/cost/over-time
GET  /api/routing-table
GET  /api/requests

# Models
GET    /api/models
POST   /api/models
PUT    /api/models/{provider}/{model}
DELETE /api/models/{provider}/{model}
PUT    /api/models/{provider}/{model}/calibration

# Calibration
POST /api/calibration/start
GET  /api/calibration/progress
GET  /api/calibration/runs
GET  /api/calibration/runs/{id}/results

# Routing
GET  /api/routing/vectors

# Health
GET  /health
```

## License

AGPL-3.0
