Metadata-Version: 2.4
Name: evalkit-llm
Version: 0.1.0
Summary: Simple LLM/RAG evaluation framework for teams
Author-email: Ivan Diaz <feleir@gmail.com>
License: MIT
License-File: LICENSE.md
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.11
Requires-Dist: click>=8.0
Requires-Dist: pydantic>=2.0
Requires-Dist: python-dotenv>=1.0
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.25; extra == 'anthropic'
Provides-Extra: bleu
Requires-Dist: nltk>=3.8; extra == 'bleu'
Provides-Extra: dev
Requires-Dist: anthropic; extra == 'dev'
Requires-Dist: black>=25.1; extra == 'dev'
Requires-Dist: faiss-cpu; extra == 'dev'
Requires-Dist: httpx>=0.27; extra == 'dev'
Requires-Dist: langchain; extra == 'dev'
Requires-Dist: langchain-community; extra == 'dev'
Requires-Dist: langchain-core>=0.3; extra == 'dev'
Requires-Dist: langchain-openai; extra == 'dev'
Requires-Dist: nltk>=3.8; extra == 'dev'
Requires-Dist: ollama; extra == 'dev'
Requires-Dist: openai; extra == 'dev'
Requires-Dist: openai-agents>=0.1; extra == 'dev'
Requires-Dist: pre-commit>=3.6; extra == 'dev'
Requires-Dist: pydantic-ai>=0.2; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.4; extra == 'dev'
Requires-Dist: rouge-score>=0.1; extra == 'dev'
Requires-Dist: ruff>=0.11; extra == 'dev'
Provides-Extra: langchain
Requires-Dist: langchain-core>=0.3; extra == 'langchain'
Provides-Extra: ollama
Requires-Dist: ollama>=0.4; extra == 'ollama'
Provides-Extra: openai
Requires-Dist: openai>=1.0; extra == 'openai'
Provides-Extra: openai-agents
Requires-Dist: openai-agents>=0.1; extra == 'openai-agents'
Provides-Extra: pydanticai
Requires-Dist: pydantic-ai>=0.2; extra == 'pydanticai'
Provides-Extra: rouge
Requires-Dist: rouge-score>=0.1; extra == 'rouge'
Provides-Extra: server
Requires-Dist: alembic>=1.13; extra == 'server'
Requires-Dist: fastapi>=0.111; extra == 'server'
Requires-Dist: jinja2>=3.1; extra == 'server'
Requires-Dist: python-multipart>=0.0.9; extra == 'server'
Requires-Dist: sqlalchemy>=2.0; extra == 'server'
Requires-Dist: uvicorn[standard]>=0.29; extra == 'server'
Description-Content-Type: text/markdown

# evalkit

A simple, provider-agnostic LLM/RAG evaluation framework for teams.

[![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue)](https://www.python.org/)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)

## Why evalkit?

RAG systems are in production everywhere, but most teams evaluate them manually or not at all. Existing tools (RAGAS, DeepEval) are powerful but require understanding metrics theory and a pytest workflow. Enterprise platforms are expensive and complex.

evalkit is the middle ground: a Python library + web dashboard that evaluates RAG pipelines out of the box. Define test cases, pick a suite, get a plain-English report with fix suggestions. No pytest. No metrics theory. Works with OpenAI, Anthropic, or any local model via Ollama.

---

## Table of Contents

- [Install](#install)
- [Quick Start](#quick-start)
- [How It Works](#how-it-works)
- [Evaluation Suites](#evaluation-suites)
- [Agent Tool-Call Evaluation](#agent-tool-call-evaluation)
- [Framework Trackers](#framework-trackers)
- [Judges](#judges)
- [Prompt Engineering](#prompt-engineering)
- [Online Monitoring](#online-monitoring)
- [Running the Web Dashboard](#running-the-web-dashboard)
- [CLI Reference](#cli-reference)
- [Development Setup](#development-setup)
- [Running Tests](#running-tests)
- [Project Structure](#project-structure)
- [Roadmap](#roadmap)

---

## Install

```bash
# Core only (CLI, no LLM judge)
pip install evalkit-llm

# With OpenAI as judge
pip install "evalkit-llm[openai]"

# With Anthropic as judge
pip install "evalkit-llm[anthropic]"

# Deterministic text metrics (no judge needed)
pip install "evalkit-llm[rouge]"           # ROUGE score
pip install "evalkit-llm[bleu]"            # BLEU score

# Framework trackers (capture tool calls + RAG context automatically)
pip install "evalkit-llm[openai-agents]"   # OpenAI Agents SDK tracker
pip install "evalkit-llm[langchain]"       # LangChain tracker
pip install "evalkit-llm[pydanticai]"      # PydanticAI tracker

# With the web dashboard
pip install "evalkit-llm[server]"

# Everything (web UI + all providers + all trackers)
pip install "evalkit-llm[server,openai,openai-agents,anthropic,langchain,pydanticai]"

# Full dev toolchain (tests, examples, all providers)
pip install "evalkit-llm[server,openai,anthropic,dev]"
```

---

## Quick Start

```python
from evalkit import TestCase, EvaluationEngine, RAGQASuite
from evalkit.judges.openai import OpenAIJudge

# 1. Define test cases — question + context + your pipeline's answer
cases = [
    TestCase(
        question="What is the capital of France?",
        context=["France is a country in Western Europe. Its capital is Paris."],
        answer="The capital of France is Paris.",
        expected_answer="Paris",  # optional, needed for CompletenessMetric
    ),
    TestCase(
        question="Who invented the telephone?",
        context=["Alexander Graham Bell is credited with inventing the telephone in 1876."],
        answer="The telephone was invented by Alexander Graham Bell.",
    ),
]

# 2. Pick a judge and a suite
engine = EvaluationEngine(judge=OpenAIJudge(model="gpt-4o-mini"))
report = engine.run(test_cases=cases, suite=RAGQASuite())

# 3. Read the report
print(f"Average score: {report.summary.average_score:.0%}")
print(f"Passed: {report.summary.passed}/{report.summary.total_cases}")

for metric, score in report.summary.score_by_metric.items():
    print(f"  {metric:<35} {score:.0%}")

for failure in report.failures:
    print(f"\n[{failure.failure_type}] {failure.count} failures")
    print(f"  Fix: {failure.suggestion}")
```

**Output:**
```
Average score: 91%
Passed: 2/2
  faithfulness                        95%
  answer_relevancy                    88%
  contextual_relevancy                92%
  hallucination                       97%
  completeness                        84%
```

---

## Local Setup & First Evaluation

Step-by-step guide to go from zero to seeing evaluation results.

### 1. Install

```bash
# Clone the repo
git clone https://github.com/your-org/evalkit.git
cd evalkit

# Option A — editable install (recommended, run once)
pip install -e ".[openai]"

# Option B — no install, just set PYTHONPATH (dependencies still needed)
pip install pydantic click python-dotenv openai
export PYTHONPATH=src   # needed because code lives under src/
```

With Option A you run scripts normally: `python examples/openai_agent.py`.
With Option B you prefix with PYTHONPATH: `PYTHONPATH=src python examples/openai_agent.py`.

For the full stack (server + all providers + tests):
```bash
pip install -e ".[server,openai,anthropic,dev]"
```

### 2. Set your API key

Create a `.env` file in the project root (see `.env.example`):

```bash
cp .env.example .env
# Edit .env and fill in your API key
```

Or set the variable directly:
```bash
# OpenAI
export OPENAI_API_KEY=sk-...

# Or Anthropic (if using AnthropicJudge)
export ANTHROPIC_API_KEY=sk-ant-...

# Or neither — use Ollama for fully local evaluation (no API key needed)
pip install "evalkit-llm[ollama]"
```

The CLI and all examples load `.env` automatically via `python-dotenv`.

### 3. Option A — Run evaluation from Python (no server)

```python
from evalkit import TestCase, EvaluationEngine, RAGQASuite
from evalkit.judges.openai import OpenAIJudge

cases = [
    TestCase(
        question="What is Python?",
        context=["Python is a high-level programming language created by Guido van Rossum."],
        answer="Python is a programming language.",
        expected_answer="Python is a high-level programming language.",
    ),
]

engine = EvaluationEngine(judge=OpenAIJudge(model="gpt-4o-mini"))
report = engine.run(test_cases=cases, suite=RAGQASuite())

print(f"Score: {report.summary.average_score:.0%}")
for metric, score in report.summary.score_by_metric.items():
    print(f"  {metric}: {score:.0%}")
```

### 3. Option B — Run evaluation from CLI

```bash
# Create a test_cases.json file
cat > test_cases.json << 'EOF'
[
  {
    "question": "What is Python?",
    "context": ["Python is a high-level programming language created by Guido van Rossum."],
    "answer": "Python is a programming language.",
    "expected_answer": "Python is a high-level programming language."
  }
]
EOF

# Run evaluation
evalkit run --input test_cases.json --suite rag_qa --judge openai --output report.json

# View the report
evalkit report report.json
```

### 3. Option C — Run with server + dashboard

```bash
# Terminal 1 — start the server
evalkit serve
# → Server starts at http://localhost:8000, DB auto-created at ./evalkit.db

# Terminal 2 — run the example
python examples/server_rag.py
# → Creates a project, submits RAG test cases, polls for results

# Open http://localhost:8000 in your browser to see the dashboard
```

### 4. Run with Ollama (fully local, no API key)

```bash
# Install and start Ollama (https://ollama.com)
ollama pull gpt-oss

# Run evaluation with Ollama as judge
evalkit run --input test_cases.json --suite rag_qa --judge ollama --model gpt-oss
```

### 5. Verify your setup

```bash
# Run unit tests (no API keys needed)
pytest tests/unit/ -v

# Run a bundled example (requires OPENAI_API_KEY)
python examples/openai_agent.py
```

---

## How It Works

```
TestCase(s) ──► EvaluationEngine ──► Suite / Metrics ──► Judge (LLM) ──► EvaluationReport
                                                              │
                                              OpenAI structured outputs (chat.completions.parse)
                                              Anthropic structured outputs (messages.parse)
                                              Azure OpenAI structured outputs
                                              Ollama (ollama package)
```

1. **Test cases** — each `TestCase` holds a question, retrieved context chunks, the pipeline's answer, and (optionally) an expected answer.
2. **Engine** — `EvaluationEngine` runs metrics concurrently (thread pool for sync, `asyncio` for async).
3. **Metrics** — each metric builds a prompt and sends it to the judge. The judge returns a `JudgeResponse` with `score` (0–1) and `reason`.
4. **Structured outputs** — `OpenAIJudge` uses `chat.completions.parse(response_format=JudgeResponse)`; `AnthropicJudge` uses `messages.parse(output_format=JudgeResponse)`. Both guarantee type-safe responses without text parsing.
5. **Report** — `ReportGenerator` aggregates scores, groups failures by metric, and attaches plain-English fix suggestions.

---

## Evaluation Suites

Suites are named collections of metrics. Use a built-in or compose your own.

### Suite enum

Built-in suites are identified by the `Suite` StrEnum, which restricts suite selection to valid options at the type level. `Suite` is backward-compatible with plain strings (`Suite.RAG_QA == "rag_qa"` is `True`).

```python
from evalkit import Suite

Suite.RAG_QA            # "rag_qa"
Suite.DOCUMENT_SEARCH   # "document_search"
Suite.CONVERSATIONAL    # "conversational"
Suite.AGENT_TOOL_CALL      # "agent_tool_call"
Suite.LIVE_QA              # "live_qa"
Suite.ANSWER_SIMILARITY    # "answer_similarity"
```

Use `Suite` values with the CLI, `EvalKitClient`, and the REST API:

```python
client.evaluate(project.id, test_cases, suite=Suite.RAG_QA)
```

### Built-in suites

| Suite | Metrics | Use when |
|---|---|---|
| `RAGQASuite()` | All 5 metrics | General RAG Q&A evaluation |
| `DocumentSearchSuite()` | Faithfulness + Contextual Relevancy | Retrieval quality focus |
| `ConversationalSuite()` | Answer Relevancy + Faithfulness + Hallucination | Chat/conversational RAG |
| `AgentToolCallSuite()` | Tool Selection + Tool Parameter + Tool Parameter Similarity | Agent tool-call evaluation |
| `LiveQASuite()` | Answer Relevancy + Contextual Relevancy | Live evaluation without ground truth |
| `AnswerSimilaritySuite()` | Exact Match + String Containment + String Similarity | Fast deterministic checks (no judge) |

### Built-in metrics

**LLM-judged** (require a judge provider):

| Metric | What it measures | `expected_answer` needed? |
|---|---|---|
| `FaithfulnessMetric` | Are all answer claims supported by context? | No |
| `AnswerRelevancyMetric` | Does the answer address the question? | No |
| `ContextualRelevancyMetric` | Is the retrieved context relevant to the question? | No |
| `HallucinationMetric` | Does the answer contain facts not in context? | No |
| `CompletenessMetric` | Does the answer cover the expected answer? | Yes (skipped if absent) |

**Deterministic** (no judge needed — fast, free, reproducible):

| Metric | What it measures | Dependencies |
|---|---|---|
| `ExactMatchMetric` | Does the answer match the expected answer exactly? (normalized) | None |
| `StringContainmentMetric` | Does the answer contain the expected answer? | None |
| `StringSimilarityMetric` | Levenshtein edit distance similarity (0-1) | None |
| `RougeMetric` | ROUGE-L n-gram recall (configurable: rouge1, rouge2, rougeL) | `evalkit[rouge]` |
| `BleuMetric` | BLEU n-gram precision with brevity penalty | `evalkit[bleu]` |

All deterministic metrics require `expected_answer`. The three zero-dependency metrics (exact match, containment, similarity) are bundled in `AnswerSimilaritySuite`.

### Custom metrics

```python
from evalkit.metrics.base import BaseMetric
from evalkit.models import TestCase

class ToneMetric(BaseMetric):
    name = "tone"
    threshold = 0.7

    def _build_prompt(self, test_case: TestCase) -> str:
        return (
            f"Question: {test_case.question}\n"
            f"Answer: {test_case.answer}\n\n"
            "Is the tone of this answer professional and helpful? "
            "Score 1.0 for excellent tone, 0.0 for unprofessional. "
            'Respond as JSON: {"score": <0-1>, "reason": "<explanation>"}'
        )

# Use with engine
from evalkit import EvaluationEngine
engine = EvaluationEngine(judge=my_judge)
report = engine.run(test_cases=cases, metrics=[ToneMetric(threshold=0.8)])
```

### Custom suites

```python
from evalkit.suites.base import EvaluationSuite
from evalkit.metrics.faithfulness import FaithfulnessMetric
from evalkit.metrics.hallucination import HallucinationMetric

my_suite = EvaluationSuite(
    name="strict_qa",
    description="High-trust Q&A — faithfulness + hallucination only",
    metrics=[FaithfulnessMetric(threshold=0.8), HallucinationMetric(threshold=0.9)],
)

report = engine.run(test_cases=cases, suite=my_suite)
```

---

## Agent Tool-Call Evaluation

evalkit can evaluate whether agents call the right tools with the correct parameters in the correct order. This is useful for testing function-calling agents, tool-using pipelines, and any system where the LLM must choose and invoke tools.

### How it works

Each `TestCase` carries optional `expected_tool_calls` and `actual_tool_calls` fields. Each tool call is a `ToolCall(name, parameters)` object. Three metrics cover different aspects of correctness:

| Metric | What it measures | LLM needed? |
|---|---|---|
| `tool_selection` | Did the agent call the right tools in the right order? | No (deterministic) |
| `tool_parameter` | Did the agent pass the correct parameter keys and values? | No (deterministic) |
| `tool_parameter_similarity` | Are parameter values semantically equivalent (e.g. "Paris" vs "paris, france")? | Yes (LLM-judged) |

### Example test case

```json
{
  "question": "What's the weather in Paris?",
  "expected_tool_calls": [
    {"name": "get_weather", "parameters": {"city": "Paris"}}
  ],
  "actual_tool_calls": [
    {"name": "get_weather", "parameters": {"city": "Paris"}}
  ]
}
```

### Quick usage

```python
from evalkit import TestCase, ToolCall, EvaluationEngine, AgentToolCallSuite
from evalkit.judges.openai import OpenAIJudge

test_cases = [TestCase(
    question="What's the weather?",
    expected_tool_calls=[ToolCall(name="get_weather", parameters={"city": "Paris"})],
    actual_tool_calls=[ToolCall(name="get_weather", parameters={"city": "Paris"})],
)]
engine = EvaluationEngine(judge=OpenAIJudge())
report = engine.run(test_cases, suite=AgentToolCallSuite())
```

See the agent example files for end-to-end workflows:
- [`examples/openai_agent.py`](examples/openai_agent.py) — OpenAI Agents SDK
- [`examples/langchain_agent.py`](examples/langchain_agent.py) — LangChain
- [`examples/pydantic_ai_agent.py`](examples/pydantic_ai_agent.py) — PydanticAI

---

## Framework Trackers

Framework trackers automatically capture tool calls, RAG context, and answers from your agent framework — no manual extraction or decorator stacking needed. Each tracker uses its framework's native hook/callback system.

### OpenAI Agents SDK

```python
from agents import Agent, Runner, function_tool
from evalkit.contrib.openai import OpenAIAgentTracker
from evalkit.judges.openai import OpenAIJudge

@function_tool
def get_weather(city: str) -> str:
    """Get the current weather."""
    return f"Sunny in {city}"

agent = Agent(name="Assistant", tools=[get_weather])

# Run with tracker — captures everything automatically
tracker = OpenAIAgentTracker()
Runner.run_sync(agent, "What's the weather in Paris?", hooks=tracker)

# Access captured data
tracker.question     # "What's the weather in Paris?"
tracker.tool_calls   # [ToolCall(name="get_weather", parameters={"city": "Paris"})]
tracker.context      # ["Sunny in Paris"]
tracker.answer       # "The weather in Paris is sunny."

# Evaluate locally (auto-selects suite based on available data)
report = tracker.evaluate(judge=OpenAIJudge())
report.print_report()

# Or upload to server (no judge needed — uses server's default)
evaluation = tracker.upload("My Project", server_url="http://localhost:8000")
```

Requires: `pip install evalkit-llm[openai-agents]`

### LangChain

```python
from evalkit.contrib.langchain import LangChainTracker

tracker = LangChainTracker()
chain.invoke(question, config={"callbacks": [tracker]})

tracker.tool_calls   # captured via on_tool_start/end callbacks
tracker.context      # captured via on_retriever_end + on_tool_end
tracker.answer       # captured via on_chain_end
```

Requires: `pip install evalkit-llm[langchain]`

### PydanticAI

```python
from pydantic_ai import Agent
from evalkit.contrib.pydanticai import PydanticAITracker

result = agent.run_sync("What's the weather in Paris?")
tracker = PydanticAITracker(result)  # post-hoc extraction from RunResult

tracker.tool_calls   # extracted from ToolCallPart messages
tracker.context      # extracted from ToolReturnPart messages
tracker.answer       # from result.output
```

Requires: `pip install evalkit-llm[pydanticai]`

### Fallback: ToolCallTracker

For any framework, or when you need decorator-based tracking:

```python
from evalkit.contrib import ToolCallTracker

tracker = ToolCallTracker()

@tracker.wrap
def my_tool(query: str) -> str:
    return f"Result for {query}"

with tracker.capture("What is Python?") as run:
    my_tool(query="python")
    run.answer = "Python is a language."
    run.context = ["Python is a programming language."]

run.evaluate(judge=my_judge)
```

### Four tiers

All trackers share the same `BaseTracker` protocol with multiple levels of convenience:

```python
# 1. Raw data — no judge, no server
tc = tracker.to_test_case(expected_tool_calls=[...])

# 2. Local evaluation — needs a judge
report = tracker.evaluate(judge=OpenAIJudge())

# 3. Batch evaluation — server's judge, one evaluation run
evaluation = tracker.upload("My Project")

# 4. Online monitoring — push as a trace for time-series dashboards
client.log_trace(project.id, tc, suite=Suite.AGENT_TOOL_CALL, metadata={"model": "gpt-4o"})
```

All framework examples support `--server` (batch evaluation) and `--monitor` (trace mode):

```bash
python examples/openai_agent.py --server    # batch evaluation
python examples/openai_agent.py --monitor   # push traces to monitoring dashboard
```

### LiveQASuite

When evaluating live runs without ground truth, `LiveQASuite` checks only answer relevancy and contextual relevancy:

```python
from evalkit import LiveQASuite

# Auto-selected when no expected_tool_calls or expected_answer are provided
report = tracker.evaluate(judge=my_judge)  # uses LiveQASuite automatically

# Or explicitly
report = tracker.evaluate(judge=my_judge, suite=LiveQASuite())
```

---

## Judges

### OpenAI (structured outputs)

Uses `chat.completions.parse(response_format=JudgeResponse)` — the model is constrained to always return a typed `score` + `reason` object.

```python
from evalkit.judges import OpenAIJudge

judge = OpenAIJudge(
    model="gpt-4o-mini",       # any model supporting structured outputs
    api_key="sk-...",           # or set OPENAI_API_KEY env var
    temperature=0.0,
    max_tokens=1024,
)
```

Requires: `pip install "evalkit-llm[openai]"`

### Azure OpenAI

Same `openai` package, same structured outputs — just uses `AzureOpenAI` client instead.

```python
from evalkit.judges import AzureOpenAIJudge

judge = AzureOpenAIJudge(
    model="gpt-4o-mini",                                    # Azure deployment name
    azure_endpoint="https://my-resource.openai.azure.com",   # or set AZURE_OPENAI_ENDPOINT
    api_key="...",                                           # or set AZURE_OPENAI_API_KEY
    api_version="2024-10-21",
)
```

Requires: `pip install "evalkit-llm[openai]"`

### Anthropic (structured outputs)

Uses `messages.parse(output_format=JudgeResponse)` — the model always returns a typed `score` + `reason` object.

```python
from evalkit.judges import AnthropicJudge

judge = AnthropicJudge(
    model="claude-haiku-4-5-20251001",
    api_key="sk-ant-...",       # or set ANTHROPIC_API_KEY env var
    max_tokens=1024,
)
```

Requires: `pip install "evalkit-llm[anthropic]"`

### Ollama (local models)

No API key required. Uses the `ollama` package with structured output parsing.

```python
from evalkit.judges import OllamaJudge

judge = OllamaJudge(
    model="gpt-oss",
    base_url="http://localhost:11434",  # default
)
```

Requires: `pip install "evalkit-llm[ollama]"`

### Custom judge

Subclass `BaseJudge` and implement `evaluate()`. Async support is provided for free.

```python
from evalkit.judges.base import BaseJudge
from evalkit.models import JudgeResponse

class MyJudge(BaseJudge):
    def evaluate(self, prompt: str) -> JudgeResponse:
        # call your LLM, parse the result...
        return JudgeResponse(score=0.9, reason="Well supported by context")
```

---

## Async evaluation

```python
import asyncio

async def main():
    engine = EvaluationEngine(judge=OpenAIJudge(), concurrency=8)
    report = await engine.run_async(test_cases=cases, suite=RAGQASuite())
    print(report.summary.average_score)

asyncio.run(main())
```

---

## Prompt Engineering

All prompts live in `src/evalkit/prompts/` as standalone Python files for easy review and customization. The design follows best practices from the [Anthropic](https://platform.claude.com/docs/en/build-with-claude/prompt-engineering/claude-prompting-best-practices) and [OpenAI](https://developers.openai.com/api/docs/guides/prompt-guidance) prompt engineering guides.

### System prompt

All judges share a single system prompt (`prompts/system.py`) that establishes the evaluator role, scoring calibration, and evidence-grounding requirements. This means every metric benefits from consistent behavior regardless of which judge provider you use.

### Metric prompt templates

Each LLM-judged metric has its own prompt template file:

| File | Metric | Key evaluation focus |
|---|---|---|
| `prompts/faithfulness.py` | Faithfulness | Are answer claims supported by context? |
| `prompts/answer_relevancy.py` | Answer Relevancy | Does the answer address the question? |
| `prompts/contextual_relevancy.py` | Contextual Relevancy | Is the retrieved context relevant? |
| `prompts/hallucination.py` | Hallucination | Does the answer fabricate information? |
| `prompts/completeness.py` | Completeness | Does the answer cover the expected answer? |
| `prompts/tool_parameter_similarity.py` | Tool Parameter Similarity | Are tool parameters semantically equivalent? |

### Prompt structure

Every metric prompt uses XML sections for unambiguous parsing:

- **`<task>`** — what the judge should evaluate
- **`<data>`** — the test case inputs (question, context, answer) in labeled tags
- **`<rubric>`** — concrete scoring anchors (what 1.0, 0.5, 0.0 mean) with a reasoning scaffold ("first identify, then check")

This structure prevents the model from confusing instructions with input data and produces more consistent, calibrated scores.

### Faithfulness vs Hallucination

These two metrics are related but distinct:

- **Faithfulness** checks grounding: is each claim in the answer *supported* by the context?
- **Hallucination** checks fabrication: does the answer *invent* facts, entities, or numbers not in the context?

A faithful answer has all its claims backed by context. A non-hallucinating answer avoids making things up. An answer can be unfaithful (making unsupported general claims) without hallucinating (inventing specific false facts).

### Customizing prompts

To modify a prompt, edit the `TEMPLATE` string in the corresponding file under `src/evalkit/prompts/`. The templates use Python `str.format()` with named placeholders (`{question}`, `{context}`, `{answer}`, etc.). Changes take effect immediately — no rebuild needed.

---

## Online Monitoring

evalkit supports two modes: **batch evaluation** (run N test cases, get a report) and **online monitoring** (push individual traces from production, view quality over time). Both modes share the same `TestCase` model and evaluation suites.

### Concept: Traces

A **trace** is a single production LLM interaction stored with a timestamp and metadata. Unlike batch evaluations (which group test cases into a single run), traces are pushed individually over time and displayed as a time series.

Each trace contains:
- A `TestCase` (question, answer, context — the same model used everywhere in evalkit)
- A timestamp (when the interaction happened in your system)
- Metadata (key-value pairs for filtering: model version, environment, user segment, etc.)

The server evaluates each trace using the configured suite and displays scores in the monitoring dashboard.

### Pushing traces via the SDK

```python
from evalkit import EvalKitClient, Suite, TestCase

client = EvalKitClient("http://localhost:8000")
project = client.create_project("Production RAG")

# Push a single trace — server evaluates it with the specified suite
trace_id = client.log_trace(
    project.id,
    TestCase(
        question="What is Python?",
        context=["Python is a high-level programming language."],
        answer="Python is a programming language created by Guido van Rossum.",
    ),
    suite=Suite.RAG_QA,
    metadata={"model": "gpt-4o", "environment": "prod"},
)

# Push a batch of traces
traces = [
    {
        "test_case": {"question": "Q1", "context": ["..."], "answer": "A1"},
        "metadata": {"model": "gpt-4o"},
    },
    {
        "test_case": {"question": "Q2", "context": ["..."], "answer": "A2"},
        "metadata": {"model": "gpt-4o-mini"},
    },
]
trace_ids = client.log_traces(project.id, traces, suite=Suite.RAG_QA)

# Query traces
traces, total = client.get_traces(project.id, start="2026-03-24T00:00:00Z", limit=20)
```

### Monitoring dashboard

The monitoring view is at `/projects/{id}/monitor`. It shows:

- **Time-series chart** — average score over time (grouped by hour or day)
- **Anomaly detection** — buckets flagged when score drops below 2 standard deviations or drops 15%+ from the previous period
- **Summary cards** — total traces, average score, anomaly count
- **Trace table** — paginated list with question, answer, score, and metadata badges
- **Drill-down** — click any trace to see full question, answer, context, and per-metric scores

Filter by date range (24h / 7d / 30d) and metric name.

### REST API for traces

```bash
# Push traces (server evaluates with the specified suite)
curl -X POST http://localhost:8000/api/v1/traces \
  -H "Content-Type: application/json" \
  -d '{
    "project_id": 1,
    "suite": "rag_qa",
    "traces": [
      {
        "test_case": {
          "question": "What is Python?",
          "context": ["Python is a high-level programming language."],
          "answer": "Python is a programming language."
        },
        "metadata": {"model": "gpt-4o", "environment": "prod"}
      }
    ]
  }'
# → {"count": 1, "trace_ids": [42]}

# List traces
curl "http://localhost:8000/api/v1/traces?project_id=1&limit=10"

# Get single trace with scores
curl http://localhost:8000/api/v1/traces/42

# Get monitoring time series with anomaly detection
# Params: interval (5m|15m|1h|6h|1d), metric (filter by name), start/end (ISO timestamps)
curl "http://localhost:8000/api/v1/projects/1/monitor?interval=1h&metric=faithfulness"
```

### Using with framework trackers

All framework examples support `--monitor` to push traces instead of batch evaluations:

```bash
# Push each agent interaction as a trace
python examples/openai_agent.py --monitor
python examples/langchain_agent.py --monitor http://my-server:9000
python examples/pydantic_ai_agent.py --monitor
```

### Batch evaluation vs online monitoring

| | Batch Evaluation | Online Monitoring |
|---|---|---|
| **Use case** | CI/CD, regression testing | Production observability |
| **Data model** | `Evaluation` with N test cases | Individual `Trace` per interaction |
| **Dashboard** | Score timeline + per-run drill-down | Time-series chart + anomaly detection |
| **Endpoint** | `POST /api/v1/evaluate` | `POST /api/v1/traces` |
| **CLI flag** | `--server` | `--monitor` |
| **Server evaluates?** | Yes (background task) | Yes (background task) |

---

## Running the Web Dashboard

```bash
# Install server dependencies
pip install "evalkit-llm[server,openai]"

# Set your API key in .env (or export it)
echo "OPENAI_API_KEY=sk-..." >> .env

# Start the server (defaults to http://localhost:8000)
evalkit serve
```

### Server configuration

The server is configured via **CLI flags** or **environment variables**. All settings have sensible defaults — the only thing you need is an API key for your chosen judge.

| Setting | CLI flag | Env var | Default |
|---|---|---|---|
| Judge provider | `--judge` / `-j` | `EVALKIT_JUDGE_PROVIDER` | `openai` |
| Judge model | `--model` / `-m` | `EVALKIT_JUDGE_MODEL` | per-provider default |
| Evaluation suite | `--suite` / `-s` | `EVALKIT_SUITE` | `rag_qa` |
| Database URL | `--db` | `DATABASE_URL` | `sqlite:///./evalkit.db` |
| Host | `--host` | — | `127.0.0.1` |
| Port | `--port` / `-p` | — | `8000` |

```bash
# Use Anthropic as judge with the conversational suite
export ANTHROPIC_API_KEY=sk-ant-...
evalkit serve --judge anthropic --suite conversational

# Or use environment variables
EVALKIT_JUDGE_PROVIDER=anthropic EVALKIT_SUITE=conversational evalkit serve

# Use Ollama for fully local evaluation (no API key needed)
evalkit serve --judge ollama --model gpt-oss

# Custom host/port/database
evalkit serve --host 0.0.0.0 --port 9000 --db postgresql://user:pass@localhost/evalkit
```

These defaults apply to all evaluations submitted via the REST API. You can **override any setting per-request** by including it in the API call (e.g., pass `"judge_provider": "anthropic"` to use a different judge for one evaluation).

The API key is read from the environment (`OPENAI_API_KEY` / `ANTHROPIC_API_KEY`) — you can also pass it per-request via the `api_key` field.

The database is auto-created on first startup (SQLite by default at `./evalkit.db`).

### Check server config

```bash
curl http://localhost:8000/api/v1/config
# → {"judge_provider": "openai", "judge_model": null, "suite": "rag_qa",
#    "available_suites": ["rag_qa", "document_search", "conversational"],
#    "available_providers": ["openai", "azure", "anthropic", "ollama"]}

The dashboard has two views per project:

**Evaluations** (batch mode):
- Browse evaluation runs (paginated, 10 per page) with score timeline chart
- Compare evaluations side-by-side — select two or more runs and see metric deltas
- Drill into individual test cases and their per-metric scores
- Auto-collected git tags (commit, branch) attached to each evaluation

**Monitor** (online mode):
- Time-series score chart with anomaly markers
- Filter by date range (24h / 7d / 30d) and metric name
- Paginated trace table with metadata badges
- Drill into individual traces — full question, answer, context, and scores

### REST API — full workflow

```bash
# 1. Create a project
curl -X POST http://localhost:8000/api/v1/projects \
  -H "Content-Type: application/json" \
  -d '{"name": "My RAG App", "description": "Production Q&A pipeline"}'
# → {"id": 1, "name": "My RAG App", "description": "Production Q&A pipeline"}

# 2. List projects
curl http://localhost:8000/api/v1/projects
# → [{"id": 1, "name": "My RAG App", "description": "..."}]

# 3. Submit an evaluation (returns immediately, runs in background)
curl -X POST http://localhost:8000/api/v1/evaluate \
  -H "Content-Type: application/json" \
  -d '{
    "project_id": 1,
    "suite": "rag_qa",
    "judge_provider": "openai",
    "test_cases": [
      {
        "question": "What is Python?",
        "context": ["Python is a high-level programming language."],
        "answer": "Python is a programming language."
      }
    ]
  }'
# → {"evaluation_id": 1, "status": "pending"}

# 4. Poll for results
curl http://localhost:8000/api/v1/evaluations/1
# → {"id": 1, "status": "complete", "average_score": 0.91, ...}

# 5. Download full report
curl http://localhost:8000/evaluations/1/report

# 6. Compare evaluations
curl "http://localhost:8000/api/v1/projects/1/compare?ids=1,2&baseline=1"

# 7. Manage tags
curl -X PUT http://localhost:8000/api/v1/evaluations/1/tags \
  -H "Content-Type: application/json" \
  -d '{"tags": {"environment": "staging", "remove_me": null}}'
```

**Evaluate request fields:**

| Field | Required | Default | Description |
|---|---|---|---|
| `project_id` | Yes | — | Project to attach the evaluation to |
| `suite` | No | `"rag_qa"` | Suite enum value: `rag_qa`, `document_search`, `conversational`, `agent_tool_call`, `live_qa`, or `answer_similarity` |
| `judge_provider` | No | `"openai"` | Provider: `openai`, `azure`, `anthropic`, or `ollama` |
| `judge_model` | No | per-provider | Model name override |
| `api_key` | No | env var | API key (falls back to `OPENAI_API_KEY` / `ANTHROPIC_API_KEY`) |
| `test_cases` | Yes | — | List of test case objects (see format below) |
| `tags` | No | auto-detected | Custom key-value tags (git commit/branch added automatically) |

### Python SDK — `EvalKitClient`

Instead of raw HTTP calls, use the typed SDK client:

```python
from evalkit import EvalKitClient, Suite, TestCase

client = EvalKitClient("http://localhost:8000")

# Create a project
project = client.create_project("My RAG App")

# Option 1: server-side evaluation (server runs the judge)
evaluation = client.evaluate(project.id, test_cases, suite=Suite.RAG_QA)
result = client.wait_for_evaluation(evaluation.id)
print(f"Score: {result.average_score:.0%}")

# Option 2: upload local results (you ran the evaluation locally)
from evalkit import EvaluationEngine, RAGQASuite
from evalkit.judges.openai import OpenAIJudge

engine = EvaluationEngine(judge=OpenAIJudge())
report = engine.run(test_cases, suite=RAGQASuite())
report.print_report()  # formatted summary to stdout

# upload_report() creates or reuses a project by name, then uploads
evaluation = client.upload_report("My RAG App", report)
print(f"Dashboard: http://localhost:8000/evaluations/{evaluation.id}")
```

```python
# Option 3: online monitoring — push traces for production observability
from evalkit import TestCase

trace_id = client.log_trace(
    project.id,
    TestCase(question="What is X?", context=["X is Y."], answer="X is Y."),
    suite=Suite.RAG_QA,
    metadata={"model": "gpt-4o", "environment": "prod"},
)
# View at: http://localhost:8000/projects/{id}/monitor
```

See [`examples/server_rag.py`](examples/server_rag.py), [`examples/server_agent.py`](examples/server_agent.py), and [`examples/server_combined.py`](examples/server_combined.py) for complete server workflows. Framework examples support `--server` (batch) and `--monitor` (traces).

---

## CLI Reference

### `evalkit run` — run evaluation from a file

```bash
# From a JSON file of test cases
evalkit run --input test_cases.json --suite rag_qa --judge openai --output report.json

# From a CSV file (columns: question,context,answer,expected_answer)
evalkit run --input test_cases.csv --suite rag_qa --judge anthropic --model claude-haiku-4-5-20251001

# Custom model
evalkit run --input cases.json --judge openai --model gpt-4o --output report.json

# Azure OpenAI (requires AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_API_KEY env vars)
evalkit run --input test_cases.json --suite rag_qa --judge azure --model gpt-4o-mini
```

**Test case JSON format:**
```json
[
  {
    "question": "What is X?",
    "context": ["chunk 1", "chunk 2"],
    "answer": "X is ...",
    "expected_answer": "X"
  }
]
```

**Test case CSV format:**
```
question,context,answer,expected_answer
"What is X?","chunk1|chunk2","X is...","X"
```
(pipe `|` separates multiple context chunks in CSV)

### `evalkit serve` — start the web dashboard

```bash
evalkit serve [--host HOST] [--port PORT]
```

### `evalkit report` — display a saved report

```bash
# Plain text summary
evalkit report report.json

# JSON output
evalkit report report.json --format json
```

### Report Suggestions

evalkit reports include contextual improvement suggestions that adapt to your results:

**Severity Levels:**

| Level | When | Meaning |
|-------|------|---------|
| CRITICAL | Score < 50% of threshold | Fundamental issue — check configuration immediately |
| WARNING | Score between 50-85% of threshold | Moderate issue — review and improve |
| MILD | Score near threshold (≥85%) | Near-miss — fine-tune for better results |

**Example output:**

```
═══ evalkit Report: rag_qa ═══
Overall Score: 0.62 | 3/5 passed

Score by metric:
  faithfulness                     78.00%  (4/5 passed)
  hallucination                    45.00%  (2/5 passed)  [!]
  answer_relevancy                 71.00%  (3/5 passed)

Suggestions (1 groups):

  [hallucination] CRITICAL (3 failures, avg: 0.15)
    High hallucination rate: the LLM is fabricating facts ...
    Patterns detected:
      - 67% of failures involve contexts with >5 chunks
    LLM Analysis:
      Failures share a pattern: when context has numerical data,
      the model invents statistics.
      Priority fix: add "only cite numbers in the context" to your prompt.
```

**Enabling LLM analysis:**

When you use a judge (OpenAI, Anthropic, Ollama), the same judge is used to analyze failure patterns and suggest targeted fixes. This adds one LLM call per failure group.

```python
from evalkit import EvaluationEngine
from evalkit.judges.openai import OpenAIJudge

judge = OpenAIJudge()
engine = EvaluationEngine(judge=judge)
report = engine.run(test_cases, suite=RAGQASuite())
# report.failures will include LLM-powered analysis
```

To skip LLM analysis, use deterministic suites or don't provide a judge.

---

## Development Setup

```bash
# Clone and install in editable mode with all extras
git clone https://github.com/your-org/evalkit.git
cd evalkit
pip install -e ".[server,openai,anthropic,dev]"
```

### Environment variables

All variables can be set in a `.env` file in the project root — the CLI and examples load it automatically. See `.env.example` for a template.

| Variable | Default | Description |
|---|---|---|
| `OPENAI_API_KEY` | — | Required for `OpenAIJudge` |
| `AZURE_OPENAI_API_KEY` | — | Required for `AzureOpenAIJudge` |
| `AZURE_OPENAI_ENDPOINT` | — | Azure endpoint URL for `AzureOpenAIJudge` |
| `ANTHROPIC_API_KEY` | — | Required for `AnthropicJudge` |
| `DATABASE_URL` | `sqlite:///./evalkit.db` | Database for the web server |
| `EVALKIT_JUDGE_PROVIDER` | `openai` | Default judge for the server (`openai`, `azure`, `anthropic`, `ollama`) |
| `EVALKIT_JUDGE_MODEL` | per-provider | Default model for the server |
| `EVALKIT_SUITE` | `rag_qa` | Default evaluation suite for the server (`rag_qa`, `document_search`, `conversational`, `agent_tool_call`, `live_qa`, `answer_similarity`) |

---

## Running Tests

```bash
# All unit tests (no API keys required — uses MockJudge)
pytest tests/unit/ -v

# Integration tests (no API keys — uses answer_similarity suite + in-memory SQLite)
pytest tests/integration/ -v

# Everything
pytest tests/ -v

# With coverage
pytest tests/ -v --cov=src/evalkit --cov-report=term-missing
```

Unit tests use a `MockJudge` that returns deterministic responses. Integration tests spin up the FastAPI server with an in-memory SQLite database and use the `answer_similarity` suite (deterministic, no judge) to run full end-to-end evaluation and trace lifecycles. No API keys needed for any tests.

---

## Project Structure

```
evalkit/
├── .env.example             # Template for environment variables
├── src/evalkit/
│   ├── __init__.py          # Public API re-exports
│   ├── models.py            # Pydantic: TestCase, MetricScore, EvaluationReport, …
│   ├── engine.py            # EvaluationEngine (sync + async, concurrent)
│   ├── client.py            # EvalKitClient SDK — typed server client (stdlib only)
│   ├── judges/
│   │   ├── base.py          # BaseJudge ABC
│   │   ├── openai.py        # OpenAIJudge + AzureOpenAIJudge (structured outputs)
│   │   ├── anthropic.py     # AnthropicJudge (structured outputs via messages.parse)
│   │   └── ollama.py        # OllamaJudge — local models, ollama package
│   ├── metrics/
│   │   ├── base.py          # BaseMetric ABC
│   │   ├── faithfulness.py
│   │   ├── answer_relevancy.py
│   │   ├── contextual_relevancy.py
│   │   ├── hallucination.py
│   │   ├── completeness.py
│   │   ├── tool_selection.py
│   │   ├── tool_parameter.py
│   │   ├── tool_parameter_similarity.py
│   │   ├── exact_match.py       # Deterministic — no judge needed
│   │   ├── string_containment.py
│   │   ├── string_similarity.py # Levenshtein distance
│   │   ├── rouge.py             # Requires evalkit[rouge]
│   │   └── bleu.py              # Requires evalkit[bleu]
│   ├── prompts/             # Prompt templates (one file per metric + shared system prompt)
│   │   ├── system.py        # Shared judge system prompt
│   │   ├── faithfulness.py
│   │   ├── answer_relevancy.py
│   │   ├── contextual_relevancy.py
│   │   ├── hallucination.py
│   │   ├── completeness.py
│   │   └── tool_parameter_similarity.py
│   ├── suites/              # EvaluationSuite + RAGQASuite, LiveQASuite, AgentToolCallSuite, …
│   ├── contrib/             # Framework trackers
│   │   ├── __init__.py      # BaseTracker, ToolCallTracker (framework-agnostic fallback)
│   │   ├── tracker.py       # BaseTracker ABC, TrackerRun, ToolCallTracker
│   │   ├── openai.py        # OpenAIAgentTracker (requires evalkit[openai-agents])
│   │   ├── langchain.py     # LangChainTracker (requires evalkit[langchain])
│   │   └── pydanticai.py    # PydanticAITracker (requires evalkit[pydanticai])
│   ├── reports/             # ReportGenerator — aggregation, failure grouping, suggestions
│   ├── cli.py               # Click CLI: evalkit run / serve / report
│   └── server/              # FastAPI + HTMX web dashboard (evalkit[server])
│       ├── app.py
│       ├── db/              # SQLAlchemy ORM + session (projects, evaluations, traces)
│       ├── routes/          # dashboard, projects, evaluations, traces, REST API
│       │   ├── traces.py    # Trace ingest, query, monitor API + HTML routes
│       │   └── ...
│       └── templates/       # Jinja2 + Tailwind CSS + HTMX
│           ├── monitor.html # Monitoring dashboard (time-series, anomalies)
│           ├── trace.html   # Individual trace detail
│           └── ...
├── tests/
│   ├── conftest.py          # MockJudge, shared fixtures
│   ├── unit/                # Unit tests, no API keys needed
│   └── integration/         # Server integration tests (in-memory SQLite, no API keys)
├── examples/
│   ├── openai_agent.py        # OpenAI Agents SDK (--server / --monitor)
│   ├── langchain_agent.py     # LangChain (--server / --monitor)
│   ├── pydantic_ai_agent.py   # PydanticAI (--server / --monitor)
│   ├── deterministic.py       # Deterministic metrics, no judge needed
│   ├── server_rag.py          # Server-side RAG evaluation via SDK
│   ├── server_agent.py        # Local agent eval + server upload
│   └── server_combined.py     # Combined RAG + agent server workflow
└── pyproject.toml
```

---

## Roadmap

Planned features and improvements — contributions welcome.

### Authentication & Security
- [ ] Dashboard authentication (OAuth2, JWT, or API key-based) for production deployments
- [ ] Per-project access control and team roles
- [ ] API key management for server-to-server workflows

### Evaluation Engine
- [ ] Retry with backoff on transient LLM/judge failures
- [ ] Configurable rate limiting on judge API calls (token bucket)
- [ ] Streaming progress for large batch evaluations (SSE)
- [ ] Multi-model comparison — run the same test cases against multiple judges in one call

### Dashboard & API
- [ ] Evaluation deletion and archiving (API + UI)
- [ ] Export reports in CSV and HTML formats
- [ ] Webhook notifications on evaluation completion (Slack, email, custom URL)
- [ ] Test case set management — save, version, and reuse test case collections

### Integrations
- [ ] GitHub Actions action — run evalkit in CI and post results as PR comments
- [ ] GitLab CI template
- [ ] Pytest plugin — run evalkit as part of your test suite

### Metrics
- [ ] Latency and cost tracking per evaluation
- [ ] Custom metric templates (tone, safety, format compliance)
- [ ] Multi-turn conversation evaluation

---

## License

[MIT](LICENSE.md)
