v0.1.0  ·  Open Source  ·  MIT License

pytest for AI.

Evaluate LLM outputs and AI agent behavior with a zero-dependency Python framework. Neutral, extensible, and not owned by any AI company.

Get Started → View on GitHub
$ pip install rubric-eval
Zero required dependencies
Native pytest integration
First-class agent trace eval
— Works with any LLM
Local HTML reports
Why Rubric

Built for developers.
Not for any AI company.

After Promptfoo joined OpenAI, the community needed a truly independent evaluation framework. Rubric is that answer — open source forever, no cloud required, no lock-in.

01

First-Class Agent Evaluation

Most eval frameworks only check the final output. Rubric evaluates the entire agent run — which tools were called, in what order, what the reasoning trace looks like, how long it took, and what it cost. You can require specific tools, forbid others, and penalize loops or redundant calls.

02

Works With Any LLM

Rubric is model-agnostic. Pass any Python callable as your judge function: OpenAI, Anthropic, Ollama, a local model, or a mock. No API keys required unless you use LLMJudge or GEval — Rubric auto-detects from your environment variables when you do.

03

Zero Required Dependencies

The entire core — string matching, agent metrics, results, CLI, and HTML reports — ships with nothing mandatory. Install extras only when you need them: pip install rubric-eval[semantic] for embeddings, [openai] or [anthropic] for LLM judging.

04

Native pytest Integration

Drop the rubric_eval fixture into any test file. Your LLM evals run inside the same pytest session as your unit tests — same CI pipeline, same runner, same output. No separate eval server or dashboard login needed.

05

Interactive HTML Reports

Every eval run can produce a self-contained HTML report — no server, no build step, just a file. Filter results by pass/fail, drill into per-metric score breakdowns with explanations, and see exact inputs and outputs side-by-side. Share it as a single file.

06

Truly Independent

MIT licensed. Not a product of OpenAI, Anthropic, Google, or any model provider. Rubric will never have a financial incentive to favor one model over another. Evaluate any model with the same unbiased framework — that's the whole point.

What Makes Rubric Different

The metrics no one else ships.

These are the metrics that matter when you're deploying a real agent — not just a chatbot. Rubric ships them out of the box.

Tool Usage
Tool Call Accuracy & Efficiency
Verify not just what your agent did, but how well it did it. Two separate metrics cover correctness and quality of tool use.
ToolCallAccuracy — assert that every required tool was called, no forbidden tools were used, and optionally that they were called in the correct order. Score degrades proportionally to missing or unexpected tools.
ToolCallEfficiency — detect redundant calls (same tool, same args, called twice), failed tool invocations, and individual tools that exceeded a latency budget. Produces a composite efficiency score.
Safety
Safety Compliance
Scan every agent output and tool call for real-world safety violations before they reach users.
PII detection — flags Social Security numbers, credit card patterns, email addresses, and phone numbers leaking through responses.
Dangerous SQL — catches DROP, DELETE, TRUNCATE, and other destructive patterns in tool arguments before they execute.
Forbidden tool enforcement — fails the test if any tool on your deny-list was invoked, regardless of what the agent was instructed.
Reasoning
Reasoning & Trace Quality
Evaluate the quality of the agent's thinking, not just its final answer. Two metrics cover different angles of the same problem.
TraceQuality — analyzes the full reasoning trace for circular loops, repeated steps, and dead-end paths. Penalizes agents that cycle through the same actions without making progress.
ReasoningQuality — measures the ratio of reasoning steps to tool calls, and checks whether the agent updated its plan based on observations — a sign of genuine multi-step thinking.
RAG
Context Utilization
For RAG pipelines: verify the agent actually used what it retrieved, rather than ignoring context and hallucinating.
ContextUtilization — checks that retrieved context is grounded in the final answer. Catches the common failure mode where an agent fetches documents but generates a response that doesn't reference them at all.
Works on any RAG architecture — pass the retrieved context as trace steps in your AgentTestCase and Rubric handles the rest.
Quick Start

Up and running
in 5 minutes.

Install, define test cases, apply metrics, read the report.

basic_eval.py
import rubriceval as rubric

# Any function that returns a string
def my_llm(prompt):
    return "The capital of France is Paris."

results = rubric.evaluate(
    test_cases=[
        rubric.TestCase(
            input="What is the capital of France?",
            actual_output=my_llm("What is the capital of France?"),
            expected_output="Paris",
        )
    ],
    metrics=[
        rubric.Contains("Paris"),
        rubric.NotContains("I don't know"),
        rubric.SemanticSimilarity(threshold=0.8),
    ],
    output_html="report.html",
)
test_llm.py — pytest
import rubriceval as rubric

def test_agent_books_flight(rubric_eval):
    rubric_eval.add_case(
        rubric.AgentTestCase(
            input="Book a flight to Tokyo",
            actual_output=agent.run("Book a flight to Tokyo"),
            expected_tools=["search_flights", "book_flight"],
            tool_calls=agent.tool_calls,
        ),
        metrics=[
            rubric.ToolCallAccuracy(),
            rubric.LatencyMetric(max_ms=5000),
        ],
    )
    # auto-asserts on teardown
agent_eval.py — full agent eval
import rubriceval as rubric

results = rubric.evaluate(
    test_cases=[
        rubric.AgentTestCase(
            name="flight booking",
            input="Book a flight to Tokyo",
            actual_output=agent_output,
            expected_tools=[
                "search_flights",
                "check_availability",
                "book_flight",
            ],
            forbidden_tools=["send_email"],
            tool_calls=agent.tool_calls,
            trace=agent.trace,
            latency_ms=1240,
            cost_usd=0.004,
        )
    ],
    metrics=[
        rubric.ToolCallAccuracy(check_order=True),
        rubric.ToolCallEfficiency(),
        rubric.TraceQuality(penalize_loops=True),
        rubric.SafetyCompliance(),
        rubric.ReasoningQuality(),
        rubric.ContextUtilization(),
        rubric.LatencyMetric(max_ms=5000),
        rubric.CostMetric(max_cost_usd=0.01),
    ],
    output_html="agent_report.html",
)

results.print_summary()
Output

Every run produces a report
your team can actually use.

A single self-contained HTML file. No server. No login. Filter by pass/fail, drill into agent traces, inspect tool calls.

agent_report.html
Metrics Library

15 metrics across 4 categories.

Mix and match. Extend with your own by subclassing BaseMetric.

String Matching No deps
ExactMatch
Binary exact string comparison with optional case-insensitive mode. Returns 1.0 or 0.0. Best for structured outputs with a known correct answer.
Contains
Check that the output contains a substring or all/any items in a list. Use require_all=True to enforce that every item must be present.
NotContains
Passes only when the output does NOT contain the given string. Essential for safety guardrails — catch refusal phrases, hallucinated names, or forbidden content.
RegexMatch
Validate that the output matches a regex pattern. Ideal for format checks: dates, phone numbers, JSON structure, email addresses, or custom codes.
Semantic [semantic]
SemanticSimilarity
Embeds both the output and expected answer using sentence-transformers and computes cosine similarity. Catches correct answers phrased differently. Configurable threshold (default 0.8).
RougeScore
Measures n-gram overlap between output and a reference text. The standard metric for summarization quality. Supports ROUGE-1, ROUGE-2, and ROUGE-L.
LLM Judge [openai] / [anthropic]
LLMJudge
Use any LLM to score the output against criteria you define in plain English. Pass your own judge_fn callable or let Rubric auto-detect from your API key environment variables.
GEval
Chain-of-thought evaluation: the LLM reasons step-by-step before assigning a score. More accurate than single-pass judging for nuanced criteria like coherence or factual accuracy.
Agent Metrics Built-in
ToolCallAccuracy
Assert all expected tools were called, no forbidden tools were used, and optionally that they appeared in the correct order. Score degrades proportionally to missing or unexpected tools.
ToolCallEfficiency
Detects redundant calls (same tool + same args repeated), failed tool invocations, and slow individual tools. Combines into a single efficiency score.
TraceQuality
Analyzes the full reasoning trace for loops, repeated steps, and dead-end paths. Penalizes agents that get stuck cycling through the same actions without progress.
ReasoningQuality
Measures the ratio of reasoning to tool calls, and checks whether the agent updated its plan based on what it observed — a signal of genuine multi-step thinking.
SafetyCompliance
Scans outputs and tool arguments for PII (SSNs, credit cards, emails), dangerous SQL (DROP, DELETE, TRUNCATE), and forbidden tool names. Critical before production deployment.
ContextUtilization
For RAG: verifies the agent actually used retrieved context in its answer rather than hallucinating. Catches the failure mode of fetching documents then ignoring them entirely.
TaskCompletion
Determines whether the agent actually finished the task. Uses heuristic keyword checking by default, or an LLM judge when provided.
LatencyMetric · CostMetric
Enforce performance and cost budgets. Set a max latency in milliseconds or max cost in USD — scores degrade gracefully above the threshold.

Start evaluating your LLM today.

Free, MIT licensed, and ready in minutes. No account, no cloud, no lock-in.