Evaluate LLM outputs and AI agent behavior with a zero-dependency Python framework. Neutral, extensible, and not owned by any AI company.
After Promptfoo joined OpenAI, the community needed a truly independent evaluation framework. Rubric is that answer — open source forever, no cloud required, no lock-in.
Most eval frameworks only check the final output. Rubric evaluates the entire agent run — which tools were called, in what order, what the reasoning trace looks like, how long it took, and what it cost. You can require specific tools, forbid others, and penalize loops or redundant calls.
Rubric is model-agnostic. Pass any Python callable as your judge function: OpenAI, Anthropic, Ollama, a local model, or a mock. No API keys required unless you use LLMJudge or GEval — Rubric auto-detects from your environment variables when you do.
The entire core — string matching, agent metrics, results, CLI, and HTML reports — ships with nothing mandatory. Install extras only when you need them: pip install rubric-eval[semantic] for embeddings, [openai] or [anthropic] for LLM judging.
Drop the rubric_eval fixture into any test file. Your LLM evals run inside the same pytest session as your unit tests — same CI pipeline, same runner, same output. No separate eval server or dashboard login needed.
Every eval run can produce a self-contained HTML report — no server, no build step, just a file. Filter results by pass/fail, drill into per-metric score breakdowns with explanations, and see exact inputs and outputs side-by-side. Share it as a single file.
MIT licensed. Not a product of OpenAI, Anthropic, Google, or any model provider. Rubric will never have a financial incentive to favor one model over another. Evaluate any model with the same unbiased framework — that's the whole point.
These are the metrics that matter when you're deploying a real agent — not just a chatbot. Rubric ships them out of the box.
AgentTestCase and Rubric handles the rest.
Install, define test cases, apply metrics, read the report.
import rubriceval as rubric # Any function that returns a string def my_llm(prompt): return "The capital of France is Paris." results = rubric.evaluate( test_cases=[ rubric.TestCase( input="What is the capital of France?", actual_output=my_llm("What is the capital of France?"), expected_output="Paris", ) ], metrics=[ rubric.Contains("Paris"), rubric.NotContains("I don't know"), rubric.SemanticSimilarity(threshold=0.8), ], output_html="report.html", )
import rubriceval as rubric def test_agent_books_flight(rubric_eval): rubric_eval.add_case( rubric.AgentTestCase( input="Book a flight to Tokyo", actual_output=agent.run("Book a flight to Tokyo"), expected_tools=["search_flights", "book_flight"], tool_calls=agent.tool_calls, ), metrics=[ rubric.ToolCallAccuracy(), rubric.LatencyMetric(max_ms=5000), ], ) # auto-asserts on teardown
import rubriceval as rubric results = rubric.evaluate( test_cases=[ rubric.AgentTestCase( name="flight booking", input="Book a flight to Tokyo", actual_output=agent_output, expected_tools=[ "search_flights", "check_availability", "book_flight", ], forbidden_tools=["send_email"], tool_calls=agent.tool_calls, trace=agent.trace, latency_ms=1240, cost_usd=0.004, ) ], metrics=[ rubric.ToolCallAccuracy(check_order=True), rubric.ToolCallEfficiency(), rubric.TraceQuality(penalize_loops=True), rubric.SafetyCompliance(), rubric.ReasoningQuality(), rubric.ContextUtilization(), rubric.LatencyMetric(max_ms=5000), rubric.CostMetric(max_cost_usd=0.01), ], output_html="agent_report.html", ) results.print_summary()
A single self-contained HTML file. No server. No login. Filter by pass/fail, drill into agent traces, inspect tool calls.
Mix and match. Extend with your own by subclassing BaseMetric.
require_all=True to enforce that every item must be present.judge_fn callable or let Rubric auto-detect from your API key environment variables.