{% extends "main.html" %}
{% block tabs %}
{{ super() }}
Catches the regressions that aggregate metrics hide.
Success rate up, cost down. Every dashboard would green-light this deploy. Per-task breakdown reveals them. New easy tasks masked the damage in the average. The quality gate caught it. Exit code 1. Deploy blocked. Groups traces by task ID and compares success rates. Finds the 5 tasks that broke even when the aggregate improved. The thing dashboards don't show you. Bootstrap confidence intervals and z-tests on every metric. Separates real changes from random variance. Exit 1 on violation. Markdown PR comments. GitHub Action included. Flat or nested. Same data, same results. No LLM-as-judge. 2 deps. No API keys. Installs in seconds.You changed the prompt.
Did it actually get better?The aggregate looks great
Five tasks completely broke
Duration doubled
Per-task regression detection
Statistical significance, not just deltas
Quality gates
Any JSONL
--suggest scans field names and gives you a config.Deterministic
Lightweight