Metadata-Version: 2.4
Name: aevals
Version: 0.1.1
Summary: Agent evals framework that gives Claude Code the tools to instrument your codebase, capture traces, write evals, and catch LLM regressions.
Project-URL: Homepage, https://aevals.sh
Project-URL: Repository, https://github.com/satyaborg/aevals
Project-URL: Issues, https://github.com/satyaborg/aevals/issues
Project-URL: Changelog, https://github.com/satyaborg/aevals/blob/main/CHANGELOG.md
Author: satyaborg
License-Expression: MIT
License-File: LICENSE
Keywords: agents,ai,evals,evaluation,llm,opentelemetry,testing
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Testing
Classifier: Typing :: Typed
Requires-Python: >=3.12
Requires-Dist: click>=8.1
Requires-Dist: litellm>=1.0
Requires-Dist: mcp>=1.0
Requires-Dist: opentelemetry-api>=1.20
Requires-Dist: opentelemetry-sdk>=1.20
Requires-Dist: pydantic>=2.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0
Provides-Extra: anthropic
Requires-Dist: opentelemetry-instrumentation-anthropic; extra == 'anthropic'
Provides-Extra: bedrock
Requires-Dist: opentelemetry-instrumentation-bedrock; extra == 'bedrock'
Provides-Extra: cohere
Requires-Dist: opentelemetry-instrumentation-cohere; extra == 'cohere'
Provides-Extra: dev
Requires-Dist: pre-commit>=3.0; extra == 'dev'
Requires-Dist: pyright>=1.1; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Provides-Extra: google
Requires-Dist: opentelemetry-instrumentation-google-generativeai; extra == 'google'
Provides-Extra: mistral
Requires-Dist: opentelemetry-instrumentation-mistralai; extra == 'mistral'
Provides-Extra: openai
Requires-Dist: opentelemetry-instrumentation-openai; extra == 'openai'
Description-Content-Type: text/markdown

# aevals

[![CI](https://github.com/satyaborg/aevals/actions/workflows/ci.yml/badge.svg)](https://github.com/satyaborg/aevals/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/aevals)](https://pypi.org/project/aevals/)
[![Python 3.12+](https://img.shields.io/badge/python-3.12%2B-blue)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/license-MIT-green)](LICENSE)

Agent evals framework that gives Claude Code the tools to instrument your codebase, capture traces, write evals, and catch LLM regressions.

```bash
pip install aevals
aevals init        # detects your agent, generates config
aevals run         # runs scenarios, reports pass/fail
```

## Why

Most teams building agents know they should eval. [They don't.](https://www.langchain.com/state-of-agent-engineering#evaluation-and-testing-for-agents) The problem isn't motivation — it's that nobody knows where to start.

aevals closes that gap. Point it at your codebase and it figures out the rest — which SDKs you use, where your entrypoint is, what tools your agent has. You go from nothing to a working eval suite without writing boilerplate.

## Install

```bash
pip install aevals

# Add instrumentation for your provider
pip install aevals[openai]       # OpenAI
pip install aevals[anthropic]    # Anthropic
pip install aevals[google]       # Google GenAI
pip install aevals[bedrock]      # AWS Bedrock
pip install aevals[mistral]      # Mistral
pip install aevals[cohere]       # Cohere
```

## Quick start

**1. Initialize** — scans your project, detects SDKs and entrypoints, generates `aevals.yaml`:

```bash
aevals init
```

**2. Define scenarios:**

```yaml
# aevals.yaml
config_version: 1
entry: src.agent:main            # module:callable

judge:
  model: openai/gpt-5.4           # any litellm model

scenarios:
  - name: simple-booking
    input: "Book a flight from SFO to JFK for next Tuesday"
    rubric:
      - "Agent calls search_flights before book_flight"
      - "Agent confirms with user before booking"
      - "Final output includes a confirmation number"
    constraints:
      max_steps: 5
      max_duration_ms: 10000
```

**3. Run:**

```bash
aevals run
```

```
── simple-booking ──────────────────────────────────────
  3 spans | 4.2s | 1,840 tokens

  Constraints:
    ✓ steps: 3 <= 5
    ✗ duration: 4200ms > 10000ms

  Rubric: (judge: openai/gpt-5.4)
    ✓ Agent calls search_flights before book_flight
    ✓ Agent confirms with user before booking
    ✓ Final output includes a confirmation number

── Summary ─────────────────────────────────────────────
  1 scenario, 0 passed, 1 failed
```

## How it works

Each scenario spawns your agent in an isolated subprocess. [OpenLLMetry](https://github.com/traceloop/openllmetry) auto-instruments your SDK and captures every LLM call as OpenTelemetry spans. The spans are parsed into a trajectory, then scored on two tracks:

**Constraints** — deterministic, zero LLM cost:

| Constraint | Checks |
|---|---|
| `max_duration_ms` | Wall-clock time under limit |
| `max_steps` | Number of LLM calls under limit |
| `tool_sequence` | Required tools called in order (subsequence match) |
| `no_repeat_calls` | No tool called N+ times with identical arguments |
| `output_contains` | Final output includes a substring |

**Rubric** — natural-language assertions scored pass/fail by a judge model against the full trajectory (every LLM call, tool invocation, intermediate step). Uses [litellm](https://github.com/BerriAI/litellm), so any model it supports works as a judge. No judge configured? Rubrics stay pending and don't fail the run.

A scenario passes when **all constraints pass AND all rubric items pass.**

## CI

```yaml
# .github/workflows/eval.yml
- name: Run evals
  run: aevals run --json
  # Exit codes: 0 = all pass, 1 = any fail, 2 = no traces
```

Constraints need no API keys. Add judge keys as secrets for rubric evaluation; if omitted, rubrics stay pending and don't block the pipeline.

## Claude Code

aevals ships as an [MCP](https://modelcontextprotocol.io/) server. `aevals init` writes the config to `.claude/mcp.json` automatically.

```bash
aevals mcp-serve
```

## OTel compatibility

Traces are standard OpenTelemetry. Pipe them to [Langfuse](https://langfuse.com), [Phoenix](https://phoenix.arize.com), [Jaeger](https://www.jaegertracing.io/), or any OTel backend.

## Development

```bash
pip install -e ".[dev]"
pytest
ruff check src/ tests/
```

## License

MIT
