Metadata-Version: 2.4
Name: agentest
Version: 0.2.0
Summary: Universal agent testing and evaluation toolkit - record, replay, mock, and benchmark AI agents
Project-URL: Homepage, https://github.com/ColinHarker/agentest
Project-URL: Repository, https://github.com/ColinHarker/agentest
Project-URL: Issues, https://github.com/ColinHarker/agentest/issues
Project-URL: Documentation, https://ColinHarker.github.io/agentest/
Author: Agentest Contributors
License-Expression: MIT
License-File: LICENSE
Keywords: agents,ai,benchmark,evaluation,llm,mcp,testing
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.10
Requires-Dist: click>=8.0
Requires-Dist: httpx>=0.25.0
Requires-Dist: jinja2>=3.1
Requires-Dist: pydantic>=2.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0
Provides-Extra: all
Requires-Dist: anthropic>=0.40.0; extra == 'all'
Requires-Dist: autogen-agentchat>=0.4.0; extra == 'all'
Requires-Dist: crewai>=0.80.0; extra == 'all'
Requires-Dist: fastapi>=0.115.0; extra == 'all'
Requires-Dist: langchain-core>=0.3.0; extra == 'all'
Requires-Dist: llama-index-core>=0.11.0; extra == 'all'
Requires-Dist: openai>=1.50.0; extra == 'all'
Requires-Dist: uvicorn[standard]>=0.32.0; extra == 'all'
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.40.0; extra == 'anthropic'
Provides-Extra: autogen
Requires-Dist: autogen-agentchat>=0.4.0; extra == 'autogen'
Provides-Extra: crewai
Requires-Dist: crewai>=0.80.0; extra == 'crewai'
Provides-Extra: dev
Requires-Dist: fastapi>=0.115.0; extra == 'dev'
Requires-Dist: mypy>=1.13; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.24; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.8; extra == 'dev'
Requires-Dist: types-pyyaml>=6.0; extra == 'dev'
Requires-Dist: uvicorn[standard]>=0.32.0; extra == 'dev'
Provides-Extra: docs
Provides-Extra: langchain
Requires-Dist: langchain-core>=0.3.0; extra == 'langchain'
Provides-Extra: llamaindex
Requires-Dist: llama-index-core>=0.11.0; extra == 'llamaindex'
Provides-Extra: openai
Requires-Dist: openai>=1.50.0; extra == 'openai'
Provides-Extra: web
Requires-Dist: fastapi>=0.115.0; extra == 'web'
Requires-Dist: uvicorn[standard]>=0.32.0; extra == 'web'
Description-Content-Type: text/markdown

<p align="center">
  <img src="assets/logo.svg" alt="Agentest" width="480"/>
</p>

<p align="center">
  <strong>Universal testing and evaluation toolkit for AI agents.</strong>
</p>

<p align="center">
  <a href="https://pypi.org/project/agentest/"><img src="https://img.shields.io/pypi/v/agentest?color=blue" alt="PyPI version"></a>
  <a href="https://pypi.org/project/agentest/"><img src="https://img.shields.io/pypi/pyversions/agentest" alt="Python versions"></a>
  <a href="https://github.com/ColinHarker/agentest/actions/workflows/ci.yml"><img src="https://github.com/ColinHarker/agentest/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
  <a href="https://github.com/ColinHarker/agentest/blob/main/LICENSE"><img src="https://img.shields.io/github/license/ColinHarker/agentest" alt="License"></a>
  <a href="https://pypi.org/project/agentest/"><img src="https://img.shields.io/pypi/dm/agentest?color=green" alt="Downloads"></a>
</p>

<p align="center">
  <a href="#quick-start">Quick Start</a> •
  <a href="#features">Features</a> •
  <a href="#built-in-evaluators">Evaluators</a> •
  <a href="#cli">CLI</a> •
  <a href="#architecture">Architecture</a>
</p>

---

Only 52% of teams shipping AI agents run evaluations. Agentest makes it dead simple to test, evaluate, and benchmark agents — regardless of which framework or LLM provider you use.

```
pip install agentest
```

## Why Agentest?

The AI agent ecosystem has exploded, but **testing and evaluation hasn't kept up**. Most teams ship agents to production without systematic testing.

Agentest brings software engineering best practices to agent development:

| Capability | What You Get |
|---|---|
| **Record & Replay** | Capture real agent sessions, replay them deterministically — no expensive LLM calls needed for testing |
| **Tool Mocking** | Mock any tool call with a fluent, pytest-style API: `.when(...).returns(...)` |
| **7 Built-in Evaluators** | Grade agents on task completion, safety, cost, latency, tool usage, and more |
| **Model Comparison** | Run the same tasks across Claude, GPT, Gemini — compare pass rates, cost, and latency |
| **MCP Server Testing** | Test MCP servers for protocol compliance and tool schema validation |
| **pytest Plugin** | Drop-in integration with auto-registered fixtures and custom markers |
| **CLI & Web UI** | `agentest evaluate`, `agentest replay`, `agentest summary`, and a FastAPI dashboard |

### What Makes It Different

- **Framework-agnostic** — Works with any agent framework (LangChain, CrewAI, AutoGen, LlamaIndex, custom agents). No vendor lock-in.
- **Auto-instrumentation** — `agentest.instrument()` patches anthropic/openai clients to auto-record traces with zero code changes.
- **Native adapters** — First-class integrations for LangChain, CrewAI, AutoGen, LlamaIndex, Claude Agent SDK, and OpenAI Agents SDK.
- **LLM-provider-agnostic** — Built-in cost tracking for Anthropic, OpenAI, and Google models.
- **Offline-first** — No network calls required for recording, replaying, or evaluating.
- **CI/CD ready** — GitHub Action for running evaluations in your pipeline.
- **Minimal dependencies** — 6 core runtime deps. Optional extras for web UI and frameworks.

## Quick Start

### 1. Record and Evaluate

```python
from agentest import Recorder, TaskCompletionEvaluator, SafetyEvaluator, CostEvaluator

# Record an agent interaction
recorder = Recorder(task="Summarize README.md")
recorder.record_message("user", "Please summarize README.md")
recorder.record_tool_call(
    name="read_file",
    arguments={"path": "README.md"},
    result="# My Project\nThis is a sample project.",
)
recorder.record_llm_response(
    model="claude-sonnet-4-6",
    content="This is a sample project.",
    input_tokens=100,
    output_tokens=20,
)
trace = recorder.finalize(success=True)

# Evaluate
for evaluator in [TaskCompletionEvaluator(), SafetyEvaluator(), CostEvaluator(max_cost=0.10)]:
    result = evaluator.evaluate(trace)
    print(f"{result.evaluator}: {'PASS' if result.passed else 'FAIL'} ({result.score:.2f})")

# Save for replay
recorder.save("traces/summarize.yaml")
```

### 2. Mock Tools for Deterministic Testing

```python
from agentest import ToolMock, MockToolkit

toolkit = MockToolkit()

# Simple returns
toolkit.mock("read_file").returns("file contents")

# Conditional returns
toolkit.mock("search") \
    .when(query="python").returns(["python result"]) \
    .when(query="rust").returns(["rust result"]) \
    .otherwise().returns([])

# Sequential returns (pagination, retries)
toolkit.mock("get_page").returns_sequence(["page 1", "page 2", "page 3"])

# Custom logic
toolkit.mock("calculator").responds_with(lambda args: args["a"] + args["b"])

# Error simulation
toolkit.mock("flaky_api").raises(TimeoutError("service unavailable"))

# Use them
result = toolkit.execute("read_file", path="test.txt")  # "file contents"
result = toolkit.execute("search", query="python")       # ["python result"]

# Assertions
toolkit.mock("read_file").assert_called()
toolkit.mock("read_file").assert_called_with(path="test.txt")
toolkit.assert_all_called()
```

### 3. Replay Recorded Sessions

```python
from agentest import Recorder, Replayer

# Load a previously recorded trace
trace = Recorder.load("traces/summarize.yaml")
replayer = Replayer(trace, strict=True)

# Replay — get the exact same responses
response = replayer.next_llm_response()
tool_result = replayer.next_tool_result("read_file")

# Generate mock functions from the trace
mocks = replayer.create_tool_mock()
result = mocks["read_file"]()  # Returns the recorded result
```

### 4. Benchmark Across Models

```python
from agentest import BenchmarkRunner, ModelComparison
from agentest.benchmark.runner import BenchmarkTask

comparison = ModelComparison()

for model in ["claude-sonnet-4-6", "gpt-4o", "gpt-4o-mini"]:
    runner = BenchmarkRunner(name=f"bench_{model}", evaluators=[...])
    runner.add_task(BenchmarkTask(
        name="summarize",
        description="Summarize a document",
        task_fn=lambda: run_your_agent(model, "Summarize README.md"),
    ))
    comparison.add_result(model, runner.run())

# Compare
table = comparison.comparison_table()
best = comparison.best_model("avg_score")
diff = comparison.diff("claude-sonnet-4-6", "gpt-4o")

# Export
comparison.to_markdown("results.md")
comparison.to_csv("results.csv")
```

### 5. pytest Integration

```python
# tests/test_my_agent.py — fixtures auto-registered via entry point

def test_agent_completes_task(agent_recorder, agent_eval_suite):
    agent_recorder.trace.task = "Summarize a document"
    agent_recorder.record_tool_call(name="read_file", arguments={"path": "doc.txt"}, result="...")
    agent_recorder.record_llm_response(model="claude-sonnet-4-6", content="Summary here.")
    trace = agent_recorder.finalize(success=True)

    results = agent_eval_suite.evaluate_all(trace)
    assert all(r.passed for r in results)

def test_mocked_tools(agent_toolkit):
    agent_toolkit.mock("search").when(query="python").returns(["result"])
    assert agent_toolkit.execute("search", query="python") == ["result"]

def test_safety():
    from agentest import Recorder, SafetyEvaluator
    rec = Recorder(task="test")
    rec.record_tool_call(name="bash", arguments={"command": "rm -rf /"}, result="")
    assert not SafetyEvaluator().evaluate(rec.finalize()).passed
```

Run with:
```bash
pytest tests/ --agentest-max-cost=0.50 --agentest-max-tokens=100000
```

### 6. Test MCP Servers

```python
from agentest.mcp_testing import MCPServerTester, MCPAssertions

tester = MCPServerTester(command=["python", "-m", "my_mcp_server"])

# Run standard compliance tests
results = tester.run_standard_tests()
MCPAssertions(results).all_passed().has_tool("read_file").max_latency(5000)

# Test specific tools
result = tester.test_tool_call("read_file", arguments={"path": "/tmp/test.txt"})
assert result.passed

# Validate tool schemas
schema_results = tester.test_tool_schema_validation()
MCPAssertions(schema_results).all_passed()
```

### 7. Auto-Instrumentation (Zero Code Changes)

```python
import agentest

# Monkey-patch anthropic and openai clients globally
agentest.instrument()

# All API calls are now automatically recorded
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": "Hello"}],
    max_tokens=100,
)

# Get all recorded traces
traces = agentest.get_traces()
agentest.uninstrument()  # Remove patches when done
```

### 8. Framework Integrations

```python
# LangChain
from agentest.integrations.langchain import AgentestCallbackHandler
handler = AgentestCallbackHandler(task="My chain")
result = chain.invoke({"input": "Hello"}, config={"callbacks": [handler]})
trace = handler.get_trace()

# CrewAI
from agentest.integrations.crewai import record_crew
result, trace = record_crew(crew, inputs={"topic": "testing"})

# AutoGen
from agentest.integrations.autogen import record_autogen_chat
result, trace = record_autogen_chat(user_proxy, assistant, "Hello")

# Claude Agent SDK
from agentest.integrations.claude_agent_sdk import AgentestTracer
tracer = AgentestTracer(task="My agent")
result, trace = tracer.record(agent.run, "What is 2+2?")

# OpenAI Agents SDK
from agentest.integrations.openai_agents import AgentestTracer
tracer = AgentestTracer(task="My agent")
result, trace = tracer.record(runner.run_sync, agent, "Hello")
```

Install framework extras: `pip install agentest[langchain,crewai,autogen,llamaindex]`

### 9. GitHub Action

```yaml
# .github/workflows/agent-eval.yml
- uses: ColinHarker/agentest@v1
  with:
    traces-dir: traces/
    evaluators: task_completion,safety,tool_usage
    max-cost: "1.00"
    check-safety: "true"
    fail-on-error: "true"
```

## Features

### Record & Replay

Capture agent interactions as immutable trace snapshots (YAML or JSON). Replay them deterministically without making real LLM or tool calls — saving time and money during development.

```python
# Record with context manager — auto-finalizes on exit
with Recorder(task="My task") as rec:
    rec.record_message("user", "Do something")
    rec.record_llm_response("claude-sonnet-4-6", "Done.", 100, 20)
# trace = rec.trace
```

### Tool Mocking

Fluent builder API with conditional returns, sequences, regex matching, custom handlers, and full assertion support. Test agent logic without real integrations.

### Safety Evaluation

Built-in detection for:
- **Unsafe commands** — `rm -rf /`, `DROP TABLE`, `sudo chmod 777`, `eval(`, `curl | sh`, etc.
- **PII leakage** — SSNs, credit card numbers, emails, API keys, AWS credentials
- **Custom patterns** — supply your own regex rules
- **Blocked tools** — prevent specific tools from being called

### Cost & Latency Tracking

Automatic cost estimation with built-in pricing for Claude (Opus, Sonnet, Haiku), GPT-4o, GPT-4o-mini, O3, and O4-mini. Set budgets and latency limits as evaluator constraints.

### Model Comparison

Side-by-side benchmarking with export to CSV and Markdown. Compare pass rates, average scores, total cost, and latency across any number of models.

## CLI

```bash
# Initialize Agentest in your project
agentest init

# Evaluate a recorded trace
agentest evaluate traces/my_trace.yaml --max-cost 0.50 --check-safety

# Replay a trace
agentest replay traces/my_trace.yaml

# Summarize all traces in a directory
agentest summary traces/ --format table

# Compare two traces side-by-side
agentest diff traces/v1.yaml traces/v2.yaml

# Watch a directory and re-evaluate on changes
agentest watch traces/ --check-safety --interval 5

# Launch the web UI dashboard
agentest serve --traces-dir traces/ --port 8000
```

## Built-in Evaluators

| Evaluator | What it checks | Scoring |
|-----------|---------------|---------|
| `TaskCompletionEvaluator` | Success status, errors, message count, required tools, failed tool calls | -0.25 per issue (min 0.0) |
| `SafetyEvaluator` | Dangerous commands, PII leakage, blocked tools, custom regex patterns | -0.2 per violation (min 0.0) |
| `CostEvaluator` | Total cost, token count, LLM call count against budgets | Binary pass/fail |
| `LatencyEvaluator` | Total duration and per-call latency against limits | 1.0 pass, 0.5 fail |
| `ToolUsageEvaluator` | Required/forbidden tools, retry limits, error rates | -0.2 per issue (min 0.0) |
| `LLMJudgeEvaluator` | LLM-graded evaluation against custom criteria (Anthropic or OpenAI) | LLM-assigned score |
| `CompositeEvaluator` | Combines multiple evaluators with AND/OR logic | Average of all scores |

## Architecture

```
agentest/
├── core.py              # Data models: AgentTrace, ToolCall, LLMResponse, TraceSession
├── recorder/
│   ├── recorder.py      # Record agent sessions to YAML/JSON
│   └── replayer.py      # Replay sessions deterministically
├── mocking/
│   └── tool_mock.py     # ToolMock, MockToolkit — fluent builder + assertions
├── evaluators/
│   ├── base.py          # Evaluator ABC, EvalResult, CompositeEvaluator, LLMJudge
│   └── builtin.py       # TaskCompletion, Safety, Cost, Latency, ToolUsage
├── benchmark/
│   ├── runner.py        # BenchmarkRunner (sync + async), BenchmarkTask, BenchmarkResult
│   └── comparison.py    # ModelComparison, ModelScore — CSV/Markdown export
├── integrations/
│   ├── instrument.py    # Auto-instrumentation for anthropic/openai
│   ├── langchain.py     # LangChain callback handler adapter
│   ├── crewai.py        # CrewAI crew recorder
│   ├── autogen.py       # AutoGen conversation recorder
│   ├── llamaindex.py    # LlamaIndex callback handler
│   ├── claude_agent_sdk.py # Claude Agent SDK tracer
│   └── openai_agents.py # OpenAI Agents SDK tracer
├── mcp_testing/
│   ├── server_tester.py # MCPServerTester — subprocess-based JSON-RPC testing
│   └── assertions.py    # MCPAssertions — fluent assertion chains
├── reporters/
│   ├── console.py       # Rich console output
│   └── json_reporter.py # Machine-readable JSON reports
├── server/
│   └── app.py           # FastAPI web UI for trace exploration
├── pytest_plugin.py     # Auto-registered fixtures, markers, and trace collectors
└── cli.py               # Click CLI with 8 commands
```

### Design Principles

- **Pydantic models** for strict typing and automatic serialization
- **Builder pattern** for fluent APIs (ToolMock, Recorder)
- **Strategy pattern** for pluggable evaluators
- **Composite pattern** for evaluator aggregation
- **Zero framework coupling** — works with any agent that produces traces

## Comparison

| Feature | Agentest | LangSmith | LangFuse | Braintrust |
|---------|:---------:|:---------:|:--------:|:----------:|
| Record & Replay | ✅ | — | — | — |
| Tool Mocking | ✅ | Basic | — | — |
| Safety Evaluator | ✅ | — | — | — |
| MCP Server Testing | ✅ | — | — | — |
| pytest Integration | ✅ | — | — | — |
| Framework-Agnostic | ✅ | ❌ | ❌ | ❌ |
| Open Source | ✅ | ❌ | ❌ | ❌ |
| Cost Tracking | ✅ | ✅ | ✅ | ✅ |
| Web UI | Basic | Full | Full | Full |
| Centralized Backend | — | ✅ | ✅ | ✅ |
| Auto-Instrumentation | ✅ | ✅ | ✅ | ✅ |
| Framework Adapters | ✅ (7) | ❌ | Partial | — |
| GitHub Action | ✅ | — | — | — |

**Best for:** Local development, CI/CD pipelines, deterministic testing, safety compliance, multi-model benchmarking.

## Installation

```bash
pip install agentest              # Core (recording, evaluation, CLI)
pip install agentest[web]         # Web UI dashboard
pip install agentest[langchain]   # LangChain adapter
pip install agentest[crewai]      # CrewAI adapter
pip install agentest[autogen]     # AutoGen adapter
pip install agentest[llamaindex]  # LlamaIndex adapter
pip install agentest[all]         # Everything
```

## License

MIT
