Metadata-Version: 2.4
Name: exgentic
Version: 0.2.0
Summary: Exgentic - General agent evaluation
Author: Exgentic Team
License: MIT
License-File: LICENSE
Requires-Python: >=3.11
Requires-Dist: click<9,>=8.1.7
Requires-Dist: cloudpickle<4,>=3
Requires-Dist: diskcache<6,>=5
Requires-Dist: filelock<4,>=3
Requires-Dist: json-schema-to-pydantic<1,>=0.4
Requires-Dist: litellm!=1.82.7,!=1.82.8,<2,>=1.65.0
Requires-Dist: mcp<2,>=1.24
Requires-Dist: nicegui<4,>=3
Requires-Dist: pydantic-settings<3,>=2
Requires-Dist: pydantic<3,>=2.9.2
Requires-Dist: python-dotenv<2,>=1
Requires-Dist: rich-click<2,>=1
Requires-Dist: rich<14,>=13
Requires-Dist: typing-extensions<5,>=4
Provides-Extra: analysis
Requires-Dist: matplotlib<4,>=3; extra == 'analysis'
Requires-Dist: numpy<3,>=2; extra == 'analysis'
Requires-Dist: pandas<4,>=3; extra == 'analysis'
Requires-Dist: scipy<2,>=1; extra == 'analysis'
Requires-Dist: statsmodels<1,>=0.14; extra == 'analysis'
Provides-Extra: dev
Requires-Dist: codespell; extra == 'dev'
Requires-Dist: detect-secrets; extra == 'dev'
Requires-Dist: pre-commit; extra == 'dev'
Requires-Dist: pytest-asyncio; extra == 'dev'
Requires-Dist: pytest-mock; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Provides-Extra: otel
Requires-Dist: opentelemetry-api<2,>=1; extra == 'otel'
Requires-Dist: opentelemetry-exporter-otlp-proto-grpc<2,>=1; extra == 'otel'
Requires-Dist: opentelemetry-exporter-otlp-proto-http<2,>=1; extra == 'otel'
Requires-Dist: opentelemetry-sdk<2,>=1; extra == 'otel'
Requires-Dist: opentelemetry-semantic-conventions-ai<1,>=0.4.0; extra == 'otel'
Description-Content-Type: text/markdown

<img src="misc/assets/exgentic_banner_black.png" alt="Exgentic Banner" width="100%"/>

<p align="center">
  <strong>Evaluate any agent on any benchmark in the simplest way possible</strong>
</p>

---

## What is Exgentic?

Exgentic is a universal evaluation framework that enables standardized testing of AI agents across diverse benchmarks and domains. It provides a consistent interface for evaluating any agent on any benchmark, making it easy to compare performance, reproduce results, and ensure your agent works reliably across different tasks and environments.

## Who is it for?

1. **General Audience** - Visit [www.exgentic.ai](https://www.exgentic.ai) to explore the first general agent leaderboard comparing leading agents and frontier models across varied tasks.
2. **Agent Builders** - Evaluate your agents comprehensively across multiple domains and benchmarks.
3. **Researchers & Component Developers** - Test agentic components (memory, context compression, planning) across different agents and domains.
4. **Benchmark Builders** - Evaluate your benchmark across multiple agents to ensure meaningful differentiation.

---

## Quick Start

### Installation

```bash
uv tool install exgentic
```

### API Credentials

```bash
export OPENAI_API_KEY=...
# or
export ANTHROPIC_API_KEY=...
```

### Run an Evaluation

```bash
# List available benchmarks and agents
exgentic list benchmarks
exgentic list agents

# Evaluate an agent on a benchmark
exgentic evaluate --benchmark tau2 --agent tool_calling --subset retail --num-tasks 2 \
  --model gpt-4o \
  --set benchmark.user_simulator_model="gpt-4o"
```

Benchmarks are automatically set up in an isolated virtual environment on first run — no manual installation needed. You can also set them up explicitly:

```bash
exgentic setup --benchmark tau2
exgentic setup --agent litellm_tool_calling
```

For full container isolation, use the Docker runner (`--set benchmark.runner=docker`). You only need Docker installed and running:

```bash
exgentic evaluate --benchmark tau2 --agent tool_calling --subset retail --num-tasks 2 \
  --model gpt-4o \
  --set benchmark.runner=docker \
  --set benchmark.user_simulator_model="gpt-4o"
```

### Python API

To use exgentic as a library, install it first:

```bash
uv add exgentic   # or: pip install exgentic
```

```python
from exgentic import evaluate

results = evaluate(
    benchmark="tau2",
    agent="tool_calling",
    subset="retail",
    num_tasks=2,
    model="gpt-4o",
    benchmark_kwargs={"user_simulator_model": "gpt-4o"},
)
```

For more examples, see the [`examples/`](./examples/) directory.

---

## Available Benchmarks

```bash
exgentic list benchmarks
```

| Benchmark | Description |
|-----------|-------------|
| **tau2** | Simulated customer support tasks across multiple domains (retail, airline, banking) |
| **appworld** | Multi-app API environment testing agents' ability to interact with application interfaces |
| **browsecompplus** | Web search and browsing benchmark for information retrieval and navigation |
| **swebench** | Software engineering benchmark for resolving real-world GitHub issues |
| **hotpotqa** | Multi-hop question answering over Wikipedia |
| **gsm8k** | Grade school math word problems with optional calculator tool |

## Available Agents

| Agent | Description |
|-------|-------------|
| **LiteLLM Tool Calling** | Generic tool-calling agent via LiteLLM |
| **SmolAgents** | HuggingFace SmolAgents framework |
| **OpenAI MCP** | OpenAI Responses API with MCP tools |
| **Claude Code** | Anthropic Claude Code agent |
| **Codex CLI** | OpenAI Codex CLI agent |
| **Gemini CLI** | Google Gemini CLI agent |

---

## Dashboard

<img src="misc/assets/gui.png" alt="Dashboard" width="100%"/>

```bash
exgentic dashboard
```

---

## Output Structure

Each run creates its own directory under `outputs/<run_id>/`:

```text
outputs/<run_id>/
├── results.json                    # Overall scores, costs, per-session statistics
├── benchmark_results.json          # Benchmark-specific aggregated results
├── run/
│   ├── config.json                # Snapshot of benchmark and agent configuration
│   ├── run.log                    # Main execution log
│   └── warnings.log               # Warnings during execution
└── sessions/<session_id>/
    ├── config.json                # Session configuration
    ├── results.json               # Session results
    ├── trajectory.jsonl           # One JSON line per step (action + observation)
    ├── agent/
    │   └── agent.log             # Agent execution log
    └── benchmark/
        ├── results.json          # Benchmark-specific results
        └── session.log           # Benchmark session log
```

---

## CLI Reference

<img src="misc/assets/cli.png" alt="CLI" width="100%"/>

```bash
# Discover
exgentic list benchmarks
exgentic list subsets --benchmark tau2
exgentic list tasks --benchmark tau2 --subset retail --limit 5
exgentic list agents
exgentic setup --benchmark tau2

# Run
exgentic evaluate --benchmark tau2 --agent tool_calling --subset airline --num-tasks 10
exgentic batch run --benchmark tau2 --agent tool_calling --subset airline --num-tasks 10

# Inspect
exgentic status --benchmark tau2 --agent tool_calling --subset airline --num-tasks 10
exgentic preview --benchmark tau2 --agent tool_calling --subset airline --num-tasks 10
exgentic results --benchmark tau2 --agent tool_calling --subset airline --num-tasks 10

# Analyze
exgentic compare --agents tool_calling openai --benchmark tau2

# Explore
exgentic dashboard
```

---

## Advanced

### Model Configuration

```bash
exgentic evaluate --benchmark tau2 --agent tool_calling --subset retail --num-tasks 2 \
  --set agent.model.temperature=0.2
```

Supported fields: `temperature`, `top_p`, `max_tokens`, `reasoning_effort`, `num_retries`, `retry_after`, `retry_strategy`

### Run Limits

```bash
exgentic evaluate --benchmark tau2 --agent tool_calling --subset retail --num-tasks 2 \
  --max-steps 100 --max-actions 100
```

Sessions stop at either limit and record `limit_reached` status. Default: 100 for both.

### HuggingFace

Use HuggingFace models or run evaluations on HuggingFace Jobs. See [docs/huggingface.md](./docs/huggingface.md).

---

## How It Works

To learn more about Exgentic's architecture and design, see our [arXiv paper](https://arxiv.org/abs/2602.22953).

## Development

For local development, editing, and contributing, see [DEVELOPMENT.md](./DEVELOPMENT.md).

## Contributing

We welcome issues and pull requests! See [CONTRIBUTING.md](./CONTRIBUTING.md) for guidelines.

## Citing Exgentic

```bibtex
@misc{bandel2026generalagentevaluation,
      title={General Agent Evaluation},
      author={Elron Bandel and Asaf Yehudai and Lilach Eden and Yehoshua Sagron and Yotam Perlitz and Elad Venezian and Natalia Razinkov and Natan Ergas and Shlomit Shachor Ifergan and Segev Shlomov and Michal Jacovi and Leshem Choshen and Liat Ein-Dor and Yoav Katz and Michal Shmueli-Scheuer},
      year={2026},
      url={https://arxiv.org/abs/2602.22953},
}
```

## License

Apache License 2.0 — see [LICENSE](LICENSE).

## Support

For questions and support, [open an issue](https://github.com/Exgentic/exgentic/issues) on GitHub.
