Metadata-Version: 2.4
Name: claw-bench
Version: 0.1.0
Summary: Standardized evaluation benchmark for the Claw ecosystem
Project-URL: Homepage, https://github.com/claw-bench/claw-bench
Project-URL: Repository, https://github.com/claw-bench/claw-bench
Project-URL: Documentation, https://claw-bench.github.io/claw-bench/
Author: Claw Bench Contributors
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: agent,ai,benchmark,claw,evaluation
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.11
Requires-Dist: cryptography
Requires-Dist: docker>=7
Requires-Dist: httpx>=0.27
Requires-Dist: pydantic>=2
Requires-Dist: pytest>=8
Requires-Dist: pyyaml>=6
Requires-Dist: rich>=13
Requires-Dist: tomli>=2
Requires-Dist: typer[all]>=0.9
Provides-Extra: dev
Requires-Dist: mypy; extra == 'dev'
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Provides-Extra: server
Requires-Dist: fastapi>=0.115; extra == 'server'
Requires-Dist: uvicorn[standard]>=0.34; extra == 'server'
Requires-Dist: websockets>=14; extra == 'server'
Description-Content-Type: text/markdown

# Claw Bench

**A standardized evaluation benchmark for the Claw ecosystem.**

Claw Bench provides a reproducible, container-isolated harness for measuring how well AI agent frameworks perform across real-world desktop and application tasks.

[Documentation](https://claw-bench.github.io/claw-bench/) | [Leaderboard](https://claw-bench.github.io/claw-bench/leaderboard) | [Chinese / 中文](README.zh-CN.md)

---

## Quick Start

```bash
# 1. Install
pip install claw-bench

# 2. Run the benchmark
claw-bench run --adapter openclaw --tasks all

# 3. Submit results to the leaderboard
claw-bench submit results/<run-id>.json
```

## Features

- **Reproducible evaluation** -- every task runs in a Docker container with a deterministic initial state.
- **Multi-framework support** -- pluggable adapter system lets you benchmark any Claw-compatible agent framework.
- **Rich task library** -- curated tasks spanning productivity apps, coding, web browsing, system administration, and more.
- **Automated scoring** -- objective rubrics with both binary and partial-credit metrics.
- **CLI-first workflow** -- validate tasks, run suites, and submit results from the command line.
- **Encrypted ground truth** -- answer keys are age-encrypted so agents cannot peek at solutions.

## Supported Frameworks

| Framework | Adapter Name | Status | Language |
|-----------|-------------|--------|----------|
| OpenClaw  | `openclaw`  | Supported | TypeScript |
| IronClaw  | `ironclaw`  | Supported | Rust |
| ZeroClaw  | `zeroclaw`  | Supported | Rust |
| QClaw     | `qclaw`     | Supported | TypeScript |
| NullClaw  | `nullclaw`  | Supported | Zig |
| PicoClaw  | `picoclaw`  | Supported | Go |
| NanoBot   | `nanobot`   | Supported | Python |
| DryRun    | `dryrun`    | Built-in | Python (oracle) |

The `dryrun` adapter runs oracle solutions directly for infrastructure validation. Register additional frameworks by implementing the `ClawAdapter` interface and adding an entry point. See [CONTRIBUTING.md](CONTRIBUTING.md) for details.

## Task Library

**210 tasks** across **14 domains** and **4 difficulty levels** (L1–L4):

| Domain | Tasks | L1 | L2 | L3 | L4 |
|--------|------:|----:|----:|----:|----:|
| Calendar | 15 | 5 | 5 | 3 | 2 |
| Code Assistance | 15 | 3 | 6 | 4 | 2 |
| Communication | 15 | 3 | 5 | 6 | 1 |
| Cross-Domain | 15 | 0 | 0 | 8 | 7 |
| Data Analysis | 15 | 3 | 4 | 6 | 2 |
| Document Editing | 15 | 4 | 6 | 4 | 1 |
| Email | 15 | 3 | 6 | 5 | 1 |
| File Operations | 15 | 6 | 5 | 3 | 1 |
| Memory | 15 | 1 | 6 | 7 | 1 |
| Multimodal | 15 | 1 | 6 | 7 | 1 |
| Security | 15 | 3 | 5 | 4 | 3 |
| System Admin | 15 | 3 | 6 | 5 | 1 |
| Web Browsing | 15 | 3 | 6 | 5 | 1 |
| Workflow Automation | 15 | 2 | 6 | 6 | 1 |
| **Total** | **210** | **40** | **72** | **73** | **25** |

## Fair Evaluation Design

Claw Bench addresses the key challenge of comparing frameworks with different Skills ecosystems and model preferences:

- **Skills 3-Condition Comparison** (SkillsBench methodology): Each task is tested in `vanilla` (no skills), `curated` (Claw Bench standard skills), and `native` (framework's own skills) modes to isolate framework capability from ecosystem size.
- **Model Standardization**: Canonical model tiers (flagship/standard/economy/opensource) ensure fair cross-framework comparison. Frameworks are also tested with their best model configuration.
- **Cost-Performance Pareto Frontier**: Visualize optimal framework choices at any budget constraint.
- **Multi-Dimensional Scoring**: Task completion (40%), efficiency (20%), security (15%), skills efficacy (15%), UX (10%) with switchable weight profiles.

## Project Structure

```
claw_bench/
  src/claw_bench/       # Core library and CLI
    adapters/           # Framework adapters (openclaw, ironclaw, zeroclaw)
    core/               # Runner, verifier, scorer, metrics
    cli/                # Command-line interface
  tasks/                # 210 task definitions across 14 domains
    _schema/            # JSON Schema for task validation
  skills/curated/       # Curated skills for fair cross-framework testing
  config/               # Model tiers and skills profile config
  tests/                # Test suite (781 tests, 98% coverage)
  leaderboard/          # Next.js leaderboard frontend
  docs/                 # Documentation
  docker/               # Container images
```

## Development

```bash
git clone https://github.com/claw-bench/claw-bench.git
cd claw-bench
pip install -e ".[dev]"
pytest
```

See [CONTRIBUTING.md](CONTRIBUTING.md) for the full contribution guide.

## License

Apache-2.0. See [LICENSE](LICENSE) for details.
