Metadata-Version: 2.4
Name: skillbench
Version: 0.1.0
Summary: The development workbench for Agent Skills — develop, test, and evaluate skills across agent frameworks.
Author: SkillBench Contributors
License-Expression: MIT
Keywords: agent-skills,testing,evaluation,llm,ai-agents
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: typer>=0.12
Requires-Dist: rich>=13.0
Requires-Dist: pydantic>=2.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: python-frontmatter>=1.1.0
Requires-Dist: anthropic>=0.40
Requires-Dist: openai>=1.50
Requires-Dist: pytest>=8.0
Requires-Dist: jinja2>=3.1
Requires-Dist: aiosqlite>=0.20
Requires-Dist: httpx>=0.27
Provides-Extra: dev
Requires-Dist: pytest-asyncio>=0.24; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: ruff>=0.5; extra == "dev"
Dynamic: license-file

# SkillBench

**The pytest for [Agent Skills](https://agentskills.io)** — develop, test, and evaluate skills across agent frameworks.

```
skills.sh     → discover skills
GitHub repos  → store skills
SkillBench     → develop, test, and evaluate skills
```

## Quick Start

```bash
git clone https://github.com/YOUR_USERNAME/skillbench.git
cd skillbench
python3 -m venv .venv && source .venv/bin/activate
pip install .

# Run an eval in under 60 seconds
skillbench pull vercel-labs/skills/skills/find-skills
skillbench test ~/.skillbench/skills/vercel-labs/skills/find-skills \
  --adapter azure --model YOUR_DEPLOYMENT
```

## What It Does

- **Pull** skills from any GitHub repo into a local cache
- **Test** skills against real agent frameworks (Claude, OpenAI, Azure, Aider, Codex, Cursor, Claude Code, OpenCode)
- **Evaluate** across multiple frameworks in one command and compare results
- **Track** run history with persistent SQLite storage and JSON/terminal reports

## Example: Testing the #1 Skill on skills.sh

```bash
# Pull the top skill (503K installs)
skillbench pull vercel-labs/skills/skills/find-skills

# Set up Azure OpenAI
export AZURE_OPENAI_API_KEY="your-key"
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com"
export SKILLBENCH_JUDGE_MODEL="your-deployment"

# Run evaluation
skillbench test ~/.skillbench/skills/vercel-labs/skills/find-skills \
  --adapter azure --model your-deployment

# Cross-framework comparison
skillbench eval ~/.skillbench/skills/vercel-labs/skills/find-skills \
  --frameworks azure,aider --model your-deployment
```

## Define Scenarios

Scenarios are declarative YAML test cases:

```yaml
skill: find-skills
scenarios:
  - name: search-for-react-skill
    prompt: "How do I optimize my React app? Can you find a skill for that?"
    assertions:
      - type: tool_called
        tool: Bash
      - type: output_contains
        value: "skills"
      - type: llm_judge
        criteria: "Agent searched for React skills and presented install commands"
    tags: [core]
```

Assertion types: `tool_called`, `file_created`, `output_contains`, `llm_judge` (auto-detects Anthropic/OpenAI/Azure).

Or generate them automatically:

```bash
skillbench generate-scenarios ./my-skill/
```

## CLI Reference

| Command | Description |
|---------|-------------|
| `skillbench init <name>` | Scaffold a new skill with SKILL.md template |
| `skillbench validate <path>` | Validate SKILL.md against the spec |
| `skillbench pull <source>` | Fetch skills from GitHub |
| `skillbench generate-scenarios <path>` | AI-generate test scenarios |
| `skillbench test <path>` | Run scenarios with a single adapter |
| `skillbench eval <path>` | Evaluate across multiple adapters |
| `skillbench history` | View past run results |
| `skillbench report <run-id>` | Display a previous run report |
| `skillbench benchmark pull` | Pull the top 100 skills from skills.sh |
| `skillbench contribute <path>` | *(Coming in v0.2)* PR improvements back |

## Adapters

| Adapter | `--adapter` | Type | Requirements |
|---------|-------------|------|-------------|
| Claude API | `claude` | API | `ANTHROPIC_API_KEY` |
| OpenAI API | `openai` | API | `OPENAI_API_KEY` |
| Azure OpenAI | `azure` | API | `AZURE_OPENAI_API_KEY` + `AZURE_OPENAI_ENDPOINT` |
| Custom endpoint | `openai` | API | `OPENAI_API_KEY` + `OPENAI_BASE_URL` |
| Aider | `aider` | CLI | `aider` CLI installed (`pip install aider-chat`) |
| Claude Code | `claude-code` | CLI | `claude` CLI installed |
| Cursor | `cursor` | CLI | `agent` CLI + `CURSOR_API_KEY` |
| OpenAI Codex | `codex` | CLI | `codex` CLI installed (`npm i -g @openai/codex`) |
| OpenCode | `opencode` | CLI | `opencode` CLI installed ([opencode.ai](https://opencode.ai)) |

**API adapters** simulate tool use in a sandbox — the model sees Bash/Read/Write tools and SkillBench executes them in an isolated workspace.

**CLI adapters** run the actual agent binary with real filesystem access in a temp workspace — true end-to-end testing.

Use `--model` to override the default model/deployment:

```bash
skillbench test ./my-skill --adapter azure --model my-gpt4o-deployment
skillbench test ./my-skill --adapter aider --model azure/my-deployment
```

## Architecture

```
src/skillbench/
├── cli.py              # Typer CLI
├── core/
│   ├── skill.py        # SKILL.md parser + validator
│   ├── scenario.py     # Scenario models + YAML loader
│   ├── runner.py       # Scenario execution engine
│   ├── evaluator.py    # Assertions (deterministic + LLM judge)
│   └── scenario_gen.py # AI-assisted scenario generation
├── adapters/
│   ├── base.py         # Framework adapter protocol
│   ├── claude_api.py   # Anthropic Claude API
│   ├── openai_api.py   # OpenAI / Azure / compatible APIs
│   ├── aider.py        # Aider CLI
│   ├── claude_code.py  # Claude Code CLI
│   ├── cursor.py       # Cursor agent CLI
│   ├── codex.py        # OpenAI Codex CLI
│   └── opencode.py     # OpenCode CLI
├── registry/
│   ├── puller.py       # Pull skills from GitHub
│   ├── cache.py        # Local skill cache (~/.skillbench/skills/)
│   └── contribute.py   # PR workflow (v0.2)
├── store/
│   └── history.py      # SQLite run history
└── reports/
    └── generator.py    # Rich terminal + JSON reports
```

## CI/CD

```yaml
# .github/workflows/skill-eval.yml
name: Skill Evaluation
on: [push, pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install .
      - run: skillbench eval ./my-skill --frameworks azure -o results.json
        env:
          AZURE_OPENAI_API_KEY: ${{ secrets.AZURE_OPENAI_API_KEY }}
          AZURE_OPENAI_ENDPOINT: ${{ secrets.AZURE_OPENAI_ENDPOINT }}
          SKILLBENCH_JUDGE_MODEL: ${{ secrets.SKILLBENCH_JUDGE_MODEL }}
```

Exit codes: `0` = all pass, `1` = failures. JSON output (`-o`) for machine-readable results.

## Benchmark Suite

SkillBench includes a manifest of the top 100 skills from [skills.sh](https://skills.sh):

```bash
skillbench benchmark pull     # Pull all skills
```

## Requirements

- Python 3.11+
- API keys for the adapters you want to use

## Contributing

Contributions welcome! To get started:

```bash
git clone https://github.com/YOUR_USERNAME/skillbench.git
cd skillbench
python3 -m venv .venv && source .venv/bin/activate
pip install .[dev]
pytest tests/ -v
```

Please open an issue before submitting large changes.

## License

MIT
