Metadata-Version: 2.3
Name: sage-evaluator
Version: 1.1.0
Summary: CLI tool for validating, benchmarking, and optimizing Sage agent configurations
Author: Sage Choi
Author-email: Sage Choi <iamsagebynature@gmail.com>
License: MIT License
         
         Copyright (c) 2026 Sage Choi
         
         Permission is hereby granted, free of charge, to any person obtaining a copy
         of this software and associated documentation files (the "Software"), to deal
         in the Software without restriction, including without limitation the rights
         to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
         copies of the Software, and to permit persons to whom the Software is
         furnished to do so, subject to the following conditions:
         
         The above copyright notice and this permission notice shall be included in all
         copies or substantial portions of the Software.
         
         THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
         IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
         FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
         AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
         LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
         OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
         SOFTWARE.
Requires-Dist: sage-agent<2.0.0
Requires-Dist: click>=8.0
Requires-Dist: rich>=13.0
Requires-Dist: azure-ai-inference>=1.0.0b1
Requires-Dist: azure-identity>=1.15
Requires-Dist: azure-mgmt-cognitiveservices>=13.5
Requires-Dist: azure-mgmt-resource>=23.0
Requires-Dist: azure-mgmt-resource-subscriptions>=1.0.0b1
Requires-Dist: pyyaml>=6.0
Requires-Python: >=3.10
Description-Content-Type: text/markdown

# Sage Evaluator

CLI tool for validating, benchmarking, and optimizing Sage agent configurations.

## Why This Tool Exists

Building effective AI agents is iterative. You write a system prompt, choose a model, wire up tools, and hope for the best. But how do you know if your configuration is correct? How do you pick between GPT-4o and Claude when both "seem fine"? How do you catch the configuration mistake that makes your agent silently worse?

Sage Evaluator exists to bring rigor to that process. It provides four capabilities that address the core challenges of agent development:

1. **Validation** -- Catch configuration errors before they reach production. Invalid permissions, malformed extension paths, missing frontmatter fields, unreachable subagent paths, and suspicious defaults are surfaced immediately instead of manifesting as mysterious runtime failures.

2. **Benchmarking** -- Compare models objectively. Rather than gut-checking outputs by hand, run the same agent intent across multiple models and get back token usage, latency, cost estimates, and LLM-as-judge quality scores in a single report.

3. **Suggestion** -- Get actionable feedback on your agent configuration. The analyzer identifies prompt improvements, opportunities to extract logic into tools, guardrail candidates, and architectural changes -- then optionally generates the code.

4. **Comparison** -- A/B test configuration changes. Run two versions of the same agent against identical conditions and see exactly what changed in quality, cost, and speed.

## Requirements

- Python 3.10+
- [uv](https://docs.astral.sh/uv/) package manager
- Azure credentials (for `discover` and model access via Azure Cognitive Services)

## Installation

```bash
uv tool install sage-evaluator
```

### Development Setup

```bash
git clone <repository-url>
cd sage-evaluator
make install
```

This runs `uv sync --frozen --group dev` and installs pre-commit hooks for linting and commit message validation.

If you need to update dependencies:

```bash
make update
```

## Configuration

Create a `.env` file in the project root (or export these variables):

```env
# Required for Azure model access
AZURE_AI_API_BASE=https://<your-endpoint>.services.ai.azure.com

# Required for the discover command (account name, not full URL)
AZURE_AI_ACCOUNT_NAME=<your-account-name>

# Optional -- used by discover to skip auto-discovery
AZURE_SUBSCRIPTION_ID=<your-subscription-id>
AZURE_RESOURCE_GROUP=<your-resource-group>

# Optional -- defaults to azure_ai/claude-opus-4-6
EVALUATOR_MODEL=azure_ai/claude-opus-4-6
```

## Commands

### `evaluate validate`

Validates agent and skill markdown configuration files. Runs three levels of checks:

- **Structural** -- YAML frontmatter parsing and Pydantic model validation
- **Semantic** -- Model identifier format, extension module path verification, permission value validation, subagent path resolution
- **Best-practice** -- Heuristic warnings (default `max_turns`, short prompt bodies, missing permission/extensions)

```bash
# Validate a single file
evaluate validate ./my-agent/AGENTS.md

# Validate a directory (looks for AGENTS.md inside)
evaluate validate ./my-agent

# Validate multiple paths
evaluate validate ./agent-a ./agent-b ./shared-skill.md

# Strict mode: treat warnings as errors
evaluate validate ./my-agent --strict

# JSON output (for CI pipelines)
evaluate validate ./my-agent --format json
```

Exit codes: `0` if all files pass, `1` if any file has errors.

### `evaluate discover`

Lists models deployed in an Azure Cognitive Services account. Useful for seeing what's available before benchmarking. Subscription and resource group are auto-discovered from the account name unless explicitly provided.

```bash
# List deployed models (auto-discovers subscription and resource group)
evaluate discover --account-name my-aisvcs-account

# Include per-token pricing
evaluate discover --account-name $AZURE_AI_ACCOUNT_NAME --include-pricing

# Skip auto-discovery by providing subscription and resource group
evaluate discover --account-name $AZURE_AI_ACCOUNT_NAME \
  --subscription $AZURE_SUBSCRIPTION_ID \
  --resource-group $AZURE_RESOURCE_GROUP

# JSON output
evaluate discover --account-name $AZURE_AI_ACCOUNT_NAME --include-pricing --format json
```

| Option | Default | Description |
|--------|---------|-------------|
| `--account-name` | *(required)* | Azure Cognitive Services account name (or `AZURE_AI_ACCOUNT_NAME` env var) |
| `--subscription` | *(auto-discovered)* | Azure subscription ID (or `AZURE_SUBSCRIPTION_ID` env var) |
| `--resource-group` | *(auto-discovered)* | Azure resource group (or `AZURE_RESOURCE_GROUP` env var) |
| `--include-pricing` | `false` | Enrich each model with per-token pricing |
| `--format` | `text` | Output format: `text` or `json` |

Pricing is resolved through a 3-tier lookup: litellm data, a hard-coded fallback table for common models, or reported as unknown.

### `evaluate benchmark`

Benchmarks an agent configuration across one or more models. This is the core workflow:

1. Elaborates the user intent via LLM (clarifying expected outcomes and evaluation criteria)
2. Runs the agent with each specified model (multiple times if `--runs > 1`)
3. Captures metrics: token usage, latency, tool calls
4. Scores outputs using an LLM-as-judge against a rubric
5. Estimates costs using pricing data
6. Ranks models by weighted composite score

```bash
# Basic benchmark
evaluate benchmark ./my-agent/AGENTS.md \
  --models gpt-4o azure_ai/claude-opus-4-6 \
  --intent "Answer user questions about Python programming"

# Multiple runs with code generation rubric
evaluate benchmark ./my-agent/AGENTS.md \
  --models gpt-4o azure_ai/claude-opus-4-6 \
  --intent "Generate a REST API for a todo app" \
  --rubric code_generation \
  --runs 3

# Skip quality scoring (metrics only)
evaluate benchmark ./my-agent/AGENTS.md \
  --models gpt-4o \
  --intent "Summarize a document" \
  --no-judge

# Save report to file
evaluate benchmark ./my-agent/AGENTS.md \
  --models gpt-4o \
  --intent "Debug a failing test" \
  --output report.json
```

| Option | Default | Description |
|--------|---------|-------------|
| `--models`, `-m` | *(required)* | Model identifiers to benchmark (repeatable) |
| `--intent` | *(required)* | User intent to benchmark against |
| `--rubric` | `default` | Built-in name or path to YAML rubric |
| `--runs` | `1` | Number of runs per model |
| `--no-judge` | `false` | Skip LLM-as-judge evaluation |
| `--account-name` | *(none)* | Azure Cognitive Services account name for model discovery |
| `--subscription` | *(none)* | Azure subscription ID (used with `--account-name`) |
| `--resource-group` | *(none)* | Azure resource group (used with `--account-name`) |
| `--output` | *(none)* | Save report to JSON file |
| `--format` | `text` | Output format: `text` or `json` |

### `evaluate suggest`

Analyzes an agent configuration and returns optimization suggestions across four categories:

- **Prompt improvement** -- Wording, structure, and clarity changes
- **Tool extraction** -- Logic that should be moved from the prompt into `@tool` functions
- **Guardrail** -- Input/output validation that should be enforced programmatically
- **Architecture** -- Structural changes (subagent decomposition, model selection, etc.)

```bash
# Analyze and get suggestions
evaluate suggest ./my-agent/AGENTS.md

# Generate @tool function code from suggestions
evaluate suggest ./my-agent --generate-tools

# Generate guardrail validation functions
evaluate suggest ./my-agent --generate-guardrails

# Both, with JSON output
evaluate suggest ./my-agent \
  --generate-tools --generate-guardrails \
  --format json --output suggestions.json
```

| Option | Default | Description |
|--------|---------|-------------|
| `--generate-tools` | `false` | Generate `@tool` function code |
| `--generate-guardrails` | `false` | Generate guardrail validation functions |
| `--output` | *(none)* | Save report to JSON file |
| `--format` | `text` | Output format: `text` or `json` |

### `evaluate compare`

Runs two agent configurations through the same benchmark and produces a side-by-side comparison. Useful for A/B testing configuration changes.

```bash
evaluate compare ./agent-v1/AGENTS.md ./agent-v2/AGENTS.md \
  --models gpt-4o azure_ai/claude-opus-4-6 \
  --intent "Answer user questions about Python" \
  --output comparison.json
```

Accepts the same options as `benchmark` (`--models`, `--rubric`, `--runs`, `--no-judge`, `--account-name`, `--subscription`, `--resource-group`, `--output`, `--format`).

## Evaluation Rubrics

The benchmark command scores agent outputs using an LLM-as-judge against a rubric. Three rubrics are built in:

**`default`** -- General-purpose evaluation:
| Dimension | Weight | Description |
|-----------|--------|-------------|
| relevance | 2.0 | How well the output addresses the user's intent |
| accuracy | 2.0 | Factual and technical correctness |
| completeness | 1.5 | Whether all aspects of the request are covered |
| clarity | 1.0 | Structure and readability |
| efficiency | 1.0 | Appropriate tool use, no unnecessary steps |

**`code_generation`** -- For code tasks:
| Dimension | Weight | Description |
|-----------|--------|-------------|
| correctness | 2.5 | Functional correctness and edge case handling |
| completeness | 2.0 | All requested functionality implemented |
| code_quality | 1.5 | Naming, structure, DRY principles |
| security | 1.5 | Avoids common vulnerabilities |
| documentation | 1.0 | Comments and docstrings |

**`qa`** -- For question-answering:
| Dimension | Weight | Description |
|-----------|--------|-------------|
| accuracy | 2.5 | Factual correctness |
| relevance | 2.0 | Directly addresses the question |
| depth | 1.5 | Thoroughness of explanation |
| source_usage | 1.0 | Use of tools and references |
| conciseness | 1.0 | Avoids unnecessary verbosity |

### Custom Rubrics

Create a YAML file and pass it via `--rubric path/to/rubric.yaml`:

```yaml
name: my_rubric
description: Custom rubric for my use case
dimensions:
  - name: accuracy
    description: Factual correctness of the response
    weight: 2.0
  - name: tone
    description: Professional and helpful tone
    weight: 1.5
  - name: actionability
    description: Provides clear next steps
    weight: 1.0
```

## Architecture

```
skills/                         # Sage agent skills
├── create-sage-agent/
│   └── SKILL.md                # Agent scaffolding skill
└── evaluate-sage-agent/
    └── SKILL.md                # Agent evaluation skill
sage_evaluator/
├── cli/
│   └── main.py              # Click CLI with 5 commands
├── validation/
│   └── validator.py          # 3-level config validation
├── benchmark/
│   ├── engine.py             # Orchestrates the benchmark pipeline
│   ├── runner.py             # InstrumentedProvider for metrics capture
│   └── collector.py          # Multi-run metric aggregation
├── evaluation/
│   ├── judge.py              # LLM-as-judge scoring
│   └── rubrics.py            # Built-in and YAML rubric loading
├── discovery/
│   ├── azure_models.py       # Azure AI Foundry model discovery
│   └── pricing.py            # 3-tier pricing lookup
├── suggestion/
│   ├── analyzer.py           # Prompt and config analysis
│   ├── tool_generator.py     # @tool function code generation
│   └── guardrail_generator.py  # Guardrail function generation
├── reporting/
│   ├── terminal.py           # Rich terminal output
│   └── json_export.py        # JSON report serialization
├── models.py                 # All Pydantic data models
└── exceptions.py             # Exception hierarchy
```

Key design decisions:

- **Async-first** -- Benchmark execution, model discovery, and suggestion analysis use `asyncio` for parallel operations.
- **Instrumentation via wrapping** -- `InstrumentedProvider` wraps the Sage `LiteLLMProvider` to capture metrics without modifying the agent runtime.
- **Deterministic guardrails** -- Generated guardrail functions are pure validation logic with no LLM calls at runtime.
- **Strongly typed** -- All data flows through Pydantic models for validation and serialization.

## Skills

Sage Evaluator ships with two skills that can be loaded by any Sage agent to automate the evaluation workflow:

### `create-sage-agent`

Scaffolds a new agent from a natural language description. Walks the user through:

1. Describing what the agent should do
2. Analyzing the intent to infer permissions, model, max_turns, and subagent opportunities
3. Generating a complete `AGENTS.md` with a tailored system prompt
4. Validating the result
5. Optionally handing off to `evaluate-sage-agent` for optimization

### `evaluate-sage-agent`

Runs a full evaluation pipeline on an existing agent configuration:

1. Validates the config format
2. Determines the appropriate rubric (code_generation, qa, or default) from the agent's purpose
3. Generates optimization suggestions grouped by category
4. Offers to apply suggestions with versioned backups (AGENTS.v1.md, v2.md, etc.)
5. Benchmarks against the agent's model (with optional model comparison)
6. Compares before/after if changes were applied

### Using Skills

To use these skills with a Sage agent, point the agent's `skills_dir` to the evaluator's skills directory:

```yaml
---
name: my-agent
model: azure_ai/gpt-4o
skills_dir: /path/to/sage-evaluator/skills
---
```

Or copy the skill files into your agent's own skills directory:

```bash
cp -r /path/to/sage-evaluator/skills/create-sage-agent ./my-agent/skills/
cp -r /path/to/sage-evaluator/skills/evaluate-sage-agent ./my-agent/skills/
```

The skills auto-install sage-evaluator via `uv pip install` if the `evaluate` CLI is not available.

## Development

```bash
# Run the full quality pipeline (lint + format + type-check + tests)
make test

# Run only tests
make test-only

# Individual checks
make lint          # ruff check with auto-fix
make format        # ruff format
make type-check    # mypy

# Clean build artifacts
make clean
```

### Running Tests

Tests use `pytest` with `pytest-asyncio` for async test support and `pytest-mock` for mocking:

```bash
# All tests
uv run pytest -v tests

# Specific test module
uv run pytest -v tests/test_validation/
uv run pytest -v tests/test_benchmark/

# Single test
uv run pytest -v tests/test_cli/test_validate.py -k "test_validate_strict"
```

### CI/CD

The project uses GitHub Actions for continuous integration and release management:

- **CI** (`ci.yml`) -- Runs `make test` (lint, format, type-check, pytest) on pull requests to `main`.
- **Release** (`release.yml`) -- On push to `main` or `dev`, runs [Python Semantic Release](https://python-semantic-release.readthedocs.io/) to version, tag, and publish to PyPI. Pushes to `dev` produce release candidates.
