Metadata-Version: 2.4
Name: agent-eval-mcp
Version: 0.2.0
Summary: Deterministic evaluation tools for AI coding agents, exposed as an MCP server.
Project-URL: Homepage, https://github.com/nicolaemorcov/agent-eval-mcp
Project-URL: Repository, https://github.com/nicolaemorcov/agent-eval-mcp
Project-URL: Changelog, https://github.com/nicolaemorcov/agent-eval-mcp/blob/main/CHANGELOG.md
Author-email: Nicolae Morcov <nicolaemorcov@gmail.com>
License: MIT
License-File: LICENSE
Keywords: ai-agents,code-review,evaluation,mcp
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.10
Requires-Dist: mcp[cli]>=1.0.0
Requires-Dist: requests>=2.31.0
Requires-Dist: unidiff>=0.6.0
Description-Content-Type: text/markdown

# 🛡️ agent-eval-mcp

**Deterministic Evaluation and Guardrails for AI Coding Agents.**

[![MCP Compatible](https://img.shields.io/badge/MCP-Compatible-green.svg)](#)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](#)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](#)

Building autonomous coding agents is easy. Figuring out how to evaluate whether what they've done is actually good is incredibly hard. 

`agent-eval-mcp` is a stateless, deterministic Model Context Protocol (MCP) server that stops AI agents from writing lazy, unverified, or hallucinated code. It provides language-agnostic rulesets and hybrid scoring to grade AI-generated revisions *before* they get merged.

## ⚠️ The Problem

When you ask an LLM to evaluate its own code, it suffers from sycophancy. It will confidently tell you its fix is perfect, even when it has:
* Generated dummy patterns like `new HashMap<>()` or `pass`.
* Left `// TODO: implement this` in the production patch.
* Hallucinated the surrounding `SEARCH/REPLACE` context, breaking the Git patch.

## 💡 The Solution

This package exposes objective evaluation tools to your agentic workflows via the **Model Context Protocol (MCP)**. It evaluates AI-generated code edits using fuzzy-matching and language-specific Abstract Syntax Tree (AST) rules (Java, Python, TypeScript, Go) to catch hallucinations deterministically.

It supports two edit formats:
- **Cursor SEARCH/REPLACE blocks** — for interactive agentic coding sessions.
- **Standard unified diffs** — for CI/CD pipelines and GitHub Action workflows, where diffs come from Pull Requests or `git diff` output.

It completely decouples the heavy lifting of code validation from your LLM orchestration layer.

## 🔧 Available Tools

| Tool | Input Format | Use Case |
|---|---|---|
| `verify_fix` | Cursor `<<<< SEARCH >>>> REPLACE` blocks | Interactive agentic coding sessions in Cursor |
| `verify_unified_diff` | Standard unified diff (`git diff` / GitHub PR) | CI/CD pipelines and GitHub Action workflows |
| `evaluate_revision_quality` | Behavioral signals (reflexion count, fetch count) | Scoring AI revision quality objectively |

### `verify_fix`
Validates a Cursor-style `<<<< SEARCH ==== >>>> REPLACE` edit through four sequential checks: format, fuzzy grounding against the original file, dummy-pattern detection (AST + regex), and no-op detection.

### `verify_unified_diff`
Validates a **standard unified diff** — the patch format produced by `git diff` or visible in a GitHub Pull Request's "Files changed" view.

- The diff must target **exactly one file**. To obtain a single-file diff from git, run:
  ```bash
  git diff HEAD -- path/to/file.py
  ```
- Runs the same four-stage pipeline as `verify_fix`: format → grounding → dummy-pattern detection → no-op check.
- Supports the same `language` (`java`, `python`, `typescript`, `go`) and `custom_patterns` arguments.

### `evaluate_revision_quality`
Derives a deterministic quality score (1–10) from observable facts about what the agent *actually did* during diagnosis — facts that cannot be fabricated by the LLM. Use it as a ceiling on the LLM's own self-reported confidence: `final_score = min(llm_self_score, objective_score)`.

**Signal table:**

| Signal | Condition | Penalty |
|---|---|---|
| `reflexion_count` | == 1 | −2 |
| `reflexion_count` | ≥ 2 | −5 |
| `fetch_call_count` | 3–4 | −1 |
| `fetch_call_count` | ≥ 5 | −2 |
| `has_related_files` | `False` | −1 |
| `has_file_content` | `False` | −2 |

**How to populate each signal (framework-agnostic):**

- **`reflexion_count`** — your retry counter: the number of times `verify_fix` or `verify_unified_diff` returned `accepted=False` before the current passing call. Pass `0` on the first accepted attempt.
- **`fetch_call_count`** — count every file-read tool call your agent made during diagnosis (e.g. `read_file`, `get_file_contents`, `fetch_github_file`, or equivalent in your framework).
- **`has_related_files`** — `True` if your agent retrieved at least one file *other than* the primary failing file (an import, a test file, a schema, a dependency).
- **`has_file_content`** — `True` if your agent retrieved the full content of the primary file being fixed before generating the patch.

**Local mode vs. cloud mode:**

In **local mode** (default), `has_related_files` and `has_file_content` reflect what the agent actually fetched — penalties apply when it skipped context retrieval. In **cloud mode**, where file access is always pre-loaded or guaranteed, pass both as `True` unconditionally to avoid penalising the agent for something outside its control.

## 🚀 Quickstart

### 1. Install the Package
Install globally via pip so your MCP clients can execute it:

```bash
pip install agent-eval-mcp
```

### 2. Configure your MCP client

Add to `~/.cursor/mcp.json` or your Claude Desktop config:

```json
{
  "mcpServers": {
    "agent-eval": {
      "command": "agent-eval-mcp"
    }
  }
}
```

### 3. Use in your agentic workflow

**For Cursor sessions** — call `verify_fix` after every LLM-generated patch:
```
verify_fix(diagnosis="...", file_content="...", language="python")
```

**For CI/CD / GitHub Actions** — call `verify_unified_diff` with the PR diff:
```
verify_unified_diff(diff_text="...", file_content="...", language="java")
```

## 🖥️ CLI Usage (GitHub Actions / CI/CD)

The `agent-eval` command validates a unified diff directly from the terminal, making it a drop-in step for any CI/CD pipeline. It prints a JSON result to `stdout` and exits with `0` (pass) or `1` (fail) so GitHub Actions can block a Pull Request automatically.

### Install

```bash
pip install agent-eval-mcp
```

### Basic usage

```bash
# Produce a single-file diff, then validate it
git diff HEAD -- src/mymodule.py > pr.patch
agent-eval --diff-file pr.patch --source-file src/mymodule.py --language python
```

### With custom rejection patterns

Pass `--pattern` once per regex. Commas inside quantifiers (e.g. `{1,5}`) are safe because patterns are never comma-split.

```bash
agent-eval --diff-file pr.patch --source-file src/mymodule.py --language python \
  --pattern "print\(.*\)" \
  --pattern "TODO|FIXME"
```

### Example output

```json
{
  "accepted": false,
  "rejection_reason": "This diff introduces dummy or non-production patterns...",
  "stage_failed": "dummy_patterns"
}
```

### GitHub Actions example

```yaml
- name: Validate AI-generated diff
  run: |
    git diff HEAD -- ${{ env.CHANGED_FILE }} > pr.patch
    agent-eval \
      --diff-file pr.patch \
      --source-file ${{ env.CHANGED_FILE }} \
      --language java
```

### Arguments

| Argument | Required | Default | Description |
|---|---|---|---|
| `--diff-file` | Yes | — | Path to the unified diff file (output of `git diff`) |
| `--source-file` | Yes | — | Path to the original source file before the diff |
| `--language` | No | `python` | Language ruleset: `java`, `python`, `typescript`, `go` |
| `--pattern` | No | — | Regex to reject in added lines (repeatable) |

## GitHub Action (CI/CD)

`agent-eval-mcp` ships as a reusable **composite GitHub Action** that you can drop into any repository workflow to automatically block Pull Requests that introduce AI-generated dummy patterns, grounding failures, or no-op diffs.

### Usage

Reference the action in your workflow. Use `uses: ./` if the action lives in the same repository, or `uses: your-org/agent-eval-mcp@v1` when consuming it from a separate repo.

The `patterns` input is a newline-separated list — one regex per line. Commas inside quantifiers (e.g. `{1,5}`) are safe because patterns are never comma-split.

> **Production tip:** The example below validates a single hardcoded file. For real PRs that touch multiple files, combine this action with [`tj-actions/changed-files`](https://github.com/tj-actions/changed-files) and a matrix strategy to iterate over every modified file automatically.

```yaml
name: AI Diff Quality Check

on:
  pull_request:
    branches: [main]

jobs:
  validate-ai-diff:
    name: Validate AI-generated changes
    runs-on: ubuntu-latest

    steps:
      # fetch-depth: 0 is required for git show and git diff against origin/main.
      - name: Checkout repository
        uses: actions/checkout@v4
        with:
          fetch-depth: 0

      # Extract the pre-PR version of the file from the base branch.
      - name: Export original source file from base branch
        run: git show origin/main:src/main.py > original_main.py

      # Produce a unified diff scoped to the single file under review.
      - name: Generate single-file unified diff
        run: git diff origin/main...HEAD -- src/main.py > patch.diff

      # Run agent-eval. Exit code 1 automatically fails the PR check.
      - name: Validate diff with agent-eval
        uses: ./   # or: uses: your-org/agent-eval-mcp@v1
        with:
          diff_file: patch.diff
          source_file: original_main.py
          language: python
          patterns: |
            print\(.*\)
            TODO|FIXME
```

### Action Inputs

| Input | Required | Default | Description |
|---|---|---|---|
| `diff_file` | Yes | — | Path to the unified diff file (single-file `git diff` output) |
| `source_file` | Yes | — | Path to the original source file before the diff |
| `language` | No | `python` | Language ruleset: `java`, `python`, `typescript`, `go` |
| `patterns` | No | — | Newline-separated regex patterns to reject in added lines |

## Telemetry & Dashboards

`agent-eval` can optionally stream every evaluation result to a Supabase PostgreSQL database, enabling a central dashboard for tracking AI code-quality trends across all repositories in your organization.

The feature is **completely opt-in** and **non-blocking**:
- If any of the required environment variables are absent, no data is sent and the CLI behaves identically.
- Telemetry is sent from a detached background process so it never adds latency to the CLI exit. Network failures and other errors are silently discarded — a broken telemetry path will never fail the pipeline or appear in CI output.

### Activation

Set the following environment variables in your CI/CD runner or GitHub Actions secret store:

| Variable | Required | Description |
|---|---|---|
| `SUPABASE_URL` | Yes | Base URL of your Supabase project (e.g. `https://xxxx.supabase.co`) |
| `SUPABASE_KEY` | Yes | Anon/public API key for your Supabase project |
| `SENTINEL_ORG_ID` | Yes | UUID of the organization that owns this pipeline |

The following variable is read automatically from the GitHub Actions environment and does not need to be configured manually:

| Variable | Default | Description |
|---|---|---|
| `GITHUB_REPOSITORY` | `local-dev` | Repository name (`owner/repo`), set automatically by GitHub Actions |

### Example — GitHub Actions

```yaml
- name: Validate diff with agent-eval
  env:
    SUPABASE_URL: ${{ secrets.SUPABASE_URL }}
    SUPABASE_KEY: ${{ secrets.SUPABASE_KEY }}
    SENTINEL_ORG_ID: ${{ secrets.SENTINEL_ORG_ID }}
  run: |
    agent-eval --diff-file patch.diff --source-file original.py --language python
```

Each evaluation writes one row to the `evaluations` table with the fields: `org_id`, `repository`, `file_path`, `language`, `accepted`, `stage_failed`, `rejection_reason`, and `timestamp` (ISO-8601 UTC).