Metadata-Version: 2.4
Name: extract-bench
Version: 0.1.0
Summary: Evaluate structured extraction quality with per-field metrics
Author: Contextual AI
License-Expression: MIT
License-File: LICENSE
Requires-Python: >=3.11
Requires-Dist: jsonschema>=4.0
Requires-Dist: litellm>=1.0
Requires-Dist: loguru>=0.7
Requires-Dist: pydantic>=2.0
Provides-Extra: dev
Requires-Dist: google-cloud-aiplatform==1.135.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.21; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: python-dotenv>=1.0; extra == 'dev'
Description-Content-Type: text/markdown

# Extract Bench Evaluation Suite

A standalone Python package for evaluating structured extraction quality by comparing predicted JSON against gold JSON with per-field metrics.

## Installation

```bash
pip install -e .

# With dev dependencies
pip install -e ".[dev]"
```

## Quick Start

```python
import json
from pathlib import Path
from extract_bench import ReportBuilder, ReportConfig

# Load your data
schema = json.load(open("schema.json"))
gold = json.load(open("gold.json"))
extracted = json.load(open("model_output.json"))

# Configure and build report
config = ReportConfig(
    output_dir=Path("./eval_results"),
    output_name="nvidia-10k-extract-gemini-flash",  # Identifies this experiment
)
builder = ReportBuilder(config)
report = builder.build(schema, gold, extracted)

# Save all outputs
output_path = builder.save(report)
print(f"Results saved to: {output_path}")
```

This creates `eval_results/nvidia-10k-extract-gemini-flash/` containing:

| File          | Purpose                                                  |
| ------------- | -------------------------------------------------------- |
| `report.json` | Machine-readable full report (for programmatic analysis) |
| `summary.txt` | Human-readable one-page summary (for quick inspection)   |
| `fields.csv`  | Per-field outcomes (for csv analysis)                    |
| `fields.md`   | Markdown table (for documentation/sharing)               |

### Key Metrics in the Report

```python
# After building the report
print(f"Overall pass rate: {report.overall_pass_rate:.1%}")
print(f"Overall score: {report.overall_score:.3f}")
print(f"Fields evaluated: {report.outcomes.total_evaluated}")
print(f"Passed: {report.outcomes.total_passed}")
print(f"Failed: {report.outcomes.total_failed}")
```

## Batch Evaluation Example

For running many experiments:

```python
import asyncio
import json
from pathlib import Path
from extract_bench import ReportBuilder, ReportConfig

async def evaluate_model_outputs(
    schema_path: Path,
    gold_path: Path,
    outputs_dir: Path,
    results_dir: Path,
):
    """Evaluate all model outputs in a directory."""
    schema = json.load(schema_path.open())
    gold = json.load(gold_path.open())

    results = []
    for output_file in outputs_dir.glob("*.json"):
        extracted = json.load(output_file.open())

        config = ReportConfig(
            output_dir=results_dir,
            output_name=output_file.stem,  # Use filename as experiment ID
        )
        builder = ReportBuilder(config)
        report = await builder.build_async(schema, gold, extracted)
        builder.save(report)

        results.append({
            "model": output_file.stem,
            "pass_rate": report.overall_pass_rate,
            "score": report.overall_score,
        })

    return results

# Run batch evaluation
results = asyncio.run(evaluate_model_outputs(
    schema_path=Path("schema.json"),
    gold_path=Path("gold.json"),
    outputs_dir=Path("./model_outputs"),
    results_dir=Path("./eval_results"),
))

# Print comparison
for r in sorted(results, key=lambda x: -x["score"]):
    print(f"{r['model']}: {r['pass_rate']:.1%} pass, {r['score']:.3f} avg score")
```

## ReportConfig Options

```python
config = ReportConfig(
    output_dir=Path("./outputs"),      # Where to save reports
    output_name="my-experiment",       # Subdirectory name (auto-generated if None)
    max_reasoning_length=200,          # Truncate LLM reasoning in outputs
    top_n_lowest_fields=5,             # Track N lowest-scoring fields
    save_json=True,                    # Generate report.json
    save_text=True,                    # Generate summary.txt
    save_csv=True,                     # Generate fields.csv
    save_markdown=True,                # Generate fields.md
)
```

## Understanding Report Output

### summary.txt Structure

```
================================================================================
                        EVALUATION REPORT: my-experiment
================================================================================

OVERALL RESULTS
---------------
Pass Rate: 85.2% (23/27 fields)
Average Score: 0.891

SCHEMA SHAPE
------------
Total nodes: 45
By type: object=12, string=18, number=8, array=5, boolean=2

COVERAGE
--------
Present in both: 25
Missing in extracted: 2
Spurious in extracted: 0

PASS/FAIL BY METRIC
-------------------
string_semantic: 15/18 passed (83.3%)
number_tolerance: 6/6 passed (100.0%)
integer_exact: 2/3 passed (66.7%)

LOWEST SCORING FIELDS
---------------------
1. borrower.address (0.45) - Partial match, missing suite number
2. terms.rate_type (0.60) - Semantic mismatch
...
```

### fields.csv Columns

| Column            | Description                                 |
| ----------------- | ------------------------------------------- |
| `path`            | Full JSONPath to the field                  |
| `normalized_path` | Human-readable path (e.g., `borrower.name`) |
| `metric_id`       | Metric used for evaluation                  |
| `score`           | Numeric score (0.0-1.0)                     |
| `passed`          | Boolean pass/fail                           |
| `gold_value`      | Expected value                              |
| `extracted_value` | Model's output value                        |
| `reasoning`       | LLM reasoning (for semantic metrics)        |

## Low-Level API

For direct access to evaluation results without reporting:

```python
from extract_bench import StructuredEvaluator, StructuredEvaluatorConfig

evaluator = StructuredEvaluator(StructuredEvaluatorConfig(metrics=[]))
result = evaluator.evaluate(schema, gold, predicted)

# Raw results dict: path -> metric_id -> MetricResult
for path, metrics in result["results"].items():
    for metric_id, metric_result in metrics.items():
        print(f"{path} [{metric_id}]: passed={metric_result.passed}, score={metric_result.score}")
```

Use `evaluate_async()` for better performance with LLM-based metrics.

## Configuration

### Environment Setup

LLM-based metrics use LiteLLM. Configure your provider:

```bash
# Vertex AI (Google Cloud)
gcloud auth application-default login

# OpenAI
export OPENAI_API_KEY=sk-...

# Or copy .env.example to .env
```

### LLM Model Configuration

Default model: `vertex_ai/gemini-2.5-flash` (or set `DEFAULT_LLM_MODEL` in `.env`).

Override per-field in schema:

```python
schema = {
    "type": "object",
    "properties": {
        "company": {
            "type": "string",
            "evaluation_config": {
                "metrics": [{"metric_id": "string_semantic", "params": {"model": "openai/gpt-4o-mini"}}]
            },
        }
    },
}
```

### Evaluation Presets

Specify `evaluation_config` in schema fields to control metrics:

| Preset                    | Description                                          |
| ------------------------- | ---------------------------------------------------- |
| `string_exact`            | Case-sensitive exact match                           |
| `string_fuzzy`            | Levenshtein similarity (case-insensitive by default) |
| `string_case_insensitive` | Case-insensitive match                               |
| `string_semantic`         | LLM-based semantic similarity (default for strings)  |
| `number_exact`            | Exact numeric equality                               |
| `number_tolerance`        | Match within tolerance (default for numbers)         |
| `integer_exact`           | Exact integer equality (default for integers)        |
| `boolean_exact`           | Exact boolean equality (default for booleans)        |
| `array_llm`               | LLM evaluation of entire array (default for arrays)  |
| `skip`                    | Skip evaluation for this node                        |

### Custom Metric Configuration

```python
schema = {
    "type": "object",
    "properties": {
        "price": {
            "type": "number",
            "evaluation_config": {
                "metrics": [{"metric_id": "number_tolerance", "params": {"tolerance": 0.01}}]
            }
        },
        "description": {
            "type": "string",
            "evaluation_config": "string_fuzzy"  # Use preset
        }
    }
}
```

## Available Metrics

| Category | Metric                    | Description                             |
| -------- | ------------------------- | --------------------------------------- |
| String   | `string_exact`            | Case-sensitive exact match              |
|          | `string_case_insensitive` | Case-insensitive match                  |
|          | `string_fuzzy`            | Levenshtein similarity                  |
|          | `string_semantic`         | LLM-based semantic comparison (default) |
| Number   | `number_exact`            | Exact numeric equality                  |
|          | `number_tolerance`        | Match within tolerance (default)        |
|          | `integer_exact`           | Exact integer equality                  |
| Boolean  | `boolean_exact`           | Exact boolean equality                  |
| Array    | `array_llm`               | LLM-based array comparison              |
| General  | `string_llm`              | LLM judge for any comparison            |

## Custom Metrics

```python
from extract_bench import global_metric_registry
from extract_bench.evaluation.metrics import BaseMetric, MetricResult

class MyCustomMetric(BaseMetric):
    metric_id = "my_custom"

    async def evaluate(self, node, config=None):
        gold = node.get_gold_value()
        extracted = node.get_extracted_value()
        return MetricResult(
            metric_id=self.metric_id,
            score=1.0,
            passed=True,
            details={"custom": "data"}
        )

global_metric_registry.register_metric(MyCustomMetric)
```

## Architecture

```
extract_bench/
├── infra/                 # Schema AST (nodes, visitors)
├── evaluation/
│   ├── metrics/           # Metric implementations
│   └── reporting/         # Report generation (see reporting/README.md)
```

Schema → AST → Values instantiated → Metrics evaluated async in parallel → Report generated.

## Development

```bash
pip install -e ".[dev]"
pytest tests/ -v
```
