Metadata-Version: 2.4
Name: inspect-mlflow
Version: 0.7.0
Summary: MLflow integration for Inspect AI: experiment tracking, execution tracing, evaluation comparison, and Scout analysis
Project-URL: Homepage, https://github.com/debu-sinha/inspect-mlflow
Project-URL: Issues, https://github.com/debu-sinha/inspect-mlflow/issues
Project-URL: Repository, https://github.com/debu-sinha/inspect-mlflow
Author-email: Debu Sinha <debusinha2009@gmail.com>
License-Expression: MIT
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: inspect-ai>=0.3.180
Requires-Dist: mlflow<4.0,>=3.0
Requires-Dist: numpy>=1.24
Provides-Extra: config
Requires-Dist: pydantic-settings>=2.0; extra == 'config'
Requires-Dist: pydantic>=2.0; extra == 'config'
Provides-Extra: scout
Requires-Dist: inspect-scout>=0.1.0; extra == 'scout'
Description-Content-Type: text/markdown

# inspect-mlflow

![logo](https://raw.githubusercontent.com/debu-sinha/inspect-mlflow/main/docs/images/logo.png)

[![CI](https://github.com/debu-sinha/inspect-mlflow/actions/workflows/ci.yml/badge.svg)](https://github.com/debu-sinha/inspect-mlflow/actions/workflows/ci.yml)
[![CodeQL](https://github.com/debu-sinha/inspect-mlflow/actions/workflows/codeql.yml/badge.svg)](https://github.com/debu-sinha/inspect-mlflow/actions/workflows/codeql.yml)
[![PyPI version](https://img.shields.io/pypi/v/inspect-mlflow.svg)](https://pypi.org/project/inspect-mlflow/)
[![Downloads](https://img.shields.io/pypi/dm/inspect-mlflow.svg)](https://pypi.org/project/inspect-mlflow/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
[![Docs](https://readthedocs.org/projects/inspect-mlflow/badge/?version=latest)](https://inspect-mlflow.readthedocs.io/)
[![GitHub stars](https://img.shields.io/github/stars/debu-sinha/inspect-mlflow?style=social)](https://github.com/debu-sinha/inspect-mlflow)

MLflow integration for [Inspect AI](https://inspect.aisi.org.uk/). Provides experiment tracking, execution tracing, and artifact logging for Inspect AI evaluations.

## Install

```bash
pip install inspect-mlflow
```

## Quick Start

Hooks auto-register via entry points when the package is installed. No code changes needed.

```bash
# Start MLflow server
mlflow server --port 5000

# Set env vars
export MLFLOW_TRACKING_URI="http://localhost:5000"
export MLFLOW_INSPECT_TRACING="true"

# Run evals as usual. Both hooks activate automatically.
inspect eval my_task.py --model openai/gpt-4o
```

Then open http://localhost:5000 to see runs and traces.

## What it does

This package provides two hooks that run automatically during Inspect AI evaluations. Both hooks use the `MlflowClient` API for full isolation from user MLflow state (no global `mlflow.start_run` calls). Thread-safe for concurrent sample processing.

### Tracking Hook

Activated when `MLFLOW_TRACKING_URI` is set. Creates hierarchical MLflow runs with full evaluation telemetry.

**What gets logged:**

- Parent run per eval invocation with nested child runs per task
- Task configuration as parameters (model, dataset, solver, temperature, top_p, max_tokens)
- Per-sample scores as step metrics (accuracy, timing per sample)
- Aggregate metrics (total_samples, completed_samples, match/accuracy, match/stderr)
- Model token usage (input/output/total tokens per model)
- Real-time event counting (total_model_calls, total_tool_calls)
- Eval artifacts: per-sample results JSON + full eval log JSON
- Additional rich table artifacts for analysis (`inspect/*.json`)
- Trace assessments: eval scores logged as MLflow assessments via `mlflow.log_feedback()`, visible in the Traces UI assessment column

**Task run with 15 metrics, parameters, and parent run link:**

![Task run detail](https://raw.githubusercontent.com/debu-sinha/inspect-mlflow/main/docs/images/screenshot-01-task-run.png)

**Traces table with assessment column showing per-sample match scores:**

![Traces list](https://raw.githubusercontent.com/debu-sinha/inspect-mlflow/main/docs/images/screenshot-03-traces-list.png)

**Artifact tables (inspect/) with structured eval data:**

![Artifacts](https://raw.githubusercontent.com/debu-sinha/inspect-mlflow/main/docs/images/screenshot-02-artifacts-expanded.png)

### Tracing Hook

Activated when `MLFLOW_INSPECT_TRACING=true` is also set. Maps eval execution to MLflow trace spans, giving you a visual debugging view of every model call, tool invocation, and scoring step.

**Span hierarchy:**

```
eval_run:98h4b4KN (CHAIN)
  task:task (CHAIN)
    sample:keAdeL1U (CHAIN)
      solvers (from SpanBeginEvent)
        use_tools (solver span)
          model:openai/gpt-4o-mini (LLM) - 5,167 tokens
          tool:calculator (TOOL) - args: {"expression": "47 * 89"}, result: "4183"
          model:openai/gpt-4o-mini (LLM) - 5,263 tokens
        generate (solver span)
          model:openai/gpt-4o-mini (LLM) - 182 tokens
      scorers (from SpanBeginEvent)
        match (scorer span)
          score (EVALUATOR) - value: C
    sample:HWl2wp2B (CHAIN)
      ...
```

**Each span type captures different data:**

| Span Type | Data Captured |
|-----------|------|
| CHAIN | eval run, task, and sample lifecycle with scores and timing |
| LLM | model name, input/output token counts, temperature, cache status, response text |
| TOOL | function name, arguments, result, working time, errors |
| EVALUATOR | score value, explanation, target |

**Full span tree with solver/scorer hierarchy and assessments panel:**

![Span tree](https://raw.githubusercontent.com/debu-sinha/inspect-mlflow/main/docs/images/screenshot-04-span-tree.png)

**LLM span detail with model name, token counts, and response:**

![LLM detail](https://raw.githubusercontent.com/debu-sinha/inspect-mlflow/main/docs/images/screenshot-05-llm-detail.png)

### Autolog

Autolog enables MLflow provider integrations at run start.
Supported providers are: `openai`, `anthropic`, `langchain`, `litellm`,
`mistral`, `groq`, `cohere`, `gemini`, `bedrock`.
Each provider is enabled only when both the MLflow flavor module and provider SDK are installed.

### Artifact Tables

When artifact logging is enabled (`INSPECT_MLFLOW_LOG_ARTIFACTS=true` or
`MLFLOW_INSPECT_LOG_ARTIFACTS=true`), the tracking hook logs the following artifacts:

- `inspect/tasks.json`
- `inspect/samples.json`
- `inspect/messages.json`
- `inspect/sample_scores.json`
- `inspect/events.json`
- `inspect/model_usage.json`
- `sample_results/*.json`
- `eval_logs/*.json`

### Evaluation Comparison

Compare results from two evaluation runs to detect score regressions with statistical significance testing.

```python
from inspect_mlflow.comparison import compare_evals

result = compare_evals("logs/baseline.eval", "logs/candidate.eval")
print(result.summary())

# Per-sample regressions
for r in result.regressions:
    print(f"Sample {r.id}: {r.baseline_score} -> {r.candidate_score}")
```

Output:

```
Baseline:  openai/gpt-4o-mini (math_task)
Candidate: openai/gpt-4o-mini (math_task)
Samples:   5 aligned, 0 missing, 0 new

  Metric            Baseline  Candidate             Delta        Sig.
  -------------------------------------------------------------------
  match/accuracy      0.6000     0.4000   -0.2000 (-33.3%)  p=0.048*
  Effect size (match/accuracy): Cohen's d = -0.73 (medium effect)

Regressions: 2, Improvements: 1, Unchanged: 2
Candidate won on 1 of 5 samples (20.0%)
```

The comparison module aligns samples by (id, epoch), automatically selects the right significance test (McNemar's for binary scores, bootstrap CI for continuous), and computes Cohen's d effect size. No scipy required.

**Features:**
- Sample alignment with string/int ID normalization and multi-epoch support
- Regression threshold to filter noise (`regression_threshold=0.05`)
- Sample filtering (`sample_filter=lambda s: s.id in subset`)
- Win rate tracking across aligned samples
- Works with file paths or `EvalLog` objects directly

## Configuration

Configuration is loaded from environment variables. When `pydantic-settings` is installed (`pip install inspect-mlflow[config]`), settings are typed and validated with the `INSPECT_MLFLOW_` prefix. Without it, standard `os.getenv()` is used.

| Env var | Required | Default | Description |
|---------|----------|---------|-------------|
| `MLFLOW_TRACKING_URI` | Yes | - | MLflow server URL |
| `MLFLOW_EXPERIMENT_NAME` | No | `inspect_ai` | Experiment name |
| `MLFLOW_INSPECT_TRACING` | No | `false` | Enable execution tracing |
| `MLFLOW_INSPECT_LOG_ARTIFACTS` | No | `true` | Log eval artifacts |
| `INSPECT_MLFLOW_LOG_ARTIFACTS` | No | `true` | Same as above (new prefix, takes priority) |
| `INSPECT_MLFLOW_AUTOLOG_ENABLED` | No | `true` | Enable MLflow provider autolog integrations |
| `INSPECT_MLFLOW_AUTOLOG_MODELS` | No | `openai,anthropic,langchain,litellm` | CSV or JSON array of providers to autolog |

## Examples

### Basic eval (tracking + tracing)

```python
from inspect_ai import Task, eval
from inspect_ai.dataset import Sample
from inspect_ai.scorer import match
from inspect_ai.solver import generate

# No special imports needed. Hooks auto-register on install.

task = Task(
    dataset=[
        Sample(input="What is 2 + 2?", target="4"),
        Sample(input="What is 3 * 5?", target="15"),
        Sample(input="What is 10 - 7?", target="3"),
    ],
    solver=generate(),
    scorer=match(),
)

logs = eval(task, model="openai/gpt-4o-mini")
# MLflow now has: runs with metrics + traces with span tree
```

### Eval with tool calls

```python
from inspect_ai import Task, eval
from inspect_ai.dataset import Sample
from inspect_ai.scorer import match
from inspect_ai.solver import generate, use_tools
from inspect_ai.tool import tool


@tool
def calculator():
    """Perform arithmetic calculations."""

    async def run(expression: str) -> str:
        """Evaluate a math expression.

        Args:
            expression: A math expression to evaluate, e.g. "47 * 89"
        """
        allowed = {"__builtins__": {}}
        return str(eval(expression, allowed))

    return run


task = Task(
    dataset=[
        Sample(
            input="Use the calculator to compute 47 * 89.",
            target="4183",
        ),
        Sample(
            input="Use the calculator to compute 1024 / 16.",
            target="64",
        ),
    ],
    solver=[use_tools([calculator()]), generate()],
    scorer=match(),
)

logs = eval(task, model="openai/gpt-4o-mini")
# Traces now include TOOL spans for each calculator() call
# with function name, arguments, and result
```

## Development

```bash
git clone https://github.com/debu-sinha/inspect-mlflow.git
cd inspect-mlflow
uv sync --group dev
uv run pre-commit install
uv run pytest tests/ -v
```

See [CONTRIBUTING.md](https://github.com/debu-sinha/inspect-mlflow/blob/main/CONTRIBUTING.md) for integration testing and PR guidelines.

## Related

- [Documentation](https://inspect-mlflow.readthedocs.io/) - Full API reference and usage guide
- [Inspect AI](https://inspect.aisi.org.uk/) - AI evaluation framework by UK AI Security Institute
- [MLflow](https://mlflow.org/) - ML experiment tracking and model management
- [Inspect AI hooks docs](https://inspect.aisi.org.uk/extensions.html#sec-hooks) - How hooks work
- [Issue #3547](https://github.com/UKGovernmentBEIS/inspect_ai/issues/3547) - Original proposal
- [Vector Institute inspect-mlflow](https://github.com/VectorInstitute/inspect-mlflow) - Related extension whose features are being consolidated here

## Contributors

- **Debu Sinha** - Creator and maintainer
- **Vector Institute / National Research Council of Canada (NRC)** - Autolog provider support, contributed on behalf of the Canadian AI Safety Institute (CAISI). Consolidated from [VectorInstitute/inspect-mlflow](https://github.com/VectorInstitute/inspect-mlflow).
