Metadata-Version: 2.4
Name: llmdebug
Version: 2.24.0
Summary: Structured debug snapshots for LLM-assisted debugging
Project-URL: Homepage, https://github.com/NicolasSchuler/llmdebug
Project-URL: Repository, https://github.com/NicolasSchuler/llmdebug
Author-email: Nicolas Schuler <schuler.nicolas@proton.me>
License: MIT
License-File: LICENSE
Keywords: crash-reporting,debugging,llm,pytest
Classifier: Development Status :: 4 - Beta
Classifier: Framework :: Pytest
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Debuggers
Requires-Python: >=3.10
Requires-Dist: filelock>=3.0
Requires-Dist: orjson>=3.10.0
Provides-Extra: cli
Requires-Dist: click>=8.0; extra == 'cli'
Requires-Dist: rich>=13.0; extra == 'cli'
Provides-Extra: dev
Requires-Dist: bandit>=1.8.0; extra == 'dev'
Requires-Dist: click>=8.0; extra == 'dev'
Requires-Dist: deptry>=0.22.0; extra == 'dev'
Requires-Dist: diff-cover>=9.2; extra == 'dev'
Requires-Dist: httpx[http2]>=0.27.0; extra == 'dev'
Requires-Dist: import-linter>=2.0; extra == 'dev'
Requires-Dist: ipython>=8.0; extra == 'dev'
Requires-Dist: mcp>=1.0; extra == 'dev'
Requires-Dist: mutmut>=3.2; extra == 'dev'
Requires-Dist: numpy>=1.20; extra == 'dev'
Requires-Dist: pip-audit>=2.9.0; extra == 'dev'
Requires-Dist: polars>=1.12.0; extra == 'dev'
Requires-Dist: pyright>=1.1.390; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.25.0; extra == 'dev'
Requires-Dist: pytest-benchmark>=4.0; extra == 'dev'
Requires-Dist: pytest-cov>=6.0; extra == 'dev'
Requires-Dist: pytest>=9.0; extra == 'dev'
Requires-Dist: python-semantic-release>=9.0; extra == 'dev'
Requires-Dist: radon>=6.0; extra == 'dev'
Requires-Dist: rich>=13.0; extra == 'dev'
Requires-Dist: ruff>=0.12.0; extra == 'dev'
Requires-Dist: scikit-learn>=1.4.0; extra == 'dev'
Requires-Dist: scipy>=1.13; extra == 'dev'
Requires-Dist: toons>=0.1; extra == 'dev'
Requires-Dist: vulture>=2.14; extra == 'dev'
Requires-Dist: xenon>=0.9.3; extra == 'dev'
Provides-Extra: evals
Requires-Dist: datasets>=2.0; extra == 'evals'
Requires-Dist: httpx[http2]>=0.27.0; extra == 'evals'
Requires-Dist: polars>=1.12.0; extra == 'evals'
Requires-Dist: scikit-learn>=1.4.0; extra == 'evals'
Requires-Dist: scipy>=1.13; extra == 'evals'
Requires-Dist: swebench==4.1.0; extra == 'evals'
Requires-Dist: testcontainers>=4.13.2; extra == 'evals'
Provides-Extra: jupyter
Requires-Dist: ipython>=8.0; extra == 'jupyter'
Provides-Extra: mcp
Requires-Dist: mcp>=1.0; extra == 'mcp'
Provides-Extra: toon
Requires-Dist: toons>=0.1; extra == 'toon'
Description-Content-Type: text/markdown

<p align="center">
  <img src="logo/bird.png" alt="llmdebug logo" width="200">
</p>

# llmdebug

Structured debug snapshots for LLM-assisted debugging.

`llmdebug` captures failure-time evidence as a local artifact: exception details,
prioritized stack frames, local variables, and execution context in a
machine-readable format that works well for both humans and coding agents.
The goal is to make the first failing run useful, rather than reconstructing
state after the fact.

This README documents shipped behavior and operational defaults.
Research context and forward-looking priorities live in
[`docs/research-improvement-roadmap.md`](docs/research-improvement-roadmap.md).

## Why?

LLM-assisted debugging works best when the failing run already preserves the
relevant runtime state.

Without that evidence, a typical loop looks like:
```
fail → infer missing state → guess patch → rerun → repeat
```

With `llmdebug`, the loop becomes:
```
fail → read snapshot → ranked hypotheses → minimal patch → verify
```

The design assumption is simple: baseline instrumentation should already be on,
so the first failure emits an artifact that can support diagnosis,
reproduction, and comparison.

## Installation

```bash
pip install llmdebug          # Core library + pytest plugin
pip install llmdebug[cli]     # CLI for viewing snapshots
pip install llmdebug[mcp]     # MCP server for IDE integration (Claude Code, etc.)
pip install llmdebug[jupyter]  # Jupyter/IPython integration
pip install llmdebug[toon]    # TOON output format for maximum token savings
pip install llmdebug[evals]   # Eval harness + analysis/report dependencies
```

## Quick Start

The fastest path is the pytest plugin. The other entry points use the same
capture model when you need finer control or a different inspection surface.

### Pytest (automatic - recommended)

After installation, failing tests automatically create
`.llmdebug/latest.json`.

```bash
pytest  # Failures create .llmdebug/latest.json
```

### Decorator

```python
from llmdebug import debug_snapshot

@debug_snapshot()
def main():
    data = load_data()
    process(data)

if __name__ == "__main__":
    main()
```

### Context Manager

For targeted instrumentation when you need more detail:

```python
from llmdebug import snapshot_section

with snapshot_section("data_processing"):
    result = transform(data)
```

### Snapshot Privacy Defaults

`debug_snapshot()` and `snapshot_section()` keep backward-compatible redaction defaults.
If you do not pass any redaction settings, llmdebug emits a `UserWarning` so you can
opt into a safer profile explicitly.

Recommended:

```python
from llmdebug import debug_snapshot, snapshot_section

@debug_snapshot(redaction_profile="ci")  # safer dev/CI default
def run_job():
    ...

with snapshot_section("checkout", redaction_profile="prod"):  # stricter production profile
    ...
```

### Jupyter / IPython

Automatic snapshot capture in notebooks with rich HTML display:

```python
# In a notebook cell:
%load_ext llmdebug

# Or programmatically:
import llmdebug
llmdebug.load_jupyter()
```

After any cell error, a compact banner shows the exception, crash location, and hints. Use magic commands for deeper analysis:

```python
%llmdebug              # Show full snapshot with locals and context
%llmdebug hypothesize  # Generate ranked debugging hypotheses
%llmdebug diff         # Compare latest vs previous snapshot
%llmdebug list         # List recent snapshots
%llmdebug config       # Show active configuration
```

Requires the `jupyter` extra: `pip install llmdebug[jupyter]`

### Production Hooks

Capture unhandled exceptions automatically in production applications:

```python
import llmdebug

llmdebug.install_hooks(out_dir=".llmdebug")

# Any unhandled exception, thread crash, or unraisable exception
# will now produce a snapshot automatically.

# Optional: uninstall when done
llmdebug.uninstall_hooks()
```

Hooks install into `sys.excepthook`, `threading.excepthook`, and `sys.unraisablehook`. They include rate limiting (default: 10 captures/min) and automatic PII redaction.

### Web Middleware

Zero-config crash capture for web frameworks:

```python
# Flask
app.wsgi_app = LLMDebugWSGIMiddleware(app.wsgi_app)

# FastAPI
from llmdebug import LLMDebugASGIMiddleware
app.add_middleware(LLMDebugASGIMiddleware)

# Django WSGI
from llmdebug import LLMDebugWSGIMiddleware
application = LLMDebugWSGIMiddleware(application)
```

Middleware captures request context (method, path, query string) alongside the crash snapshot, with automatic PII redaction on query parameters.

## CLI

The CLI is the main local inspection surface for reviewing snapshots,
controlling output scope, and comparing runs:

```bash
llmdebug                          # Show latest snapshot (crash-level detail)
llmdebug show --detail full       # Show all stack frames
llmdebug show --detail context    # Everything including repro, git, env
llmdebug show --json              # Output raw expanded JSON
llmdebug show --raw-session       # Output raw DebugSession envelope JSON
llmdebug list                     # List recent snapshots
llmdebug frames -i 0              # Inspect a specific frame
llmdebug diff                     # Compare latest vs previous snapshot
llmdebug git-context              # On-demand enhanced git metadata
llmdebug git-context --json       # Enhanced git metadata as JSON
llmdebug hypothesize              # Auto-generate debugging hypotheses
llmdebug clean -k 5               # Keep only 5 most recent snapshots
```

All commands accept `--dir <path>` to point at a custom snapshot directory.

Requires the `cli` extra: `pip install llmdebug[cli]`

### Detail Levels

The `show` command defaults to **crash** level to keep evidence compact.
Expand to `full` or `context` only when the extra state is needed:

| Level | Content | Typical Size |
|-------|---------|--------------|
| `crash` (default) | Exception + crash frame only | ~2K tokens |
| `full` | All frames + traceback | ~5K tokens |
| `context` | Everything (repro, git, env, coverage) | ~10K tokens |

### Snapshot Diff

Compare two snapshots to see what changed between runs:

```bash
llmdebug diff                     # Compare latest vs previous
llmdebug diff old.json new.json   # Compare specific files
llmdebug diff --json              # Output diff as JSON
```

### Git Context

Get richer git-aware debugging metadata on demand (without inflating snapshot capture payloads):

```bash
llmdebug git-context              # Latest snapshot, text view
llmdebug git-context --json       # JSON output for tooling
llmdebug git-context '#2'         # Specific snapshot reference
```

Outputs metadata only:
- crash-line blame metadata
- recent commit metadata + shortstats
- crash-file diffstat metadata

### Hypothesis Engine

Auto-generate ranked debugging hypotheses from snapshot patterns:

```bash
llmdebug hypothesize              # Analyze latest snapshot
llmdebug hypothesize --json       # Output as JSON array
```

The hypothesis engine includes 10 pattern detectors that identify common bug
patterns (empty arrays, shape mismatches, `None` values, off-by-one errors,
and similar issues) and provide actionable suggestions. Treat the results as
triage support rather than proof.

## MCP Server

`llmdebug` includes an MCP server for exposing the same evidence model to
MCP-capable tools such as Claude Code and Cursor:

```bash
llmdebug-mcp  # Start the MCP server (stdio transport)
```

Install with: `pip install llmdebug[mcp]`

### Available Tools

| Tool | Description |
|------|-------------|
| `llmdebug_diagnose` | Concise crash summary optimized for LLM consumption |
| `llmdebug_show` | Full expanded JSON snapshot with detail level control |
| `llmdebug_list` | List available snapshots with metadata |
| `llmdebug_frame` | Detailed view of a specific stack frame |
| `llmdebug_git_context` | On-demand enhanced git metadata for crash triage |
| `llmdebug_diff` | Compare two snapshots to show what changed |
| `llmdebug_hypothesize` | Generate ranked debugging hypotheses |
| `llmdebug_rca_status` | Show latest RCA state for a session |
| `llmdebug_rca_history` | Show RCA attempt history |
| `llmdebug_rca_advance` | Manually advance RCA state machine |

`llmdebug_diagnose`/`llmdebug_show` support strict `detail` controls:
`crash`, `full`, or `context`.
Invalid values are rejected with `INVALID_ARGUMENT` in JSON mode and a plain-text
argument error in text mode.
`llmdebug_list` and RCA state tools keep text defaults for backward compatibility, and can return
the same JSON envelope via `response_format="json"`.

### Evidence-First Defaults

MCP evidence tools are evidence-only by default and optimized for model consumption:

- `response_format="json"` by default on evidence tools (`diagnose`, `show`, `frame`, `git_context`,
  `diff`, `hypothesize`).
- `with_rca=false` by default on evidence tools (RCA coaching/state is opt-in).
- `evidence_schema="summary"` omits heavy payloads by default.

This keeps tool outputs compact and neutral. LLM reasoning remains primary; the protocol focuses on
transporting high-signal evidence and retry deltas. Model-side reasoning and
patch selection remain outside the protocol contract.

See [`docs/mcp-reference.md`](docs/mcp-reference.md) for JSON envelope schemas and parameter reference.

### RCA Workflow (Opt-In)

Evidence tools omit RCA metadata unless `with_rca=true`.
Use `llmdebug_rca_status` and `llmdebug_rca_history` to inspect progression, or
`llmdebug_rca_advance` for custom/manual agent workflows.

RCA prompt contract reference: `docs/rca_prompt_contract.md`.

### Claude Code Configuration

Add to your project's `.mcp.json`:

```json
{
  "mcpServers": {
    "llmdebug": {
      "command": "llmdebug-mcp"
    }
  }
}
```

## Output

On failure, `.llmdebug/latest.json` stores a versioned `DebugSession`
envelope:

```json
{
  "schema_version": "2.0",
  "kind": "llmdebug.debug_session",
  "session": {
    "name": "test_training_step",
    "timestamp_utc": "2026-01-27T14:30:52Z",
    "llmdebug_version": "2.3.0"
  },
  "snapshot": {
    "exception": {
      "type": "ValueError",
      "message": "operands could not be broadcast together..."
    },
    "frames": [
      {
        "file": "training.py",
        "line": 42,
        "function": "train_step",
        "code": "output = model(x) + residual",
        "locals": {
          "x": {"__array__": "jax.Array", "shape": [32, 64], "dtype": "float32"},
          "residual": {"__array__": "jax.Array", "shape": [32, 128], "dtype": "float32"}
        }
      }
    ]
  },
  "context": {
    "env": {"python": "3.12.0", "platform": "Darwin-24.0.0-arm64"}
  }
}
```

For compatibility, `get_latest_snapshot()` and loader APIs return a normalized flat view by default:

```python
from llmdebug import get_latest_snapshot

flat = get_latest_snapshot()  # default: normalized flat snapshot

from llmdebug.output import get_latest_snapshot as get_raw_snapshot
raw = get_raw_snapshot(normalize=False)  # raw DebugSession envelope
```

**Notable properties of the normalized view:**
- Crash frame is at index 0 (most relevant first)
- Arrays summarized with `shape` and `dtype` (not raw data)
- Source snippet around the failing line
- Environment info for reproducibility

### Snapshot Enrichment

Snapshots are automatically enriched with contextual data:

- **Schema metadata**: `schema_version`, `llmdebug_version`, `crash_frame_index`
- **Exception detail**: `qualified_type`, `args`, `notes`, `cause`, `context`, `exceptions` (ExceptionGroup), `error_category` with auto-classification and suggestions
- **Frame metadata**: `module`, `file_rel`, `locals_meta` (type/size hints), truncation markers
- **Git context**: commit hash, branch, dirty status
- **Pytest context**: `longrepr`, `capstdout`, `capstderr`, params, `repro` command
- **Coverage data**: executed/missing lines, branch stats (when pytest-cov is active)
- **Async context**: asyncio task name and state
- **Log records**: recent log entries (opt-in via `capture_logs=True`)
- **Capture config**: frames, locals_mode, truncation limits, redaction patterns

## Scope and Limits

`llmdebug` is an evidence layer, not an automated debugging oracle.

- Snapshots describe an observed failing execution; unexercised paths and
  nondeterministic conditions may still require reruns or extra
  instrumentation.
- Large values can be summarized, truncated, or redacted to keep artifacts
  inspectable and safer to store.
- Hypotheses and RCA metadata help prioritize investigation, but they are not
  proofs of root cause or fix correctness.
- Benchmark workflows and statistical analysis live in
  [`evals/README.md`](evals/README.md); this README focuses on shipped package
  behavior.

## For Claude Code / LLM Users

If you want a ready-to-paste workflow prompt instead of API details, the
project's own [`CLAUDE.md`](CLAUDE.md) includes a "Debug Snapshots
(llmdebug)" section for evidence-based debugging.

## Configuration

```python
@debug_snapshot(
    # -- Capture scope --
    frames=5,                       # Stack frames to capture
    source_context=3,               # Lines of source before/after crash
    source_mode="all",              # "all" | "crash_only" | "none"
    locals_mode="safe",             # "safe" | "meta" | "none"
    include_args=True,              # Separate function arguments from locals
    include_modules=None,           # Filter frames by module prefix (None = all)
    max_exception_depth=5,          # Exception chain recursion limit

    # -- Truncation limits --
    max_str=500,                    # Truncate long strings
    max_items=50,                   # Truncate large collections

    # -- Redaction and privacy --
    redaction_profile="dev",        # Optional: "dev" | "ci" | "prod"
    redact=[r"api_key=.*"],         # Regex patterns to redact
    redact_keys=False,              # Keep dict keys stable by default
    redact_traceback=False,         # Redact traceback text
    redact_exception_strings=False, # Redact exception message/args/notes

    # -- Context enrichment --
    include_env=True,               # Include Python/platform info
    include_git=True,               # Git commit/branch/dirty status
    git_dirty_mode="always",        # "always" | "cached" | "off"
    categorize_errors=True,         # Auto-classify errors with suggestions
    include_async_context=True,     # Asyncio task info
    include_array_stats=False,      # Compute min/max/mean/std for arrays
    capture_logs=False,             # Capture recent log records
    log_max_records=20,             # Max log records to capture
    include_coverage=True,          # Pytest-plugin coverage enrichment toggle

    # -- Output and storage --
    out_dir=".llmdebug",            # Output directory
    max_snapshots=50,               # Auto-cleanup old snapshots (0 = unlimited)
    output_format="json_compact",   # "json" | "json_compact" | "toon"
    lock_timeout=5.0,               # Seconds to wait for file lock
)
```

All parameters go in a single `@debug_snapshot(...)` call. Groups are for readability.

Redaction defaults to leaf string values only. This avoids accidental key collisions in nested dicts.
Set `redact_keys=True` only if you explicitly need key-name redaction and can accept possible key merging.

`redaction_profile` provides preset behavior:
- `dev`: minimal redaction defaults
- `ci`: stronger string redaction for non-local workflows
- `prod`: strictest defaults (includes traceback/exception-string redaction)

Profiles are additive: explicit `redact`/`redact_*` options always take precedence.

`include_coverage` currently applies to pytest-plugin failure captures only. Coverage data is
attached when pytest-cov is active and `LLMDEBUG_INCLUDE_COVERAGE` is enabled.
Decorator/context-manager captures do not currently add coverage payloads.

`include_git` uses a short in-process cache for static repo metadata (commit/branch).
`git_dirty_mode` controls dirty checks:
- `always`: refresh dirty status on every capture (default)
- `cached`: reuse dirty status while cache is valid
- `off`: skip dirty status detection

### Environment Variables

All configuration options can also be set via environment variables for pytest:

```bash
LLMDEBUG_OUTPUT_FORMAT=json pytest               # Use pretty JSON
LLMDEBUG_INCLUDE_GIT=false pytest                # Disable git context
LLMDEBUG_GIT_DIRTY_MODE=cached pytest            # Dirty check strategy
LLMDEBUG_CAPTURE_LOGS=true pytest                # Enable log capture
LLMDEBUG_REDACTION_PROFILE=ci pytest             # Use CI redaction profile
LLMDEBUG_REDACT_TRACEBACK=true pytest            # Redact traceback text
LLMDEBUG_REDACT_EXCEPTION_STRINGS=true pytest    # Redact exception strings
LLMDEBUG_RCA_MAX_RECORDS=5000 pytest             # Cap persisted RCA history records
```

### Output Formats

llmdebug supports multiple output formats to optimize for different use cases:

| Format | Size | Best For |
|--------|------|----------|
| `json` | baseline | Human readability, external tools |
| `json_compact` (default) | ~40% smaller | LLM context efficiency |
| `toon` | ~50% smaller | Maximum token savings |

**Compact JSON** uses abbreviated keys (e.g., `_exc` instead of `exception`) to reduce token usage. The `get_latest_snapshot()` function auto-expands keys and normalizes DebugSession envelopes by default, so your code works identically regardless of format.

### Pytest Opt-out

Skip snapshot capture for specific tests:

```python
import pytest

@pytest.mark.no_snapshot
def test_expected_failure():
    ...
```

## API

```python
from llmdebug import (
    # Capture
    debug_snapshot,          # Decorator for exception capture
    snapshot_section,        # Context manager for targeted capture
    get_latest_snapshot,     # Read the most recent snapshot (auto-expands keys)
    SnapshotConfig,          # Configuration dataclass
    RedactionProfile,        # Type alias: "dev" | "ci" | "prod"
    resolve_redaction_policy,# Resolve profile + explicit redaction settings

    # Analysis
    generate_hypotheses,     # Auto-generate debugging hypotheses from a snapshot
    Hypothesis,              # Hypothesis dataclass (confidence, pattern, evidence, suggestion)
    filter_snapshot,         # Layered disclosure: filter to crash/full/context detail
    DetailLevel,             # Type alias: "crash" | "full" | "context"

    # Production hooks
    install_hooks,           # Install sys.excepthook + thread + unraisable hooks
    uninstall_hooks,         # Restore original hooks
    PII_PATTERNS,            # Default PII redaction patterns (email, API keys, etc.)

    # Jupyter / IPython
    load_jupyter,            # Install into current IPython/Jupyter session

    # Web middleware
    LLMDebugWSGIMiddleware,  # WSGI middleware (Flask, Django)
    LLMDebugASGIMiddleware,  # ASGI middleware (FastAPI, Starlette)

    # Log capture
    enable_log_capture,      # Install log handler to capture recent records
)

# Read the most recent snapshot programmatically
snapshot = get_latest_snapshot()  # Returns dict or None

# Filter to minimal detail for LLM context
from llmdebug import filter_snapshot
filtered = filter_snapshot(snapshot, "crash")  # Exception + crash frame only

# Generate debugging hypotheses
from llmdebug import generate_hypotheses
hypotheses = generate_hypotheses(snapshot)
for h in hypotheses:
    print(f"[{h.confidence:.0%}] {h.description}")
    print(f"  Suggestion: {h.suggestion}")
```

## Developer Guide

See [`CONTRIBUTING.md`](CONTRIBUTING.md) for the development workflow and
quality gates. For benchmark methodology and analysis commands, see
[`evals/README.md`](evals/README.md).

## Project Docs

- MCP reference: [`docs/mcp-reference.md`](docs/mcp-reference.md)
- Research roadmap: [`docs/research-improvement-roadmap.md`](docs/research-improvement-roadmap.md)
- Eval harness: [`evals/README.md`](evals/README.md)
- Contributing guide: [`CONTRIBUTING.md`](CONTRIBUTING.md)
- Security policy: [`SECURITY.md`](SECURITY.md)
- Code of conduct: [`CODE_OF_CONDUCT.md`](CODE_OF_CONDUCT.md)

## License

MIT
