Research Software | Open Source | MIT

Evidence-First Snapshot Capture for LLM-Assisted Debugging

llmdebug captures structured execution evidence at failure time, including exception context, stack frames, and local state, and makes this evidence available through CLI, notebook, and MCP interfaces.

Problem

LLM debugging without runtime evidence is often underdetermined.

Contribution

Reproducible snapshots that prioritize crash-site signal over verbose traces.

$ pip install llmdebug[cli]
Reproducible Minimal Example

Abstract

Context. LLM-assisted debugging pipelines often receive insufficient runtime evidence, which limits diagnosis quality. Method. llmdebug captures structured snapshots at exception boundaries with crash-frame prioritization and local-state summaries. Output. The captured evidence is exposed through machine-readable and human-readable interfaces (CLI, notebook, MCP) to support iterative analysis. Scope. The project provides evidence transport and inspection; it does not provide formal guarantees of root-cause correctness.

Key Contributions

This section summarizes the core capabilities that make snapshot-based debugging reproducible and inspectable.

Default failure capture

Pytest failures can emit snapshots without additional project instrumentation.

Structured, inspectable artifacts

JSON snapshots preserve exception context, frames, locals, and environment metadata.

Consistent multi-surface access

The same evidence can be queried from terminal, notebook, production hooks, and MCP.

Differential analysis support

Snapshot diffing enables run-to-run comparisons for regression diagnosis.

Hypothesis support

A pattern-based hypothesis engine ranks common failure mechanisms for faster triage.

Governance-aware controls

Redaction policies and rate limiting support safer operation in production contexts.

Method Overview

This section outlines the failure-triggered pipeline from exception boundary to evidence consumption.

  1. Step 1

    Exception boundary

    def test_transform():
        result = transform(data)
        assert result.shape == (100, 5)
  2. Step 2

    Snapshot serialization

    {
      "exception": "ValueError: shape mismatch",
      "closest_frame": {
        "file": "pipeline.py",
        "line": 47
      }
    }
  3. Step 3

    Evidence consumption

    $ llmdebug show
    $ llmdebug hypothesize
    $ llmdebug diff

Capability Evidence

This section lists currently available capabilities with links to canonical documentation.

Evidence table: features available in the current public release.
Capability Status Documentation
Pytest failures produce snapshots by default Available README: Quick Start
CLI inspection (show, list, frames, diff, hypothesize) Available README: CLI
Detail levels (crash, full, context) for evidence size control Available README: Detail Levels
Production hooks with rate limiting and redaction controls Available README: Production Hooks
MCP server with evidence tools and RCA state tools Available README: MCP Server

Reproducible Minimal Example

This section provides a minimal procedure for capturing and inspecting a failure artifact.

terminal
$ pip install llmdebug[cli]
$ pytest
$ llmdebug show

Interfaces (Shared Evidence Contract)

This section shows equivalent access patterns for the same snapshot evidence across integration surfaces.

shell
# zero additional setup after installation
$ pip install llmdebug
$ pytest

# failure artifact:
#   .llmdebug/latest.json

Data Governance and Operational Safety

This section summarizes data handling safeguards and operational constraints for practical deployments.

Redaction-aware capture

Built-in redaction controls help reduce accidental leakage of sensitive fields in stored snapshots.

Production rate limiting

Exception hooks apply rate limits to avoid artifact floods during repeated failures.

Local-first artifact handling

Snapshots are stored locally by default, enabling offline and air-gapped debugging workflows.

No automatic causal guarantees

The system provides structured context for diagnosis but does not establish causal correctness.

Limitations

This section states interpretation limits and known boundaries of the current implementation.

  • Snapshots reflect observed failing executions; unexercised paths remain unobserved.
  • Large objects may be summarized for compactness, which can omit low-level detail.
  • Hypothesis ranking is heuristic and should be treated as triage support, not proof.
  • Benchmark and statistical evaluation workflows live in evals/ and are separate from this page.

Citation and Resources

This section provides a software citation template and release-traceable project references.

BibTeX (software citation template)
@software{vadasz2026llmdebug,
  author  = {Vadasz, Nicolas},
  title   = {llmdebug: Structured Debug Snapshots for LLM-Assisted Debugging},
  year    = {2026},
  url     = {https://github.com/nicholasvadasz/llmdebug},
  license = {MIT}
}