Metadata-Version: 2.4
Name: kurral
Version: 0.1.1
Summary: Deterministic LLM agent testing and replay framework
Author-email: Kurral Team <team@kurral.dev>
License: Apache-2.0
Project-URL: Homepage, https://github.com/kurral-dev/kurral-cli
Project-URL: Documentation, https://github.com/kurral-dev/kurral-cli#readme
Project-URL: Repository, https://github.com/kurral-dev/kurral-cli
Project-URL: Issues, https://github.com/kurral-dev/kurral-cli/issues
Project-URL: Changelog, https://github.com/kurral-dev/kurral-cli/blob/main/CHANGELOG.md
Keywords: llm,testing,determinism,langsmith,replay,ai,agents,regression-testing,a-b-testing
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Testing
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Operating System :: OS Independent
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: click>=8.1.7
Requires-Dist: pydantic>=2.5.0
Requires-Dist: pydantic-settings>=2.1.0
Requires-Dist: fastapi>=0.109.0
Requires-Dist: uvicorn>=0.27.0
Requires-Dist: sqlalchemy>=2.0.0
Requires-Dist: psycopg2-binary>=2.9.9
Requires-Dist: boto3>=1.34.0
Requires-Dist: httpx>=0.26.0
Requires-Dist: rich>=13.7.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: langsmith>=0.1.0
Requires-Dist: openai>=1.0.0
Requires-Dist: anthropic>=0.18.0
Requires-Dist: tenacity>=8.2.3
Requires-Dist: python-dateutil>=2.8.2
Provides-Extra: dev
Requires-Dist: pytest>=7.4.3; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: black>=23.12.0; extra == "dev"
Requires-Dist: ruff>=0.1.9; extra == "dev"
Requires-Dist: mypy>=1.8.0; extra == "dev"
Dynamic: license-file

# Kurral 🎯

**Deterministic LLM Agent Testing & Replay Framework**

Kurral enables you to capture, replay, and evaluate LLM agent behaviors with deterministic precision. Built for teams that need reliable testing, regression detection, and reproducible AI workflows.

## Features

- 🎬 **Capture & Replay**: Record LLM traces with full tool/MCP calls and replay them deterministically with stubbed responses
- 🔍 **Determinism Scoring**: Automatic analysis of replay reliability (Levels A/B/C)
- 🧬 **LLM State Hydration**: Captures complete sampler state (temperature, top_p, seed, etc.) for Level B promotion
- ✅ **Validation & Diff**: Hash + structural comparison with detailed diff output
- 📊 **Regression Testing**: Compare agent versions with ARS (Agent Regression Score)
- 🪣 **Semantic Buckets**: Organize traces by business logic (e.g., `refund_flow`, `support_chat`)
- 🔌 **LangSmith Integration**: Seamless import from LangSmith traces
- 🔧 **MCP Support**: Full Model Context Protocol tool capture and stubbing
- 💾 **Dual Storage**: PostgreSQL metadata + Cloudflare R2 artifact storage
- 🎨 **Beautiful CLI**: Rich terminal UI with progress bars, diffs, and debug views

## Quick Start

### Installation

```bash
pip install kurral
```

### Usage with Decorator

```python
from kurral import trace_llm
from openai import OpenAI

client = OpenAI()

@trace_llm(semantic_bucket="customer_support", tenant_id="acme_prod")
def handle_customer_query(query: str) -> str:
    """Handle customer support query with LLM"""
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": query}],
        seed=12345,  # Important for determinism!
    )
    return response.choices[0].message.content

# This call is automatically traced and exported as a .kurral artifact
result = handle_customer_query("I need a refund for order #12345")
```

### Export from LangSmith

```bash
# Export a specific run
kurral export --run-id ls_abc123 --output trace.kurral

# Export from local trace file
kurral export --input trace.json --output trace.kurral
```

### Replay

```bash
# Basic replay - stubs all tool/MCP calls, reconstructs output stream
kurral replay trace.kurral

# Show diff between original and replayed outputs
kurral replay trace.kurral --diff

# Debug mode - shows LLM sampler state, stub status, graph hashes
kurral replay trace.kurral --debug --verbose

# Replay with prompt override for testing
kurral replay trace.kurral --prompt-override new_prompt.txt --diff
```

**What Replay Does:**
- Loads the `.kurral` artifact with all recorded data
- Primes cache with every tool/MCP call (inputs, outputs, status, latency, hashes)
- Stubs external calls—nothing hits live APIs during replay
- Reconstructs the assistant output stream (items/full_text/stream_map)
- Validates outputs via hash + structural comparison
- Returns enriched metadata: LLM sampler state, replay ID, validation results, graph fingerprint

### Backtest (Regression Testing)

```bash
# Test new agent version against golden bucket
kurral backtest \
  --baseline semantic:golden_tests \
  --candidate ./new_config.yaml \
  --threshold 0.90 \
  --output report.json
```

### Query Artifacts

```bash
# List semantic buckets
kurral buckets list

# Show artifacts in a bucket
kurral buckets show --semantic refund_flow

# Filter by tenant and date
kurral buckets show --tenant acme_prod --since 2024-01-01
```

## Architecture

```
┌─────────────────────────────────────────────────────────┐
│                    Your LLM Agent                       │
│               (with @trace_llm decorator)               │
└─────────────────────┬───────────────────────────────────┘
                      │
         ┌────────────▼────────────┐
         │   Kurral Deep Capture   │
         │  • LLM interactions     │
         │  • Tool calls (I/O)     │
         │  • Timing (start/end)   │
         │  • Execution graphs     │
         │  • Error forensics      │
         └────────────┬────────────┘
                      │
         ┌────────────▼─────────────┐
         │  Optional: LangSmith     │
         │  (parallel enrichment)   │
         └────────────┬─────────────┘
                      │
         ┌────────────▼────────────┐
         │  .kurral Artifact       │
         │  • Complete trace       │
         │  • Determinism score    │
         │  • Replay metadata      │
         └────────────┬────────────┘
                      │
         ┌────────────▼────────────┐
         │    Kurral API Backend   │
         │  • Authentication       │
         │  • Artifact upload      │
         │  • Query & analytics    │
         └────────────┬────────────┘
                      │
         ┌────────────▼────────────┐
         │  Dual Storage System    │
         │  • PostgreSQL (metadata)│
         │  • R2 (full artifacts)  │
         └─────────────────────────┘
```

**Why This Architecture?**

1. **Beyond OpenTelemetry**: We capture LLM-specific data that generic telemetry misses (prompts, token usage, determinism scores, tool call graphs)
2. **Replay-First Design**: Every artifact contains everything needed to reproduce the execution
3. **Security & Compliance**: Immutable artifacts with cryptographic hashes, perfect for audit trails
4. **Cost Attribution**: Track spend by semantic bucket, model, and tenant
5. **Drift Detection**: Graph and prompt hashing detects when agent behavior changes

## .kurral Artifact Format

A `.kurral` file is a JSON artifact containing everything needed for deterministic replay:

```json
{
  "kurral_id": "550e8400-e29b-41d4-a716-446655440000",
  "run_id": "ls_abc123",
  "tenant_id": "acme_prod",
  "semantic_buckets": ["refund_flow", "tier_3_support"],
  "environment": "production",
  "deterministic": true,
  "replay_level": "B",
  "timestamp": "2024-01-15T10:30:00Z",
  "inputs": {"query": "I need a refund"},
  "outputs": {
    "items": ["Refund", " initiated", "..."],
    "full_text": "Refund initiated...",
    "stream_map": [...]
  },
  "llm_config": {
    "model_name": "gpt-4-0613",
    "provider": "openai",
    "parameters": {
      "temperature": 0.0,
      "top_p": 1.0,
      "seed": 12345,
      "max_tokens": 500
    }
  },
  "resolved_prompt": {
    "template": "You are a helpful assistant...",
    "variables": {"tone": "professional"},
    "final_text": "You are a helpful assistant. Be professional..."
  },
  "graph_version": {
    "graph_hash": "d35d48cb...",
    "graph_checksum": "2f2011f2..."
  },
  "tool_calls": [
    {
      "id": "tc_abc123",
      "parent_id": null,
      "tool_name": "check_order_status",
      "type": "http",
      "start_time": "2024-01-15T10:30:00.123Z",
      "end_time": "2024-01-15T10:30:00.265Z",
      "latency_ms": 142,
      "input": {"order_id": "12345"},
      "output": {"status": "shipped"},
      "status": "ok",
      "error_flag": false,
      "cache_key": "sha256:abc123...",
      "output_hash": "sha256:def456...",
      "effect_type": "http",
      "metadata": {
        "cost_usd": 0.001,
        "retry_count": 0,
        "cache_hit": false
      }
    }
  ]
}
```

**Key Fields for Advanced Observability:**
- `llm_config.parameters`: Complete sampler state (temperature, top_p, seed, etc.)
- `tool_calls[]`: Enhanced tool call records with:
  - `id`: Unique identifier for each operation
  - `parent_id`: Build execution graphs for nested/chained calls
  - `start_time`/`end_time`: Precise timing for concurrency analysis
  - `type`: Operation classification (tool, llm, agent, retrieval, etc.)
  - `error_type`/`error_stack`: Comprehensive error forensics
  - `metadata`: Extensible field for cost, tokens, custom attributes
- `graph_version`: Pinned graph fingerprint for version tracking
- `outputs`: Stream reconstruction data (items, full_text, stream_map)

**What This Enables:**
- **Execution Graphs**: Reconstruct parent-child call relationships
- **Concurrency Analysis**: Find overlapping operations and bottlenecks
- **Cost Attribution**: Track spend by operation type and business context
- **Error Pattern Detection**: Group and analyze failures by type
- **External Correlation**: Link to infrastructure events (DB outages, API issues)
- **Future-Proof**: Add new metadata fields without schema migrations

## Determinism & Replay Levels

Kurral classifies every trace by determinism score:

- **Level A** (score ≥ 0.90): Byte-for-byte reproducible (frozen model + seed + tool cache + clock)
- **Level B** (0.50 ≤ score < 0.90): Structurally equivalent (same tool I/O + sampler settings captured)
- **Level C** (score < 0.50): Task-level equivalence only (use recorded tool outputs, validate task metrics)

**Promoting C → B:**  
Replay now hydrates LLM sampler state (temperature, top_p, top_k, max_tokens, seed, penalties) so you can validate structural equivalence and lift a trace from Level C to B if conditions align.

**Replay Validation:**  
Every replay returns:
- `ReplayLLMState`: complete sampler config
- `ReplayValidation`: output hash match, structural match, diff
- `ReplayMetadata`: replay_id, record_ref, replay_level, assertion hooks
- Stubbed tool calls flagged with `stubbed_in_replay=True`

## Environment Variables

Create a `.env` file:

```bash
# LangSmith (optional)
LANGSMITH_API_KEY=your_key_here

# Storage
DATABASE_URL=postgresql://user:pass@localhost:5432/kurral
KURRAL_R2_BUCKET=kurral-artifacts
KURRAL_R2_ACCOUNT_ID=your_cloudflare_account_id
KURRAL_R2_ACCESS_KEY_ID=your_r2_access_key
KURRAL_R2_SECRET_ACCESS_KEY=your_r2_secret_key

# Application
KURRAL_ENVIRONMENT=production
```

## How to Integrate

### 1. Decorate Your Agent Functions

```python
from kurral import trace_llm

@trace_llm(semantic_bucket="customer_support", tenant_id="acme_prod")
def my_agent(query: str) -> str:
    # Your LLM logic here
    response = llm.invoke(query, temperature=0.0, seed=42)
    return response
```

### 2. Configure MCP Servers (Optional)

If using Model Context Protocol tools:

```python
from kurral.core.decorator import trace_llm

@trace_llm(
    semantic_bucket="data_ops",
    tenant_id="acme",
    mcp_servers=["filesystem", "database"]  # Auto-captures MCP tool calls
)
def process_data(request):
    # MCP tools are automatically stubbed during replay
    return agent.run(request)
```

### 3. Configure Storage (Recommended)

**Storage Options:**
- **local** - Save to disk (default: `./artifacts/`)
- **memory** - Store in RAM for fast access, zero I/O overhead (great for testing/dev)
- **r2** - Auto-upload to Cloudflare R2 for production

**Option 1: Interactive Setup (Easiest)**

```bash
kurral config init
```

This wizard will:
- Choose storage backend (local/memory/r2)
- Save credentials to `~/.config/kurral/config.json`
- Enable auto-upload (for r2)
- No need to set env vars every time!

**Option 2: Environment Variables**

```bash
# Use in-memory storage (fast, no I/O)
export KURRAL_STORAGE_BACKEND=memory
export KURRAL_MEMORY_MAX_ARTIFACTS=1000  # optional, default: 1000
export KURRAL_MEMORY_MAX_SIZE_MB=500     # optional, default: 500MB

# Or use Cloudflare R2
export KURRAL_STORAGE_BACKEND=r2
export KURRAL_R2_BUCKET=kurral-artifacts
export KURRAL_R2_ACCOUNT_ID=<your_account_id>
export KURRAL_R2_ACCESS_KEY_ID=<your_access_key>
export KURRAL_R2_SECRET_ACCESS_KEY=<your_secret_key>
```

**Option 3: Project-Specific Config**

```bash
# Save config to ./.kurral/config.json (project-specific)
kurral config init --local
```

Once configured, all traces **automatically save/upload**—no manual steps needed!

```bash
# View current config
kurral config show

# Memory storage commands (when using KURRAL_STORAGE_BACKEND=memory)
kurral memory stats              # Show memory usage
kurral memory list               # List all artifacts in memory
kurral memory get <artifact_id>  # Get artifact details
kurral memory export <artifact_id> <output_path>  # Export to disk
kurral memory delete <artifact_id>  # Delete from memory
kurral memory clear              # Clear all artifacts

# R2 commands (when using KURRAL_STORAGE_BACKEND=r2)
kurral list-r2 --tenant-id acme --limit 10
kurral download <kurral_id> <tenant_id> <created_at> ./artifacts/
```

### 4. Replay Your Traces

```bash
# Replay captured artifact
kurral replay path/to/artifact.kurral

# Debug mode shows LLM params + stub status
kurral replay path/to/artifact.kurral --debug
```

## Development

```bash
# Clone repo
git clone https://github.com/your-org/kurral.git
cd kurral

# Create virtual environment
python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Format code
black kurral/
ruff check kurral/

# Run CLI
kurral --help
```

## Configuration

```python
from kurral import KurralConfig

config = KurralConfig(
    storage_backend="r2",  # or "local"
    r2_bucket="my-bucket",
    r2_account_id="your_account_id",
    database_url="postgresql://...",
    langsmith_enabled=True,
    auto_export=True,  # Automatically export traces
    determinism_threshold=0.90,
)
```

## Troubleshooting

### Missing boto3 Error
If you see `ModuleNotFoundError: No module named 'boto3'`, R2 storage is optional. Use local storage instead:
```python
config = KurralConfig(storage_backend="local")
```

### Artifact Schema Mismatch
If a `.kurral` file fails to load, check `schema_version`. Upgrade with:
```bash
kurral migrate artifact.kurral --to-version 1.0.0
```

### LangSmith Connection Issues
Verify `LANGSMITH_API_KEY` is set and the run ID exists:
```bash
export LANGSMITH_API_KEY=your_key
kurral export --run-id <id> --output test.kurral
```

### Replay Output Doesn't Match
- Check `result.validation.hash_match` and `result.validation.structural_match`
- Use `--diff` to see what changed
- Verify LLM sampler state via `result.llm_state` (temperature, seed, etc.)

## Kurral API Backend

For centralized artifact management, deploy the Kurral API:

```bash
cd kurral-api
docker-compose up -d
```

**Services:**
- **FastAPI Backend** (port 8000) - Artifact management, authentication, analytics
- **PostgreSQL** - Metadata storage with indexes for fast queries
- **Cloudflare R2** - Scalable artifact storage

**Features:**
- 🔐 API key authentication with scopes
- 📦 Artifact upload/download via REST API
- 🔍 Advanced filtering (tenant, environment, semantic bucket, model, dates)
- 📊 Real-time statistics and time series analytics
- 💾 Dual storage: metadata in PostgreSQL, full artifacts in R2
- 🚀 Production-ready with health checks and monitoring

**Quick Start:**
```bash
# Configure
cp kurral-api/.env.template kurral-api/.env
# Edit .env with your R2 credentials

# Start services
cd kurral-api && docker-compose up -d

# Create user and get API key
curl -X POST http://localhost:8000/api/v1/auth/register \
  -H "Content-Type: application/json" \
  -d '{"email": "user@example.com", "password": "secure", "tenant_id": "acme"}'

# Upload artifact
curl -X POST http://localhost:8000/api/v1/artifacts/upload \
  -H "X-API-Key: kurral_YOUR_KEY" \
  -d @artifact.kurral
```

See `kurral-api/README.md` for complete API documentation.

### Production Deployment

**CLI:**
```bash
# Install globally
pip install kurral

# Configure to use API backend
export KURRAL_STORAGE_BACKEND=api
export KURRAL_API_URL=https://api.kurral.your-domain.com
export KURRAL_API_KEY=kurral_YOUR_KEY

# Artifacts automatically upload to API
python your_agent.py  # Traces are captured and uploaded

# Run replay in CI
kurral replay artifacts/*.kurral --diff
```

**Docker Compose (Full Stack):**
```bash
# Deploy API + Database + your agents
docker-compose -f docker-compose.production.yml up -d
```

## License

MIT License - see LICENSE file

## Support

- Documentation: https://drive.google.com/file/d/1iixtUYjMsFKsTp8PUgdI_TNG3XrRxpED/view?usp=sharing

