Metadata-Version: 2.4
Name: harnesskit
Version: 0.2.0
Summary: Fuzzy edit tool for LLM coding agents — never fail a str_replace again
Author-email: Alex Melges <alex@melges.dev>
License-Expression: MIT
Project-URL: Homepage, https://github.com/alexmelges/harnesskit
Project-URL: Repository, https://github.com/alexmelges/harnesskit
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Software Development :: Libraries
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Dynamic: license-file

# 🔧 HarnessKit

> **Fuzzy edit tool for LLM coding agents — never fail a `str_replace` again.**

[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-green.svg)](https://python.org)
[![Zero Dependencies](https://img.shields.io/badge/dependencies-zero-brightgreen.svg)](#)

---

## The Problem

Every LLM coding agent has the same Achilles' heel: **edit application**.

When Claude, GPT, or any model tries to modify code, it generates an `old_text` → `new_text` pair. The tool then does an exact string match to find where to apply the change. And it fails. A lot.

- **Whitespace differences** — the model adds a space, drops a tab, or normalizes indentation
- **Minor hallucinations** — a variable name is slightly off, a comment is paraphrased
- **Format fragility** — diffs, patches, and line-number schemes all break in different ways

The result? Up to **50% edit failure rates** on non-native models. Every failed edit wastes a tool call, burns tokens on retries, and breaks agent flow.

## The Solution

HarnessKit (`hk`) is a drop-in edit tool that **fuzzy-matches** the old text before replacing it. It uses a 4-stage matching cascade:

1. **Exact match** — zero overhead when the model is precise
2. **Normalized whitespace** — catches the most common failure mode
3. **Sequence matching** — `difflib.SequenceMatcher` with configurable threshold (default 0.8)
4. **Line-by-line fuzzy** — finds the best contiguous block match for heavily drifted edits

Every edit returns a **confidence score** and **match type**, so your agent knows exactly how the edit was resolved.

## Quick Start

```bash
pip install harnesskit
```

Or just copy `hk.py` into your project — it's a single file, stdlib only.

### CLI Usage

```bash
# Direct arguments
hk apply --file app.py --old "def hello():\n    print('hi')" --new "def hello():\n    print('hello world')"

# JSON from stdin (perfect for tool_use integration)
echo '{"file": "app.py", "old_text": "def hello():", "new_text": "def greet():"}' | hk apply --stdin

# From a JSON file
hk apply --edit changes.json

# Dry run — see what would change without writing
hk apply --file app.py --old "..." --new "..." --dry-run
```

### JSON Edit Format

```json
{
  "file": "path/to/file.py",
  "old_text": "def hello():\n    print('hi')",
  "new_text": "def hello():\n    print('hello world')"
}
```

Batch multiple edits:

```json
{
  "edits": [
    {"file": "a.py", "old_text": "...", "new_text": "..."},
    {"file": "b.py", "old_text": "...", "new_text": "..."}
  ]
}
```

### Output

```json
{
  "status": "applied",
  "file": "app.py",
  "match_type": "fuzzy",
  "confidence": 0.92,
  "matched_text": "def hello():\n    print( 'hi' )"
}
```

### Exit Codes

| Code | Meaning |
|------|---------|
| `0`  | Edit applied successfully |
| `1`  | No match found |
| `2`  | Ambiguous — multiple matches |

## MCP Server

HarnessKit ships an [MCP (Model Context Protocol)](https://modelcontextprotocol.io/) server for plug-and-play integration with any MCP-compatible agent.

### Quick Start

Add to your MCP client config (e.g. Claude Desktop, Cursor, etc.):

```json
{
  "mcpServers": {
    "harnesskit": {
      "command": "python3",
      "args": ["/path/to/hk_mcp.py"]
    }
  }
}
```

### Tools

| Tool | Description |
|------|-------------|
| `harnesskit_apply` | Apply a fuzzy edit to a file |
| `harnesskit_apply_batch` | Apply multiple edits in one call |
| `harnesskit_match` | Preview the match without modifying (dry run) |

Each tool returns the match type, confidence score, and matched text — giving the agent full visibility into how the edit was resolved.

### Example

```json
{
  "name": "harnesskit_apply",
  "arguments": {
    "file": "app.py",
    "old_text": "def hello():\n    print('hi')",
    "new_text": "def hello():\n    print('hello world')",
    "threshold": 0.8
  }
}
```

Response:
```json
{
  "status": "applied",
  "match_type": "whitespace",
  "confidence": 0.95
}
```

## Integration

HarnessKit is designed to slot into any agent framework as the edit backend:

```python
import subprocess, json

def apply_edit(file, old_text, new_text):
    result = subprocess.run(
        ["hk", "apply", "--stdin"],
        input=json.dumps({"file": file, "old_text": old_text, "new_text": new_text}),
        capture_output=True, text=True
    )
    return json.loads(result.stdout)
```

Or import directly:

```python
from hk import apply_edit

result = apply_edit("app.py", old_text, new_text, threshold=0.8)
```

## Benchmarks

We tested HarnessKit against **45 realistic edit failure scenarios** — the kind that break `str_replace` and `apply_patch` in production agent workflows.

| Category | Exact Match | HarnessKit | Recovery Rate |
|---|---|---|---|
| **Whitespace** (tabs/spaces, trailing, indentation, CRLF, nesting) | 0/11 | **11/11** | 100% |
| **Hallucinations** (typos, quotes, types, multi-language) | 0/16 | **16/16** | 100% |
| **Line Drift** (shifted context, extra decorators, renames) | 2/5 | **5/5** | 100% |
| **Partial Matches** (subset of target) | 2/2 | **2/2** | — |
| **Real-World** (str_replace failures, docstring diffs) | 0/6 | **6/6** | 100% |
| **Hard** (multi-error combos, brace styles, compression) | 0/5 | **5/5** | 100% |
| **Total** | **4/45 (9%)** | **45/45 (100%)** | **100%** |

> **Exact match succeeds 9% of the time. HarnessKit succeeds 100% of the time.**
> 41 out of 41 failed edits recovered.

Run the benchmarks yourself:

```bash
python3 benchmarks/benchmark.py
```

## Design Principles

- **Single file, stdlib only** — copy it, vendor it, pip install it. No dependency hell.
- **419 lines of Python** — small enough to audit in one sitting
- **Graceful degradation** — exact match when possible, fuzzy only when needed
- **Transparent** — every result tells you *how* it matched and *how confident* it is
- **Model-agnostic** — works with any LLM that can produce old/new text pairs

## Configuration

| Flag | Default | Description |
|------|---------|-------------|
| `--threshold` | `0.8` | Minimum similarity score for fuzzy matching |
| `--dry-run` | `false` | Preview changes without writing to disk |

## Development

```bash
git clone https://github.com/alexmelges/harnesskit.git
cd harnesskit
python3 -m pytest test_hk.py test_mcp.py -v  # 53 tests
```

## License

MIT — see [LICENSE](LICENSE).

---

**Built for the agents that build everything else.**
