Metadata-Version: 2.4
Name: skill-mem
Version: 0.3.0
Summary: Hybrid routing, utility tracking, and evolution for Agent Skills.
License-Expression: MIT
License-File: LICENSE
Keywords: agent,embeddings,memory,routing,skills
Requires-Python: >=3.11
Requires-Dist: pyyaml>=6.0
Provides-Extra: openrouter
Requires-Dist: openrouter>=0.7.11; extra == 'openrouter'
Description-Content-Type: text/markdown

# skill-mem

Hybrid routing, utility tracking, and evolution for [Agent Skills](https://agentskills.io).

```python
from skill_mem import Library, Outcome, Revision

library = Library(".agents/skills", embed=my_embed_fn, rerank=my_rerank_fn)

# Route — hybrid dense+BM25 retrieval with cross-encoder reranking
matches = library.route("analyze this CSV and find outliers")
top = matches[0]  # Match(name, description, path, score)

# Load the full skill when you need it
skill = library.get(top.name)
response = await agent.run(query, system=skill.content)

# Record what happened — successful queries improve routing over time
library.record(skill.name, query, Outcome(success=True, reason="found 3 outliers"))

# Skills that fail get rewritten — including their scripts and references
library.evolve(skill.name, context="choked on 500k rows", rewrite=my_rewrite_fn)
```

Routing returns lightweight metadata (name, description, path) for prompt injection — not the full skill body. The router sees everything internally for selection, but the consumer gets just the pointer. This is the hidden-body asymmetry from the [SkillRouter paper](https://arxiv.org/abs/2603.22455).

Skills are standard [Agent Skills](https://agentskills.io/specification) `SKILL.md` files with optional `scripts/`, `references/`, and `assets/` directories. Claude Code, Cursor, VS Code, and [30+ other tools](https://agentskills.io) already read them. This library adds routing, tracking, and learning.

## Install

```bash
uv add skill-mem
```

You bring your own embedding and (optionally) reranking functions:

```python
# Embedding — any model works
from fastembed import TextEmbedding
model = TextEmbedding("BAAI/bge-small-en-v1.5")
def embed(text: str) -> list[float]:
    return list(next(model.embed([text])))

# Reranking — optional, adds cross-encoder second stage
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, docs: list[str]) -> list[float]:
    return reranker.predict([[query, d] for d in docs]).tolist()

library = Library("./skills", embed, rerank=rerank)
```

Without a reranker, routing uses hybrid dense+BM25 retrieval with reciprocal rank fusion. Adding a reranker enables the full two-stage pipeline from the [SkillRouter paper](https://arxiv.org/abs/2603.22455).

## How it works

### Skills on disk

```
.agents/skills/
  csv-analysis/
    SKILL.md              <- standard Agent Skills file
    scripts/              <- optional executable code
      analyze.py
    references/           <- optional documentation
      pandas-guide.md
  web-search/
    SKILL.md
  .skill-meta/            <- added by skill-mem (gitignored)
    embeddings.json       <- cached vectors for routing
    attempts.json         <- full attempt log (query, success, reason)
    versions.json         <- version tracking
    history/              <- archived SKILL.md before each evolution
```

The `SKILL.md` files are portable. The `.skill-meta/` sidecar stores the intelligence layer. Extra frontmatter fields (`license`, `metadata`, `compatibility`, etc.) are preserved through evolution.

### A skill with scripts

```yaml
---
name: csv-analysis
description: Analyze, summarize, or query CSV files and tabular data.
---

Load the file with pandas. Use df.describe() for numeric columns.
Check nulls with df.isnull().sum(). For large files, run
scripts/analyze.py with sampling enabled.
```

The full skill text — name, description, body, scripts, and references — is used for routing. The [SkillRouter paper](https://arxiv.org/abs/2603.22455) shows this body text is the decisive signal: 91.7% of cross-encoder attention concentrates on it, and removing it causes 29-44pp accuracy degradation.

## Routing

Routing uses hybrid retrieval with optional cross-encoder reranking:

1. **Dense retrieval**: embeds the full skill text (name + description + body + files + successful queries) and retrieves candidates by cosine similarity.
2. **BM25 retrieval**: sparse keyword matching over the same full text, catching exact terms that embeddings miss.
3. **Reciprocal rank fusion**: combines dense and sparse rankings into a single candidate list.
4. **Cross-encoder reranking** (optional): scores each candidate against the query with deeper cross-attention, re-sorts, and returns top-`k`.

```python
library = Library(path, embed)                            # hybrid dense+BM25
library = Library(path, embed, rerank=fn)                 # + cross-encoder reranking
library = Library(path, embed, rerank=fn, retrieve_k=30)  # wider retrieval window
```

The reranker sees the same full text as the embedder. `retrieve_k` defaults to 20 (the paper's value for ~80K skill pools).

### Pluggable vector index

For large skill pools, replace the default brute-force search with an ANN backend:

```python
from skill_mem import Library, BruteForce, VectorIndex

# Default — brute-force cosine, fine up to ~10K skills
library = Library(path, embed)

# Custom — any object with index() and search() methods
import faiss
class FaissIndex:
    def index(self, vectors: dict[str, list[float]]) -> None:
        ...
    def search(self, query: list[float], k: int) -> list[tuple[str, float]]:
        ...

library = Library(path, embed, vector_index=FaissIndex())
```

`BruteForce` is the default. `VectorIndex` is a Protocol you can implement with FAISS, Annoy, or any other backend.

## API

```python
from skill_mem import Library, Skill, Match, Outcome, Revision, Stats

library = Library(path, embed, rerank=rerank)
```

### Read

```python
library.route(query, k=3) -> list[Match]       # hybrid search, top-k
library.get(name) -> Skill                     # full skill from disk
library.all() -> list[Skill]                   # everything
library.utility(name) -> float                 # success rate (0.0-1.0)
library.stats(name) -> Stats                   # utility, attempts, recent log
```

`Match` contains `name`, `description`, `path`, and `score` — the metadata needed for prompt injection. Call `library.get(match.name)` when you need the full skill content.

### Write

```python
library.add(name, description, content) -> Skill
library.record(name, query, Outcome(success, reason)) -> Skill | None
library.evolve(name, context, rewrite=fn, validate=fn) -> Skill
library.discover(query, context, create=fn) -> Skill
```

`record()` returns an evolved `Skill` if auto-evolution triggered, `None` otherwise.

### Auto-evolution

Pass a `rewrite` function to the library and skills automatically evolve when their utility drops too low:

```python
library = Library(
    path, embed,
    rewrite=my_rewrite_fn,     # enables auto-evolution
    evolve_threshold=0.3,      # trigger below this utility (default 0.3)
    min_attempts=5,            # minimum attempts before triggering (default 5)
)

# After 5+ failures with utility < 0.3, record() auto-calls evolve()
evolved = library.record(skill_name, query, Outcome(success=False, reason="OOM"))
if evolved:
    print(f"Auto-evolved to v{evolved.version}")
```

After auto-evolution, the cooldown resets — requires `min_attempts` new failures before triggering again.

### The loop

The core cycle from the [Memento-Skills paper](https://arxiv.org/abs/2603.18743):

```python
matches = library.route(query)
top = matches[0] if matches else None

if top:
    skill = library.get(top.name)
    result = await agent.run(query, skill=skill)
    outcome = judge(query, result)
    library.record(top.name, query, outcome)

    if not outcome.success:
        if library.utility(top.name) < 0.4:
            library.discover(query, outcome.reason, create=my_create_fn)
        else:
            library.evolve(top.name, outcome.reason, rewrite=my_rewrite_fn)
elif outcome.success:
    library.discover(query, result, create=my_create_fn)
```

Route. Execute. Judge. Evolve. Every cycle, the library gets better.

## Evolve and discover

`evolve` and `discover` take your functions — the library handles versioning, archival, and re-indexing.

```python
def my_rewrite(skill: Skill, context: str) -> Revision:
    """Called by library.evolve(). Can update content and files."""
    result = llm(f"Rewrite this skill:\n{skill.content}\n\nFailure: {context}")
    return Revision(description=result.description, content=result.content)

def my_create(query: str, context: str) -> tuple[str, str, str]:
    """Called by library.discover(). Returns (name, description, content)."""
    result = llm(f"Create a reusable skill from this interaction:\n{query}\n{context}")
    return (result.name, result.description, result.content)
```

### Evolving scripts and references

Skills can include `scripts/`, `references/`, and `assets/` directories. These files are available on `skill.files` and can be updated through evolution:

```python
def my_rewrite(skill: Skill, context: str) -> Revision:
    # Fix a broken script
    old_script = skill.files["scripts/analyze.py"]
    fixed = llm(f"Fix this script:\n{old_script}\n\nError: {context}")
    return Revision(
        description=skill.description,
        content=skill.content,
        files={"scripts/analyze.py": fixed},
    )
```

Files in the revision overwrite their counterparts; files not mentioned stay unchanged. An optional `validate` hook gates the entire revision:

```python
library.evolve("csv", context, rewrite=my_rewrite, validate=lambda old, new: "pandas" in new.content)
```

When a skill evolves, the old `SKILL.md` is archived to `.skill-meta/history/<name>/v<N>.md`. Successful queries are folded into routing embeddings so skills become easier to find as they accumulate evidence.

## Examples

```bash
uv run python examples/basic.py       # add skills, route, record outcomes
uv run python examples/evolving.py    # evolution loop with validation gate
uv run python examples/llm.py         # full loop with LLM rewriting and reranking
```

## Papers

Built on two complementary papers:

- [Memento-Skills](https://arxiv.org/abs/2603.18743) — the skill memory lifecycle: routing, utility tracking, evolution, and discovery.
- [SkillRouter](https://arxiv.org/abs/2603.22455) — two-stage retrieve-and-rerank over full skill text. Shows that skill body is the decisive routing signal (91.7% of attention), and a compact 1.2B pipeline outperforms much larger zero-shot alternatives.

Skills are stored in the [Agent Skills](https://agentskills.io) format for ecosystem compatibility.
