Metadata-Version: 2.4
Name: openclaw-mem
Version: 0.2.0
Summary: Lightweight RAG memory system for AI agents — Progressive Disclosure, Auto-Capture, 3-Layer Archive
Author: Jay Lee
License: MIT
Project-URL: Homepage, https://github.com/kjaylee/openclaw-mem
Project-URL: Repository, https://github.com/kjaylee/openclaw-mem
Project-URL: Documentation, https://github.com/kjaylee/openclaw-mem#readme
Project-URL: Bug Tracker, https://github.com/kjaylee/openclaw-mem/issues
Keywords: rag,memory,ai-agents,lancedb,vector-search,progressive-disclosure
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: lancedb>=0.4
Requires-Dist: sentence-transformers>=2.0
Provides-Extra: openai
Requires-Dist: openai>=1.0; extra == "openai"
Provides-Extra: ollama
Requires-Dist: ollama>=0.1; extra == "ollama"
Provides-Extra: all
Requires-Dist: openai>=1.0; extra == "all"
Requires-Dist: ollama>=0.1; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Dynamic: license-file

# openclaw-mem

**Local-first AI memory** — No API keys. No cloud. No vendor lock-in.

Lightweight RAG memory system for AI agents — Progressive Disclosure, Auto-Capture, 3-Layer Archive, built-in injection defense.

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![Local-First](https://img.shields.io/badge/local--first-100%25%20offline-brightgreen)](https://github.com/kjaylee/openclaw-mem)

## Features

- **Progressive Disclosure** — 2-step search: summaries first (`--index`), then full content on demand (`--detail`). Saves tokens for LLM agents.
- **Auto-Capture** — Rule-based extraction of decisions, learnings, errors, and insights from session transcripts. No LLM required.
- **3-Layer Archive** — Hot (active files), Warm (indexed in RAG), Cold (archived but still searchable).
- **Observation Logging** — Structured `[tag]` observations with instant indexing.
- **Pluggable Embeddings** — Local sentence-transformers (default, no API key), OpenAI, or Ollama backends.
- **Multilingual** — Swap in multilingual models for Korean + English support.
- **Zero Config** — Works out of the box with `pip install` and no API keys. Everything is overridable via environment variables.

## Architecture

```
┌─────────────────────────────────────────────────────┐
│                   openclaw-mem                       │
│                                                     │
│  ┌──────────┐  ┌──────────┐  ┌──────────────────┐  │
│  │  search   │  │  index   │  │  auto-capture    │  │
│  │(2-step)   │  │(incr.)   │  │(rule-based)      │  │
│  └─────┬─────┘  └─────┬────┘  └────────┬─────────┘  │
│        │              │                │             │
│        ▼              ▼                ▼             │
│  ┌─────────────────────────────────────────────┐    │
│  │              LanceDB + Embeddings            │    │
│  │   local (default) │ openai │ ollama          │    │
│  └─────────────────────────────────────────────┘    │
│                                                     │
│  Memory Layers:                                     │
│  ┌────────┐  ┌────────┐  ┌────────┐                │
│  │  HOT   │  │  WARM  │  │  COLD  │                │
│  │(active)│→ │(indexed)│→ │(archive│                │
│  │        │  │  in RAG │  │ in RAG)│                │
│  └────────┘  └────────┘  └────────┘                │
└─────────────────────────────────────────────────────┘
```

## Installation

```bash
# Default: local embeddings (no API key needed)
pip install openclaw-mem

# With OpenAI backend support
pip install openclaw-mem[openai]

# With Ollama backend support
pip install openclaw-mem[ollama]

# Everything
pip install openclaw-mem[all]
```

### From source

```bash
git clone https://github.com/kjaylee/openclaw-mem.git
cd openclaw-mem
pip install -e ".[dev]"
```

## Quick Start

### 1. Index your markdown files

```bash
# Set your workspace root (where memory/ directory lives)
export OPENCLAW_MEM_ROOT=/path/to/workspace

# Index all configured files
openclaw-mem index --all

# Or index only changed files (incremental)
openclaw-mem index --changed

# Or index a specific file
openclaw-mem index path/to/notes.md
```

### 2. Search with Progressive Disclosure

```bash
# Step 1: Get summaries (cheap on tokens)
openclaw-mem search "deployment process" --index

# Output:
# 1. [0.8432] memory/2025-01-15.md
#    id: 2025-01-15.md:3:a1b2c3d4
#    Deployed the new API to production...

# Step 2: Get full content for interesting chunks
openclaw-mem search --detail "2025-01-15.md:3:a1b2c3d4"
```

### 3. Record observations

```bash
openclaw-mem observe "Redis cache reduced latency by 40%" --tag learning
openclaw-mem observe "Switched to Rust for WASM builds" --tag decision
openclaw-mem observe "OOM on 2GB instances with batch size > 100" --tag error
```

### 4. Auto-capture from sessions

```bash
# Scan recent session transcripts for observations
openclaw-mem auto-capture --since 6h

# Dry run — see what would be captured
openclaw-mem auto-capture --dry-run
```

### 5. Archive old files

```bash
# See what would be archived (dry run)
openclaw-mem archive

# Actually archive files older than 30 days
openclaw-mem archive --execute

# Re-index archive for search
openclaw-mem archive --reindex
```

## Configuration

All settings can be overridden via environment variables:

| Variable | Default | Description |
|----------|---------|-------------|
| `OPENCLAW_MEM_ROOT` | Package parent dir | Workspace root directory |
| `OPENCLAW_MEM_DB_PATH` | `$ROOT/lance_db` | LanceDB database path |
| `OPENCLAW_MEM_TABLE` | `openclaw_memory` | LanceDB table name |
| `OPENCLAW_MEM_BACKEND` | `local` | Embedding backend: `local`, `openai`, `ollama` |
| `OPENCLAW_MEM_MODEL` | `intfloat/multilingual-e5-small` | Model name (per backend) |
| `OPENAI_API_KEY` | *(empty)* | Required only for `openai` backend |
| `OPENAI_BASE_URL` | *(empty)* | Custom OpenAI-compatible endpoint |
| `OLLAMA_BASE_URL` | `http://localhost:11434` | Ollama server URL |
| `OPENCLAW_MEM_CHUNK_SIZE` | `500` | Max chunk size (characters) |
| `OPENCLAW_MEM_CHUNK_OVERLAP` | `50` | Chunk overlap (characters) |
| `OPENCLAW_MEM_ARCHIVE_DIR` | `$ROOT/memory/archive` | Archive directory |
| `OPENCLAW_MEM_ARCHIVE_DAYS` | `30` | Days before archiving |
| `OPENCLAW_MEM_OBSERVATIONS_FILE` | `$ROOT/memory/observations.md` | Observations file |
| `OPENCLAW_MEM_SESSION_DIR` | `~/.openclaw/agents/main/sessions` | Session transcripts dir |

### Embedding Backends

```bash
# Default: local sentence-transformers (no API key, ~470MB model download)
export OPENCLAW_MEM_BACKEND=local
export OPENCLAW_MEM_MODEL=intfloat/multilingual-e5-small  # default, Korean+English

# English-only lightweight alternative
export OPENCLAW_MEM_MODEL=all-MiniLM-L6-v2

# OpenAI API
export OPENCLAW_MEM_BACKEND=openai
export OPENCLAW_MEM_MODEL=text-embedding-3-small
export OPENAI_API_KEY=sk-...

# Ollama (local server)
export OPENCLAW_MEM_BACKEND=ollama
export OPENCLAW_MEM_MODEL=nomic-embed-text
```

## Python API

```python
from openclaw_mem.search import search, search_index, get_detail
from openclaw_mem.index import index_single, index_observation
from openclaw_mem.observe import append_observation
from openclaw_mem.auto_capture import extract_observations_from_text
from openclaw_mem.embedder import get_embedder, Embedder

# Search
results = search("deployment", top_k=5)
summaries = search_index("deployment", top_k=10)
detail = get_detail("chunk:0:abc123")

# Index
index_single("path/to/file.md")
index_observation("Important finding", tag="learning")

# Observe
append_observation("Cache works great", tag="learning")

# Extract patterns from text
obs = extract_observations_from_text("결정: Redis를 사용한다")
# [{"tag": "decision", "text": "Redis를 사용한다"}]

# Direct embedding access
embedder = get_embedder()  # uses configured backend
vectors = embedder.embed(["text 1", "text 2"])

# Custom backend
embedder = Embedder(backend="openai", model="text-embedding-3-small")
```

## Observation Tags

| Tag | Description | Example patterns |
|-----|-------------|-----------------|
| `decision` | Decisions made | `결정:`, `Decision:`, `→ 채택` |
| `learning` | Things learned | `배움:`, `Learned:`, `발견:`, `✅` |
| `error` | Errors encountered | `에러:`, `Error:`, `FAIL`, `실패` |
| `insight` | TODOs and insights | `TODO:`, `할일:`, `다음에` |

## Benchmark — Korean Search Accuracy

Tested with `intfloat/multilingual-e5-small` on Korean+English mixed project data.

| Metric | Result | Target |
|--------|--------|--------|
| **Accuracy** | **10/10 (100%)** | ≥ 80% |
| **Avg Response** | **0.38s** | ≤ 1.0s |
| **Similarity Scores** | 0.83–0.88 | - |

All 10 queries — pure Korean, pure English, and mixed — returned the correct document sections. See [`docs/benchmark.md`](docs/benchmark.md) for full results.

## Why Local-First?

| Cloud-dependent memory | openclaw-mem |
|------------------------|-------------|
| API outage → entire memory offline | **100% offline** — works without internet |
| Data sent to third-party servers | **Data stays on your disk** — zero telemetry |
| API key management & rotation | **No API keys needed** — `pip install` and go |
| Vendor lock-in | **MIT license** — use anywhere, modify freely |

Real-world pain: when OpenAI's embedding API went down, cloud-dependent agents lost all memory access. With openclaw-mem, a **470MB local model** (`intfloat/multilingual-e5-small`) gives you Korean + English search that never goes offline.

> **Your data. Your disk. Your rules.**

## Security

openclaw-mem includes a **built-in memory injection sanitizer** that protects against prompt injection attacks stored in memory.

### How it works

- **Observe** — All incoming observations are scanned before storage. Detected injection patterns are filtered out and replaced with `[FILTERED]`.
- **Index** — During indexing, each chunk is scanned and warnings are logged for any detected patterns (non-blocking, since these are existing files).

### Detected patterns

- Direct command injection (`ignore previous instructions`, `you are now`, `system prompt:`, etc.)
- Data exfiltration (`send api key`, `curl https://...`, `fetch(...)`)
- Encoding bypasses (`base64.encode`, `eval()`, `exec()`)
- Role manipulation (`act as`, `pretend`, `jailbreak`, `DAN mode`)

### Custom patterns

```python
from openclaw_mem.sanitizer import MemorySanitizer

sanitizer = MemorySanitizer(extra_patterns=[
    r"my_custom_attack_pattern",
    r"another_pattern_to_block",
])
is_safe, matches = sanitizer.check(user_input)
```

## Development

```bash
# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=openclaw_mem
```

## License

MIT — see [LICENSE](LICENSE) for details.
