Metadata-Version: 2.4
Name: rubberduck-index
Version: 0.1.0
Summary: Local project indexer for RubberDuck Semantic Intelligence MCP
Author: RubberDuck Team
License-Expression: MIT
Project-URL: Homepage, https://github.com/Grieco/cpg_query_service
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: requests>=2.28
Requires-Dist: pathspec>=0.11
Provides-Extra: watch
Requires-Dist: watchdog>=3.0; extra == "watch"
Provides-Extra: all
Requires-Dist: watchdog>=3.0; extra == "all"

# rubberduck-index

Local project indexer for [RubberDuck Semantic Intelligence](https://github.com/Grieco/cpg_query_service). Syncs your Python source code to the RubberDuck MCP server where it's analyzed as Code Property Graphs (CPGs) — enabling LLMs to query definitions, data flow, call chains, and more.

## How It Works

```
Your Machine                          EC2 Server (port 8182)
┌──────────────┐    SHA-256 hashes    ┌──────────────────────┐
│  rubberduck  │───────────────────►  │  /index/manifest     │
│  -index      │◄──── need_upload ──  │  (diff check)        │
│              │                      │                      │
│  scanner.py  │──── changed files ─► │  /index/upload       │
│  syncer.py   │    (gzip tar/JSON)   │  (store + build CPG) │
│  watcher.py  │                      │                      │
└──────────────┘                      │  ProjectStore        │
                                      │  LocalProjects/      │
                                      │    u{user_id}/       │
                                      │      {project}/      │
                                      └──────────────────────┘
```

1. **Scan** — Finds Python files, computes SHA-256 hashes, respects `.gitignore`
2. **Diff** — Sends hashes to server; server replies with which files need uploading
3. **Upload** — Sends only changed files (JSON for small batches, gzip tar for large)
4. **Build** — Server stores files in user-scoped directories and builds CPG graphs
5. **Query** — LLMs use MCP tools (`analyze_code`, `trace_variable`, `call_chain`, etc.)

## Install

```bash
pip install -e /path/to/rubberduck_index

# With file watcher support (auto-sync on save):
pip install -e "/path/to/rubberduck_index[watch]"
```

### Requirements

- Python 3.9+
- `requests` (HTTP client)
- `pathspec` (`.gitignore`-compatible pattern matching)
- `watchdog` (optional — for `watch` command)

## Quick Start

```bash
# 1. Initialize a project (first-time setup)
cd ~/my-python-project
rubberduck-index init \
  --server http://54.81.153.13:8182 \
  --project my-app \
  --token YOUR_TOKEN

# 2. Sync changes (incremental — only uploads what changed)
rubberduck-index sync

# 3. Watch for changes (auto-sync on file save)
rubberduck-index watch -d    # daemon mode (background)
rubberduck-index stop        # stop the daemon

# 4. Check status
rubberduck-index status      # local + server status
rubberduck-index list        # all projects on server
```

## Commands

### `init`

Initialize a project directory for indexing. Creates `.rubberduck/config.json`, scans files, and performs the first sync.

```bash
rubberduck-index init --server URL --project NAME [OPTIONS]
```

| Flag | Description |
|------|-------------|
| `--server` | **(required)** MCP server URL, e.g. `http://54.81.153.13:8182` |
| `--project` | Project name on the server (default: directory name) |
| `--token` | Bearer token for auth (or set `RUBBERDUCK_TOKEN` env var) |
| `--directory` | Project directory (default: current directory) |
| `--include` | File patterns to include (default: `**/*.py`) |
| `--watch`, `-w` | Start background watcher after init |

### `sync`

Sync changed files to the server. Compares local hashes with server manifest and uploads only what's different.

```bash
rubberduck-index sync [--force]
```

| Flag | Description |
|------|-------------|
| `--force` | Force full re-upload (ignore hash comparison) |

### `watch`

Watch for file changes and auto-sync. Uses OS-native file system events (FSEvents on macOS, inotify on Linux).

```bash
rubberduck-index watch [-d]
```

| Flag | Description |
|------|-------------|
| `--daemon`, `-d` | Run in background. Stop with `rubberduck-index stop`. |

Changes are debounced (default 500ms) and batched before uploading.

### `stop`

Stop the background watcher daemon.

```bash
rubberduck-index stop
```

### `status`

Show index status for the current project — local file count vs. server state.

```bash
rubberduck-index status
```

### `list`

List all indexed projects on the server.

```bash
rubberduck-index list [--server URL]
```

### `remove`

Remove a project from the server (deletes synced files and CPG graphs).

```bash
rubberduck-index remove [--project NAME] [--server URL]
```

## Authentication

The server requires Bearer token authentication. Obtain a token from the server admin.

**Three ways to provide your token:**

1. **`--token` flag** (init only):
   ```bash
   rubberduck-index init --server http://... --token abc123
   ```

2. **Environment variable** (any command):
   ```bash
   export RUBBERDUCK_TOKEN=abc123
   rubberduck-index sync
   ```

3. **Config file** (auto-saved by init):
   ```json
   // .rubberduck/config.json
   { "token": "abc123", ... }
   ```

Priority: `--token` flag > `RUBBERDUCK_TOKEN` env var > config file.

The `.rubberduck/` directory is automatically added to `.gitignore` to prevent accidental token commits.

## Configuration

All config lives in `.rubberduck/config.json` (created by `init`):

```json
{
  "server": "http://54.81.153.13:8182",
  "project": "my-app",
  "token": "your-bearer-token",
  "include": ["**/*.py"],
  "exclude_defaults": true,
  "max_file_size": 5000000,
  "watch_debounce_ms": 500
}
```

| Field | Default | Description |
|-------|---------|-------------|
| `server` | — | MCP server URL |
| `project` | directory name | Project name on the server |
| `token` | `""` | Bearer token for auth |
| `include` | `["**/*.py"]` | Glob patterns for files to index |
| `exclude_defaults` | `true` | Use built-in exclude list (see below) |
| `max_file_size` | `5000000` (5MB) | Skip files larger than this |
| `watch_debounce_ms` | `500` | Debounce interval for file watcher |

### Custom ignore patterns

Create `.rubberduck/ignore` with `.gitignore`-style patterns:

```
# Extra excludes
tests/fixtures/**
docs/**
*.generated.py
```

### Built-in excludes

Always excluded regardless of config: `__pycache__`, `.git`, `.venv`, `venv`, `node_modules`, `.tox`, `.mypy_cache`, `.pytest_cache`, `.eggs`, `*.egg-info`, `dist`, `build`, `.DS_Store`, `.hg`, `.svn`, `.env`.

## Sync Protocol

The indexer uses a two-step incremental sync protocol:

### Step 1: Manifest comparison (`POST /index/manifest`)

```
Client sends:  { "project": "my-app", "files": { "app.py": "sha256...", "utils.py": "sha256..." } }
Server replies: { "need_upload": ["app.py"], "already_synced": ["utils.py"], "deleted_on_client": ["old.py"] }
```

### Step 2: File upload (`POST /index/upload`)

- **Small batches** (<10 files, <1MB): JSON mode — `{"project": "my-app", "files": {"app.py": "content..."}}`
- **Large batches**: gzip tar mode — compressed archive with `Content-Type: application/gzip`

The server stores files in user-scoped directories (`LocalProjects/u{user_id}/{project}/`), builds CPG graphs for each `.py` file, and returns the list of graphs built.

## User Isolation

Each user's projects are stored in separate directories on the server. Two users can have projects with the same name without conflicts:

```
LocalProjects/
  u1/my-app/       ← User 1's "my-app"
  u2/my-app/       ← User 2's "my-app" (completely isolated)
```

The `user_id` is derived from your Bearer token — the CLI never needs to know it.

## File Structure

```
rubberduck_index/
├── __init__.py      # Package metadata, version
├── __main__.py      # CLI entry point (7 commands)
├── config.py        # Config management (.rubberduck/config.json)
├── scanner.py       # File discovery + SHA-256 hashing
├── syncer.py        # HTTP client for /index/* endpoints
├── watcher.py       # File watcher (watchdog) + daemon mode
├── pyproject.toml   # Package definition
└── README.md        # This file
```

## Server Limits

| Limit | Default | Description |
|-------|---------|-------------|
| Max project size | 50 MB | Total size of all files in a project |
| Max file size | 5 MB | Individual file size limit |
| Compressed upload | 100 MB | Max gzip-compressed body size |
| Project TTL | 24 hours | Synced projects expire after this (re-sync to refresh) |

## Examples

### Index a Django project

```bash
cd ~/django-project
rubberduck-index init \
  --server http://54.81.153.13:8182 \
  --project my-django \
  --token $RUBBERDUCK_TOKEN \
  --watch
```

### Index only specific directories

```bash
rubberduck-index init \
  --server http://54.81.153.13:8182 \
  --project my-app \
  --token $RUBBERDUCK_TOKEN \
  --include "src/**/*.py" "lib/**/*.py"
```

### Use with Cursor / Claude Code

After indexing, tell the LLM in Cursor or Claude Code:

```
Load the project "my-app" and trace the data flow of the `request` variable.
```

The LLM will use MCP tools:
1. `load_repo(repo="local/my-app")` — loads CPG graphs
2. `analyze_code(statement="trace data flow of request", graph_id="...")` — queries the graph
3. Returns facts about definitions, assignments, and flow paths

## Troubleshooting

**"No .rubberduck/config.json found"**
Run `rubberduck-index init` first, or `cd` into the project directory.

**"Server will reject unauthenticated requests"**
Set your token: `export RUBBERDUCK_TOKEN=your-token` or re-run `init --token ...`.

**"Request body too large"**
Your project exceeds 50MB. Use more specific `--include` patterns to reduce file count, or increase the server limit.

**"No matching files found"**
Check your `include` patterns in `.rubberduck/config.json`. Default is `**/*.py`.

**Watcher not detecting changes**
Ensure `watchdog` is installed: `pip install watchdog`. On Linux, check inotify limits: `sysctl fs.inotify.max_user_watches`.
