Metadata-Version: 2.4
Name: ws-ctx-engine
Version: 0.1.9
Summary: Intelligently package codebases into optimized context for Large Language Models
Author-email: zamery <zaob.ogn@gmail.com>
Maintainer-email: zamery <zaob.ogn@gmail.com>
License: GPL-3.0-or-later
Project-URL: Homepage, https://github.com/maemreyo/zmr-ctx-paker
Project-URL: Documentation, https://github.com/maemreyo/zmr-ctx-paker#readme
Project-URL: Repository, https://github.com/maemreyo/zmr-ctx-paker
Project-URL: Bug Tracker, https://github.com/maemreyo/zmr-ctx-paker/issues
Project-URL: Source Code, https://github.com/maemreyo/zmr-ctx-paker
Project-URL: Changelog, https://github.com/maemreyo/zmr-ctx-paker/blob/main/CHANGELOG.md
Keywords: llm,context,codebase,semantic-search,pagerank,code-analysis,ast,vector-search,dependency-graph,code-review,ai,machine-learning,natural-language-processing,embeddings,token-budget
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Software Development :: Code Generators
Classifier: Topic :: Software Development :: Documentation
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Topic :: Text Processing :: Markup :: XML
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Typing :: Typed
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: AUTHORS.md
Requires-Dist: tiktoken<1.0.0,>=0.5.0
Requires-Dist: PyYAML<7.0,>=6.0
Requires-Dist: lxml<6.0.0,>=4.9.0
Requires-Dist: typer<1.0.0,>=0.9.0
Requires-Dist: rich<14.0.0,>=13.0.0
Requires-Dist: psutil<6.0.0,>=5.9.0
Requires-Dist: numpy<3.0.0,>=1.24.0
Provides-Extra: fast
Requires-Dist: faiss-cpu<2.0.0,>=1.7.4; extra == "fast"
Requires-Dist: networkx<4.0,>=3.0; extra == "fast"
Requires-Dist: scikit-learn<2.0.0,>=1.3.0; extra == "fast"
Provides-Extra: all
Requires-Dist: faiss-cpu<2.0.0,>=1.7.4; extra == "all"
Requires-Dist: networkx<4.0,>=3.0; extra == "all"
Requires-Dist: scikit-learn<2.0.0,>=1.3.0; extra == "all"
Requires-Dist: python-igraph<1.0.0,>=0.11.0; extra == "all"
Requires-Dist: sentence-transformers<4.0.0,>=3.0.0; extra == "all"
Requires-Dist: torch<3.0.0,>=2.0.0; extra == "all"
Requires-Dist: tree-sitter<1.0.0,>=0.20.0; extra == "all"
Requires-Dist: tree-sitter-python<1.0.0,>=0.20.0; extra == "all"
Requires-Dist: tree-sitter-javascript<1.0.0,>=0.20.0; extra == "all"
Requires-Dist: tree-sitter-typescript<1.0.0,>=0.20.0; extra == "all"
Requires-Dist: tree-sitter-rust<1.0.0,>=0.20.0; extra == "all"
Requires-Dist: leann<1.0.0,>=0.3.0; extra == "all"
Provides-Extra: leann
Requires-Dist: leann<1.0.0,>=0.3.0; extra == "leann"
Provides-Extra: full
Requires-Dist: faiss-cpu<2.0.0,>=1.7.4; extra == "full"
Requires-Dist: networkx<4.0,>=3.0; extra == "full"
Requires-Dist: scikit-learn<2.0.0,>=1.3.0; extra == "full"
Requires-Dist: python-igraph<1.0.0,>=0.11.0; extra == "full"
Requires-Dist: sentence-transformers<4.0.0,>=3.0.0; extra == "full"
Requires-Dist: torch<3.0.0,>=2.0.0; extra == "full"
Requires-Dist: tree-sitter<1.0.0,>=0.20.0; extra == "full"
Requires-Dist: tree-sitter-python<1.0.0,>=0.20.0; extra == "full"
Requires-Dist: tree-sitter-javascript<1.0.0,>=0.20.0; extra == "full"
Requires-Dist: tree-sitter-typescript<1.0.0,>=0.20.0; extra == "full"
Requires-Dist: tree-sitter-rust<1.0.0,>=0.20.0; extra == "full"
Requires-Dist: leann<1.0.0,>=0.3.0; extra == "full"
Provides-Extra: dev
Requires-Dist: pytest<9.0.0,>=7.4.0; extra == "dev"
Requires-Dist: pytest-cov<6.0.0,>=4.1.0; extra == "dev"
Requires-Dist: pytest-benchmark<5.0.0,>=4.0.0; extra == "dev"
Requires-Dist: hypothesis<7.0.0,>=6.82.0; extra == "dev"
Requires-Dist: black<25.0.0,>=23.0.0; extra == "dev"
Requires-Dist: ruff<1.0.0,>=0.0.280; extra == "dev"
Requires-Dist: mypy<2.0.0,>=1.4.0; extra == "dev"
Requires-Dist: types-PyYAML<7.0.0,>=6.0.0; extra == "dev"
Dynamic: license-file

# ws-ctx-engine

Intelligently package codebases into optimized context for Large Language Models (LLMs). ws-ctx-engine uses hybrid ranking (semantic search + PageRank) to select the most relevant files within your token budget, with comprehensive fallback strategies for production reliability.

## Features

- **Hybrid Ranking**: Combines semantic search with structural analysis (PageRank) to identify the most important code
- **Token Budget Management**: Precise token counting using tiktoken to fit LLM context windows
- **Dual Output Formats**: 
  - XML for one-shot paste workflows (Claude.ai, ChatGPT)
  - ZIP for multi-turn upload workflows (Cursor, Claude Code)
- **Production Ready**: Automatic fallback strategies for every component
- **Incremental Indexing**: Build indexes once, reuse for fast queries
- **Flexible Configuration**: Customize weights, filters, and backends via YAML

## Installation

ws-ctx-engine offers three installation tiers based on your needs:

### Minimal (Core Only)

Basic functionality with regex-based parsing and file size ranking:

```bash
pip install ws-ctx-engine
```

**Includes**: tiktoken, PyYAML, lxml, typer, rich

### Fast (Recommended)

Core + fallback backends for semantic search and graph analysis:

```bash
pip install ws-ctx-engine[fast]
```

**Adds**: faiss-cpu (vector search), networkx (graph analysis)

### All (Full Features)

All features including primary backends for optimal performance:

```bash
pip install ws-ctx-engine[all]
```

**Adds**: python-igraph (fast PageRank), sentence-transformers (local embeddings), py-tree-sitter (accurate AST parsing)

## Quick Start

### 1. Index Your Repository

Build indexes for semantic search and dependency analysis:

```bash
ws-ctx-engine index /path/to/your/repo
```

This creates a `.ws-ctx-engine/` directory with:
- `vector.idx` - Semantic search index
- `graph.pkl` - Dependency graph with PageRank scores
- `metadata.json` - Staleness detection metadata
- `logs/` - Execution logs

### 2. Generate Context Pack

Create an optimized context pack for LLM review:

```bash
# Generate ZIP output (default)
ws-ctx-engine pack /path/to/your/repo

# Generate XML output for paste workflows
ws-ctx-engine pack /path/to/your/repo --format xml

# Specify token budget
ws-ctx-engine pack /path/to/your/repo --budget 50000

# Query with natural language
ws-ctx-engine query "authentication and user management" --format zip
```

### 3. Use the Output

**For XML output**: Copy the generated `repomix-output.xml` and paste into Claude.ai or ChatGPT

**For ZIP output**: Upload `ws-ctx-engine.zip` to Cursor or Claude Code. The archive includes:
- `files/` - Selected source files with preserved directory structure
- `REVIEW_CONTEXT.md` - Manifest with importance scores and reading order

## CLI Commands

### `ws-ctx-engine index`

Build and save indexes for later queries:

```bash
ws-ctx-engine index <repo_path> [OPTIONS]
```

**Options**:
- `--config PATH` - Custom configuration file (default: `.ws-ctx-engine.yaml`)
- `--verbose` - Enable detailed logging with timing information

### `ws-ctx-engine query`

Search indexed repository and generate output:

```bash
ws-ctx-engine query <query_text> [OPTIONS]
```

**Options**:
- `--format {xml|zip}` - Output format (default: zip)
- `--budget INT` - Token budget (default: 100000)
- `--config PATH` - Custom configuration file
- `--output PATH` - Output directory (default: ./output)
- `--verbose` - Enable detailed logging

### `ws-ctx-engine pack`

Full workflow: index + query + pack:

```bash
ws-ctx-engine pack <repo_path> [OPTIONS]
```

**Options**:
- `--query TEXT` - Natural language query for semantic search
- `--changed-files PATH` - File with list of changed files (one per line)
- `--format {xml|zip}` - Output format (default: zip)
- `--budget INT` - Token budget (default: 100000)
- `--config PATH` - Custom configuration file
- `--output PATH` - Output directory (default: ./output)
- `--verbose` - Enable detailed logging

## Configuration

Create a `.ws-ctx-engine.yaml` file in your repository root to customize behavior:

```yaml
# Output settings
format: zip  # xml | zip
token_budget: 100000
output_path: ./output

# Scoring weights (must sum to 1.0)
semantic_weight: 0.6
pagerank_weight: 0.4

# File filtering
include_tests: false
include_patterns:
  - "**/*.py"
  - "**/*.js"
  - "**/*.ts"
exclude_patterns:
  - "*.min.js"
  - "node_modules/**"
  - "__pycache__/**"
  - ".git/**"

# Backend selection (auto | primary | fallback)
backends:
  vector_index: auto  # auto | leann | faiss
  graph: auto         # auto | igraph | networkx
  embeddings: auto    # auto | local | api

# Embeddings configuration
embeddings:
  model: all-MiniLM-L6-v2
  device: cpu
  batch_size: 32
  api_provider: openai
  api_key_env: OPENAI_API_KEY

# Performance tuning
performance:
  max_workers: 4
  cache_embeddings: true
  incremental_index: true
```

See `.ws-ctx-engine.yaml.example` for detailed documentation of all options.

## How It Works

ws-ctx-engine uses a multi-stage pipeline to select the most relevant code:

### 1. AST Parsing

Parse source code into structured chunks with metadata:
- **Primary**: py-tree-sitter (accurate, 40+ languages)
- **Fallback**: Regex patterns (Python, JavaScript, TypeScript)

### 2. Semantic Indexing

Build vector embeddings for semantic search:
- **Primary**: LEANN (97% storage savings, graph-based)
- **Fallback**: FAISS (battle-tested HNSW index)
- **Embeddings**: sentence-transformers (local) or OpenAI API (fallback)

### 3. Dependency Graph

Analyze code structure and compute PageRank:
- **Primary**: python-igraph (C++ backend, <1s for 10k files)
- **Fallback**: NetworkX (pure Python, <10s for 10k files)

### 4. Hybrid Ranking

Merge semantic and structural scores:
```
importance_score = semantic_weight × semantic_score + pagerank_weight × pagerank_score
```

### 5. Budget Selection

Greedy knapsack algorithm to maximize importance within token budget:
- 80% budget for file content
- 20% reserved for metadata and manifest

### 6. Output Generation

Package selected files in chosen format:
- **XML**: Single file with Repomix-style structure
- **ZIP**: Preserved directory structure + manifest

## Fallback Strategy

ws-ctx-engine never fails due to missing dependencies. Each component has automatic fallbacks:

```
Level 1: igraph + LEANN + local embeddings (optimal)
  ↓ igraph fails
Level 2: NetworkX + LEANN + local embeddings
  ↓ LEANN fails
Level 3: NetworkX + FAISS + local embeddings
  ↓ local embeddings OOM
Level 4: NetworkX + FAISS + API embeddings
  ↓ API fails
Level 5: NetworkX + TF-IDF (no embeddings)
  ↓ NetworkX too slow
Level 6: File size ranking only (no graph)
```

All fallback transitions are logged with actionable suggestions.

## Performance

Performance targets with primary backends:

- **Indexing**: <5 minutes for 10,000 files
- **Query**: <10 seconds for 10,000 files
- **Parsing**: <5 seconds per 1,000 lines of code
- **Token Counting**: ±2% accuracy vs actual LLM count

Fallback backends maintain functionality within 2x of primary performance.

## Examples

### Code Review Workflow

```bash
# Index your repository once
ws-ctx-engine index ~/projects/myapp

# Generate context for PR review
ws-ctx-engine query "authentication changes" \
  --changed-files changed.txt \
  --format zip \
  --budget 50000

# Upload ws-ctx-engine.zip to Cursor for review
```

### Bug Investigation

```bash
# Find relevant code for a bug
ws-ctx-engine pack ~/projects/myapp \
  --query "database connection pooling and timeout handling" \
  --format xml \
  --budget 30000

# Paste repomix-output.xml into Claude.ai
```

### Documentation Generation

```bash
# Select core API files
ws-ctx-engine pack ~/projects/myapp \
  --query "public API endpoints and data models" \
  --format zip \
  --budget 80000
```

## Development

### Running Tests

```bash
# Install development dependencies
pip install -e ".[dev,all]"

# Run all tests
pytest

# Run with coverage
pytest --cov=ws_ctx_engine --cov-report=html

# Run property-based tests only
pytest -m property

# Run integration tests
pytest -m integration

# Run benchmarks
pytest -m benchmark --benchmark-only
```

### Test Profiles

Hypothesis property tests support multiple profiles:

```bash
# CI profile: 100 examples, verbose output
pytest --hypothesis-profile=ci

# Dev profile: 20 examples, quick feedback
pytest --hypothesis-profile=dev

# Debug profile: 10 examples, maximum verbosity
pytest --hypothesis-profile=debug
```

## Troubleshooting

### "LEANN not available, using FAISS fallback"

LEANN is an optional primary backend. Install with:
```bash
pip install ws-ctx-engine[all]
```

### "igraph not available, using NetworkX fallback"

python-igraph requires C++ compilation. Install with:
```bash
pip install ws-ctx-engine[all]
```

Or force NetworkX backend in config:
```yaml
backends:
  graph: networkx
```

### "Local embeddings OOM, falling back to API"

Reduce batch size or use API embeddings:
```yaml
embeddings:
  batch_size: 16  # Reduce from default 32
  # Or use API
backends:
  embeddings: api
```

Set `OPENAI_API_KEY` environment variable for API access.

### "Index is stale, rebuilding"

Files have changed since last index. This is automatic. To force rebuild:
```bash
rm -rf .ws-ctx-engine/
ws-ctx-engine index /path/to/repo
```

## License

GPL-3.0-or-later - see LICENSE file for details.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This ensures that any derivative work must also be open source under GPL-3.0.

## Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

## AI Agents

See [AI_AGENTS.md](AI_AGENTS.md) for guidelines on how AI agents should use this tool.

## Citation

If you use ws-ctx-engine in research, please cite:

```bibtex
@software{ws_ctx_engine,
  title = {ws-ctx-engine: Intelligent Codebase Packaging for LLMs},
  author = {zamery},
  year = {2024},
  url = {https://github.com/maemreyo/zmr-ctx-paker}
}
```
