Metadata-Version: 2.4
Name: rbsp
Version: 0.1.1
Summary: Resonance-Based Semantic Protocol - Ultra-fast search and indexing
Author-email: yethikrishna <yethikrishna@users.noreply.github.com>
License: MIT
Project-URL: Homepage, https://github.com/yethikrishna/rbsp-framework
Project-URL: Repository, https://github.com/yethikrishna/rbsp-framework
Project-URL: Documentation, https://github.com/yethikrishna/rbsp-framework#readme
Project-URL: Bug Tracker, https://github.com/yethikrishna/rbsp-framework/issues
Keywords: search,indexing,semantic,protocol
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Text Processing :: Indexing
Classifier: Topic :: Software Development :: Libraries
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Dynamic: license-file

# RBSP

**Resonance-Based Semantic Protocol** -- a zero-dependency, pure-Python search engine
with BM25 ranking, HNSW vector search, resonance-based hybrid fusion, and a built-in
HTTP API. 85 modules. 3,233 tests. No pip install required beyond Python 3.9+.

[![CI](https://github.com/yethikrishna/rbsp-framework/actions/workflows/ci.yml/badge.svg)](https://github.com/yethikrishna/rbsp-framework/actions/workflows/ci.yml)
[![PyPI version](https://img.shields.io/pypi/v/rbsp)](https://pypi.org/project/rbsp/)
![Python 3.9+](https://img.shields.io/badge/python-3.9%2B-blue)
![License: MIT](https://img.shields.io/badge/license-MIT-green)
![Tests: 3,233](https://img.shields.io/badge/tests-3%2C233-brightgreen)
![Dependencies: 0](https://img.shields.io/badge/dependencies-0-orange)

---

## What is RBSP?

RBSP is a full-featured search engine written entirely in Python's standard library.
It combines classical information retrieval (BM25, inverted index, Porter stemmer)
with modern vector search (HNSW, binary/product quantization) and a novel
resonance-based hybrid fusion algorithm. It ships as a library, CLI tool, and HTTP
server -- all with zero external dependencies.

| | |
|---|---|
| **85 modules** | Full IR stack: indexing, ranking, vector search, hybrid fusion, reranking |
| **3,233 tests** | Comprehensive suite across 85 test files -- no external services needed |
| **0 dependencies** | Pure Python stdlib. `pip install rbsp` and you are done |
| **Python 3.9 -- 3.13** | Tested on every supported CPython release |

---

## Installation

```bash
pip install rbsp
```

Or install from source:

```bash
git clone https://github.com/yethikrishna/rbsp-framework.git
cd rbsp-framework
pip install -e .
```

---

## Quick Start

### Python API

```python
from rbsp import init, index, search

# Initialize and index a project
init("/path/to/project")
stats = index("/path/to/project")
print(f"Indexed {stats.files_indexed} files in {stats.duration:.2f}s")

# Search
results = search("authentication logic", max_results=10)
for r in results[:5]:
    print(f"{r.path}:{r.line}  ({r.score:.2f})")
```

Or use the one-line API -- `resolve` auto-initializes on first call:

```python
from rbsp import resolve

results = resolve("database connection")
for r in results[:5]:
    print(f"{r.path}:{r.line}  ({r.score:.2f})")
```

### CLI

```bash
# Initialize and index
rbsp init .
rbsp index .

# Search
rbsp search "authentication logic"

# Search with options
rbsp search "database" --limit 10 --verbose --json

# Watch for file changes and re-index automatically
rbsp watch .

# View index status
rbsp status

# Manage configuration
rbsp config max_results 50
```

### HTTP API

Start the server:

```python
from rbsp.runtime.server import RBSPServer

server = RBSPServer(host="127.0.0.1", port=7342, auto_init=".")
server.serve_forever()
```

Query via curl:

```bash
# Quick search via query params
curl "http://127.0.0.1:7342/api/v1/search?q=authentication"

# Full query with options
curl -X POST http://127.0.0.1:7342/api/v1/resolve \
  -H "Content-Type: application/json" \
  -d '{"query": "authentication logic", "max_results": 10}'

# Trigger indexing
curl -X POST http://127.0.0.1:7342/api/v1/index \
  -H "Content-Type: application/json" \
  -d '{"path": ".", "force": false}'

# Health check
curl http://127.0.0.1:7342/health
```

#### Endpoints

| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/health` | Liveness probe |
| GET | `/api/v1/status` | Engine status, metrics, and throttle stats |
| POST | `/api/v1/resolve` | Full query (JSON body) |
| GET | `/api/v1/search?q=...` | Query via query parameters |
| POST | `/api/v1/index` | Trigger indexing |
| POST | `/api/v1/init` | Initialize the engine |
| POST | `/api/v1/watcher/start` | Start file watcher |
| POST | `/api/v1/watcher/stop` | Stop file watcher |
| GET | `/api/v1/watcher` | Watcher status |

---

## Features

### Classical Information Retrieval

- **Okapi BM25 Ranking** -- k1=1.2, b=0.75 with normalized scoring
- **BM25F** -- Field-weighted BM25 for multi-field documents
- **Inverted Index** -- Term-to-document posting lists with document frequency tracking
- **WAND-style Early Termination** -- Skips non-competitive candidates for top-K
- **LSH (Locality-Sensitive Hashing)** -- MinHash with configurable bands/rows
- **Phrase Queries** -- Exact and near-phrase matching with configurable slop
- **Boolean Queries** -- AND / OR / NOT query composition
- **Wildcard Queries** -- Prefix, suffix, and infix pattern matching
- **Proximity Search** -- Find terms within a positional window
- **Porter Stemmer** -- Algorithmic stemming at index and query time
- **Spell Checking** -- Edit-distance-based query correction
- **Query Suggestions** -- Typeahead / autocomplete from the index
- **Query Expansion** -- Automatic term expansion for broader recall
- **Synonyms** -- Synonym expansion for query enrichment
- **Highlighting** -- Snippet extraction with match highlighting
- **Faceted Search** -- Drill-down aggregations by field, directory, file type
- **Freshness Scoring** -- Time-decay boost for recently modified documents
- **Diversity Ranking** -- MMR-style result diversification
- **Language Model Scoring** -- Statistical language model retrieval
- **Learning to Rank** -- Pluggable LTR scoring with feature extraction

### Vector Search

- **HNSW Index** -- Hierarchical Navigable Small World graph for O(log n) ANN search
- **Binary Quantization** -- Hamming distance via XOR + popcount on Python ints
- **Product Quantization** -- Subspace codebook compression for memory efficiency
- **Unified VectorStore** -- Four search modes: exact, hnsw, binary, hybrid
- **FilterExpression DSL** -- Composable metadata filters (`&`, `|`, `~` operators)
- **Filtered HNSW** -- 3-tier adaptive fallback (HNSW -> widened ef -> exact)
- **Cosine and L2 Distance** -- Pluggable distance functions

### Hybrid Fusion

- **Resonance Model** -- Novel wave-interference-inspired fusion of BM25 + vector scores
- **3 Interference Modes** -- Constructive, adaptive, and linear
- **Coherence Scoring** -- Based on rank agreement between signals
- **Adaptive Resonance** -- Self-tuning alpha via click feedback (EMA)
- **Multi-Signal Resonance** -- N-signal generalization beyond two rankers
- **Ensemble Methods** -- Linear, RRF, CombScore, CombMax, Borda, Min fusion

### Reranking

- **Cross-Encoder Pipeline** -- Two-stage retrieve-then-rerank architecture
- **Pluggable Backends** -- Cohere Rerank, HuggingFace Transformers, or custom encoders
- **DummyCrossEncoder** -- Zero-dependency BM25-based heuristic for testing
- **Score Fusion** -- Linear interpolation with alpha blending

### Embeddings

- **Provider Abstraction** -- Unified interface for OpenAI, Cohere, Ollama, and custom backends
- **Document Chunking** -- Fixed, sentence, and paragraph strategies with configurable overlap
- **EmbeddingPipeline** -- Chunk + embed in a single call
- **DummyProvider** -- Deterministic vectors for testing without API keys

### Advanced Query

- **Query Planning** -- Cost-based query plan optimization
- **Query Rewriting** -- Automatic query normalization and transformation
- **Score Explanation** -- Per-result scoring breakdown for debugging

### HTTP API

- **REST/JSON Server** -- Built on stdlib `http.server` with `ThreadPoolExecutor`
- **HTTP/1.1 Keep-Alive** -- Configurable socket timeout
- **API Key Authentication** -- Optional `X-API-Key` header validation
- **CORS Support** -- Cross-origin headers for browser access
- **Rate Limiting** -- Token bucket + backpressure with `429` / `503` and `Retry-After`
- **Slow Query Log** -- Queries exceeding a latency threshold are logged
- **Metrics** -- Counters, histograms, and gauges for monitoring
- **Graceful Shutdown** -- Drains in-flight requests before stopping

### Durability

- **Write-Ahead Log** -- Binary WAL with CRC32 integrity checks and crash recovery
- **7 Operation Types** -- INSERT, DELETE, UPDATE, CHECKPOINT, SEGMENT_CREATE/MERGE/DELETE
- **Configurable Sync** -- FSYNC, FDATASYNC, or NONE modes
- **Checkpointing** -- Periodic WAL truncation at safe sequence numbers
- **Snapshots** -- Point-in-time index snapshots
- **Segments** -- Segmented index architecture with background merging

### Distribution

- **Jump Consistent Hash** -- Minimal reassignment on shard count changes (Lamping & Veach 2014)
- **Shard Partitions** -- Each wraps its own VectorStore + InvertedIndex
- **Scatter-Gather Queries** -- Parallel query across shards with result merging
- **Replica Manager** -- Read replicas with health tracking
- **Consistency Levels** -- ONE, QUORUM, ALL for reads and writes
- **Online Rebalancing** -- Migrate to a new shard count without downtime

### Indexing

- **Incremental Indexing** -- Only re-indexes changed files
- **Concurrent Indexing** -- Thread-pool-based parallel file processing
- **File Watching** -- Live re-indexing on file system changes
- **MMap Store** -- Memory-mapped binary storage for large indexes
- **Positional Postings** -- Term positions stored for phrase and proximity queries
- **Doc Values** -- Columnar storage for sort and aggregation fields
- **Deletion Tracking** -- Soft deletes with periodic compaction
- **Index Warmup** -- Pre-loads hot data into memory on startup

### Data Structures

- **Bloom Filter** -- Probabilistic set membership with configurable false-positive rate
- **Counting Bloom Filter** -- Supports deletions via counters
- **Roaring Bitmaps** -- Compressed integer sets for posting list intersection
- **Skip Lists** -- O(log n) sorted set for ordered iteration
- **Finite State Transducers** -- FST for fast prefix/fuzzy lookups
- **Levenshtein Automata** -- Edit distance computation for fuzzy matching
- **Varint Encoding** -- Variable-length integer encoding for compact storage
- **LZ4/Snappy-style Compression** -- Block compression for stored fields

### Integrations

- **File System** -- Recursive directory traversal with gitignore-style filtering
- **Git** -- Git-aware indexing (respects `.gitignore`, tracks blame metadata)
- **Plugin System** -- Extensible resolvers, indexers, and rankers via plugin registry
- **Schema System** -- Typed field definitions with validation
- **Migration System** -- Index schema migrations across versions

---

## Advanced Usage

### Vector Search

```python
from rbsp.indexer.vector_store import VectorStore, FieldCondition

store = VectorStore(dim=128, distance="cosine")
store.insert("doc1", [0.1] * 128, metadata={"category": "tech"})

# HNSW search with metadata filter
results = store.search(
    [0.1] * 128, k=5, mode="hnsw",
    filter_expr=FieldCondition("category", "eq", "tech"),
)
```

### Hybrid Retrieval

```python
from rbsp.query.hybrid import HybridRetriever, ResonanceConfig

retriever = HybridRetriever(ResonanceConfig(interference_mode="adaptive"))
fused = retriever.fuse(bm25_ranking, vector_ranking, top_k=10)
```

### Write-Ahead Log

```python
from rbsp.indexer.wal import WriteAheadLog, WALConfig, WALOperation

with WriteAheadLog(WALConfig(log_dir="/tmp/rbsp-wal")) as wal:
    wal.append(WALOperation.INSERT, b'{"doc_id": "abc"}')
    wal.checkpoint()
    entries = wal.recover()  # Replay after crash
```

### Distributed Sharding

```python
from rbsp.runtime.distributed import ShardedIndex, ShardConfig

idx = ShardedIndex(ShardConfig(num_shards=4, replication_factor=2))
idx.insert("doc1", tokens=["python", "search"], vector=[0.1] * 128)
results = idx.search_vector([0.1] * 128, k=10)
```

---

## Architecture

```
rbsp/
  core/         Classical IR primitives (BM25, stemmer, LSH, bloom, HNSW,
                quantization, embeddings, chunker, roaring bitmaps, FST, ...)
  indexer/      Storage layer (inverted index, vector store, WAL, mmap,
                segments, concurrent indexing, merge, ...)
  query/        Query pipeline (hybrid fusion, reranker, ensemble, boolean,
                phrase, wildcard, spellcheck, facets, learning-to-rank, ...)
  runtime/      Orchestration (engine, config, plugins, CLI, HTTP server,
                distributed sharding, monitoring, throttle, ...)
  integration/  External integrations (filesystem, git)
```

### How a query works

1. **Parse** -- Tokenize and stem via Porter stemmer.
2. **Retrieve** -- O(1) candidate lookup via InvertedIndex + LSH.
3. **Score** -- BM25 with trigram fuzzy matching (0.8 / 0.2 blend).
4. **Prune** -- WAND-style early termination skips non-competitive candidates.
5. **Rank** -- `heapq.nlargest` for top-K selection.
6. **Snippet** -- Extract best-matching lines only for final results.

Optionally, vector search (HNSW) retrieves dense-embedding candidates,
resonance fusion merges BM25 and vector rankings via wave interference, and a
cross-encoder reranker refines the final top candidates.

### How indexing works

1. Files are discovered via recursive directory traversal (respecting `.gitignore`).
2. Each file is tokenized, stemmed, and added to the inverted index and LSH index.
3. Term frequencies, document lengths, and trigrams are stored for BM25 and fuzzy matching.
4. All mutations are recorded in the **WAL** before being applied.
5. Periodic background merging compacts segments for efficient storage.

---

## Design Philosophy

1. **Zero dependencies** -- stdlib only, no C extensions, no pip install overhead.
2. **O(1) candidate retrieval** -- InvertedIndex + LSH instead of brute-force scan.
3. **WAND early termination** -- skips documents that cannot reach top-K.
4. **heapq everywhere** -- `nlargest` / `nsmallest` avoids full sort.
5. **Deferred snippet extraction** -- only reads files for final top-K results.
6. **Binary quantization** -- Python `int` as SIMD: XOR + popcount in 2 ops.
7. **Thread-safe by design** -- `threading.Lock` / `RLock` on all shared structures.
8. **Lazy initialization** -- heavy resources allocated only on first use.
9. **MMapStore binary persistence** -- fast index load and save.
10. **Jump consistent hash** -- minimal reassignment on shard count changes.

---

## Benchmarks

RBSP ships a benchmark suite covering core operations, indexing throughput,
search latency, and vector search performance:

```bash
python benchmarks/benchmark_core.py      # Signature, bloom, LSH, registry
python benchmarks/benchmark_index.py     # Indexing throughput
python benchmarks/benchmark_search.py    # Search latency vs corpus size
python benchmarks/benchmark_vector.py    # HNSW and quantized vector search
python benchmarks/benchmark_e2e.py       # End-to-end pipeline
```

---

## Development

### Prerequisites

- Python 3.9+

### Setup

```bash
git clone https://github.com/yethikrishna/rbsp-framework.git
cd rbsp-framework
pip install -e ".[dev]"
```

### Running Tests

```bash
python -m pytest tests/ -v
```

The full suite of 3,233 tests runs without any external services or API keys.

### Python Version Support

| Python | Status |
|--------|--------|
| 3.9 | Supported |
| 3.10 | Supported |
| 3.11 | Supported |
| 3.12 | Supported |
| 3.13 | Supported |

---

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on development setup,
testing, code style, and the pull request process.

---

## License

[MIT](LICENSE) -- Copyright (c) 2026 kkvin
