Metadata-Version: 2.4
Name: dot-search
Version: 0.1.0
Summary: Augment existing database tables with vector and BM25 search
Project-URL: Homepage, https://gitlab.com/deepika6190303/deepika-open-toolbox/dot-search
Project-URL: Repository, https://gitlab.com/deepika6190303/deepika-open-toolbox/dot-search
Project-URL: Issues, https://gitlab.com/deepika6190303/deepika-open-toolbox/dot-search/-/issues
Author-email: deepika Team <contact@deepika.ai>
License: TODO: TO BE COMPLETED
License-File: LICENSE
Keywords: bm25,deepika,embeddings,hybrid,open-toolbox,search,vector
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.12
Requires-Dist: aiosqlite>=0.20
Requires-Dist: asyncpg>=0.29
Requires-Dist: httpx>=0.27
Requires-Dist: pgvector>=0.3
Requires-Dist: sqlalchemy[asyncio]>=2.0
Description-Content-Type: text/markdown

# dot-search

![Python Version](https://img.shields.io/badge/python-3.12%2B-blue)

**dot-search** augments existing database tables with vector search and BM25 keyword search. Add embedding columns directly to your tables, combine SQL filters with semantic similarity, and fuse results from multiple embeddings via Reciprocal Rank Fusion (RRF).

## Prerequisites

### PostgreSQL (production)

dot-search requires PostgreSQL with two extensions:

- **[pgvector](https://github.com/pgvector/pgvector)** — vector storage and similarity search
- **[ParadeDB `pg_search`](https://github.com/paradedb/paradedb)** — BM25 full-text search

These are system-level PostgreSQL extensions and must be installed before creating a database. On Arch Linux:

```bash
yay -S pgvector
```

On Ubuntu/Debian:

```bash
sudo apt install postgresql-16-pgvector
```

Then enable them in your database:

```sql
CREATE EXTENSION vector;
CREATE EXTENSION pg_search;
```

### SQLite (testing only)

The SQLite backend uses **[sqlite-vec](https://github.com/asg017/sqlite-vec)** for vector search. Install it as a Python package:

```bash
pip install sqlite-vec
```

Note: BM25 is not supported with SQLite — use it only for local testing of the indexing and vector search flow.

## Install

```bash
pip install dot-search
```

## Concept

`SearchEngine` orchestrates indexing and search over your existing tables. It uses a pluggable `Store` backend (PostgreSQL with pgvector + ParadeDB, or SQLite for testing).

```
SearchEngine
 ├── EmbeddingConfig  →  which column to embed (provider/model stored, resolved via env)
 └── Store            →  database backend
          │
          ▼
   engine.index(config=...)   # persists config, adds ds_* columns, computes embeddings
          │
          ▼
   engine.search(query, index_id, ...)  # loads config from DB, vector + BM25 + SQL filters → [SearchResult]
```

Config is persisted in a `ds_configs` table managed by dot-search, keyed by `index_id`. After the first `index(config=...)`, you can restart the process and call `search()` or `index(index_id=...)` without re-providing the config.

Multiple configs can target the same table with different `index_id`s — useful for different embedding models or different subsets of columns.

Embedding columns are prefixed with `ds_` and are nullable — they don't interfere with your existing ORM (Django, SQLAlchemy ORM, etc.).

## Environment variables

dot-search reads its embedder configuration from environment variables:

| Variable | Description |
|---|---|
| `DOT__EMBED_API_KEY` | Bearer token for the embedding API |
| `DOT__EMBED_BASE_URL` | Base URL of the API (e.g. `https://api.openai.com/v1`) |
| `DOT__EMBED_MODEL` | Model name (e.g. `text-embedding-3-small`) |
| `DOT__EMBED_DIMENSION` | Embedding dimension (e.g. `1536`) |

Works with any OpenAI-compatible API: OpenAI, vLLM, TGI, Ollama, etc.

## Quick start

```python
import asyncio
from dot_search import SearchEngine, TableConfig, EmbeddingConfig, BM25Config, ExactConfig

# Set DOT__EMBED_* env vars before running (see above)

engine = SearchEngine(db_url="postgresql+asyncpg://user:pass@localhost/mydb")

async def main():
    # First time: provide config — it's saved to ds_configs in the DB
    await engine.index(config=TableConfig(
        table="documents",
        embeddings=[
            EmbeddingConfig(source_column="body"),
        ],
        bm25=[BM25Config(source_column="body")],
    ))

    results = await engine.search("neural networks", "documents")
    for r in results:
        print(r.id, r.score)

asyncio.run(main())
```

After the first run, config is persisted. On subsequent runs you can re-index without re-providing it:

```python
await engine.index(index_id="documents")  # loads config from ds_configs, re-runs indexing
```

## Multiple indexes per table

Use `index_id` to register multiple independent indexes on the same table — for example with different embedding models:

```python
await engine.index(config=TableConfig(
    table="documents",
    index_id="documents_multilingual",
    embeddings=[EmbeddingConfig(source_column="body")],
))

await engine.index(config=TableConfig(
    table="documents",
    index_id="documents_longctx",
    embeddings=[EmbeddingConfig(source_column="body", model="text-embedding-3-large")],
))

results = await engine.search("neural networks", "documents_multilingual")
```

`index_id` defaults to `table` when not specified.

## Multiple embeddings with fusion

Define multiple `EmbeddingConfig`s to fuse results via Reciprocal Rank Fusion:

```python
await engine.index(config=TableConfig(
    table="documents",
    embeddings=[
        EmbeddingConfig(source_column="title", default_weight=0.8),
        EmbeddingConfig(source_column="body", default_weight=1.0),
    ],
    bm25=[BM25Config(source_column="body", default_weight=1.0)],
))

results = await engine.search("machine learning", "documents")
```

## Search with SQL filters

Pass a `where` clause to combine vector search with SQL filters:

```python
results = await engine.search(
    "climate change",
    "articles",
    where="published_at > '2024-01-01' AND category = 'science'",
    limit=10,
)
```

## Search strategies

Pass `strategy` to `search()` to control what signals are used. Default is `"hybrid"`.

| Strategy | What it uses |
|----------|-------------|
| `"hybrid"` | Vector + BM25 + exact (any configured), fused via RRF (default) |
| `"vector"` | Vector similarity only |
| `"bm25"` | BM25 keyword search only |
| `"exact"` | Substring (`LIKE`) search only |

## Hybrid search (vector + BM25)

Add `bm25` to `TableConfig` to enable BM25 alongside vector search. Multiple BM25 targets are supported — each column produces an independent ranked list, all fused together via RRF:

```python
await engine.index(config=TableConfig(
    table="documents",
    embeddings=[EmbeddingConfig(source_column="body")],
    bm25=[
        BM25Config(source_column="title", default_weight=0.8),
        BM25Config(source_column="body", default_weight=1.0),
    ],
))

# Hybrid by default (vector + BM25 fused via RRF)
results = await engine.search("neural networks", "documents")

# BM25 only
results = await engine.search("neural networks", "documents", strategy="bm25")
```

## Exact (substring) search

Add `exact` to `TableConfig` to enable `LIKE`-based substring matching. Exact results are automatically included in `hybrid` searches and fused via RRF alongside vector and BM25 results. Use `strategy="exact"` to search with exact only.

```python
await engine.index(config=TableConfig(
    table="documents",
    embeddings=[EmbeddingConfig(source_column="body")],
    bm25=[BM25Config(source_column="body")],
    exact=[ExactConfig(source_column="name")],  # substring match on a separate column
))

# hybrid: vector + BM25 + exact, all fused via RRF
results = await engine.search("Dupont", "documents")

# exact only
results = await engine.search("Dupont", "documents", strategy="exact")
```

Multiple `ExactConfig` entries are supported. Each column produces its own ranked list, fused via RRF.

### Overriding weights at search time

Pass `embedding_weights` and/or `bm25_weights` to override `default_weight` for a specific query, keyed by column name:

```python
results = await engine.search(
    "neural networks",
    "documents",
    embedding_weights={"ds_documents_body_default": 0.5},
    bm25_weights={"title": 2.0, "body": 1.0},
)
```

## Embedding column naming

By default, `EmbeddingConfig` generates a `target_column` name as `ds_{index_id}_{source_column}_{model}`. For example:

```python
TableConfig(
    table="documents",
    embeddings=[EmbeddingConfig(source_column="body")],
)
# → column: ds_documents_body_default
```

You can override it explicitly:

```python
EmbeddingConfig(source_column="body", target_column="ds_my_custom_col")
```

## Serialization

Rows are serialized to text before embedding. Use `serialize_row()` to see what gets embedded:

```python
from dot_search import serialize_row

text = serialize_row({"title": "Hello", "body": "World", "ds_vec": None})
# "title: Hello\nbody: World"  — ds_* columns and NULLs are skipped
```

When `source_column=None` on an `EmbeddingConfig`, dot-search serializes all non-`ds_*` columns automatically.

## Result fusion

Use `reciprocal_rank_fusion()` directly to merge ranked lists:

```python
from dot_search import reciprocal_rank_fusion

fused = reciprocal_rank_fusion(
    [results_vec, results_bm25],
    weights=[1.0, 0.8],
)
```

## Reference

| Import | Description |
|--------|-------------|
| `SearchEngine` | Main class — orchestrates indexing and search |
| `TableConfig` | Config for a table: embeddings, BM25 targets, exact targets, batch size, pk column, index_id |
| `EmbeddingConfig` | One embedding column: source, target, model, default_weight |
| `BM25Config` | One BM25 column: source_column and default_weight |
| `ExactConfig` | One exact (substring) column: source_column and default_weight |
| `SearchResult` | Search result: id, score, row data |
| `make_openai_compatible_embed_fn` | Build an `Embedder` from any OpenAI-compatible API |
| `serialize_row` | Converts a row dict to embeddable text |
| `reciprocal_rank_fusion` | Merges multiple ranked result lists |

## Contributing & Development

See [docs/CONTRIBUTING.md](docs/CONTRIBUTING.md) and [docs/DEVELOPMENT.md](docs/DEVELOPMENT.md).

## License

See [LICENSE](LICENSE) for details.

## Contact

deepika Team — contact@deepika.ai
Project: [gitlab.com/deepika6190303/deepika-open-toolbox/dot-search](https://gitlab.com/deepika6190303/deepika-open-toolbox/dot-search)
