Metadata-Version: 2.4
Name: dot-search
Version: 0.1.1
Summary: Augment existing database tables with vector and BM25 search
Project-URL: Homepage, https://gitlab.com/deepika6190303/deepika-open-toolbox/dot-search
Project-URL: Repository, https://gitlab.com/deepika6190303/deepika-open-toolbox/dot-search
Project-URL: Issues, https://gitlab.com/deepika6190303/deepika-open-toolbox/dot-search/-/issues
Author-email: deepika Team <contact@deepika.ai>
License: TODO: TO BE COMPLETED
License-File: LICENSE
Keywords: bm25,deepika,embeddings,hybrid,open-toolbox,search,vector
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.12
Requires-Dist: aiosqlite>=0.20
Requires-Dist: asyncpg>=0.29
Requires-Dist: httpx>=0.27
Requires-Dist: pgvector>=0.3
Requires-Dist: sqlalchemy[asyncio]>=2.0
Description-Content-Type: text/markdown

# dot-search

![Python Version](https://img.shields.io/badge/python-3.12%2B-blue)

**dot-search** augments existing database tables with vector search, BM25 keyword search, and exact substring matching. Results from multiple strategies are fused via Reciprocal Rank Fusion (RRF).

## Prerequisites

### PostgreSQL (production)

Requires [pgvector](https://github.com/pgvector/pgvector) and [ParadeDB pg_search](https://github.com/paradedb/paradedb):

```sql
CREATE EXTENSION vector;
CREATE EXTENSION pg_search;
```

### SQLite (testing only)

Uses [sqlite-vec](https://github.com/asg017/sqlite-vec). No BM25 support.

## Install

```bash
pip install dot-search
```

## Environment variables

| Variable | Example |
|---|---|
| `DOT__EMBED_API_KEY` | Bearer token |
| `DOT__EMBED_BASE_URL` | `https://api.openai.com/v1` |
| `DOT__EMBED_MODEL` | `text-embedding-3-small` |
| `DOT__EMBED_DIMENSION` | `1536` |

Works with any OpenAI-compatible API (OpenAI, vLLM, TGI, Ollama, etc.).

## Minimal usage

```python
import asyncio
from dot_search import SearchEngine

engine = SearchEngine(db_url="postgresql+asyncpg://user:pass@localhost/mydb")

async def main():
    # Index — auto-serializes all columns, config saved to ds_configs
    await engine.index(table="documents")

    # Search
    results = await engine.search("neural networks", "documents", limit=5)
    for r in results:
        print(r.id, r.score)

asyncio.run(main())
```

Config is persisted in `ds_configs`. After the first `index()`, re-index without re-providing config:

```python
await engine.index(index_id="documents")
```

## Power usage

```python
import asyncio
from dot_search import SearchEngine, TableConfig, EmbeddingConfig, BM25Config, ExactConfig

engine = SearchEngine(db_url="postgresql+asyncpg://user:pass@localhost/mydb")

async def main():
    # --- Index: multiple embeddings + BM25 + exact ---
    await engine.index(config=TableConfig(
        table="articles",
        embeddings=[
            EmbeddingConfig(source_column="body", default_weight=1.0),
            EmbeddingConfig(source_column="title", default_weight=0.5),
            EmbeddingConfig(
                source_column=None,                    # serialize all columns into one vector
                target_column="ds_articles_all",       # required when source_column=None
                default_weight=0.3,
            ),
        ],
        bm25=[
            BM25Config(source_column="title", default_weight=0.8),
            BM25Config(source_column="body", default_weight=1.0),
        ],
        exact=[ExactConfig(source_column="name")],
    ))

    # --- Hybrid search (vector + BM25 + exact, fused via RRF) ---
    results = await engine.search(
        "fermentation and gut health",
        "articles",
        limit=10,
        where="published_year >= 2022 AND topic = 'health'",
        embedding_weights={
            "ds_articles_body_default": 1.0,
            "ds_articles_title_default": 0.3,
            "ds_articles_all": 0.1,
        },
        bm25_weights={"title": 0.5, "body": 2.0},
    )

    # --- BM25-only ---
    results = await engine.search("fermentation", "articles", strategy="bm25")

    # --- Exact substring only ---
    results = await engine.search("Dupont", "articles", strategy="exact")

    # --- Multiple indexes on the same table ---
    await engine.index(config=TableConfig(
        table="articles",
        index_id="article_titles",
        embeddings=[EmbeddingConfig(source_column="title")],
    ))
    results = await engine.search("gut health", "article_titles", limit=5)

asyncio.run(main())
```

## Search strategies

| Strategy | What it uses |
|----------|-------------|
| `"hybrid"` | Vector + BM25 + exact (any configured), fused via RRF (default) |
| `"vector"` | Vector similarity only |
| `"bm25"` | BM25 keyword search only |
| `"exact"` | Substring (`LIKE`) search only |

## Contributing & Development

See [docs/CONTRIBUTING.md](docs/CONTRIBUTING.md) and [docs/DEVELOPMENT.md](docs/DEVELOPMENT.md).

## License

See [LICENSE](LICENSE) for details.

## Contact

deepika Team — contact@deepika.ai
Project: [gitlab.com/deepika6190303/deepika-open-toolbox/dot-search](https://gitlab.com/deepika6190303/deepika-open-toolbox/dot-search)
