Metadata-Version: 2.4
Name: endee-model
Version: 0.1.0
Summary: Endee model for sparse embedding generation
Author-email: Endee Labs <dev@endee.io>
License: MIT
Project-URL: Documentation, https://docs.endee.io
Keywords: vector database,embeddings,machine learning,AI,keyword search,bm25,sparse
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: numpy<2.3,>=1.26.0
Requires-Dist: mmh3>=4.0.0
Requires-Dist: nltk>=3.8.0

# Endee Model

A Python library for generating **sparse text embeddings** using the BM25 algorithm. Designed for integration with vector databases to enable efficient keyword-based search alongside dense embeddings.

## Installation

```bash
pip install endee-model
```

## Quick Start

```python
from endee_model import SparseModel

model = SparseModel(model_name="endee/bm25")

documents = [
    "The quick brown fox jumps over the lazy dog",
    "Machine learning enables computers to learn from data",
]

for embedding in model.embed(documents):
    print(embedding.as_dict())  # {token_id: weight, ...}
```

## Usage

### Embed Documents

```python
from endee_model import SparseModel

model = SparseModel(model_name="endee/bm25")

documents = ["first document text", "second document text"]

# Returns a generator — iterate to get SparseEmbedding objects
for embedding in model.embed(documents, batch_size=256):
    sparse_dict = embedding.as_dict()       # {int: float}
    sparse_obj  = embedding.as_object()     # {'indices': array, 'values': array}
```

### Embed Queries

```python
query = "search query text"

for embedding in model.query_embed(query):
    print(embedding.as_dict())
```

### Count Tokens

```python
count = model.token_count("some text here")
print(f"Token count: {count}")
```

### Work with SparseEmbedding Directly

```python
from endee_model import SparseEmbedding

# Create from a {token_id: weight} dictionary
embedding = SparseEmbedding.from_dict({100: 0.5, 200: 0.8, 300: 1.2})

embedding.as_dict()    # {100: 0.5, 200: 0.8, 300: 1.2}
embedding.as_object()  # {'indices': array([100, 200, 300]), 'values': array([0.5, 0.8, 1.2])}
```

## Configuration

### SparseModel Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `model_name` | required | Model identifier (use `"endee/bm25"`) |
| `cache_dir` | `None` | Custom cache directory (see [Cache](#cache)) |
| `k` | `1.2` | BM25 saturation parameter — controls term frequency saturation |
| `b` | `0.75` | Length normalization factor (`0` = none, `1` = full) |
| `language` | `"english"` | Language for Snowball stemmer |
| `max_token_len` | `40` | Tokens longer than this are discarded |
| `disable_stemmer` | `False` | Skip stemming (enables more languages via NLTK stopwords only) |

```python
model = SparseModel(
    model_name="endee/bm25",
    k=1.5,
    b=0.8,
    language="english",
)
```

### Available Languages

```python
from endee_model.sparse.bm25 import bm25_languages

print(bm25_languages())  # List of supported Snowball stemmer languages
```

### Cache

NLTK resources and model files are cached locally. The cache location is resolved in this order:

1. `cache_dir` argument passed to `SparseModel`
2. `ENDEE_CACHE_PATH` environment variable
3. Default: `{system_tmp}/endee_cache`

```bash
export ENDEE_CACHE_PATH=/path/to/custom/cache
```

## Requirements

- Python >= 3.6
- [numpy](https://numpy.org/) >= 1.26.0, < 2.3
- [mmh3](https://github.com/hajimes/mmh3) >= 4.0.0
- [nltk](https://www.nltk.org/) >= 3.8.0

## How It Works

2. **Normalization** — punctuation is stripped, stopwords removed, oversized tokens discarded
3. **Stemming** — tokens are reduced to stems using the Snowball stemmer (optional)
5. **BM25 weights** — term-frequency weights are computed using the BM25 TF formula:

   ```
   tf_weight = tf * (k + 1) / (tf + k * (1 - b + b * (doc_len / avg_len)))
   ```

> **Note:** BM25 IDF weighting must be applied on the vector index side. This library outputs TF weights only.

## License

MIT License
