Metadata-Version: 2.4
Name: ottmlt
Version: 0.1.0
Summary: More-Like-This recommendation engine for OTT/Media platforms using NLP
Author-email: Quickplay Media - Data Science <ds@quickplay.com>
License: MIT
Project-URL: Homepage, https://github.com/yourusername/ottmlt
Project-URL: Repository, https://github.com/yourusername/ottmlt
Project-URL: Documentation, https://ottmlt.readthedocs.io
Project-URL: Bug Tracker, https://github.com/yourusername/ottmlt/issues
Keywords: recommendation,ott,media,nlp,more-like-this,content-based-filtering,tfidf,streaming
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Multimedia :: Video
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.21.0
Requires-Dist: pandas>=1.3.0
Requires-Dist: scikit-learn>=1.0.0
Requires-Dist: scipy>=1.7.0
Provides-Extra: embedding
Requires-Dist: sentence-transformers>=2.2.0; extra == "embedding"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Provides-Extra: all
Requires-Dist: ottmlt[embedding]; extra == "all"
Dynamic: license-file

# ottmlt — More-Like-This for OTT Platforms

> NLP-powered "more like this" recommendations for streaming / media catalogs.
> Built by a Data Scientist who spent years improving recommendations at a major OTT platform.

[![PyPI version](https://badge.fury.io/py/ottmlt.svg)](https://badge.fury.io/py/ottmlt)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![Tests](https://github.com/yourusername/ottmlt/actions/workflows/ci.yml/badge.svg)](https://github.com/yourusername/ottmlt/actions)

---

## What is it?

**ottmlt** (**OTT** **M**ore **L**ike **T**his) is a lightweight Python library that answers the question:

> *"A user just finished watching* Inception. *What should we show them next?"*

It analyses your content catalog — titles, descriptions, genres, cast, directors — and
finds the most semantically similar items using NLP techniques.

```
User watches: Inception (2010)
              ↓
ottmlt finds: The Dark Knight (2008)  — same director, same genre
              Interstellar (2014)     — same director, sci-fi
              The Prestige (2006)     — same director, same lead
              The Matrix (1999)       — same genre, same themes
              ...
```

---

## Why ottmlt?

| Feature | ottmlt |
|---|---|
| Zero GPU required | TF-IDF model runs on any machine |
| Semantic search | Optional embedding model (sentence-transformers) |
| OTT-aware | Designed for title/genre/cast/director metadata |
| Filtering | Restrict results by genre, language, type, etc. |
| Simple API | `fit()` → `recommend()` — that's it |
| PyPI installable | `pip install ottmlt` |
| Open source | MIT licensed |

---

## Installation

```bash
# Minimal install (TF-IDF only — recommended for most use cases)
pip install ottmlt

# With semantic embedding support (requires ~90 MB model download)
pip install "ottmlt[embedding]"
```

**Requirements:** Python 3.8+, numpy, pandas, scikit-learn, scipy

---

## Quick Start

```python
import pandas as pd
from ottmlt import MoreLikeThis
from ottmlt.utils.data import load_sample_catalog

# 1. Load your catalog (must have an 'id' column + text fields)
catalog = load_sample_catalog()          # bundled 50-title sample
# -- or --
catalog = pd.read_csv("my_catalog.csv") # your real OTT catalog

# 2. Create and fit the recommender
mlt = MoreLikeThis(
    text_fields=["title", "description", "genre", "cast", "director"],
    field_weights={"title": 3, "genre": 2, "director": 2},  # boost important fields
)
mlt.fit(catalog)

# 3. Get recommendations
recs = mlt.recommend("tt0468569", n=5)  # The Dark Knight
print(recs[["id", "title", "genre", "similarity_score"]])
```

**Output:**
```
           id                                title           genre  similarity_score
0  tt0372784                        Batman Begins  Action|Adventure            0.8123
1  tt1375666                            Inception  Action|Sci-Fi           0.6541
2  tt0482571                         The Prestige   Drama|Sci-Fi            0.6102
3  tt0816692                         Interstellar  Adventure|Sci-Fi         0.5873
4  tt0110912                          Pulp Fiction       Crime|Drama            0.3124
```

---

## Catalog Format

Your catalog should be a `pandas.DataFrame`. The only required column is `id`.
Any text columns can be used for similarity.

| Column | Type | Description | Example |
|---|---|---|---|
| `id` | str/int | Unique item identifier | `"tt0468569"` |
| `title` | str | Content title | `"The Dark Knight"` |
| `description` | str | Synopsis or plot summary | `"When the Joker..."` |
| `genre` | str | Genre tags | `"Action\|Crime\|Drama"` |
| `cast` | str | Actor names | `"Christian Bale\|Heath Ledger"` |
| `director` | str | Director name(s) | `"Christopher Nolan"` |
| `language` | str | Content language | `"English"` |
| `content_type` | str | Movie / Series / etc. | `"Movie"` |

Pipe-separated (`|`) or comma-separated values in text fields are handled automatically.

---

## Models

### 1. TF-IDF (default)

The fastest model — no GPU, no large downloads.
Excellent for catalogs where genre tags and director names are consistent.

```python
mlt = MoreLikeThis(model="tfidf")  # default
```

**How it works:**
1. Combines your text fields into a weighted "soup" string per item
2. Builds a TF-IDF matrix (rare tokens like director names get higher weight)
3. Uses cosine similarity to find nearest neighbours

### 2. Embedding (semantic)

Better at finding *thematically* similar content, even when they share no keywords.

```bash
pip install "ottmlt[embedding]"
```

```python
mlt = MoreLikeThis(model="embedding")
```

Uses `all-MiniLM-L6-v2` (384-dim, sentence-transformers) by default.
Configurable via `model_kwargs`:

```python
mlt = MoreLikeThis(
    model="embedding",
    model_name="all-mpnet-base-v2",  # larger, slower, more accurate
    batch_size=128,
)
```

### 3. Hybrid

Best of both worlds — blends TF-IDF keyword overlap with semantic similarity.

```python
mlt = MoreLikeThis(
    model="hybrid",
    alpha=0.6,   # 60% TF-IDF + 40% Embedding
)
```

`alpha=1.0` → pure TF-IDF. `alpha=0.0` → pure Embedding.
If `sentence-transformers` is not installed, automatically falls back to TF-IDF.

---

## API Reference

### `MoreLikeThis`

```python
MoreLikeThis(
    text_fields=["title", "description", "genre", "cast", "director"],
    field_weights={"title": 3, "genre": 2, "director": 2},
    model="tfidf",          # "tfidf" | "embedding" | "hybrid"
    id_col="id",            # column used as item identifier
    **model_kwargs          # forwarded to the underlying model
)
```

#### `.fit(catalog: pd.DataFrame) → self`

Fit the recommender on your catalog. Must contain `id_col` column.

#### `.recommend(item_id, n=10, filters=None) → pd.DataFrame`

Return the top-N most similar items. Returns catalog rows + `similarity_score`.

```python
# Basic
recs = mlt.recommend("tt0468569", n=10)

# With filters — only recommend English Drama films
recs = mlt.recommend(
    "tt0468569",
    n=10,
    filters={"language": "English", "genre": "Drama"},
)

# filters support lists too
recs = mlt.recommend(
    "tt0468569",
    n=10,
    filters={"language": ["English", "French"]},
)
```

#### `.get_item(item_id) → pd.Series`

Return the catalog row for a single item.

#### `.catalog` property

Access the fitted catalog as a DataFrame.

---

### `Preprocessor`

```python
from ottmlt.core.preprocessor import Preprocessor

prep = Preprocessor(
    text_fields=["title", "genre"],
    field_weights={"title": 3},
)
soups = prep.transform(catalog_df)  # list of str, one per row
```

---

### `TFIDFRecommender` (low-level)

```python
from ottmlt.models.tfidf import TFIDFRecommender

model = TFIDFRecommender(
    max_features=None,      # vocabulary size limit (None = unlimited)
    ngram_range=(1, 2),     # unigrams + bigrams
    min_df=1,               # minimum document frequency
    sublinear_tf=True,      # log-scale TF
)
model.fit(list_of_soups)
results = model.get_similar(query_idx=0, candidate_indices=[1,2,3,...], top_n=10)
# Returns: [(idx, score), ...]
```

---

## Advanced Usage

### Large catalogs (50k+ titles)

For catalogs with 50,000+ titles, limit the vocabulary to keep memory low:

```python
mlt = MoreLikeThis(
    model="tfidf",
    max_features=50_000,   # forwarded to TFIDFRecommender
    ngram_range=(1, 1),    # unigrams only — faster
)
```

### Pre-filtering candidates

Use `filters` to restrict the recommendation pool before similarity scoring.
This is useful for "more like this but in the same language":

```python
recs = mlt.recommend(
    seed_id,
    n=10,
    filters={"language": "Hindi", "content_type": "Series"},
)
```

### Validating your catalog

```python
from ottmlt.utils.data import validate_catalog

validate_catalog(
    catalog,
    text_fields=["title", "description", "genre"],
    id_col="id",
)
# Raises ValueError on duplicate IDs or missing id_col
# Warns about missing text fields or high null rates
```

---

## How to Publish Your Own Library to PyPI

This section is for contributors and anyone learning to package Python libraries.

### Step 1 — Structure your project

```
ottmlt/
├── ottmlt/            ← your package code
│   └── __init__.py
├── tests/
├── pyproject.toml     ← modern packaging config (PEP 517/518)
├── README.md
└── LICENSE
```

### Step 2 — `pyproject.toml` (the modern way)

```toml
[build-system]
requires = ["setuptools>=61", "wheel"]
build-backend = "setuptools.build_meta"

[project]
name = "your-package-name"
version = "0.1.0"
description = "One line about what it does"
readme = "README.md"
license = { text = "MIT" }
requires-python = ">=3.8"
dependencies = ["numpy", "pandas"]

[project.urls]
Homepage = "https://github.com/you/your-package"
```

### Step 3 — Build distribution files

```bash
pip install build
python -m build
# Creates: dist/your_package-0.1.0-py3-none-any.whl
#          dist/your_package-0.1.0.tar.gz
```

### Step 4 — Test on TestPyPI first

```bash
pip install twine
twine upload --repository testpypi dist/*
# Anyone can now: pip install --index-url https://test.pypi.org/simple/ your-package
```

### Step 5 — Publish to real PyPI

```bash
twine upload dist/*
# Register at https://pypi.org first and use API tokens (not passwords)
```

### Step 6 — Automate with GitHub Actions

Create `.github/workflows/publish.yml`:

```yaml
name: Publish to PyPI

on:
  push:
    tags: ["v*"]   # triggers on git tag like v0.1.0

jobs:
  publish:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11" }
      - run: pip install build twine
      - run: python -m build
      - run: twine upload dist/*
        env:
          TWINE_USERNAME: __token__
          TWINE_PASSWORD: ${{ secrets.PYPI_API_TOKEN }}
```

Tag a release → PyPI auto-updates:

```bash
git tag v0.1.1
git push origin v0.1.1
```

---

## Project Structure

```
ottmlt/
├── ottmlt/
│   ├── __init__.py              # Public API exports
│   ├── core/
│   │   ├── preprocessor.py      # Text cleaning + field-weighted soup builder
│   │   ├── similarity.py        # Efficient top-N cosine similarity (no N×N matrix)
│   │   └── recommender.py       # MoreLikeThis — the high-level API
│   ├── models/
│   │   ├── tfidf.py             # TF-IDF vectorizer-based model
│   │   ├── embedding.py         # Sentence-Transformer based model
│   │   └── hybrid.py            # Weighted blend of TF-IDF + Embedding
│   ├── utils/
│   │   └── data.py              # load_sample_catalog(), validate_catalog()
│   └── data/
│       └── sample_catalog.csv   # 50-title OTT sample dataset
├── tests/
│   ├── test_preprocessor.py
│   ├── test_models.py
│   └── test_recommender.py
├── pyproject.toml
├── LICENSE
└── README.md
```

---

## Development

```bash
git clone https://github.com/yourusername/ottmlt.git
cd ottmlt
pip install -e ".[dev]"
pytest tests/ -v
```

---

## Contributing

Pull requests are welcome! Please:

1. Fork the repo and create a feature branch
2. Add tests for any new functionality
3. Ensure `pytest tests/` passes
4. Submit a PR with a clear description

---

## Roadmap

- [ ] Collaborative filtering support (user-item interactions)
- [ ] BM25 model option
- [ ] Incremental `partial_fit()` for streaming catalog updates
- [ ] REST API / FastAPI example server
- [ ] Readthedocs documentation

---

## License

MIT — see [LICENSE](LICENSE).

---

*Built with love by a Data Scientist who spent years building recommendation systems for OTT platforms.*
