Metadata-Version: 2.4
Name: winnow-ai
Version: 0.1.0
Summary: Open-source data curation toolkit for ML. Filter, deduplicate, score, and explore your training data.
Author: Winnow AI
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/winnow-ai/winnow
Project-URL: Documentation, https://winnow-ai.github.io/winnow
Project-URL: Repository, https://github.com/winnow-ai/winnow
Project-URL: Issues, https://github.com/winnow-ai/winnow/issues
Project-URL: Changelog, https://github.com/winnow-ai/winnow/blob/main/CHANGELOG.md
Keywords: machine-learning,data-curation,data-quality,deduplication,embeddings,mlops,training-data
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pyarrow>=12.0
Requires-Dist: pydantic<3,>=2.0
Requires-Dist: rich>=13.0
Requires-Dist: typer>=0.9
Requires-Dist: fsspec>=2023.6
Requires-Dist: tqdm>=4.60
Requires-Dist: pyyaml>=6.0
Provides-Extra: io
Requires-Dist: datasets>=2.16; extra == "io"
Requires-Dist: pandas>=2.0; extra == "io"
Provides-Extra: nlp
Requires-Dist: lingua-language-detector>=2.0; extra == "nlp"
Requires-Dist: tokenizers>=0.15; extra == "nlp"
Provides-Extra: embeddings
Requires-Dist: sentence-transformers>=2.3; extra == "embeddings"
Requires-Dist: faiss-cpu>=1.7; extra == "embeddings"
Provides-Extra: dedup
Requires-Dist: datasketch>=1.6; extra == "dedup"
Provides-Extra: explore
Requires-Dist: fastapi>=0.104; extra == "explore"
Requires-Dist: uvicorn[standard]>=0.24; extra == "explore"
Provides-Extra: attribution
Requires-Dist: traker>=0.3; extra == "attribution"
Requires-Dist: transformers>=4.36; extra == "attribution"
Requires-Dist: torch>=2.1; extra == "attribution"
Provides-Extra: all
Requires-Dist: winnow-ai[attribution,dedup,embeddings,explore,io,nlp]; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.4; extra == "dev"
Requires-Dist: pytest-cov>=4.1; extra == "dev"
Requires-Dist: pytest-xdist>=3.5; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Requires-Dist: types-PyYAML>=6.0; extra == "dev"
Requires-Dist: mypy>=1.8; extra == "dev"
Requires-Dist: pre-commit>=3.5; extra == "dev"
Provides-Extra: docs
Requires-Dist: mkdocs-material>=9.5; extra == "docs"
Requires-Dist: mkdocstrings[python]>=0.24; extra == "docs"
Dynamic: license-file

<p align="center">
  <!-- TODO: Replace with actual logo -->
  <img src="https://via.placeholder.com/400x120?text=Winnow" alt="Winnow logo" width="400" />
</p>

<h3 align="center">The open-source workbench for ML training data.</h3>

<p align="center">
  <a href="https://pypi.org/project/winnow-ai/"><img alt="PyPI" src="https://img.shields.io/pypi/v/winnow-ai?color=blue" /></a>
  <a href="https://pypi.org/project/winnow-ai/"><img alt="Python 3.10+" src="https://img.shields.io/pypi/pyversions/winnow-ai" /></a>
  <a href="https://github.com/winnow-ai/winnow/blob/main/LICENSE"><img alt="License: Apache 2.0" src="https://img.shields.io/badge/license-Apache--2.0-green" /></a>
  <a href="https://github.com/winnow-ai/winnow/actions"><img alt="CI" src="https://img.shields.io/github/actions/workflow/status/winnow-ai/winnow/ci.yml?label=CI" /></a>
  <a href="https://winnow-ai.github.io/winnow"><img alt="Docs" src="https://img.shields.io/badge/docs-mkdocs-blue" /></a>
  <a href="https://github.com/winnow-ai/winnow/stargazers"><img alt="GitHub stars" src="https://img.shields.io/github/stars/winnow-ai/winnow?style=social" /></a>
</p>

<p align="center">
  <a href="#quick-start">Quick Start</a> &bull;
  <a href="#installation">Install</a> &bull;
  <a href="#features">Features</a> &bull;
  <a href="https://winnow-ai.github.io/winnow">Docs</a> &bull;
  <a href="#comparison">Compare</a> &bull;
  <a href="CONTRIBUTING.md">Contribute</a>
</p>

---

Winnow is a Python toolkit for curating ML training data. Filter junk, deduplicate, compute embeddings, find outliers, and search your dataset — from one `pip install`, with a CLI and a Python SDK.

<!-- TODO: Add GIF/screenshot of the web UI once ready -->
<!-- <p align="center"><img src="docs/assets/demo.gif" width="700" /></p> -->

## Installation

```bash
pip install winnow-ai
```

That gives you heuristic filters, exact dedup, and the CLI. For the full toolkit:

```bash
pip install "winnow-ai[all]"    # embeddings, fuzzy dedup, language detection, HF datasets
```

Or pick what you need:

| Extra | What it adds |
|---|---|
| `winnow-ai[embeddings]` | Sentence-transformer embeddings, FAISS search, semantic dedup, anomaly detection |
| `winnow-ai[dedup]` | MinHash-LSH fuzzy deduplication |
| `winnow-ai[nlp]` | Language detection (lingua) |
| `winnow-ai[io]` | HuggingFace `datasets`, Pandas |

## Quick Start

```python
from winnow.core.pipeline import Pipeline
from winnow.core.filters import LengthFilter, WhitespaceFilter, RepetitionFilter
from winnow.quality.dedup import ExactDedup

# Build a pipeline
pipeline = Pipeline("my-curation")
pipeline.add_filter(LengthFilter(min_length=50, max_length=100_000))
pipeline.add_filter(WhitespaceFilter(max_whitespace_ratio=0.4))
pipeline.add_filter(RepetitionFilter(max_repetition_ratio=0.3))
pipeline.add_filter(ExactDedup())

# Run it
result = pipeline.run("data/raw.jsonl", "data/curated.parquet")
print(f"Kept {result['total_kept']:,} / {result['total_read']:,} documents")
```

Or from the command line:

```bash
winnow curate data/raw.jsonl data/clean.parquet \
  --min-length 50 --max-whitespace 0.4 --dedup
```

Or with a YAML config:

```bash
winnow curate data/raw.jsonl data/clean.parquet --config pipeline.yaml
```

See [`examples/quickstart.py`](examples/quickstart.py) for a full walkthrough including embeddings, search, and outlier detection.

## Features

### Heuristic Filters
Twelve built-in filters, all composable, all configurable via code or YAML:

| Filter | What it catches |
|---|---|
| `LengthFilter` | Too short / too long documents |
| `WordCountFilter` | Documents outside a word-count range |
| `LineCountFilter` | Documents outside a line-count range |
| `WhitespaceFilter` | Excessive whitespace (formatting junk) |
| `RepetitionFilter` | Repeated n-grams (boilerplate, spam) |
| `SpecialCharFilter` | Special character overload (encoding artifacts) |
| `AlphaFilter` | Low alphabetic ratio (numeric spam, base64) |
| `URLFilter` | URL-heavy documents (link farms) |
| `StopwordFilter` | Missing stopwords (keyword spam, code) |
| `LanguageFilter` | Wrong language (lingua or fastText backend) |
| `FieldExistsFilter` | Missing required fields |
| `RegexFilter` | Custom pattern matching (include or exclude) |

### Deduplication
- **Exact dedup** (SHA-256) — zero dependencies, streaming
- **Fuzzy dedup** (MinHash-LSH) — catches near-duplicates at scale
- **Semantic dedup** (embedding cosine similarity) — finds paraphrases and reworded copies

### Embeddings & Search
- **Compute embeddings** with any sentence-transformer model
- **Semantic search** — FAISS-backed nearest-neighbor search over your dataset
- **Anomaly/outlier detection** — k-NN distance scoring to surface unusual documents

### Pipeline Orchestration
- Chain any number of filters in a `Pipeline`
- Configure via Python or YAML
- Per-filter removal stats and throughput reporting
- Reads JSONL, Parquet, CSV, and HuggingFace datasets
- Writes JSONL and Parquet

### CLI
Seven commands, zero boilerplate:

```
winnow version     # print version
winnow curate      # run a curation pipeline
winnow stats       # dataset statistics
winnow embed       # compute embeddings
winnow search      # semantic search
winnow outliers    # find anomalous documents
winnow explore     # launch web UI (coming soon)
```

## Comparison

Winnow is a **data workbench** — interactive, exploratory, designed for iteration. Pipeline engines like DataTrove and Data-Juicer are great for scheduled batch processing at massive scale. If you need to understand your data, experiment with filter thresholds, and investigate what you are keeping and discarding, Winnow is the right tool.

| | Winnow | DataTrove | Data-Juicer |
|---|:---:|:---:|:---:|
| Interactive exploration | Yes | No | Limited |
| Semantic search | Yes | No | No |
| Outlier detection | Yes | No | No |
| Embedding-based dedup | Yes | MinHash only | MinHash only |
| CLI + Python SDK | Both | Python only | Both |
| YAML config | Yes | No (Python) | Yes |
| HuggingFace datasets | Yes | Yes | Yes |
| Spark/distributed | Roadmap | Yes | Yes |
| Web UI | Coming soon | No | Yes |

## Examples

- [`examples/quickstart.py`](examples/quickstart.py) — Full workflow: load, filter, dedup, embed, search, outliers
- [`examples/basic_pipeline.yaml`](examples/basic_pipeline.yaml) — Simple YAML config
- [`examples/advanced_pipeline.yaml`](examples/advanced_pipeline.yaml) — All available filters

## Documentation

Full docs at **[winnow-ai.github.io/winnow](https://winnow-ai.github.io/winnow)** (coming soon).

## Contributing

We welcome contributions. See [CONTRIBUTING.md](CONTRIBUTING.md) for setup instructions.

## License

Apache 2.0 — see [LICENSE](LICENSE).
