Metadata-Version: 2.1
Name: text-dedup
Version: 0.0.5
Summary: Text deduplication with fuzzy match and more
License: MIT
Author: Chenghao Mou
Author-email: mouchenghao@gmail.com
Requires-Python: >=3.8,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Requires-Dist: alive-progress (>=1.6.2,<2.0.0)
Requires-Dist: datasets (>=1.5.0,<2.0.0)
Requires-Dist: pandas (>=1.2.3,<2.0.0)
Requires-Dist: pytest (>=6.2.2,<7.0.0)
Requires-Dist: pytest-benchmark (>=3.2.3,<4.0.0)
Requires-Dist: sentence-transformers (>=1.0.3,<2.0.0)
Requires-Dist: strsimpy (>=0.2.0,<0.3.0)
Description-Content-Type: text/markdown

# text-dedup
![PyPI](https://img.shields.io/pypi/v/text-dedup?style=plastic)

Text deduplication with fuzzy match and more. (WIP)

## Usage

1. Group near duplicates
```python
import pandas as pd
from text_dedup.dedupers import EditDistanceSimilarityDeduper
from text_dedup import group_duplicates

df = pd.read_csv(...)
df_groups = group_duplicates(
    df, 
    deduper=EditDistanceSimilarityDeduper(
        similarity_metric="cosine", 
        threshold=0.8, 
        k=3),
    column="text",
    target_column="__group_label__"
    )

df["__group_label__"].value_counts(dropna=False)
```

2. Remove near duplicates
```python
import pandas as pd
from text_dedup.dedupers import EditDistanceSimilarityDeduper
from text_dedup import drop_duplicates

df = pd.read_csv(...)
df_dedup = drop_duplicates(
    df, 
    deduper=EditDistanceSimilarityDeduper(
        similarity_metric="cosine", 
        threshold=0.8, 
        k=3),
    column="text"
    )

assert df.shape != df_dedup.shape
```

3. Remove semantically similar duplicates
```python
import pandas as pd
from text_dedup.dedupers import PretrainedBERTEmbeddingDeduper
from text_dedup import drop_duplicates

df = pd.read_csv(...)
data_dedup = drop_duplicates(
    df, 
    deduper=PretrainedBERTEmbeddingDeduper(
        model='paraphrase-distilroberta-base-v1',
        threshold=threshold, 
    ),
    column="text"
)
```

## Installation
```bash
pip install text-dedup
```
