Metadata-Version: 2.1
Name: text-dedup
Version: 0.1.1
Summary: All-in-one text deduplication
License: MIT
Author: Chenghao Mou
Author-email: mouchenghao@gmail.com
Requires-Python: >=3.9,<3.12
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Requires-Dist: annoy (>=1.17.1,<2.0.0)
Requires-Dist: datasets (>=2.4.0,<3.0.0)
Requires-Dist: datasketch (>=1.5.8,<2.0.0)
Requires-Dist: hydra-core (>=1.2.0,<2.0.0)
Requires-Dist: mpire (>=2.6.0,<3.0.0)
Requires-Dist: numpy (>=1.23.2,<2.0.0)
Requires-Dist: redis (>=4.3.4,<5.0.0)
Requires-Dist: rich (>=12.5.1,<13.0.0)
Requires-Dist: scipy (==1.9.1)
Requires-Dist: sentencepiece (>=0.1.97,<0.2.0)
Requires-Dist: torch (>=1.12.1,<2.0.0)
Requires-Dist: tqdm (>=4.64.1,<5.0.0)
Requires-Dist: transformers (>=4.21.2,<5.0.0)
Requires-Dist: yaspin (>=2.2.0,<3.0.0)
Description-Content-Type: text/markdown

# text-dedup

[![Codacy Badge](https://app.codacy.com/project/badge/Coverage/cc66178e49d24908ac1fb2b2dbe4e5b3)](https://www.codacy.com/gh/ChenghaoMou/text-dedup/dashboard?utm_source=github.com&utm_medium=referral&utm_content=ChenghaoMou/text-dedup&utm_campaign=Badge_Coverage) [![Codacy Badge](https://app.codacy.com/project/badge/Grade/cc66178e49d24908ac1fb2b2dbe4e5b3)](https://www.codacy.com/gh/ChenghaoMou/text-dedup/dashboard?utm_source=github.com&utm_medium=referral&utm_content=ChenghaoMou/text-dedup&utm_campaign=Badge_Grade)


## Features

-   Hash-based methods such as [SimHash](https://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf), [MinHash](https://web.archive.org/web/20150131043133/http://gatekeeper.dec.com/ftp/pub/dec/SRC/publications/broder/positano-final-wpnums.pdf) + [LSH](http://infolab.stanford.edu/~ullman/mmds.html) for near deduplication.
-   [SuffixArray](http://dl.acm.org/citation.cfm?id=320176.320218)-based method from [Deduplicating Training Data Makes Language Models Better](https://arxiv.org/abs/2107.06499) for substring exact deduplication.
-   In-memory or [Redis](https://redis.io)/[KeyDB](https://docs.keydb.dev)-cached index to handle larger than memory datasets.

## Documentation

[Github Pages](https://chenghaomou.github.io/text-dedup/index.html)

## Todos
-   [ ] Memory benchmark for streaming processing
-   [ ] Speed benchmark for in-memory processing
-   [ ] Inter-dataset deduplication
-   [ ] Rewrite suffix array in Python

## Thanks

-   [seomoz/simhash-cpp](https://github.com/seomoz/simhash-cpp)
-   [datasketch](http://ekzhu.com/datasketch/index.html)
-   [google-research/deduplicate-text-datasets](https://github.com/google-research/deduplicate-text-datasets)
-   Developed with OSS license from [JetBrains](https://jb.gg/OpenSourceSupport)
-   This project is heavily influenced by the deduplication work at BigScience workshop. The original code can be found at [bigscience-workshop/data-preparation](https://github.com/bigscience-workshop/data-preparation/tree/main/preprocessing/filtering/deduplicate).

