Metadata-Version: 2.4
Name: dtbag
Version: 0.1.0
Summary: Data Tool Bag (dtbag) - A Python library for text processing, data cleaning, and similarity-based clustering of textual data
Home-page: https://github.com/yourusername/dtbag
Author: Abderrahmane Sakhi
Author-email: Abderrahmane.Sakhi@gmail.com
Project-URL: Bug Reports, https://github.com/yourusername/dtbag/issues
Project-URL: Source, https://github.com/yourusername/dtbag
Keywords: categorical data unification,text processing,data cleaning,text normalization,string matching,similarity detection,data preprocessing,text unification,duplicate detection,data deduplication,text clustering,levenshtein distance,fuzzy matching,clustering algorithms,pattern recognition,natural language processing,text mining,similarity metrics,edit distance,string similarity,data science,machine learning,data analysis,ETL tools,data wrangling,information retrieval,python-library,open-source,text-utilities,data-tools
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: numpy>=1.19.0
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: project-url
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

**dtbag** is a comprehensive Python library designed for efficient text processing and data cleaning. It provides robust tools for identifying and unifying similar text entries, making it ideal for preprocessing textual data in data science and machine learning pipelines.

## Core Features

### 1. **Text Similarity & Clustering**
- Advanced Levenshtein distance implementation for accurate string matching
- Automatic clustering of similar text entries based on configurable thresholds
- Group-based text unification for consistent data representation

### 2. **Data Cleaning & Normalization**
- Intelligent duplicate detection and removal
- Text unification based on frequency analysis
- Support for multilingual text processing

### 3. **Production-Ready Tools**
- Scikit-learn compatible API design
- Memory-efficient algorithms for large datasets
- Easy integration with existing data pipelines



#What's Inside:

  *CatLists
   Identifies clusters of similar text entries and returns grouped lists with their most frequent representatives.

  *CatUnifier  
   Transforms lists by replacing similar items with their most common representative, maintaining original list structure.



#Quick Start

```python
from dtbag import CatUnifier

"Clean inconsistent data entries"
data = ["Yassine", "Parrise", "Yasin", "Pris", "PParis" "Yasine", "yasyne", "Paris"]
unifier = CatUnifier()
clean_data = unifier.fit_transform(data, threshold=0.7)
*Result: ["Yassine", "Paris"]



#Installation

```bash
pip install dtbag
