Metadata-Version: 2.4
Name: dtbag
Version: 3.1.0
Summary: Data Tool Bag (dtbag) - A Python library for text processing, data cleaning, and similarity-based clustering of textual data
Home-page: https://github.com/abderrahmane-sakhi/dtbag
Author: Abderrahmane Sakhi
Author-email: Abderrahmane.Sakhi@gmail.com
Project-URL: Bug Reports, https://github.com/abderrahmane-sakhi/dtbag/issues
Project-URL: Source, https://github.com/abderrahmane-sakhi/dtbag
Keywords: categorical data unification,text processing,data cleaning,text normalization,string matching,similarity detection,data preprocessing,text cleaning,text unification,duplicate detection,data deduplication,text clustering,levenshtein distance,fuzzy matching,clustering algorithms,pattern recognition,natural language processing,text mining,similarity metrics,edit distance,string similarity,data science,machine learning,data analysis,ETL tools,data wrangling,information retrieval,python-library,open-source,text-utilities,data-tools
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Intended Audience :: Healthcare Industry
Classifier: Intended Audience :: Manufacturing
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Other Audience
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Office/Business
Classifier: Topic :: Text Processing
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: project-url
Dynamic: requires-python
Dynamic: summary

# dtbag: Data Tools for Business, Analytics & General Tasks

**Streamline your data preparation for Machine Learning and Business Analytics.**

`dtbag` is a Python library that focuses on creating solutions to common problems, and what distinguishes it from others is that it adopts a "solution in one line" approach, meaning it combines all the necessary steps to work internally and returns the result in a single output.
For instance "CatUnifier" function solves a common categorical data problem, inconsistent categorical data due to incorrect entries, spelling mistakes, different write/map data, etc... 
data might have only four real categories but the output can show two times or even many times that number, and this will affect any charts/ machine learning models etc.. and return wrong results.

**Clean inconsistent data entries**

*   **data =**  ["Pencil", "Parrise", "Pencle", "Pris", "PParis" "pencl", "pencyl", "Paris"]
*   **clean_data =**  CatUnifier(data, threshold=0.7)
*   **Result:**  ["Pencil", "Paris"]


## Key Features

*   **Smart Text Unification :**  Intelligently groups and standardizes similar text entries (like names, addresses, product titles) even with typos, different cases, or accents.
*   **Built for Real Data :**  Handles common data issues like inconsistent capitalization (`"New York"`, `"NEW YORK"`), diacritics (`"Café"`, `"Cafe"`), and minor spelling variations.
*   **Multilingual Support :**  Works seamlessly across languages commonly found in business data, including **Arabic, English, French, Spanish, German**, and more.
*   **Zero Dependencies :**  Uses only Python's robust standard library, ensuring lightweight and conflict-free installation.
*   **Preserves Original Data :**  Returns the most frequent *original* version in each group, maintaining data integrity.


## What's Inside:

*   **CatLists :** 
   Returns a list containing the corrected unique categories from the erroneous categories.
*   **CatUnifier :** 
   Returns the entire input list as a new list with fully corrected categories and in the same order as entered, so that it can be replaced directly.


## Quick Start

#### Installation
```bash
pip install dtbag
```


#### Basic Usage: Cleaning Text Data
```python
from dtbag import CatUnifier
from dtbag import CatLists

raw_customers = ["Café Marrakech", "CAFÉ MARRAKECH", 'Café Mrakech', 'Cafe Marakesh', 'Cafffé Marrakech', 'Cfé Markech', 'Sturbuks', 'Starbucks', "Starbucks", "starbucks"]
CatUnifier(raw_customers, threshold=0.7)
Output: ['Café Marrakech', 'Café Marrakech', 'Café Marrakech', 'Café Marrakech', 'Café Marrakech', 'Café Marrakech', 'Starbucks', 'Starbucks', 'Starbucks', 'Starbucks']
```
```python
products = ["laptop-13inch", "Laptop 13 Inch", "Laptop 13\"", "mouse wirlss", "Mouse Wireless"]
CatUnifier(products, threshold=0.75)
Output:["Laptop 13 Inch", "Laptop 13 Inch", "Laptop 13 Inch", "Mouse Wireless", "Mouse Wireless"]
```
```python
international = ["São Paulo", "Sao Paulo", "Saw Paullow", "München", "Muenchen", "Naïve", "Naive", "Nayvee"]
CatUnifier(international, threshold=0.7)
Output:["São Paulo", "São Paulo", "São Paulo", "München", "München", "Naïve", "Naïve", "Naïve"]
```


#### Tuning Precision
```python
strict_cleaning = unify_similar_items(data, threshold=0.9)  # High precision
lenient_cleaning = unify_similar_items(data, threshold=0.6) # More grouping
```
