Metadata-Version: 2.4
Name: dtbag
Version: 3.1.1
Summary: Data Tool Bag (dtbag) - A Python library for text processing, data cleaning, and similarity-based clustering of textual data
Home-page: https://github.com/abderrahmane-sakhi/dtbag
Author: Abderrahmane Sakhi
Author-email: Abderrahmane.Sakhi@gmail.com
Project-URL: Bug Reports, https://github.com/abderrahmane-sakhi/dtbag/issues
Project-URL: Source, https://github.com/abderrahmane-sakhi/dtbag
Keywords: categorical data unification,text processing,data cleaning,text normalization,string matching,similarity detection,data preprocessing,text cleaning,text unification,duplicate detection,data deduplication,text clustering,levenshtein distance,fuzzy matching,clustering algorithms,pattern recognition,natural language processing,text mining,similarity metrics,edit distance,string similarity,data science,machine learning,data analysis,ETL tools,data wrangling,information retrieval,python-library,open-source,text-utilities,data-tools
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Intended Audience :: Healthcare Industry
Classifier: Intended Audience :: Manufacturing
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Other Audience
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Office/Business
Classifier: Topic :: Text Processing
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: project-url
Dynamic: requires-python
Dynamic: summary

# dtbag: Data Tools for Business, Analytics & General Tasks

**Streamline your data preparation for Machine Learning and Business Analytics.**

`dtbag` is a Python library that focuses on creating solutions to common problems, and what distinguishes it from others is that it adopts a "solution in one line" approach, meaning it combines all the necessary steps to work internally and returns the result in a single output.
For instance "CatUnifier" function solves a common categorical data problem, inconsistent categorical data due to incorrect entries, spelling mistakes, different write/map data, etc... 
data might have only four real categories but the output can show two times or even many times that number, and this will affect any charts/ machine learning models etc.. and return wrong results.

**Clean inconsistent data entries**

*   **data =**  ["Pencil", "Book", "Pencle", "Boook", "Bouk", "pencl", "pencyl", "Sook","Pencil"]
*   **CatLists_output =**  CatUnifier(data, threshold=0.6)
*   **Result:**  ['Pencil', 'Book']

*   **CatUnifier_output =**  CatUnifier(data, threshold=0.6)
*   **Result:**  ['Pencil','Book','Pencil','Book','Book','Pencil','Pencil','Book','Pencil']


## Key Features

*   **Smart Text Unification :**  Intelligently groups and standardizes similar text entries (like names, addresses, product titles) even with typos, different cases, or accents.
*   **Built for Real Data :**  Handles common data issues like inconsistent capitalization (`"New York"`, `"NEW YORK"`), diacritics (`"Café"`, `"Cafe"`), and minor spelling variations.
*   **Multilingual Support :**  Works seamlessly across languages commonly found in business data, including **Arabic, English, French, Spanish, German**, and more.
*   **Zero Dependencies :**  Uses only Python's robust standard library, ensuring lightweight and conflict-free installation.
*   **Preserves Original Data :**  Returns the most frequent *original* version in each group, maintaining data integrity.


## What's Inside:

*   **CatLists :** 
   Returns a list containing the corrected unique categories from the erroneous categories.
*   **CatUnifier :** 
   Returns the entire input list as a new list with fully corrected categories and in the same order as entered, so that it can be replaced directly.


## Quick Start

#### Installation
```bash
pip install dtbag
```


#### Basic Usage: Cleaning Text Data
```python
from dtbag import CatUnifier
from dtbag import CatLists

raw_customers = ["Café Marrakech", "CAFÉ MARRAKECH", 'Café Mrakech', 'Cafe Marakesh', 'Cafffé Marrakech', 'Cfé Markech', 'Sturbuks', 'Starbucks', "Starbucks", "starbucks", "Café Marrakech"]
dtbag.CatUnifier(raw_customers, threshold=0.7)
Output: ['Café Marrakech', 'Café Marrakech', 'Café Marrakech', 'Café Marrakech', 'Café Marrakech', 'Café Marrakech', 'Starbucks', 'Starbucks', 'Starbucks', 'Starbucks', 'Café Marrakech']

dtbag.CatLists(raw_customers, threshold=0.75)
Output: ['Café Marrakech', 'Starbucks']
```
```python
products = ["laptop-13inch", "Mouse Wireless", "Laptop 13 Inch", "Laptop 13\"", "mouse wirlss", "Mouse Wireless"]
dtbag.CatUnifier(products, threshold=0.6)
Output: ['Laptop 13"', 'Mouse Wireless', 'Laptop 13"', 'Laptop 13"', 'Mouse Wireless', 'Mouse Wireless']

dtbag.CatLists(products, threshold=0.6)
Output: ['Laptop 13"', 'Mouse Wireless']
```
```python
international = ["São Paulo", "Sao Paulo", "Saw Paullow", "München", "Muenchen", "Naïve", "Naive", "Nayve", "Naive"]
dtbag.CatUnifier(international, threshold=0.7)
Output: ['São Paulo', 'São Paulo', 'São Paulo', 'München', 'München', 'Naive', 'Naive', 'Naive', 'Naive']

dtbag.CatLists(international, threshold=0.7)
Output: ['São Paulo', 'München', 'Naive']
```


#### Tuning Precision
```python
strict_cleaning = unify_similar_items(data, threshold=0.9)  # High precision
lenient_cleaning = unify_similar_items(data, threshold=0.6) # More grouping
```
