Metadata-Version: 2.4
Name: datawrangle
Version: 0.1.0
Summary: Intuitive, chainable data cleaning and validation for data engineers and analysts
Home-page: https://github.com/nii-dhii/DataWrangle
Author: Nidhi Bhagat
Author-email: nidhi.bhagat@datawrangle.io
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.3.0
Requires-Dist: numpy>=1.20.0
Requires-Dist: scipy>=1.7.0
Provides-Extra: nlp
Requires-Dist: nltk>=3.6; extra == "nlp"
Requires-Dist: spacy>=3.0; extra == "nlp"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=3.0; extra == "dev"
Requires-Dist: black>=22.0; extra == "dev"
Requires-Dist: flake8>=4.0; extra == "dev"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license-file
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# 🧹 DataWrangle

### Transform messy data into pristine datasets with elegance

**Intuitive data cleaning, validation, and profiling for Python** — Reduce boilerplate, catch errors early, and ship reliable data pipelines.

<div align="center">

[![Python Version](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
[![GitHub Stars](https://img.shields.io/github/stars/nii-dhii/DataWrangle?style=social)](https://github.com/nii-dhii/DataWrangle)
[![Downloads](https://static.pepy.tech/badge/datawrangle)](https://pepy.tech/project/datawrangle)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

[Quick Start](#-quick-start) • [Examples](#-examples) • [Features](#-features)

</div>

---

## ⭐ What's New

**v0.1.0 includes must-have utilities for data engineers:**

🔧 **Column Operations** - Standardize names, drop null columns, split/combine  
📅 **Date/Time Utilities** - Multi-format parsing, feature extraction  
🎲 **Data Sampling** - Stratified sampling, train/test splits  
🔍 **Data Comparison** - DataFrame diffs, schema drift detection  
⚡ **Memory Optimization** - Automatic dtype optimization

[See examples below](#-examples) →

---

## 📦 Installation

```bash
pip install datawrangle
```

## 🚀 Quick Start

```python
import datawrangle as dw
import pandas as pd

# Load messy data
df = pd.read_csv('messy_data.csv')

# Clean with a simple config
cleaned = dw.clean_dataframe(df, config={
    'drop_duplicates': {'subset': ['id']},
    'fill_missing': {'age': 'mean', 'name': 'Unknown'},
    'convert_types': {'date': 'datetime', 'price': 'float'},
    'remove_outliers': {'column': 'salary', 'method': 'iqr'}
})

# Validate against schema
schema = {
    'id': {'type': int, 'constraints': {'non_null': True}},
    'email': {'type': str, 'constraints': {'regex': r'^[\w\.-]+@[\w\.-]+\.\w+$'}}
}
report = dw.validate_schema(cleaned, schema)

# Profile data quality
profile = dw.profile_data(cleaned, output='json')
print(f"Cleaned {len(df)} → {len(cleaned)} rows")
print(f"Validation: {report['status']}")
```

**Output:**

```
Cleaned 6 → 4 rows
Validation: pass
Missing values filled, outliers removed
```

Or use the chainable API:

```python
result = (
    dw.DataWrangleFrame(df)
    .drop_duplicates(subset=['id'])
    .fill_missing({'age': 'mean'})
    .normalize_text('email', {'lowercase': True})
    .validate(schema)
    .get_dataframe()
)

print(result[['id', 'name', 'email']].head())
```

**Output:**

```
   id     name              email
0   1    Alice  alice@example.com
1   2      Bob       bob@test.com
3   3  Charlie   charlie@demo.com
4   4    Diana  diana@company.com
5   5      Eve       eve@mail.com
```

## ✨ Features

**Data Cleaning**

- Drop duplicates with flexible config
- Fill missing values (mean, median, mode, forward/backward fill, custom)
- Type conversion (int, float, datetime, string)
- Outlier removal (IQR, z-score methods)

**Schema Validation**

- Type checking with detailed constraints
- Regex patterns, value ranges, date ranges
- Non-null and positivity checks
- Detailed error reporting

**Text Normalization**

- Case conversion, punctuation removal
- Pattern removal (URLs, phone numbers, etc.)
- Whitespace normalization
- Optional NLP (tokenization, stemming, lemmatization)

**Data Profiling**

- Missing value analysis
- Duplicate detection
- Statistical summaries
- Export to JSON, HTML, or Markdown

```python
profile = dw.profile_data(df, output='summary')
print(f"Total rows: {profile['overview']['total_rows']}")
print(f"Duplicates: {profile['overview']['duplicate_rows']}")
# Output: Total rows: 5, Duplicates: 0
```

**Plugin System**

- Register custom cleaning functions
- Extend with domain-specific logic
- Reusable validation rules

**Column Operations** ⭐ NEW

- Standardize column names (snake_case, camelCase)
- Drop columns with high null percentages
- Reorder, split, and combine columns
- Memory optimization and dtype inference

**Date/Time Utilities** ⭐ NEW

- Parse dates with multiple format support
- Extract date features (year, month, weekday, etc.)
- Calculate date differences
- Timezone handling

**Data Sampling** ⭐ NEW

- Random and stratified sampling
- Train/test splits with stratification
- First/last N rows sampling

**Data Comparison** ⭐ NEW

- Compare two DataFrames
- Detect schema drift
- Find row and value differences

## 📋 Examples

### E-commerce Data Pipeline

```python
import datawrangle as dw

# Clean and validate customer data
clean_customers = (
    dw.DataWrangleFrame(raw_customers)
    .drop_duplicates(subset=['customer_id'])
    .fill_missing({'lifetime_value': 0})
    .convert_types({'customer_id': 'int', 'registration_date': 'datetime'})
    .normalize_text('email', {'lowercase': True, 'strip': True})
    .remove_outliers('lifetime_value', method='iqr')
    .validate(customer_schema)
    .get_dataframe()
)

# Generate quality report
report = dw.profile_data(clean_customers, output='html')
print(f"Original: {len(raw_customers)} rows → Cleaned: {len(clean_customers)} rows")
```

**Output:**

```
Original: 7 rows → Cleaned: 5 rows
Duplicates removed, missing values filled, outliers detected
```

### Custom Business Logic

```python
@dw.register_cleaner
def standardize_country_codes(df):
    """Convert country names to ISO codes."""
    country_map = {
        'USA': 'US', 'United States': 'US',
        'UK': 'GB', 'United Kingdom': 'GB'
    }
    if 'country' in df.columns:
        df['country'] = df['country'].replace(country_map)
    return df

# Use your custom cleaner
df = dw.clean_dataframe(raw_data, custom_cleaners=['standardize_country_codes'])
print(df[['name', 'country']].head())
```

**Output:**

```
        name country
0  Product A      US
1  Product B      US
2  Product C      GB
3  Product D      GB
4  Product E  Germany
```

### Column Standardization & Date Features

```python
import datawrangle as dw

# Standardize messy column names
df = dw.standardize_column_names(df, style='snake_case')
# 'User Name' → 'user_name', 'Customer-ID' → 'customer_id'

# Drop columns with too many nulls
df = dw.drop_columns_with_nulls(df, threshold=0.5)  # >50% nulls

# Parse dates with multiple formats
df = dw.parse_dates(df, 'registration_date', formats=['%Y-%m-%d', '%d/%m/%Y'])

# Extract date features for ML
df = dw.extract_date_features(df, 'timestamp', ['year', 'month', 'is_weekend'])

# Optimize memory usage
df = dw.optimize_dtypes(df)
print(f"Memory optimized: {savings}% reduction")
```

**Output:**

```
Column names standardized: 7 columns
Dropped 2 columns with >50% nulls
Memory optimized: 35% reduction
```

### Data Sampling & Comparison

```python
# Stratified sampling for ML
train, test = dw.create_train_test_split(df, test_size=0.2, stratify_by='category')
print(f"Train: {len(train)}, Test: {len(test)}")

# Compare DataFrames for drift detection
drift = dw.find_schema_drift(current_data, expected_schema)
if drift['has_drift']:
    print(f"⚠️ Schema drift detected: {drift['changes']}")

# Sample 10% of data for quick analysis
sample = dw.sample_data(df, frac=0.1, method='random')
```

**Output:**

```
Train: 800, Test: 200
Schema drift detected: ['Type change in price: int64 → float64']
```

## 🎯 Use Cases

- **ETL/ELT Pipelines** - Clean, standardize, validate before loading
- **ML Preprocessing** - Sample, feature engineering, train/test splits
- **API Integration** - Handle inconsistent formats, detect schema drift
- **Data Quality Reports** - Automated profiling with exports
- **Compliance** - Validate schemas, track data lineage
- **Data Migration** - Compare datasets, detect changes, optimize storage
- **Production Monitoring** - Detect schema drift in real-time

## 🛠️ Development

```bash
# Clone and install
git clone https://github.com/nii-dhii/DataWrangle.git
cd DataWrangle
pip install -e .[dev]

# Run tests
pytest tests/ -v --cov=datawrangle

# Format code
black datawrangle tests

# Lint code
flake8 datawrangle tests
```

## 🤝 Contributing

Contributions welcome! Whether it's bug reports, features, or docs.

- **Issues**: [Report bugs](https://github.com/nii-dhii/DataWrangle/issues)
- **Pull Requests**: [Contribute code](https://github.com/nii-dhii/DataWrangle/pulls)
- **Discussions**: [Join community](https://github.com/nii-dhii/DataWrangle/discussions)

Built with ❤️ by [@nii-dhii](https://github.com/nii-dhii)

## 📚 Resources

- [Example Scripts](examples/example_usage.py) - Comprehensive usage examples
- [Tests](tests/) - See features in action
- [Sample Data](data/) - CSV files for testing

## 📄 License

MIT License - see [LICENSE](LICENSE) file for details

---

<div align="center">

⭐ Star us on [GitHub](https://github.com/nii-dhii/DataWrangle) • Made with ❤️ for the data community

</div>
