Metadata-Version: 2.4
Name: DataCleaningPipeline
Version: 2.1.1
Summary: Production-grade data cleaning and validation pipeline
Author-email: Sahil Sharma <sahilsharmamrp@gmail.com>
Project-URL: Homepage, https://github.com/Developer-Sahil/DataCleaningPipeline
Project-URL: Bug Tracker, https://github.com/Developer-Sahil/DataCleaningPipeline/issues
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=2.0.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: python-dateutil>=2.8.0
Requires-Dist: pyyaml>=6.0
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: scikit-learn; extra == "dev"
Requires-Dist: pyarrow; extra == "dev"
Provides-Extra: ml
Requires-Dist: scikit-learn; extra == "ml"
Provides-Extra: parquet
Requires-Dist: pyarrow; extra == "parquet"
Requires-Dist: fastparquet; extra == "parquet"
Dynamic: license-file

# Data Cleaning Pipeline

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

A robust, beta-grade Python package for automating complex data cleaning workflows.

## Features

- **🛡️ Robust Validation**: Validate data against YAML schemas and ensure type integrity.
- **🧹 Advanced Cleaning**:
  - Missing value imputation (Mean, Median, Mode, Fill)
  - Outlier detection (IQR, Z-Score, Isolation Forest)
  - Text standardization (Vectorized operations)
  - Duplicate removal
- **🚀 High Performance**: 
  - **Verified Scalability**: Tested up to 5 Million rows.
  - **Optimized**: ~300k-500k rows/sec processing speed.
  - **Chunking support** for large datasets (> RAM)
- **🔒 Secure & Safe**:
  - Configurable rollback mechanism (Undo last 1-10 steps)
  - Path traversal protection
  - Safe execution context
- **📊 Reporting**:
  - Detailed JSON audit logs
  - Data quality scoring

## Performance
We verified the pipeline's performance on datasets up to **5 Million rows**:
- **50,000 rows**: ~0.12s
- **2,000,000 rows**: ~5.12s
- **5,000,000 rows**: ~16.25s
(Tested on standard hardware, scales linearly)

## Installation

```bash
pip install DataCleaningPipeline
```

## Quick Start

```python
import pandas as pd
from DataCleaningPipeline import DataCleaningPipeline, PipelineConfig

# 1. Load Data
df = pd.read_csv("dirty_data.csv")

# 2. Configure Pipeline
config = PipelineConfig(
    log_level="INFO",
    max_checkpoints=5,
    strict_mode=True
)

# 3. Clean!
pipeline = DataCleaningPipeline(df, config)

clean_df = (pipeline
    .remove_duplicates(subset=['id'])
    .handle_missing_values({'age': 'median', 'city': 'mode'})
    .standardize_text(['city', 'email'], lowercase=True)
    .remove_outliers(['salary'], method='isolation_forest')
    .validate_email('email', remove_invalid=True)
    .get_cleaned_data()
)

# 4. Export
pipeline.to_parquet("clean_data.parquet")
pipeline.export_report("audit_log.json")
```

## Advanced Features

### Chunked Processing (Large Files)
Process files larger than memory by streaming them in chunks:

```python
pipeline = DataCleaningPipeline(pd.DataFrame(), config)
# Process large files in chunks (generator-based)
for chunk_pipeline in pipeline.process_in_chunks("huge_file.csv", chunk_size=50000):
    (chunk_pipeline
        .remove_duplicates()
        .handle_missing_values({'age': 'median'})
    ).to_csv("clean_output.csv", mode='a', header=False)
```

### Schema Validation
Enforce strict data contracts using YAML schemas:

```python
pipeline.validate_schema("schema.yaml", schema_name="customer_data")
```

## Configuration

Control pipeline behavior via `DataCleaningPipeline.config.PipelineConfig` or `config.yaml`:
- `strict_mode`: Raise exceptions instead of logging warnings.
- **Log Level**: Set logging verbosity (DEBUG, INFO, WARNING, ERROR).
- `max_checkpoints`: Control memory usage for rollback history.
- `max_memory_mb`: Strict memory limit (MB) for creating checkpoints. Older checkpoints are dropped if limit is exceeded.
- `log_file`: Enable file-based logging (JSON formatted).
- `strict_mode`: Enforce strict error handling (raises `PipelineError` on failure).
- `allow_custom_functions`: Enable/Disable `apply_custom_function` for security.
- `allow_file_operations`: Enable/Disable file system access (restricted to CWD).

## Security & Reliability features

- **Secure Logging**: Logs are strictly formatted as JSON to prevent injection.
- **Path Traversal Protection**: File operations are restricted to the Current Working Directory.
- **Memory Guardrails**: Prevents Out-Of-Memory errors by actively managing checkpoint history size.
- **Input Validation**: Strict validation for all configuration parameters.



## License

MIT
