Metadata-Version: 2.4
Name: smartclean-py
Version: 0.1.0
Summary: Data cleaning made stupid simple. Outlier detection, clipping, removal, flagging — one line each.
Author-email: Abdisalam Ahmed Ali <jbound287@email.com>
License: MIT
Project-URL: Homepage, https://github.com/HAIDER6190/smart-clean
Project-URL: Documentation, https://github.com/HAIDER6190/smart-clean#readme
Project-URL: Repository, https://github.com/HAIDER6190/smart-clean
Project-URL: Bug Tracker, https://github.com/HAIDER6190/smart-clean/issues
Keywords: data-cleaning,outlier-detection,iqr,pandas,machine-learning,preprocessing,data-science
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.3.0
Requires-Dist: numpy>=1.20.0
Provides-Extra: viz
Requires-Dist: matplotlib>=3.4.0; extra == "viz"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: matplotlib>=3.4.0; extra == "dev"
Dynamic: license-file

# SmartClean

**Data cleaning made stupid simple.**

Outlier detection, clipping, removal, flagging, and replacement — one line each. Built for data scientists who are tired of writing the same 20 lines of outlier handling code in every notebook.

[![PyPI version](https://badge.fury.io/py/smartclean-py.svg)](https://pypi.org/project/smartclean-py/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)

---

## Why SmartClean?

Every data science project has the same boilerplate:

```python
# The old way — boring, repetitive, error-prone
Q1 = df["price"].quantile(0.25)
Q3 = df["price"].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
df = df[(df["price"] >= lower) & (df["price"] <= upper)]
# ... repeat for every column ...
```

With SmartClean:

```python
from smartclean import Cleaner

cl = Cleaner(df)
cl.remove("price", method="iqr")
df_clean = cl.get_df()
```

**One line. Done.**

---

## Installation

```bash
pip install smartclean-py
```

With visualization support:

```bash
pip install smartclean[viz]
```

---

## Quick Start

```python
import pandas as pd
from smartclean import Cleaner

# Load your data
df = pd.read_csv("your_data.csv")

# Create a cleaner
cl = Cleaner(df)

# Chain multiple cleaning operations
df_clean = (
    cl.clip("price", method="iqr")          # Cap extreme prices
      .remove("size", lower=0)              # Remove negative sizes
      .replace("age", upper=120, fill="median")  # Replace age outliers with median
      .flag("income", method="zscore")       # Flag income outliers (keep data)
      .get_df()
)
```

---

## 5 Actions

### 1. `clip` — Cap values at bounds (keep all rows)

```python
cl.clip("price", method="iqr")
cl.clip("price", lower=0, upper=1_000_000)
cl.clip("rating", lower=1, upper=5)
cl.clip(["price", "size"], method="iqr", k=2.0)  # multiple columns
cl.clip_all(method="iqr")  # all numeric columns
```

### 2. `remove` — Delete rows with outliers

```python
cl.remove("price", method="iqr")
cl.remove("size", lower=0)
cl.remove("age", lower=0, upper=120)
cl.remove_all(method="zscore", exclude="id")
```

### 3. `flag` — Mark outliers without modifying data

```python
cl.flag("salary")
# Adds column: salary_is_outlier = True/False

cl.flag("salary", suffix="_outlier_flag")
# Adds column: salary_outlier_flag
```

### 4. `replace` — Swap outliers with another value

```python
cl.replace("age", fill="median")    # median of non-outlier values
cl.replace("age", fill="mean")      # mean of non-outlier values
cl.replace("score", fill=0)         # custom value
cl.replace("price", fill="null")    # replace with NaN
cl.replace("salary", fill="bound")  # nearest bound value
```

### 5. `detect` — Read-only analysis

```python
result = cl.detect("salary", method="iqr")
print(result)
print(result.outlier_count)
print(result.to_dict())

# Detect all columns
results = cl.detect_all()
```

---

## Outlier Detection Methods

| Method | Parameter | Default | Description |
|---|---|---|---|
| `iqr` | `k` | 1.5 | IQR multiplier (1.5 = standard, 3.0 = extreme only) |
| `zscore` | `threshold` | 3 | Number of standard deviations |
| `percentile` | `lower_pct`, `upper_pct` | 0.01, 0.99 | Percentile bounds |

You can also set **manual bounds**:

```python
cl.clip("age", lower=0, upper=120)  # override method with custom bounds
cl.clip("age", lower=0)             # manual lower + method-calculated upper
```

---

## Health Report

Quick scan of your entire dataset:

```python
cl.health()
```

```
══════════════════════════════════════════════════════════════
  📊 DATA HEALTH REPORT
══════════════════════════════════════════════════════════════
  Total Rows:    1,000
  Numeric Cols:  4
  Method:        iqr
──────────────────────────────────────────────────────────────
  Column               Outliers        %    Missing   Status
──────────────────────────────────────────────────────────────
  price                      10     1.0%          0   🟡 Warn
  size                        5     0.5%          0   🟢 OK
  rating                      4     0.4%          0   🟢 OK
  age                         3     0.3%          0   🟢 OK
══════════════════════════════════════════════════════════════
```

---

## Pipeline — Save & Replay

Apply the same cleaning steps to new data (e.g., test set):

```python
# Save
cl.save_pipeline("cleaning_steps.json")

# Apply to new data
cl_test = Cleaner(df_test)
cl_test.apply_pipeline("cleaning_steps.json")
df_test_clean = cl_test.get_df()
```

---

## Undo & Reset

```python
cl.undo()   # undo last step
cl.reset()  # reset to original data
```

---

## Visualization

Requires `pip install smartclean[viz]`

```python
cl.plot("price")                   # before/after comparison
cl.plot_detect("price", method="iqr")  # highlight outlier bounds
```

---

## Summary & Reports

```python
cl.summary()       # overview of all cleaning steps
cl.get_reports()   # detailed report per step
cl.get_history()   # raw history as list of dicts
```

---

## API Reference

### Cleaner(df)

| Method | Description | Modifies Data |
|---|---|---|
| `clip(columns, method, lower, upper)` | Cap values at bounds | ✅ |
| `clip_all(method, exclude)` | Clip all numeric columns | ✅ |
| `remove(columns, method, lower, upper)` | Remove outlier rows | ✅ |
| `remove_all(method, exclude)` | Remove from all numeric columns | ✅ |
| `flag(columns, method, suffix)` | Add boolean outlier column | ✅ (new col) |
| `flag_all(method, exclude)` | Flag all numeric columns | ✅ (new cols) |
| `replace(columns, method, fill)` | Replace outlier values | ✅ |
| `detect(column, method)` | Detect outliers (read-only) | ❌ |
| `detect_all(method, exclude)` | Detect in all columns | ❌ |
| `health(method)` | Print health report | ❌ |
| `summary()` | Print cleaning summary | ❌ |
| `plot(column)` | Before/after visualization | ❌ |
| `plot_detect(column, method)` | Outlier bounds plot | ❌ |
| `save_pipeline(filepath)` | Save steps to JSON | ❌ |
| `apply_pipeline(filepath)` | Replay steps from JSON | ✅ |
| `undo()` | Undo last step | ✅ |
| `reset()` | Reset to original data | ✅ |
| `get_df()` | Return cleaned DataFrame | ❌ |

---

## Requirements

- Python >= 3.8
- pandas >= 1.3.0
- numpy >= 1.20.0
- matplotlib >= 3.4.0 *(optional, for visualization)*

---

## License

MIT License — see [LICENSE](LICENSE) for details.

---

## Contributing

Contributions are welcome! Please open an issue or submit a pull request on [GitHub](https://github.com/HAIDER6190/smart-clean).
