Metadata-Version: 2.2
Name: tidyfit
Version: 0.5.0
Summary: Reusable ML utilities focused on leakage-safe preprocessing and reproducibility
Author: Omar Abdalla
License: MIT
Keywords: machine-learning,utilities,reproducibility,preprocessing
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.26
Requires-Dist: pandas>=2.1
Provides-Extra: sklearn
Requires-Dist: scikit-learn>=1.4; extra == "sklearn"
Provides-Extra: viz
Requires-Dist: matplotlib>=3.8; extra == "viz"
Provides-Extra: torch
Requires-Dist: torch>=2.2; extra == "torch"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Requires-Dist: black>=24.0; extra == "dev"
Requires-Dist: mypy>=1.10; extra == "dev"

# tidyfit

[![CI](https://img.shields.io/github/actions/workflow/status/omaression/tidyfit/ci.yml?branch=main)](https://github.com/omaression/tidyfit/actions/workflows/ci.yml)
[![Python 3.12](https://img.shields.io/badge/python-3.12-blue)](https://www.python.org/downloads/release/python-3120/)
[![License: MIT](https://img.shields.io/badge/license-MIT-green)](./LICENSE)

A lightweight, reproducible ML utilities toolkit focused on **clean data workflows**, **leakage-safe preprocessing**, and **practical evaluation helpers**.

`tidyfit` is designed to work in layers:
- **Core (default):** numpy + pandas only
- **Optional extras:** sklearn / matplotlib / torch when you need them

---

## ✨ Why tidyfit?

Most ML projects repeat the same boilerplate:
- splitting data safely
- tracking class imbalance
- fitting transforms on train and reusing on validation/test
- generating metrics/curve data for evaluation
- maintaining reproducibility across runs

`tidyfit` centralizes those patterns in a reusable package so you can move faster with fewer mistakes.

---

## 📦 Installation

### Core (no sklearn)
```bash
pip install tidyfit
```

### With optional extras
```bash
pip install "tidyfit[sklearn]"
pip install "tidyfit[viz]"
pip install "tidyfit[torch]"
```

### Developer install
```bash
pip install -e ".[dev]"
pre-commit install
```

---

## 🚀 Quick start (core-only)

```python
import pandas as pd
from tidyfit.data import stratified_split
from tidyfit.preprocessing import fit_standard_scaler, apply_standard_scaler
from tidyfit.metrics import threshold_sweep

# Example dataframe
# columns: f1, f2, y

df = pd.DataFrame(
    {
        "f1": [1.2, 0.5, 3.1, 2.2, 0.1, 1.7],
        "f2": [10, 12, 18, 17, 9, 13],
        "y": [0, 0, 1, 1, 0, 1],
    }
)

(train_X, train_y), (val_X, val_y), (test_X, test_y) = stratified_split(
    df, target="y", random_state=42
)

state = fit_standard_scaler(train_X, cols=["f1", "f2"])
train_X_scaled = apply_standard_scaler(train_X, state)
val_X_scaled = apply_standard_scaler(val_X, state)

# Example model scores (pretend these are from your model)
val_scores = [0.2, 0.8, 0.6]
sweep = threshold_sweep(val_y.iloc[:3], val_scores)
print(sweep.head())
```

---

## 🧰 CLI usage

After install, you get the `tidyfit` command:

```bash
tidyfit --help
```

### Data summary
```bash
tidyfit data-summary data.csv
```

### Imbalance report
```bash
tidyfit imbalance-report data.csv --target y
```

### Threshold sweep
```bash
tidyfit threshold-sweep preds.csv --label-col y_true --score-col y_score
```

### Stratified split check
```bash
tidyfit stratified-split data.csv --target y
```

### Curve data (ROC/PR)
```bash
tidyfit curve-data preds.csv --label-col y_true --score-col y_score
```

### Curve data + exported artifacts
```bash
tidyfit curve-data preds.csv --label-col y_true --score-col y_score --out-dir artifacts/curves
# saves:
# - roc_points.csv
# - pr_points.csv
# - roc.png (if matplotlib installed)
# - pr.png (if matplotlib installed)
```

### Environment snapshot
```bash
tidyfit env-snapshot --out artifacts/env.json
```

---

## 🧩 API Overview

### `tidyfit.data`
- `train_val_test_split(...)`
- `stratified_split(...)`
- `check_nulls(df)`
- `validate_schema(df, schema)`
- `distribution_summary(df, cols)`
- `imbalance_report(y)`

### `tidyfit.preprocessing`
- `fit_standard_scaler(df, cols)`
- `apply_standard_scaler(df, state)`
- `fit_one_hot(df, cols)`
- `apply_one_hot(df, vocab)`

### `tidyfit.features`
- `interaction_terms(df, cols)`
- `polynomial_features(df, cols, degree=2)`
- `drop_low_variance(df, threshold)`
- `drop_high_correlation(df, threshold)`

### `tidyfit.metrics`
- `classification_report_per_class(y_true, y_pred)`
- `threshold_sweep(y_true, y_score, thresholds)`
- `metric_confidence_interval(values, alpha=0.95)`
- `calibration_table(y_true, y_prob, bins=10)`

### `tidyfit.eval_viz`
- `confusion_matrix_data(y_true, y_pred)`
- `roc_curve_data(y_true, y_score, thresholds=None)`
- `pr_curve_data(y_true, y_score, thresholds=None)`
- `maybe_plot_curve(...)`
- `save_curve_plot(...)`

### `tidyfit.cv`
- `kfold_indices(...)`
- `stratified_kfold_indices(...)`

### `tidyfit.reproducibility`
- `set_global_seed(seed, torch=False)`
- `snapshot_environment(path)`

### `tidyfit.model_io`
- `save_model(obj, path, metadata=None)`
- `load_model(path)`

### `tidyfit.tracking`
- `log_experiment(path, payload)`

### `tidyfit.sklearn_extra` *(optional: requires `[sklearn]`)*
- `sklearn_stratified_split(...)`
- `sklearn_stratified_cv_scores(...)`

---

## 🧪 Optional dependency behavior

`tidyfit` is safe to use without sklearn.

If you try this module without installing the extra:
```python
from tidyfit.sklearn_extra import sklearn_stratified_split
```
You will get a clear `ImportError` instructing you to install:
```bash
pip install "tidyfit[sklearn]"
```

This keeps the base install minimal and fast.

---

## 🔁 Reproducibility pattern

Recommended minimal pattern per experiment:

```python
from tidyfit.reproducibility import set_global_seed, snapshot_environment
from tidyfit.tracking import log_experiment

set_global_seed(42)
snapshot_environment("artifacts/env.json")
log_experiment("artifacts/experiments.jsonl", {"run": "baseline", "f1": 0.84})
```

---

## ✅ Quality and project hygiene

- Tests: `python -m pytest -q`
- Lint: `python -m ruff check src tests`
- Format check: `python -m ruff format --check src tests`
- Types: `python -m mypy src/tidyfit`

CI workflow: [`.github/workflows/ci.yml`](.github/workflows/ci.yml)

---

## 🚢 Release workflow

Run local release checks + build:

```bash
./scripts/release.sh
```

This performs:
1. lint/format/type/test gates
2. build artifacts (`dist/`)
3. `twine check`

For TestPyPI publish:
- workflow: [`.github/workflows/publish-testpypi.yml`](.github/workflows/publish-testpypi.yml)
- required secret: `TEST_PYPI_API_TOKEN`

---

## 🧾 Versioning

See [CHANGELOG.md](./CHANGELOG.md) for version history (`0.1.0` → `0.5.0`).

---

## 📄 License

MIT — see [LICENSE](./LICENSE).
