Metadata-Version: 2.4
Name: pwml
Version: 1.2.1
Summary: Python Wrappers for Machine Learning
Home-page: https://github.com/braibaud/pwml
Author: Benjamin Raibaud
Author-email: Benjamin Raibaud <braibaud@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/braibaud/pwml
Project-URL: Repository, https://github.com/braibaud/pwml
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy<3,>=1.20.0
Requires-Dist: pillow<12,>=8.0.0
Requires-Dist: pandas<3,>=1.3.0
Requires-Dist: urllib3<3,>=1.26.0
Requires-Dist: scikit-learn<2,>=1.2.0
Requires-Dist: scipy<2,>=1.7.0
Requires-Dist: joblib<2,>=1.1.0
Requires-Dist: matplotlib<4,>=3.5.0
Requires-Dist: seaborn<1,>=0.13.0
Requires-Dist: sentence-transformers<4,>=2.2.0
Provides-Extra: timeseries
Requires-Dist: prophet<2,>=1.1; extra == "timeseries"
Requires-Dist: statsmodels<1,>=0.13.0; extra == "timeseries"
Provides-Extra: neptune
Requires-Dist: neptune<4,>=3.0; extra == "neptune"
Provides-Extra: mssql
Requires-Dist: pymssql<3,>=2.2.0; extra == "mssql"
Provides-Extra: all
Requires-Dist: prophet<2,>=1.1; extra == "all"
Requires-Dist: statsmodels<1,>=0.13.0; extra == "all"
Requires-Dist: neptune<4,>=3.0; extra == "all"
Requires-Dist: pymssql<3,>=2.2.0; extra == "all"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# pwml
`pwml` stands for `P`ython `W`rappers for `M`achine `L`earning

[![Quality Gate Status](https://sonarcloud.io/api/project_badges/measure?project=braibaud_pwml&metric=alert_status)](https://sonarcloud.io/dashboard?id=braibaud_pwml) [![PyPI version](https://badge.fury.io/py/pwml.svg)](https://badge.fury.io/py/pwml)

---

## Requirements

- Python >= 3.8
- See `pyproject.toml` for the full dependency list

---

## Installation

```bash
pip install pwml
```

---

## Modules

### `classifiers` - Hierarchical Classification

`HierarchicalClassifierModel` trains a tree of sklearn pipelines, one per node in the label hierarchy. Each node's classifier is selected and tuned independently via `GridSearchCV`. Inference cascades top-down through the tree.

**Features:**
- Text embedding via sentence-transformers (`all-MiniLM-L6-v2` 384-dim, `all-mpnet-base-v2` 768-dim)
- One-hot encoding for categorical features
- Numeric normalisation to [0, 1] with configurable OOD policy (`out_of_range='clip'/'warn'/'raise'`)
- Platt calibration with per-class threshold optimization
- Soft routing: descent stops when prediction confidence falls below a configurable threshold
- Batch inference via `predict_dataframe` with pre-computed embeddings
- Per-node inference latency profiling (`profile=True`)
- Model versioning metadata embedded in saved artefacts
- Evaluation and stratified cross-validation

#### Training

```python
from pwml.classifiers import hierarchical as hc
from pwml.classifiers import features as fe

model = hc.HierarchicalClassifierModel(
    model_name='my_model',
    experiment_name='experiment_1',
    input_features=[
        fe.InputFeature(feature_name='Style',    feature_type='text'),
        fe.InputFeature(feature_name='Gender',   feature_type='text'),
        fe.InputFeature(feature_name='Brand',    feature_type='text'),
        fe.InputFeature(feature_name='Price',    feature_type='numeric'),
        fe.InputFeature(feature_name='Category', feature_type='category'),
    ],
    output_feature_hierarchy=fe.OutputFeature(
        feature_name='Division',
        child_feature=fe.OutputFeature(feature_name='Class')))

model.load_from_dataframe(data=df)
model.save_model(filepath='my_model.joblib')
```

The model trains `n+1` classifiers where `n` is the number of distinct `Division` values:
one classifier for the top-level `Division` prediction, and one per `Division` value for the `Class` prediction within that division.

#### Single-sample inference

```python
model = hc.HierarchicalClassifierModel.load_model(filepath='my_model.joblib')

result = model.predict(
    input={'Style': 'slim fit jeans', 'Gender': 'men', 'Brand': 'Acme', 'Price': 49.99, 'Category': 'Bottoms'},
    min_routing_confidence=0.6)

# result is a list of dicts, one per hierarchy level:
# [{'feature_name': 'Division', 'value': 'Apparel', 'confidence': 0.91},
#  {'feature_name': 'Class',    'value': 'Denim',   'confidence': 0.78}]
```

#### Batch inference

```python
predictions_df = model.predict_dataframe(data=df)
# Returns df with extra columns: Division_predicted, Division_confidence, Class_predicted, Class_confidence

# With per-node latency profiling
predictions_df, latency = model.predict_dataframe(data=df, profile=True)
# latency: {'Division': 0.0012, 'Division/Apparel': 0.0009, ...}
```

#### Evaluation and cross-validation

```python
metrics, predictions_df = model.evaluate(data=df)

summary, per_fold = model.cross_validate(data=df, n_splits=5, search_n_jobs=4)
print(summary)  # {'Division': {'mean': 0.87, 'std': 0.02}, 'Class': {'mean': 0.74, 'std': 0.04}}
```

---

### `timeseries` - Time Series Utilities

#### Data augmentation

```python
from pwml.timeseries import dataaugmentationhelpers as dah

# Split data before calling prepare_data to avoid scaler leakage
train_df = df.iloc[:split]
test_df  = df.iloc[split:]

X_train, y_train, index, scaler_in, scaler_out, n_samples = dah.prepare_data(
    data=train_df,
    lags_in=[1, 7],
    cols_in=['feature_a', 'feature_b'],
    steps_in=14,
    cols_out=['target'],
    steps_out=7,
    augmentation_factor=3,
    noise_std=0.05)

# Pass pre-fit scalers for the test set to prevent leakage
X_test, y_test, _, _, _, _ = dah.prepare_data(
    data=test_df,
    lags_in=[1, 7],
    cols_in=['feature_a', 'feature_b'],
    steps_in=14,
    cols_out=['target'],
    steps_out=7,
    scaler_in=scaler_in,
    scaler_out=scaler_out)
```

#### Prophet helpers

```python
from pwml.timeseries import prophethelpers as ph

# Summarise regressor coefficients for a fitted Prophet model
coefs_df = ph.regressor_coefficients(m)

# Plot regressor importance (beta coefficients)
ph.plot_regressors_importance(m, title='Regressor importance')
```

#### Visualization

```python
from pwml.timeseries import visualizationhelpers as vh

vh.plot_time_series(
    title='Forecast',
    training=train_df,
    testing=test_df,
    prediction=forecast_df,
    confidence=forecast_df)

vh.plot_time_series_dist(data=residuals, title='Residual distribution')

vh.plot_seasonal_decomposition(data=series, period=52)

vh.plot_autocorrelation(data=series, lags=50)
```

---

### `utilities`

| Module | Purpose |
|---|---|
| `graphichelpers` | `GraphicsStatics`: matplotlib/seaborn style initialization, color/linestyle palette, `style_plot` |
| `mssqlhelpers` | `execute(proc_name, conn_params, proc_params, commit=True)` - call a stored procedure, returns a DataFrame |
| `neptunehelpers` | `ExperimentManager` - Neptune experiment tracking wrapper (`neptune-client >= 1.0`) |
| `driftmonitor` | `DriftMonitor` - compute PSI and Jensen-Shannon divergence between reference and live distributions; Neptune integration |
| `httphelpers` | Image download utilities |
| `imagehelpers` | PIL image helpers (resize, crop, batch conversion) |
| `filehelpers` | Pickle serialization helpers |
| `classificationhelpers` | `MulticlassClassifierOptimizer` - Platt calibration + per-class threshold tuning |
| `commonhelpers` | Miscellaneous utilities |

---

### `examples` - Runnable Examples

#### Model Hosting ([`examples/modelhosting.py`](examples/modelhosting.py))

A Flask REST API that serves one or more pre-trained `HierarchicalClassifierModel` instances.

```bash
python examples/modelhosting.py \
    --host 0.0.0.0 \
    --port 5000 \
    --models "v1/division|/path/to/model.joblib"
```

Each loaded model is exposed at `/api/<model-id>` (POST). For production, use a WSGI server such as gunicorn:

```bash
gunicorn -w 4 -b 0.0.0.0:5000 "modelhosting:Statics.g_app"
```

#### Streamlit Web App ([`examples/webapp/app.py`](examples/webapp/app.py))

An interactive demo app covering data exploration, batch predictions with confidence heatmaps, per-level accuracy charts, per-node latency profiling, and concept drift monitoring with PSI gauges.

```bash
pip install streamlit
streamlit run examples/webapp/app.py
```

---

## Experiment tracking (Neptune)

```python
from pwml.utilities import neptunehelpers as nh

with nh.ExperimentManager(
        log=True,
        project_name='workspace/project',
        experiment_name='run_001',
        experiment_params={'lr': 0.01, 'epochs': 100},
        experiment_tags=['baseline']) as em:

    em.set_experiment_property('dataset_version', 'v3')
    em.log_data_to_neptune(data=results_df, name='results')
    em.log_chart_to_neptune(figure=fig, name='loss_curve')
```

Requires `neptune-client >= 1.0`. Set the `NEPTUNE_API_TOKEN` environment variable before running.
