Metadata-Version: 2.4
Name: cv-score-predict
Version: 0.2.2
Summary: Cross-validated ensemble prediction with LGBM, XGBoost, and CatBoost — with safe categorical handling, multi-seed averaging, and artifact return.
Author-email: Danu ANDRIES <danu@andries.lu>
License: MIT
Project-URL: Homepage, https://github.com/Karabush/cv-score-predict
Project-URL: Repository, https://github.com/Karabush/cv-score-predict
Project-URL: Documentation, https://github.com/Karabush/cv-score-predict#readme
Keywords: cross-validation,ensemble learning,model averaging,LightGBM,XGBoost,CatBoost,categorical encoding,OrdinalEncoder,out-of-fold prediction,OOF,multi-seed CV,repeated cross-validation,early stopping,scikit-learn compatible,pandas,machine learning,classification,regression,model validation,kaggle,safe preprocessing,data leakage prevention,boosting ensemble
Classifier: Programming Language :: Python :: 3
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.21
Requires-Dist: pandas>=1.3
Requires-Dist: scikit-learn>=1.4
Requires-Dist: lightgbm>=3.3
Requires-Dist: xgboost>=1.7
Requires-Dist: catboost>=1.2
Dynamic: license-file

# cv-score-predict

A robust utility for **cross-validated ensemble prediction** that performs per‑fold early stopping and exposes flexible prediction outputs for advanced stacking, diagnostics, or custom ensembling.  
Each fold trains LightGBM, XGBoost, or CatBoost with early stopping on its validation split; the resulting estimators generate out-of-fold (OOF) and test predictions with configurable aggregation. The function supports custom preprocessing pipelines, dynamic per-fold categorical encoding, repeated CV over multiple seeds, and when requested — returns trained models along with their corresponding fold-specific preprocessors.

Designed for **kagglers, ML engineers, and data scientists** who need reliable, leakage-free CV with minimal boilerplate.

---

## ✨ Key Features

- **Per‑fold early stopping**: Each fold trains with early stopping on its validation split and uses the early‑stopped estimator for OOF and test predictions.
- **Flexible prediction structures** controlled by `return_raw_test_preds`:
    - **OOF predictions** (`oof_preds_df`): Always one column per **(model, seed)** — predictions from all folds for a given (model, seed) are stitched together into a single complete column.
    - **Test predictions** (`test_preds_df`):
        - *Default (`return_raw_test_preds=False`)*: **Averaged across folds** → one column per **(model, seed)** matching OOF structure. Ideal for direct stacking/blending with OOF predictions.
        - *Raw mode (`return_raw_test_preds=True`)*: **Per-fold predictions** → one column per **(model, seed, fold)**. Preserves fold-level variance for diagnostics or custom aggregation.
- **Multi-model support**: Train LightGBM (`'lgb'`), XGBoost (`'xgb'`), and CatBoost (`'cb'`) in the same CV loop.
- **Safe fold-wise preprocessing**: Accepts any scikit-learn–compatible processor with `fit_transform`/`transform`. Fitted independently per fold to prevent data leakage.
- **Automatic robust categorical handling** (always enabled):
    - Detects object/string/categorical columns **after** the base processor runs,
    - Fits an `OrdinalEncoder` per fold with explicit unseen-category handling:
        - Unseen categories → encoded as `-1`
        - Missing values → encoded as `-1`
        - Training data guaranteed to contain `-1` via `encoded_missing_value=-1`
    - Converts encoded integers to pandas `'category'` dtype for native booster support
    - Automatically sets model-specific flags: `enable_categorical=True` for XGBoost, `cat_features=col_names` for CatBoost. LightGBM auto-detects categories from dtype.
    - **Critical benefit**: Satisfies XGBoost's strict validation (all test categories must exist in training) while handling unseen values gracefully.
- **Repeated CV over seeds**: Accepts a single seed or a list of seeds; CV is repeated for each seed, and all raw predictions are preserved.
- **Custom CV splitter support**: Pass any scikit-learn–compatible splitter (e.g., `GroupKFold`) via cv_splitter.
- **Grouped cross-validation support**: Pass `cv_groups` (array/Series of group labels) along with a custom splitter to ensure samples from the same group (e.g., user, time period) stay together. Note: cv_groups requires cv_splitter to be provided.
- **Flexible scoring and thresholding**: 
    - Custom `scoring_dict` supported (e.g., accuracy, log loss, RMSE).
    - Defaults: ROC AUC for classification, RMSE for regression.
    - For classification, return probabilities (`predict_proba=True`) or binary labels (`predict_proba=False`)  using `decision_threshold.
- **Artifact return**: When `return_trained=True`, returns a list of tuples (`fold_processor, model`) — one per model × fold × seed — where `fold_processor` is the preprocessor fitted on that fold’s training data,
- **Transparent, diagnostic-rich logging**: 
    With `verbose=2` (default), the function prints:
    - Per-fold scores for every model,
    - Stacked (mean of model predictions) score per fold,
    - Per-seed mean scores (by model and stacked),
    - Final cross-seed summary of mean CV performance.
    - → Enables instant diagnosis of model instability, fold bias, or seed sensitivity — no extra code needed.

---

## 📥 Parameters

| Parameter | Type | Default | Description |
|----------|------|--------|-------------|
| `X` | `pd.DataFrame` | — | Training features. |
| `y` | `Union[pd.Series, np.ndarray]` | — | Target values. |
| `X_test` | `Optional[pd.DataFrame]` | `None` | Test set for final prediction. If `None`, no test predictions are returned. |
| `pred_type` | `str` | — | Either `'classification'` or `'regression'` (**required**). |
| `processor` | `Optional[object]` | `None` | Preprocessing pipeline with `fit_transform` and `transform` methods. Must return a `pd.DataFrame` (use `set_output(transform='pandas')`). If `None`, features are passed through unchanged. |
| `models` | `Union[List[str], str]` | `('lgb', 'xgb', 'cb')` | Models to ensemble. Supported: `'lgb'` (LightGBM), `'xgb'` (XGBoost), `'cb'` (CatBoost). |
| `params_dict` | `Optional[Dict[str, dict]]` | `None` | Model-specific hyperparameters. Keys: model names; values: param dicts. |
| `scoring_dict` | `Optional[Dict[str, Callable]]` | `None` | Metrics for evaluation. Keys: metric names; values: scoring functions (e.g., `roc_auc_score`). Defaults: `{'roc_auc': roc_auc_score}` (classification), `{'rmse': rmse_fn}` (regression). |
| `decision_threshold` | `float` | `0.5` | Threshold to convert probabilities to class labels (classification only). |
| `n_splits` | `int` | `5` | Number of cross-validation folds. Ignored if cv_splitter is provided. |
| `random_state` | `Union[int, List[int]]` | `42` | Seed(s) for reproducibility. If a list, CV is repeated for each seed and results are averaged. |
| `early_stopping_rounds` | `int` | `50` | Early stopping rounds for boosting models (if not overridden in `params_dict`). |
| `verbose` | `int` | `2` | Logging level: `2` = full per-fold details, `1` = final summary, `0` = silent. |
| `return_trained` | `bool` | `False` | If True, returns a list of (fold_processor, model) tuples (one per model × fold × seed). |
| `predict_proba` | `bool` | `True` | For classification: if `True`, return probabilities; if `False`, return binary labels (using `decision_threshold`). Ignored for regression. |
| `return_raw_test_preds` | `bool` | `False` | Controls test prediction structure:<br>- `False` (default): Average predictions across folds per (model, seed) → matches OOF structure.<br>- `True`: Return raw per-fold predictions → one column per (model, seed, fold). |
| `cv_splitter` | `Optional[object]` | `None` | Pre-configured CV splitter instance (e.g., `GroupKFold`). If provided, overrides automatic splitter selection and `n_splits`. Must implement `split(X, y, [groups])` method. |
| `cv_groups` | `Optional[Union[np.ndarray, pd.Series, List]]` | `None` | Group labels for grouped cross-validation. Requires `cv_splitter` to be provided. Passed to `splitter.split()` if the splitter accepts groups. |

---

## 🚀 Installation

```bash
pip install cv-score-predict
```

Requirements:

* Python ≥ 3.8
* Dependencies:
`numpy`, `pandas`, `scikit-learn ≥1.4`, `lightgbm`, `xgboost`, `catboost`

---

## 📌 Basic Usage
```python
import pandas as pd
from cv_score_predict import cv_score_predict

# Simulate data
X = pd.DataFrame({
    "num": [1, 2, 3, 4, 5, 6, 7, 8],
    "cat": ["A", "B", "A", "C", "B", "A", "C", "D"]
})
y = [0, 1, 0, 1, 1, 0, 1, 0]
X_test = pd.DataFrame({"num": [9, 10], "cat": ["B", "E"]})  # 'E' is unseen

# Run CV with 2 seeds → get OOF and *averaged* test predictions
oof_preds_df, test_preds_df, _ = cv_score_predict(
    X=X,
    y=y,
    X_test=X_test,
    pred_type="classification",
    models=["lgb", "xgb"],
    random_state=[42, 123],
    n_splits=2,
    verbose=2,
)

# Analyze prediction structures
print("OOF predictions shape:", oof_preds_df.shape)   # (8, 4) → 2 models × 2 seeds
print("Test predictions shape:", test_preds_df.shape) # (2, 4) → 2 models × 2 seeds (averaged across folds)
print(oof_preds_df.columns.tolist())
# ['lgb_seed_42', 'xgb_seed_42', 'lgb_seed_123', 'xgb_seed_123']
print(test_preds_df.columns.tolist())
# ['lgb_seed_42', 'xgb_seed_42', 'lgb_seed_123', 'xgb_seed_123'] ← matches OOF!

# Direct stacking: average OOF and test predictions together
final_oof = oof_preds_df.mean(axis=1)
final_test = test_preds_df.mean(axis=1)
```

---

## 🔧 Advanced Usage: Reuse Artifacts for New Data
```python
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.metrics import roc_auc_score, accuracy_score, log_loss
from cv_score_predict import cv_score_predict

# Define a processor that returns a DataFrame
base_processor = make_column_transformer(
    (StandardScaler(), ["num"]),
    remainder="passthrough"
).set_output(transform='pandas')

scoring_dict = {
    "roc_auc": roc_auc_score,
    "accuracy": accuracy_score,
    "log_loss": log_loss,
}
params_dict = {
    "lgb": {"learning_rate": 0.1, "num_leaves": 100},
    "xgb": {"learning_rate": 0.1, "max_depth": 10},
    "cb": {"learning_rate": 0.1, "depth": 8},
}
# Run CV and return artifacts
oof_preds_df, _, trained_pipelines = cv_score_predict(
    X, y,
    X_test=None,
    pred_type="classification",
    processor=base_processor,
    models=["lgb", "xgb", "cb"],
    params_dict=params_dict,
    scoring_dict=scoring_dict,
    random_state=[42, 123],
    n_splits=5,
    return_trained=True,
)
# Create new data with unseen category and missing value
X_new = pd.DataFrame({"num": [7, 8], "cat": [None, "Z"]})

# Transform and predict using each trained pipeline
all_new_preds = []
for fold_processor, model in trained_pipelines:
    X_new_proc = fold_processor.transform(X_new)  # Handles None/'Z' → -1 automatically
    pred = model.predict_proba(X_new_proc)[:, 1]
    all_new_preds.append(pred)

# Ensemble by averaging
final_new_pred = np.mean(all_new_preds, axis=0)
```
This gives you a leakage-free stacking pipeline with proper early stopping and categorical handling.

---

## 📝 Notes
* Categorical handling is always active — detection happens after your base processor runs, so processors that create/modify categoricals (e.g., binning) work correctly.
* Column naming conventions:
  - OOF predictions: `{model}_seed_{seed}`(always)
  - Test predictions:
      - Averaged mode (`return_raw_test_preds=False`): `{model}_seed_{seed}` ← matches OOF
      - Raw mode (`return_raw_test_preds=True`): `{model}_seed_{seed}_fold_{fold}`
* Averaging happens before thresholding: Probabilities are averaged across folds first, then thresholded (when `predict_proba=False`). This preserves probability semantics and avoids averaging binary labels.
* Always use `.set_output(transform="pandas")` in sklearn pipelines to preserve column names and dtypes.
* Custom splitters: When `cv_splitter` is provided, it overrides `n_splits`. The splitter is cloned for each seed in `random_state`. Custom splitters must implement a `split(X, y, [groups])` method that yields `(train_idx, val_idx)` tuples. Most scikit-learn splitters are compatible out of the box.
* cv_groups requirement: `cv_groups` must be provided when using a group-based splitter (e.g., `GroupKFold`). If `cv_groups` is provided without `cv_splitter`, a `ValueError` is raised.

---

## 📄 License
This project is licensed under the MIT License.
See the LICENSE file for details.
