Metadata-Version: 2.4
Name: surrogate-forest
Version: 0.1.0
Summary: Random Forest with surrogate splits for native missing data handling
Author: Jason Karpeles
License-Expression: MIT
Keywords: random-forest,surrogate-splits,missing-data,machine-learning,scikit-learn
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: numpy>=1.21
Requires-Dist: scipy>=1.7
Requires-Dist: scikit-learn>=1.2
Requires-Dist: joblib>=1.1
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pandas>=1.3; extra == "dev"

# surrogate-forest

Random Forest with surrogate splits for native missing data handling.

Python implementation combining the best ideas from CART (surrogate splits), XGBoost (learned NaN directions), LightGBM (histogram binning), and CatBoost (symmetric trees) into a single sklearn-compatible library.

## Installation

```bash
pip install -e .
```

## Quick Start

```python
from surrogate_forest import SurrogateRandomForestClassifier
import numpy as np

X = np.random.randn(500, 10)
X[np.random.rand(*X.shape) < 0.2] = np.nan  # 20% missing
y = (X[:, 0] > 0).astype(int)  # NaN-safe comparison not needed — handled internally

clf = SurrogateRandomForestClassifier(n_estimators=100, max_surrogates=5, random_state=42)
clf.fit(X, y)
predictions = clf.predict(X)
probabilities = clf.predict_proba(X)
```

## How It Works

### Missing Data Handling (3-layer fallback)

At each internal node during prediction:

1. **Primary split** — if the split feature is available, use the threshold
2. **Surrogate splits** — if primary is missing, try backup splits on correlated features (first non-missing wins)
3. **Learned direction** — if all surrogates are also missing, go left or right based on the direction that was optimal during training (XGBoost-style)

### Surrogate Split Discovery

After choosing the primary split at each node:

1. Determine primary's left/right assignment for all samples
2. For every other feature, find the threshold maximizing concordance with the primary
3. Compute **predictive measure of association**: `λ = (p_naive_error - disagreement) / p_naive_error`
4. Support mirrored surrogates (negatively correlated features)
5. Rank by λ, keep top `max_surrogates`

## Estimators

| Estimator | Task | Key Default |
|-----------|------|-------------|
| `SurrogateDecisionTreeClassifier` | Classification | `max_features=None` |
| `SurrogateDecisionTreeRegressor` | Regression | `criterion='squared_error'` |
| `SurrogateRandomForestClassifier` | Classification | `max_features='sqrt'` |
| `SurrogateRandomForestRegressor` | Regression | `max_features='sqrt'` |

## Key Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `max_surrogates` | 5 | Surrogate splits per node |
| `use_histogram` | True | Histogram-based splitting (LightGBM-style) |
| `max_bins` | 255 | Histogram resolution |
| `symmetric_tree` | False | CatBoost-style oblivious trees |
| `max_leaf_nodes` | None | If set, enables best-first (leaf-wise) growth |

All standard sklearn tree/forest parameters are also supported: `max_depth`, `min_samples_split`, `min_samples_leaf`, `max_features`, `n_estimators`, `bootstrap`, `oob_score`, `n_jobs`, etc.

## Feature Importance

Three types of feature importance:

```python
from surrogate_forest import impurity_importance, surrogate_importance, permutation_importance

# Standard impurity-based (MDI)
imp = impurity_importance(fitted_model)

# Surrogate-based (unique to this library)
# Captures feature redundancy/substitutability
surr_imp = surrogate_importance(fitted_model)

# Permutation importance (correctly shuffles NaN patterns)
perm_imp = permutation_importance(fitted_model, X, y, n_repeats=10)
```

## Growth Modes

```python
from surrogate_forest import SurrogateDecisionTreeClassifier

# Depth-first (default CART)
tree = SurrogateDecisionTreeClassifier(max_depth=5)

# Best-first / leaf-wise (LightGBM-style)
tree = SurrogateDecisionTreeClassifier(max_leaf_nodes=31)

# Symmetric / oblivious (CatBoost-style)
tree = SurrogateDecisionTreeClassifier(max_depth=6, symmetric_tree=True)
```

## sklearn Compatibility

Full compatibility with sklearn's API:

```python
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, GridSearchCV

# Works in pipelines
pipe = Pipeline([("clf", SurrogateRandomForestClassifier())])

# Works with cross-validation
scores = cross_val_score(clf, X, y, cv=5)

# Works with grid search
param_grid = {"max_depth": [3, 5, 10], "max_surrogates": [0, 3, 5]}
gs = GridSearchCV(SurrogateDecisionTreeClassifier(), param_grid, cv=3)
```

## Testing

```bash
pytest tests/ -v
```
