Metadata-Version: 2.1
Name: erasplit
Version: 1
Summary: Invariant Gradient Boosted Decision Tree Package - Era Splitting.
Home-page: https://arxiv.org/abs/2309.14496
Download-URL: https://github.com/jefferythewind/erasplit
Maintainer: Tim DeLise
Maintainer-email: tdelise@gmail.com
License: new BSD
Project-URL: Bug Tracker, https://github.com/jefferythewind/erasplit/issues
Project-URL: Documentation, https://github.com/jefferythewind/erasplit
Project-URL: Source Code, https://github.com/jefferythewind/erasplit
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: BSD License
Classifier: Programming Language :: C
Classifier: Programming Language :: Python
Classifier: Topic :: Software Development
Classifier: Topic :: Scientific/Engineering
Classifier: Development Status :: 5 - Production/Stable
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX
Classifier: Operating System :: Unix
Classifier: Operating System :: MacOS
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: COPYING
Requires-Dist: scikit-learn >=1.3.0
Requires-Dist: numpy >=1.17.3
Requires-Dist: scipy >=1.3.2
Requires-Dist: joblib >=1.1.1
Requires-Dist: threadpoolctl >=2.0.0
Provides-Extra: benchmark
Requires-Dist: matplotlib >=3.1.3 ; extra == 'benchmark'
Requires-Dist: pandas >=1.0.5 ; extra == 'benchmark'
Requires-Dist: memory-profiler >=0.57.0 ; extra == 'benchmark'
Provides-Extra: docs
Requires-Dist: matplotlib >=3.1.3 ; extra == 'docs'
Requires-Dist: scikit-image >=0.16.2 ; extra == 'docs'
Requires-Dist: pandas >=1.0.5 ; extra == 'docs'
Requires-Dist: seaborn >=0.9.0 ; extra == 'docs'
Requires-Dist: memory-profiler >=0.57.0 ; extra == 'docs'
Requires-Dist: sphinx >=4.0.1 ; extra == 'docs'
Requires-Dist: sphinx-gallery >=0.7.0 ; extra == 'docs'
Requires-Dist: numpydoc >=1.2.0 ; extra == 'docs'
Requires-Dist: Pillow >=7.1.2 ; extra == 'docs'
Requires-Dist: pooch >=1.6.0 ; extra == 'docs'
Requires-Dist: sphinx-prompt >=1.3.0 ; extra == 'docs'
Requires-Dist: sphinxext-opengraph >=0.4.2 ; extra == 'docs'
Requires-Dist: plotly >=5.10.0 ; extra == 'docs'
Provides-Extra: examples
Requires-Dist: matplotlib >=3.1.3 ; extra == 'examples'
Requires-Dist: scikit-image >=0.16.2 ; extra == 'examples'
Requires-Dist: pandas >=1.0.5 ; extra == 'examples'
Requires-Dist: seaborn >=0.9.0 ; extra == 'examples'
Requires-Dist: pooch >=1.6.0 ; extra == 'examples'
Requires-Dist: plotly >=5.10.0 ; extra == 'examples'
Provides-Extra: tests
Requires-Dist: matplotlib >=3.1.3 ; extra == 'tests'
Requires-Dist: scikit-image >=0.16.2 ; extra == 'tests'
Requires-Dist: pandas >=1.0.5 ; extra == 'tests'
Requires-Dist: pytest >=5.3.1 ; extra == 'tests'
Requires-Dist: pytest-cov >=2.9.0 ; extra == 'tests'
Requires-Dist: flake8 >=3.8.2 ; extra == 'tests'
Requires-Dist: black >=22.3.0 ; extra == 'tests'
Requires-Dist: mypy >=0.961 ; extra == 'tests'
Requires-Dist: pyamg >=4.0.0 ; extra == 'tests'
Requires-Dist: numpydoc >=1.2.0 ; extra == 'tests'
Requires-Dist: pooch >=1.6.0 ; extra == 'tests'

This is the official code base for Era Splitting. Using this repository you can install and run the EraHistGradientBoostingRegressor with the new **era splitting**, **directional era splitting**, or original criterion implemented via simple arguments.

Era Splitting Paper: https://arxiv.org/abs/2309.14496

# Installation

## Clone the Repo

```
git clone --single-branch --branch era_splitting-tiebreaker https://github.com/jefferythewind/scikit-learn-erasplit.git
```

## Install via Pip

```
cd scikit-learn-erasplit/
pip install .
```

# Example Implementation w/ Numerai Data

```python

from pathlib import Path
from numerapi import NumerAPI #pip install numerapi
import json

"""Era Split Model"""
from erasplit.ensemble import EraHistGradientBoostingRegressor

napi = NumerAPI()
Path("./v4").mkdir(parents=False, exist_ok=True)
napi.download_dataset("v4/train.parquet")
napi.download_dataset("v4/features.json")

with open("v4/features.json", "r") as f:
    feature_metadata = json.load(f)
features = feature_metadata["feature_sets"]['small']
TARGET_COL="target_cyrus_v4_20"

training_data = pd.read_parquet('v4/train.parquet')
training_data['era'] = training_data['era'].astype('int')

model = EraHistGradientBoostingRegressor( 
    early_stopping=False, 
    boltzmann_alpha=0, 
    max_iter=5000, 
    max_depth=5, 
    learning_rate=.01, 
    colsample_bytree=.1, 
    max_leaf_nodes=32, 
    gamma=1, #for era splitting
    #blama=1,  #for directional era splitting
    #vanna=1,  #for original splitting criterion
)
model.fit(training_data[ features ], training_data[ TARGET_COL ], training_data['era'].values)
```

## Explanation of Parameters
### Boltzmann Alpha
The Boltzmann alpha parameter varies from -infinity to +infinity. A value of zero recovers the mean, -infinity recovers the minumum and +infinity recovers the maximum. This smooth min/max function is applied to the era-wise impurity scores when evaluating a data split. Negative values here will build more invariant trees.

Read more: https://en.wikipedia.org/wiki/Smooth_maximum

### Gamma
Varies over the interval [0,1]. Indicates weight placed on the  era splitting criterion.

### Blama
Varies over the interval [0,1]. Indicates weight placed on the directional era splitting criterion.

### Vanna
Varies over the interval [0,1]. Indicates weight placed on the original splitting criterion.

Behind the scenes, this is for formula which creates a linear combination of the split criteria. Usually we just set one of these to 1 and leave the other at zero.
```python
gain = gamma * era_split_gain + blama * directional_era_split_gain + vanna * original_gain
```

# Complete (New Updated) Code Notebook Examples Available here:

https://github.com/jefferythewind/era-splitting-notebook-examples

# Citations:

````
@misc{delise2023era,
      title={Era Splitting}, 
      author={Timothy DeLise},
      year={2023},
      eprint={2309.14496},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}
````

This code was forked from the official scikit-learn repository and is currently a stand-alone version. All community help is welcome for getting these ideas part of the official scikit learn code base or even better, incorporated in the LightGBM code base.

https://scikit-learn.org/stable/about.html#citing-scikit-learn
