Metadata-Version: 2.0
Name: pyplearnr
Version: 1.0.10
Summary: Pyplearnr is a tool designed to easily and more elegantly build, validate (nested k-fold cross-validation), and test scikit-learn pipelines.
Home-page: http://packages.python.org/pyplearnr
Author: Christopher Shymansky
Author-email: CMShymansky@gmail.com
License: ALv2
Keywords: scikit-learn pipeline k-fold cross-validation model selection
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Topic :: Utilities
Classifier: License :: OSI Approved :: Apache Software License
Requires-Dist: matplotlib
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: sklearn

# What
Pyplearnr is a tool designed to easily and more elegantly build, select, and validate scikit-learn pipelines using nested k-fold cross-validation.

# How
### Use
See the [demo](https://nbviewer.jupyter.org/github/JaggedParadigm/pyplearnr/blob/master/pyplearnr_demo.ipynb) for use of pyplearnr.

### Installation
##### Dependencies

pyplearnr requires:

Python (>= 2.7 or >= 3.3)
scikit-learn (>= 0.18.2)
numpy (>= 1.13.0)
scipy (>= 0.19.1)
pandas (>= 0.20.2)
matplotlib (>= 2.0.2)

For use in Jupyter notebooks and the conda installation, I recommend having nb_conda (>= 2.2.0).

### User installation
Currently, installation is handled by using pip to install from the Github repository. I'm currently working on making this easier. 

For now, from the command line use:

```
pip install git+git://github.com/JaggedParadigm/pyplearnr.git@master
```

For conda, you can issue the same command above or you can include in your environment.yml file this:

```
- pip:
    - git+https://github.com/JaggedParadigm/pyplearnr.git#egg=pyplearnr
```

and then either generate a new environment from the terminal using:

```
conda env create
```

or update an existing one (environment_name) using:

```
conda env update -n=environment_name -f=./environment.yml
```

Another option is to simply clone the respository, link to the location in your code, and import it. 

# bleh

One core aspect of pyplearnr is the combinatorial pipeline schematic, a flexible diagram of every step (e.g. estimator), step option (e.g. knn, logistic regression, etc.), and parameter option (e.g. n_neighbors for knn and C for logistic regression) combination. Any scikit-learn class instance you would use in a normal pipeline can be inserted or one can be chosen from a list of supported ones. 

Here's an example with optional scaling, PCA (directly from the sklearn object), selection of the number of principal components to use, and the use of k-nearest neighbors with different values for the number of neighbors:
```python
pipeline_schematic = [
    {'scaler': {
            'none': {},
            'min_max': {},
            'standard': {}
        }
    },
    {'transform': {
            'pca': {
                'sklo': sklearn.decomposition.PCA,
                'n_components': [feature_count]
            }
        }         
    },
    {'feature_selection': {
            'select_k_best': {
                'k': range(1, feature_count+1)
            }
        }
    },
    {'estimator': {
            'knn': {
                'n_neighbors': range(1,31)
                }
        }
    }
]
```

The core validation method is nested k-fold cross-validation (stratified if for classification). Pyplearnr divides the data into k validation outer-folds and their corresponding training sets into k test inner-folds, picks the best pipeline as that having the highest score (median by default) for the inner-folds for each outer-fold, chooses the winning pipeline as that with the most wins, and uses the validation outer-folds to give an estimate of the ultimate winner's out-of-sample scores. This final pipeline can then be used to make predictions.

# Why
I wanted a way to do what GridSearchCV does for specific estimators with any estimator in a repeatable way.



