Metadata-Version: 2.1
Name: genetic-feature-selection
Version: 0.1.5
Summary: Program for selecting the n best features for your machine learning model.
Author: Magnus P. Nytun
Author-email: magnus.nytun@sb1ostlandet.no
Classifier: Programming Language :: Python :: 3.8

genetic-feature-selection
=========================


This package implements a genetic algorithm used for feature search.
--------------------------------------------------------------------

.. note::

   Note that the package tries to maximize the fitness function. If you want to minimize, for example, MSE, you should multiply it by -1.

Example of use
--------------

.. code::python
    from sklearn.datasets import make_classification
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import average_precision_score
    from sklearn.model_selection import train_test_split
    from genetic_feature_selection.genetic_search import GeneticSearch
    from genetic_feature_selection.f_score_generator import FScoreSoftmaxInitPop
    import pandas as pd


    X, y = make_classification(n_samples=1000, n_informative=20, n_redundant=0)
    X = pd.DataFrame(X)
    y = pd.Series(y, name="y", dtype=int)
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    X.columns = [f"col_{i}" for i in X.columns]


    gsf = GeneticSearch(
        # the search will do 5 iterations
        iterations = 5, 
        # each generation will have 4 possible solutions
        sol_per_pop = 4, 
        # every iteration will go through 15 generations 
        generations = 15, 
        # in each generation the 4 best individuals will be kept
        keep_n_best_individuals = 4, 
        # we want to find the 5 features that optimize average precision score
        select_n_features = 5,
        # 4 of the parents will be mating, this means the 4 best solutions in
        # each generation will be combined and create the basis for the next
        # generation
        num_parents_mating = 4,
        X_train = X_train,
        y_train = y_train,
        X_test = X_test,
        y_test = y_test,
        clf = LogisticRegression,
        clf_params = dict(max_iter=15),
        probas = True,
        scorer = average_precision_score,
        gen_pop = FScoreSoftmaxInitPop(
            X_train, y_train, tau = 50
        )
    )


    best_cols = gsf.search()
