Metadata-Version: 2.1
Name: ingotdr
Version: 0.0.5
Summary: INGOT-DR (INterpretable GrOup Testing for Drug Resistance)
Home-page: https://github.com/hoomanzabeti/ingotdr
Author: Hooman Zabeti
Author-email: hzabeti@sfu.ca
License: UNKNOWN
Platform: UNKNOWN
Description-Content-Type: text/markdown
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: pulp
Requires-Dist: sklearn

# INGOT-DR

**INGOT-DR** ( **IN**terpretable **G**r**O**up **T**esting for **D**rug **R**esistance) is an interpretable rule-based
predictive model base on **Group Testing** and **Boolean Compressed Sesing**. For more details and citation please see the 
INGOT-DR [paper](#citation). To access scripts used to produce the results in the paper please visit 
[INGOT-DR Project](https://github.com/hoomanzabeti/INGOT_DR_project). To access the data used in the paper
please visit/cite
[M.tuberculosis dataset for drug resistant](#https://github.com/AmirHoseinSafari/M.tuberculosis-dataset-for-drug-resistant).

##Table of content

* [Installation](#installation)
* [Usage](#usage)
  * [Arguments](#arguments)
  * [Methods](#methods)
  * [Training and evaluation](#training-and-evaluation)
  * [Hyper-parameter tuning](#hyper-parameter-tuning)
  * [Optimizing for different target metric](#optimizing-for-different-target-metric)
  * [Choosing the solver](#choosing-the-solver)
 * [Citation](#citation)
## Installation
INGOT-DR can be installed from [PyPI](https://pypi.org/project/ingotdr/). 
```shell
pip install ingotdr
```
## Usage
INGOT-DR is implemented as a [scikit-learn](https://scikit-learn.org/stable/index.html)
classifier. As a result, this classifier is compatible with most of scikit-learn
tools (e.g. cross validation and hyper-parameter tuning tools). In the following section, we provide some usage examples:
### Arguments

```python
ingot.INGOTClassifier( w_weight=1, lambda_p=1, lambda_z=1, lambda_e=1, false_positive_rate_upper_bound=None,
                       false_negative_rate_upper_bound=None, max_rule_size=None, rounding_threshold=1e-5,
                       lp_relaxation=False, only_slack_lp_relaxation=False, lp_rounding_threshold=0,
                       is_it_noiseless=False, solver_name='PULP_CBC_CMD', solver_options=None)
```

|Name|Type|Description|Default|
|:---|:---:|:---|:---:|
|w_weight|vector, float|A vector, float to provide prior weight to _w_.  | 1.0 |
|lambda_p| float| Regularization coefficient for positive labels.|1.0|
|lambda_z| float| Regularization coefficient for negative/zero labels.|1.0|
|lambda_e| float| Regularization coefficient for all slack variables.|1.0|
|false_positive_rate_upper_bound| float| False positive rate (FPR) upper bound.| None|
|false_negative_rate_upper_bound| float| False negative rate(FNR) upper bound.| None|
|max_rule_size| int | Maximum rule size.| None |
|rounding_threshold| float| Threshold for ILP solutions for Rounding to 0 and 1.| 1e-5|
|lp_relaxation| bool | A flag to use the lp relaxed version.| False|
|only_slack_lp_relaxation| bool| A flag to only use the lp relaxed slack variables.| False|
|lp_rounding_threshold| float| Threshold for lp solutions for Rounding to 0 and 1. Range from 0 to 1.| 0.0 |
|is_it_noiseless| bool| A flag to specify whether the problem is noisy or noiseless. |False|
|solver_name| str | Solver's [name](https://coin-or.github.io/pulp/guides/how_to_configure_solvers.html) provided by Pulp. | 'PULP_CBC_CMD' |
|solver_options| dict | Solver's [options](https://coin-or.github.io/pulp/technical/solvers.html) provided by Pulp.| None|

### Methods
|Method|Description|
|---|---|
|`fit(X,y)`|Fit the model with respect to the given data.|
|`get_params_dictionary(variable_type='w')`|Provide a dictionary of individuals with their status obtained by decoder. Type of the variable.e.g. 'w', 'ep' or 'en'|
|`solution()`|Provide a vector of binary features importance. i.e. 1 if feature was used in the model 0 otherwise.|
|`predict(X)`|Provide a predicted labels for X.|
|`score(X,y)`|Provide the accuracy of `self.predict(X)` with respect to `y`|
|`learned_rule(return_type='feature_name')`|Return a list of rules. return_type can be 'feature_name' or 'feature_id'.|
|`write(fileType='mps', **kwargs)`|Create a file from the problem. `fileType` can be 'mps', 'lp', 'json' or 'display'. 'display' shows the ILP/LP problem on screen.|

### Training and evaluation
**Example:**
The following is an example of training a classifier to predict resistance to second line drug _Ciprofloxacin_ in TB isolates. In this example the
feature matrix indicates presence/absence of SNPs in TB isolates, and the label vector represents the drug resistance phenotype.
Sample data is available [here](https://github.com/hoomanzabeti/INGOT_DR_project/tree/master/data).
```python
from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score
import pandas as pd
import ingot

feature_matrix = 'SNPsMatrix_ciprofloxacin.csv'
label_vector =  'ciprofloxacinLabel.csv'

X = pd.read_csv(feature_matrix, index_col=0)
y = pd.read_csv(label_vector, index_col=0).to_numpy().ravel()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33, test_size=0.2, stratify=y)

clf = ingot.INGOTClassifier(lambda_p=10, lambda_z=0.01, false_positive_rate_upper_bound=0.1,
                            max_rule_size=20, solver_name='CPLEX_PY')
clf.fit(X_train,y_train)

y_pred = clf.predict(X_test)

print("Balanced accuracy: {}".format(balanced_accuracy_score(y_test, y_pred)))
print("Accuracy: {}".format(clf.score(X_test,y_test)))
print("Features in the learned rule: {}".format(clf.learned_rule()))
```
Output:

**Note:** Results may slightly vary for different solvers. Please see [Choosing the solver](#choosing-the-solver).

```shell
Balanced accuracy: 0.8449477351916377
Accuracy: 0.9550561797752809
Features in the learned rule: ['7570, C, T', '7572, T, C', '7581, G, T', '7582, A, C', '7582, A, G']
```

### Hyper-parameter tuning

Hyper-parameter tuning via scikit-learn [Grid Search CV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html):

**Example:**
```python
from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score
import pandas as pd
import ingot
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

feature_matrix = 'SNPsMatrix_ciprofloxacin.csv'
label_vector =  'ciprofloxacinLabel.csv'

X = pd.read_csv(feature_matrix, index_col=0)
y = pd.read_csv(label_vector, index_col=0).to_numpy().ravel()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33, test_size=0.2, stratify=y)

clf = ingot.INGOTClassifier(false_positive_rate_upper_bound=0.1, max_rule_size=20, solver_name='CPLEX_PY',
                            solver_options={'timeLimit': 1800})

scoring = dict(Accuracy='accuracy', balanced_accuracy=make_scorer(balanced_accuracy_score))
param_grid={'lambda_p': [ 1, 10, 100 ], 'lambda_z': [ 0.01, 0.1, 1 ]}
grid = GridSearchCV(estimator=clf, param_grid= param_grid, scoring=scoring, cv=5, refit ='balanced_accuracy',
                    n_jobs=-1, verbose= 3)
grid.fit(X_train, y_train)

y_pred = grid.predict(X_test)

print("Balanced accuracy: {}".format(balanced_accuracy_score(y_test, y_pred)))
print('Best params: {}'.format(grid.best_params_))
```
Output:

```shell
Balanced accuracy: 0.8449477351916377
Best params: {'lambda_p': 10, 'lambda_z': 0.01}
```

### Optimizing for different target metric

**Note:** _w_weight_ and _lambda_e_ are not part of the main ILP ([Eq (11)](#citation)) defined in the INGOT-DR paper. These two variables
 are defined to provide freedom when _Optimizing for different target metric_ ([section 1.4](#citation)) is needed. The 
complete objective function with these two variables would be:

![complete objective function](https://github.com/hoomanzabeti/INGOT_DR_project/blob/master/data/CompleteObjFunc.gif)


**Example:** 
Classifier corresponding to [Eq (16)](#citation) with maximum rule size k=20 and specificity lower bound t= 90% can be defined as following:
```python
clf = ingot.INGOTClassifier(w_weight=0, lambda_z=0, false_positive_rate_upper_bound=0.1, max_rule_size=20,
                            solver_name='CPLEX_PY')
```

The following table shows the combination of arguments needed to define some of ILPs in the [paper](#citation)

|lp_relaxation|only_slack_lp_relaxation|is_it_noiseless|Equation number in the [paper](#citation)|
|---|---|---|:---:|
|False|False|False| Eq (11)|
|False|True|True|Eq (3)|
|False|True|False|Eq (4) with objective function of Eq (11)|
|False|False|True|Eq (3)|
|True|True|False|LP relaxation of Eq (4) with objective function of Eq (11)|
|True|False|False|LP relaxation of Eq (4) with objective function of Eq (11)|
|True|False|True|LP relaxation of Eq (3)|
True|True|True|LP relaxation of Eq (3)|


**Note:** True value of _lp_relaxation_ or _is_it_noiseless_ with override _only_slack_lp_relaxation_. i.e. if one of them is True
then value of _only_slack_lp_relaxation_ is not important.

**Note:** To recreate and work with [Eq (4)](#citation), you only need to use combination in row 3 and use or tune `lambda_e`
instead of `lambda_p` and `lambda_z`. For example:

```python

param_grid={'lambda_e': [0.01, 0.1,  1, 10, 100 ]}
grid = GridSearchCV(estimator=clf, param_grid= param_grid, scoring=scoring, cv=5, refit ='balanced_accuracy',
                    n_jobs=-1, verbose= 3)

```


### Choosing the solver 
INGOT-DR supports a variety of solvers through the [PuLP](https://pypi.org/project/PuLP/) application programming interface (API). 
Solvers such as [GLPK](http://www.gnu.org/software/glpk/glpk.html),
[COIN-OR CLP/CBC](https://github.com/coin-or/Cbc),
[CPLEX](http://www.cplex.com/),
[GUROBI](http://www.gurobi.com/),
[MOSEK](https://www.mosek.com/),
[XPRESS](https://www.fico.com/es/products/fico-xpress-solver),
[CHOCO](https://choco-solver.org/),
[MIPCL](http://mipcl-cpp.appspot.com/),
[SCIP](https://www.scipopt.org/).

List of available solvers on your machine:
```python
import pulp as pl
solver_list = pl.listSolvers(onlyAvailable=True)
```

Name and properties of the solver can be specified via ```solver_name``` and 
```solver_options```. e.g:
```python
clf = ingot.INGOTClassifier(solver_name='CPLEX_PY', solver_options={'timeLimit': 1800})
```
In the [INGOT-DR](#citation) paper, ```'CPLEX_PY'``` is the main solver. Results may slightly vary for different solvers. IBM CPLEX for academic use is available
[here](https://www.ibm.com/academic/technology/data-science). 
## Citation:
For general use please cite our paper: [INGOT-DR: an interpretable classifier forpredicting drug resistance in M. tuberculosis](https://www.biorxiv.org/content/10.1101/2020.05.31.115741v2.full).
([bibtex](https://scholar.googleusercontent.com/scholar.bib?q=info:bQ6FP1AQpvkJ:scholar.google.com/&output=citation&scisdr=CgXpW1OOEIO721E6sjI:AAGBfm0AAAAAYKw_qjKcmF8c1XZV57JWSMoDkwpaXPr8&scisig=AAGBfm0AAAAAYKw_qrApE1nCy1ns_BxQVZG_vrbY2Ot3&scisf=4&ct=citation&cd=-1&hl=en))


