Metadata-Version: 2.1
Name: torchRDS
Version: 0.2
Summary: Reinforced Data Sampling for Model Diversification
Home-page: https://github.com/probeu/RDS
Author: Harry Nguyen
Author-email: harry.nguyen@outlook.com
License: MIT
Download-URL: https://github.com/probeu/RDS/archive/v_02.tar.gz
Keywords: Data-Sampling,Reinforcement-Learning,Machine-Learning
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Build Tools
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Description-Content-Type: text/markdown
Requires-Dist: numpy
Requires-Dist: torch
Requires-Dist: scikit-learn
Requires-Dist: pandas
Requires-Dist: pickle

# RDS

Implementation of [Reinforced Data Sampling for Model Diversification](https://arxiv.org/pdf/2006.07100).

### Requirements

- numpy
- torch
- scikit-learn
- pandas
- tqdm

### Machine Learning Tasks

This repository supports multiple machine learning tasks on multivariate, textual and visual data:

- **Binary Classification**
- **Multi-Class Classification**
- **Regression**

### Real-World Use Cases

Please contact us if you want to be listed here for real-world competitions or use cases.

### Experiment Results

Experiments have been conducted on four datasets as the following.

| Dataset | Task                      | Challenge                             | Size of Data                     | Evaluation | Year |
| :------ | :------------------------ | :------------------------------------ | :------------------------------- | :--------- | ---: |
| MADELON | Binary Classification     | NIPS 2013 Feature Selection Challenge | 2,600 x 500 (multivariate)       | AUC        | 2003 |
| DR      | Regression                | Drug Reviews (Kaggle Hackathon)       | 215,063 x 6 (multivariate, text) | R^2        | 2018 |
| MNIST   | Multiclass Classification | Hand Written Digit Recognition        | 70,000 x 28 x 28 (image)         | Micro-F1   | 1998 |
| KLP     | Binary Classification     | Kalapa Credit Scoring Challenge       | 50,000 x 64 (multivariate, text) | AUC        | 2020 |

#### MADELON - Results

| Sampling   | \#Sample |      | Class Ratio |         |        LR |        RF |       MLP | Ensemble  |  Public   |
| :--------- | -------: | ---: | ----------: | ------: | --------: | --------: | --------: | :-------: | :-------: |
|            |    Train | Test |       Train |    Test |           |           |           |           |           |
| Preset     |     2000 |  600 |     1\.0000 | 1\.0000 |     .6019 | **.8106** |     .5590 |   .6783   |   .9063   |
| Random     |     2000 |  600 |       .9920 | 1\.0270 |     .5742 |     .7729 |     .5774 |   .6453   |   .9002   |
| Stratified |     2000 |  600 |     1\.0000 | 1\.0000 |     .5673 |     .7470 |     .6153 |   .6360   |   .8828   |
| RDS^{DET}  |     2001 |  599 |     1\.0375 |   .9137 | **.6192** |     .8050 | **.6228** | **.6973** |   .8915   |
| RDS^{STO}  |     2021 |  579 |     1\.0010 |   .9966 | **.6192** |     .8050 |     .6050 |   .6947   | **.9106** |

#### DR - Results

| Sampling  |  Train  |  Test  |   Ridge   |    MLP    |    CNN    | Ensemble  |  Public   |
| :-------- | :-----: | :----: | :-------: | :-------: | :-------: | :-------: | :-------: |
| Preset    | 161,297 | 53,766 |   .4580   | **.5787** |   .7282   |   .6660   |   .7637   |
| Random    | 161,297 | 53,766 |   .4597   |   .4179   |   .7353   |   .6485   |   .7503   |
| RDS^{DET} | 162,070 | 52,993 |   .4646   |   .5776   |   .7355   | **.6692** | **.7649** |
| RDS^{STO} | 161,944 | 53,119 | **.4647** |   .5370   | **.7509** |   .6562   |   .7600   |

#### MNIST - Results

| Sampling   | \#Sample |       | Class Ratio |       |    LR     |    RF     |    CNN    | Ensemble  |  Public   |
| :--------- | :------: | :---: | :---------: | :---: | :-------: | :-------: | :-------: | :-------: | :-------: |
|            |  Train   | Test  |    Train    | Test  |           |           |           |           |           |
| Preset     |  60000   | 10000 |    .8571    | .1429 |   .9647   | **.9524** |   .9824   |   .9819   |   .9917   |
| Random     |  59500   | 10500 |    .8500    | .1500 |   .9603   |   .9465   |   .9779   |   .9768   |   .9914   |
| Stratified |  59500   | 10500 |    .8500    | .1500 | **.9625** |   .9510   |   .9795   |   .9792   |   .9901   |
| RDS^{DET}  |  59938   | 10062 |    .8562    | .1438 |   .9495   |   .9382   |   .9757   |   .9769   |   .9927   |
| RDS^{STO}  |  59496   | 10504 |    .8499    | .1501 |   .9583   |   .9486   | **.9851** | **.9830** | **.9931** |

#### KLP - Results

| Sampling   | \#Sample |       | Class Ratio |       |    LR     |    RF     |    MLP    | Ensemble  |  Public   |
| :--------- | :------: | :---: | :---------: | :---: | :-------: | :-------: | :-------: | :-------: | :-------: |
|            |  Train   | Test  |    Train    | Test  |           |           |           |           |           |
| Preset     |  30000   | 20000 |    .0165    | .0186 |   .5799   |   .5517   |   .5635   |   .5723   |   .5953   |
| Simple     |  30000   | 20000 |    .0169    | .0179 |   .5886   |   .5374   |   .5914   |   .5856   |   .6042   |
| Stratified |  30000   | 20000 |    .0173    | .0173 |   .5952   | **.5608** |   .5780   |   .5983   |   .6014   |
| RDS^{DET}  |  29999   | 20001 |    .0180    | .0163 | **.6045** |   .5350   |   .5802   |   .6057   |   .5362   |
| RDS^{STO}  |  30031   | 19969 |    .0172    | .0174 |   .5997   |   .5491   | **.6354** | **.6072** | **.6096** |

### Demos

#### Madelon - Binary Classification

Binary Classification with Deterministic Ensemble

```
python rds.py --data datasets/madelon.csv --target 0 -id MDL_DET --learning deterministic --sampling-ratio 0.7695 --envs models.MDL_RF models.MDL_MLP models.MDL_LR
```

Binary Classification with Stochastic Choice

```
python rds.py --data datasets/madelon.csv --target 0 -id MDL_STO --learning stochastic --sampling-ratio 0.7695 --envs models.MDL_RF models.MDL_MLP models.MDL_LR
```

Evaluating with Public Benchmarking

```
python evaluator.py --data datasets/madelon.csv --target 0 --sample outputs/MDL_DET.npy --task classification --measure auc --envs models.MDL_PS
python evaluator.py --data datasets/madelon.csv --target 0 --sample outputs/MDL_STO.npy --task classification --measure auc --envs models.MDL_PS
```

#### Boston Housing - Regression

Regression with Deterministic Ensemble

```
python rds.py --data datasets/boston.csv --target 0 -id BOS_DET --task regression --learning deterministic --envs models.BOS_MLP models.BOS_Ridge models.BOS_SVM
```

Regression with Stochastic Choice

```
python rds.py --data datasets/boston.csv --target 0 -id BOS_STO --task regression --learning stochastic --envs models.BOS_MLP models.BOS_Ridge models.BOS_SVM
```

Evaluating with Ensemble Benchmarking

```
python evaluator.py --data datasets/boston.csv --target 0 --sample outputs/BOS_DET.npy --task regression --measure auc --envs models.BOS_MLP models.BOS_Ridge models.BOS_SVM
python evaluator.py --data datasets/boston.csv --target 0 --sample outputs/BOS_STO.npy --task regression --measure auc --envs models.BOS_MLP models.BOS_Ridge models.BOS_SVM
```

#### MNIST - Multi-class Classification

Regression with Deterministic Ensemble

```
python rds.py --data-loader datasets.MNIST -id MNIST_DET --task classification --learning deterministic --sampling-ratio 0.8572 --measure f1_micro --envs models.MNIST_CNN models.MNIST_RF models.MNIST_LR
```

Regression with Stochastic Choice

```
python rds.py --data-loader datasets.MNIST -id MNIST_STO --task classification --learning stochastic --sampling-ratio 0.8572 --measure f1_micro --envs models.MNIST_CNN models.MNIST_RF models.MNIST_LR
```

Evaluating with Ensemble Benchmarking

```
python evaluator.py --data-loader datasets.MNIST --sample outputs/MNIST_DET.npy --task classification --measure f1_micro --envs models.MNIST_CNN models.MNIST_RF models.MNIST_LR
python evaluator.py --data-loader datasets.MNIST --sample outputs/MNIST_STO.npy --task classification --measure f1_micro --envs models.MNIST_CNN models.MNIST_RF models.MNIST_LR
```

## Citing this work

Please consider citing us if this work is useful in your research:

```
@misc{nguyen2020reinforced,
    title={Reinforced Data Sampling for Model Diversification},
    author={Hoang D. Nguyen and Xuan-Son Vu and Quoc-Tuan Truong and Duc-Trong Le},
    year={2020},
    eprint={2006.07100},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}
```

## References

- Lee, S., Prakash, S.P.S., Cogswell, M., Ranjan, V., Crandall, D. and Batra, D., 2016. Stochastic multiple choice learning for training diverse deep ensembles. In *Advances in Neural Information Processing Systems* (pp. 2119-2127).
- Peng, M., Zhang, Q., Xing, X., Gui, T., Huang, X., Jiang, Y.G., Ding, K. and Chen, Z., 2019, July. Trainable undersampling for class-imbalance learning. In *Proceedings of the AAAI Conference on Artificial Intelligence* (Vol. 33, pp. 4707-4714).
- Gong, Z., Zhong, P. and Hu, W., 2019. Diversity in machine learning. *IEEE Access*, *7*, pp.64323-64350.



