Metadata-Version: 2.1
Name: pt-datasets
Version: 0.20.13
Summary: Library for loading PyTorch datasets and data loaders.
License: AGPL-3.0 License
Author: Abien Fred Agarap
Author-email: abienfred.agarap@gmail.com
Requires-Python: >=3.11,<4.0
Classifier: License :: Other/Proprietary License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: gdown (>=5.1.0,<6.0.0)
Requires-Dist: imbalanced-learn (>=0.11.0,<0.12.0)
Requires-Dist: nltk (>=3.8.1,<4.0.0)
Requires-Dist: numba (>=0.58.1,<0.59.0)
Requires-Dist: numpy (<2.0)
Requires-Dist: opencv-python (>=4.7.0.72,<5.0.0.0)
Requires-Dist: opentsne (>=1.0.0,<2.0.0)
Requires-Dist: pymagnitude-lite (>=0.1.143,<0.2.0)
Requires-Dist: scikit-learn (>=1.2.2,<2.0.0)
Requires-Dist: spacy (>=3.7.4,<4.0.0)
Requires-Dist: torch (>=2.2,<3.0)
Requires-Dist: torchvision (>=0.17.0,<0.18.0)
Requires-Dist: umap-learn (>=0.5.3,<0.6.0)
Description-Content-Type: text/markdown

# PyTorch Datasets

[![PyPI version](https://badge.fury.io/py/pt-datasets.svg)](https://badge.fury.io/py/pt-datasets)
[![License: AGPL v3](https://img.shields.io/badge/License-AGPL%20v3-blue.svg)](https://www.gnu.org/licenses/agpl-3.0)
[![Python 3.9](https://img.shields.io/badge/python-3.9-blue.svg)](https://www.python.org/downloads/release/python-3916/)

![](assets/term.png)

## Overview

This repository is meant for easier and faster access to commonly used
benchmark datasets. Using this repository, one can load the datasets in a
ready-to-use fashion for PyTorch models. Additionally, this can be used to load
the low-dimensional features of the aforementioned datasets, encoded using PCA,
t-SNE, or UMAP.

## Datasets

- MNIST
- Fashion-MNIST
- EMNIST-Balanced
- CIFAR10
- SVHN
- MalImg
- AG News
- IMDB
- Yelp
- 20 Newsgroups
- KMNIST
- Wisconsin Diagnostic Breast Cancer
- [COVID19 binary classification](https://github.com/lindawangg/COVID-Net)
- [COVID19 multi-classification](https://github.com/lindawangg/COVID-Net)

_Note on COVID19 datasets: Training models on this is not intended to produce
models for direct clinical diagnosis. Please do not use the model output for
self-diagnosis, and seek help from your local health authorities._

## Usage

It is recommended to use a virtual environment to isolate the project dependencies.

```shell script
$ virtualenv env --python=python3  # we use python 3
$ pip install pt-datasets  # install the package
```

We can then use this package for loading ready-to-use data loaders,

```python
from pt_datasets import load_dataset, create_dataloader

# load the training and test data
train_data, test_data = load_dataset(name="cifar10")

# create a data loader for the training data
train_loader = create_dataloader(
    dataset=train_data, batch_size=64, shuffle=True, num_workers=1
)

...

# use the data loader for training
model.fit(train_loader, epochs=10)
```

We can also encode the dataset features to a lower-dimensional space,

```python
import seaborn as sns
import matplotlib.pyplot as plt
from pt_datasets import load_dataset, encode_features

# load the training and test data
train_data, test_data = load_dataset(name="fashion_mnist")

# get the numpy array of the features
# the encoders can only accept np.ndarray types
train_features = train_data.data.numpy()

# flatten the tensors
train_features = train_features.reshape(
    train_features.shape[0], -1
)

# get the labels
train_labels = train_data.targets.numpy()

# get the class names
classes = train_data.classes

# encode training features using t-SNE
encoded_train_features = encode_features(
    features=train_features,
    seed=1024,
    encoder="tsne"
)

# use seaborn styling
sns.set_style("darkgrid")

# scatter plot each feature w.r.t class
for index in range(len(classes)):
    plt.scatter(
        encoded_train_features[train_labels == index, 0],
        encoded_train_features[train_labels == index, 1],
        label=classes[index],
        edgecolors="black"
    )
plt.legend(loc="upper center", title="Fashion-MNIST classes", ncol=5)
plt.show()
```

![](assets/tsne_fashion_mnist.png)

## Citation

When using the Malware Image classification dataset, kindly use the following
citations,

- BibTex

```
@article{agarap2017towards,
    title={Towards building an intelligent anti-malware system: a deep learning approach using support vector machine (SVM) for malware classification},
    author={Agarap, Abien Fred},
    journal={arXiv preprint arXiv:1801.00318},
    year={2017}
}
```

- MLA

```
Agarap, Abien Fred. "Towards building an intelligent anti-malware system: a
deep learning approach using support vector machine (svm) for malware
classification." arXiv preprint arXiv:1801.00318 (2017).
```

If you use this library, kindly cite it as,

```
@misc{agarap2020pytorch,
    author       = "Abien Fred Agarap",
    title        = "{PyTorch} datasets",
    howpublished = "\url{https://gitlab.com/afagarap/pt-datasets}",
    note         = "Accessed: 20xx-xx-xx"
}
```

## License

```
PyTorch Datasets utility repository
Copyright (C) 2020-2023  Abien Fred Agarap

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as published
by the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License
along with this program.  If not, see <https://www.gnu.org/licenses/>.
```

