Metadata-Version: 2.1
Name: vl-datasets
Version: 0.0.2
Summary: Clean datasets for computer vision.
Home-page: https://github.com/visual-layer/vl-datasets
Author: Visual Layer
Author-email: info@visual-layer.com
License: Apache-2.0
Keywords: machine learning,computer vision,data-centric
Platform: UNKNOWN
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Operating System :: OS Independent
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch
Requires-Dist: torchvision
Requires-Dist: pandas



<!-- PROJECT LOGO -->
<br />
<div align="center">

<a href="https://www.visual-layer.com">
  <img alt="Visual Layer Logo" src="https://raw.githubusercontent.com/visual-layer/fastdup/main/gallery/visual_layer_logo.png" alt="Logo" width="400">
</a>

<h3 align="center">Open, Clean Datasets for Computer Vision</h3>

  <p align="center">
  <br />
    🔥 We use
    <a href="https://github.com/visual-layer/fastdup">fastdup</a> - a free tool to clean all datasets shared in this repo.
    <br />
    <a href="https://visual-layer.readme.io/" target="_blank" rel="noopener noreferrer"><strong>Explore the docs »</strong></a>
    <br />
    <a href="https://github.com/visual-layer/vl-datasets/issues" target="_blank" rel="noopener noreferrer">Report Issues</a>
    ·
    <a href="https://medium.com/@amiralush/large-image-datasets-today-are-a-mess-e3ea4c9e8d22" target="_blank" rel="noopener noreferrer">Read Blog</a>
    ·
    <a href="mailto:info@visual-layer.com?subject=Sign-up%20for%20access" target="_blank" rel="noopener noreferrer">Get In Touch</a>
    ·
    <a href="https://visual-layer.com/" target="_blank" rel="noopener noreferrer">About Us</a>
    <br />
    <br /> 
    <a href="https://visualdatabase.slack.com/join/shared_invite/zt-19jaydbjn-lNDEDkgvSI1QwbTXSY6dlA#/shared-invite/email" target="_blank" rel="noopener noreferrer">
    <img src="https://img.shields.io/badge/JOIN US ON SLACK-4A154B?style=for-the-badge&logo=slack&logoColor=white" alt="Logo">
    </a>
    <a href="https://visual-layer.readme.io/discuss" target="_blank" rel="noopener noreferrer">
    <img src="https://img.shields.io/badge/Discussion-%20Forum-brightgreen?style=for-the-badge&logo=discourse&logoColor=white" alt="Logo">
    </a>
    <a href="https://www.linkedin.com/company/visual-layer/" target="_blank" rel="noopener noreferrer">
    <img src="https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white" alt="Logo">
    </a>
    <a href="https://twitter.com/visual_layer" target="_blank" rel="noopener noreferrer">
    <img src="https://img.shields.io/badge/Twitter-1DA1F2?style=for-the-badge&logo=twitter&logoColor=white" alt="Logo">
    </a>
    <a href="https://www.youtube.com/@visual-layer4035" target="_blank" rel="noopener noreferrer">
    <img src="https://img.shields.io/badge/-YouTube-black.svg?style=for-the-badge&logo=youtube&colorB=red" alt="Logo">
    </a>
  </p>
</div>

## What?
This repo shares clean version of publicly available computer vision datasets.

## Why?
Even with the success of generative models, data quality remains an issue that's mainly overlooked.
Training models will erroneours data impacts model accuracy, incurs costs in time, storage and computational resources.

## How?
In this repo we share clean version of various computer vision datasets.

The datasets are cleaned using a free tool we released - [fastdup](https://github.com/visual-layer/fastdup).

We hope this effort will also help the community train better models and mitigate various model biases.

The cleaned image dataset should be free from most if not all of the following issues:

+ Duplicates.
+ Broken images.
+ Outliers.
+ Low information images (dark/bright/blurry images).

## Datasets

Here are some of the datasets we are currently working on. 

| Dataset | Issues |
| -------- | -------- |
| Food-101    | <ul><li>Duplicates - 0.24% (12,345)</li><li>Outliers - 0.85% (456)</li><li>Broken - 0.85% (456)</li><li>Blur - 0.85% (456)</li><li>Dark - 0.85% (456)</li><li>Bright - 0.85% (456)</li></ul> |
| Oxford Pets    | <ul><li>Duplicates - 0.24% (12,345)</li><li>Outliers - 0.85% (456)</li><li>Broken - 0.85% (456)</li><li>Blur - 0.85% (456)</li><li>Dark - 0.85% (456)</li><li>Bright - 0.85% (456)</li></ul> |
| Imagenette   | <ul><li>Duplicates - 0.24% (12,345)</li><li>Outliers - 0.85% (456)</li><li>Broken - 0.85% (456)</li><li>Blur - 0.85% (456)</li><li>Dark - 0.85% (456)</li><li>Bright - 0.85% (456)</li></ul> |
| Laion 1B   | <ul><li>Duplicates - 0.24% (12,345)</li><li>Outliers - 0.85% (456)</li><li>Broken - 0.85% (456)</li><li>Blur - 0.85% (456)</li><li>Dark - 0.85% (456)</li><li>Bright - 0.85% (456)</li></ul> |
| Imagenet-21k   | <ul><li>Duplicates - 0.24% (12,345)</li><li>Outliers - 0.85% (456)</li><li>Broken - 0.85% (456)</li><li>Blur - 0.85% (456)</li><li>Dark - 0.85% (456)</li><li>Bright - 0.85% (456)</li></ul> |
| Imagenet-1k   | <ul><li>Duplicates - 0.24% (12,345)</li><li>Outliers - 0.85% (456)</li><li>Broken - 0.85% (456)</li><li>Blur - 0.85% (456)</li><li>Dark - 0.85% (456)</li><li>Bright - 0.85% (456)</li></ul> |
| KITTI   | <ul><li>Duplicates - 0.24% (12,345)</li><li>Outliers - 0.85% (456)</li><li>Broken - 0.85% (456)</li><li>Blur - 0.85% (456)</li><li>Dark - 0.85% (456)</li><li>Bright - 0.85% (456)</li></ul> |
| DeepFashion   | <ul><li>Duplicates - 0.24% (12,345)</li><li>Outliers - 0.85% (456)</li><li>Broken - 0.85% (456)</li><li>Blur - 0.85% (456)</li><li>Dark - 0.85% (456)</li><li>Bright - 0.85% (456)</li></ul> |
| Places365-standard   | <ul><li>Duplicates - 0.24% (12,345)</li><li>Outliers - 0.85% (456)</li><li>Broken - 0.85% (456)</li><li>Blur - 0.85% (456)</li><li>Dark - 0.85% (456)</li><li>Bright - 0.85% (456)</li></ul> |
| CelebA-HQ   | <ul><li>Duplicates - 0.24% (12,345)</li><li>Outliers - 0.85% (456)</li><li>Broken - 0.85% (456)</li><li>Blur - 0.85% (456)</li><li>Dark - 0.85% (456)</li><li>Bright - 0.85% (456)</li></ul> |
| ADE20K   | <ul><li>Duplicates - 0.24% (12,345)</li><li>Outliers - 0.85% (456)</li><li>Broken - 0.85% (456)</li><li>Blur - 0.85% (456)</li><li>Dark - 0.85% (456)</li><li>Bright - 0.85% (456)</li></ul> |
| COCO   | <ul><li>Duplicates - 0.24% (12,345)</li><li>Outliers - 0.85% (456)</li><li>Broken - 0.85% (456)</li><li>Blur - 0.85% (456)</li><li>Dark - 0.85% (456)</li><li>Bright - 0.85% (456)</li></ul> |



## Getting Started

Install `vl_datasets` package from PyPI.

```shell
pip install vl-datasets
```

Import the clean version of dataset.

```python
from vl_datasets import CleanFood101
```

Load the dataset into a PyTorch `DataLoader`.

```python
train_dataset = CleanFood101('./', split='train', exclude_csv='food_101_vl-datasets_analysis.csv', transform=train_transform)
valid_dataset = CleanFood101('./', split='test', exclude_csv='food_101_vl-datasets_analysis.csv', transform=valid_transform)
```

Now you can use the dataset in a PyTorch training loop. Refer to our sample training notebooks for details.

Sample training notebooks:
+ [PyTorch](./notebooks/train-clean-pytorch.ipynb)
+ [fast.ai](./notebooks/train-fastai.ipynb)


<!-- ### Clean-ImageNet-21K
In the [original ImageNet-21K](https://www.image-net.org/) dataset we find up to 15.9% of the images are problematic. Among those there are 1.2M redundant duplicate and 104K train validation leaks.

To use the Clean-ImageNet-21K dataset, you must download the original ImageNet-21K dataset here and run the `analyze.py` script to obtain the list of problematic images. We recommend runnning the script on a machine with a minimum of 64 CPU cores and 128GB of RAM. 

Alternatively you can get the list of problematic images by signing-up [here](https://forms.gle/khZpAGUQJeqgRwwo7).

### Clean-LAION-400M
In the [original LAION-400M dataset](https://laion.ai/blog/laion-400-open-dataset/), we find 10.3M missing images (stale URLs) and 1.63M corrupted images. Common corruptions include over 772k images
having format issues and not loading, 443k images smaller
than 10x10 pixels, and over 300k images that are ’File not
found’ placeholders

To use the Clean-LAION-400M dataset, you must download the original LAION-400M dataset and run the `analyze.py` script to obtain the list of problematic images. We recommend runnning the script on a machine with a minimum of 64 CPU cores and 128GB of RAM. 

Alternatively you can get the list of problematic images by signing-up [here](https://forms.gle/khZpAGUQJeqgRwwo7). -->

<!-- ## Scripts
We provide convenience functions to help you move or delete problematic files. The files are specified in a `.csv` file.

A sample content of the `.csv` file is as follows:
```
filename
buildings/0.jpg
buildings/4.jpg
```

> **Warning**: Proceed with caution. The following operations may be irrersible. Backup your data before proceeding.

Move problematic images to a destination folder:

```python
python scripts/move-images.py --file_paths_csv  problem_images.csv --images_dir sample_images --dest_folder_name problematic_images
```

Delete problematic images:
```python
python scripts/delete-images.py --file_paths_csv problem_images.csv --images_dir sample_images/
``` -->


## Disclaimer
You are bound to the usage license of the original dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license. We provide no warranty or guarantee of accuracy or completeness.


