Metadata-Version: 2.4
Name: cleanvision
Version: 0.3.7
Summary: Find issues in image datasets
Author-email: "Cleanlab Inc." <team@cleanlab.ai>
License-Expression: Apache-2.0
Project-URL: Source, https://github.com/cleanlab/cleanvision
Project-URL: Bug Tracker, https://github.com/cleanlab/cleanvision/issues
Project-URL: Documentation, https://cleanvision.readthedocs.io/
Keywords: computer_vision,cv,image_data,issue_detection,data_quality,image_quality,machine_learning,data_cleaning,image_deduplication
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python
Classifier: Topic :: Software Development
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Mathematics
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: Pillow>=9.3
Requires-Dist: numpy>=1.20.0
Requires-Dist: pandas>=1.1.5
Requires-Dist: tabulate>=0.8.3
Requires-Dist: imagehash>=4.2.0
Requires-Dist: tqdm>=4.53.0
Requires-Dist: matplotlib>=3.4
Requires-Dist: fsspec>=2023.1.0
Provides-Extra: huggingface
Requires-Dist: datasets>=2.15.0; python_version > "3.7" and extra == "huggingface"
Requires-Dist: datasets>=2.7.0; python_version < "3.8" and extra == "huggingface"
Provides-Extra: pytorch
Requires-Dist: torchvision>=0.12.0; extra == "pytorch"
Provides-Extra: azure
Requires-Dist: adlfs>=2022.2.0; extra == "azure"
Provides-Extra: gcs
Requires-Dist: gcsfs>=2022.1.0; extra == "gcs"
Provides-Extra: s3
Requires-Dist: s3fs>=2023.1.0; extra == "s3"
Provides-Extra: all
Requires-Dist: cleanvision[azure,gcs,huggingface,pytorch,s3]; extra == "all"
Dynamic: license-file

<p align="center">
  <img src="https://raw.githubusercontent.com/cleanlab/assets/master/cleanlab/cleanvision_logo_open_source_transparent.png" width=50% height=50%>
</p>

<img width="1200" alt="Screen Shot 2023-03-10 at 10 23 33 AM" src="https://user-images.githubusercontent.com/10901697/224394144-bb0e1c85-6851-4828-bcd2-4ed234270a78.png">

CleanVision automatically detects potential issues in image datasets like images that are: blurry, under/over-exposed, (near) duplicates, etc.
This data-centric AI package is a quick first step for any computer vision project to find problems in the dataset, which you want to address before applying machine learning.
CleanVision is super simple -- run the same couple lines of Python code to audit any image dataset!

[![Read the Docs](https://readthedocs.org/projects/cleanvision/badge/?version=latest)](https://cleanvision.readthedocs.io/en/latest/)
[![pypi](https://img.shields.io/pypi/v/cleanvision?color=blue)](https://pypi.org/pypi/cleanvision/)
[![os](https://img.shields.io/badge/platform-noarch-lightgrey)](https://pypi.org/pypi/cleanvision/)
[![py\_versions](https://img.shields.io/badge/python-3.10%2B-blue)](https://pypi.org/pypi/cleanvision/)
[![codecov](https://codecov.io/github/cleanlab/cleanvision/branch/main/graph/badge.svg?token=y1N6MluN9H)](https://codecov.io/gh/cleanlab/cleanvision)

## Installation
```shell
pip install cleanvision
```

## Quickstart

Download an example dataset (optional). Or just use any collection of image files you have.

```shell
wget -nc 'https://cleanlab-public.s3.amazonaws.com/CleanVision/image_files.zip'
```

1. Run CleanVision to audit the images.

```python
from cleanvision import Imagelab

# Specify path to folder containing the image files in your dataset
imagelab = Imagelab(data_path="FOLDER_WITH_IMAGES/")

# Automatically check for a predefined list of issues within your dataset
imagelab.find_issues()

# Produce a neat report of the issues found in your dataset
imagelab.report()
```

2. CleanVision diagnoses many types of issues, but you can also check for only specific issues.

```python
issue_types = {"dark": {}, "blurry": {}}

imagelab.find_issues(issue_types=issue_types)

# Produce a report with only the specified issue_types
imagelab.report(issue_types=issue_types)
```


## More resources

- [Tutorial](https://cleanvision.readthedocs.io/en/latest/tutorials/tutorial.html)
- [Documentation](https://cleanvision.readthedocs.io/)
- [Blog](https://cleanlab.ai/blog/cleanvision/)
- [Run CleanVision on a HuggingFace dataset](https://cleanvision.readthedocs.io/en/latest/tutorials/huggingface_dataset.html)
- [Run CleanVision on a Torchvision dataset](https://cleanvision.readthedocs.io/en/latest/tutorials/torchvision_dataset.html)
- [Example script](https://github.com/cleanlab/cleanvision/blob/main/docs/source/tutorials/run.py) that can be run with: `python examples/run.py --path <FOLDER_WITH_IMAGES>`
- [Additional example notebooks](https://github.com/cleanlab/cleanvision-examples)
- [FAQ](https://cleanvision.readthedocs.io/en/latest/faq.html)

## *Clean* your data for better Computer *Vision*

The quality of machine learning models hinges on the quality of the data used to train them, but it is hard to manually identify all of the low-quality data in a big dataset. CleanVision helps you automatically identify common types of data issues lurking in image datasets.

This package currently detects issues in the raw images themselves, making it a useful tool for any computer vision
task such as: classification, segmentation, object detection, pose estimation, keypoint detection, [generative modeling](https://openai.com/research/dall-e-2-pre-training-mitigations), etc.
To detect issues in the labels of your image data, you can instead
use the [cleanlab](https://github.com/cleanlab/cleanlab/) package.

In any collection of image files (most [formats](https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html) supported), CleanVision can detect the following types of issues:

|   | Issue Type       | Description                                                     | Issue Key        | Example                                                                                                                                 |
|---|------------------|-----------------------------------------------------------------|------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
| 1 | Exact Duplicates | Images that are identical to each other                         | exact_duplicates | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/exact_duplicates.png)                     |
| 2 | Near Duplicates  | Images that are visually almost identical                       | near_duplicates  | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/near_duplicates.png)                      |
| 3 | Blurry           | Images where details are fuzzy (out of focus)                   | blurry           | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/blurry.png)                               |
| 4 | Low Information  | Images lacking content (little entropy in pixel values)         | low_information  | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/low_information.png)                      |
| 5 | Dark             | Irregularly dark images (*under*exposed)                        | dark             | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/dark.jpg)                                 |
| 6 | Light            | Irregularly bright images (*over*exposed)                       | light            | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/light.jpg)                                |
| 7 | Grayscale        | Images lacking color                                            | grayscale        | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/grayscale.jpg)                            |
| 8 | Odd Aspect Ratio | Images with an unusual aspect ratio (overly skinny/wide)        | odd_aspect_ratio | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/odd_aspect_ratio.jpg)                     |
| 9 | Odd Size         | Images that are abnormally large or small compared to the rest of the dataset | odd_size         | <img src="https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/odd_size.png" width=20% height=20%> |

CleanVision supports Linux, macOS, and Windows and runs on Python 3.10+. Learn more from our [blog](https://cleanlab.ai/blog/cleanvision/).

## Community

* Interested in contributing? See the [contributing guide](CONTRIBUTING.md). An easy starting point is to
  consider [issues](https://github.com/cleanlab/cleanvision/labels/good%20first%20issue) marked `good first issue`.

* Ready to start adding your own code? See the [development guide](DEVELOPMENT.md).

* Have an issue? [Search existing issues](https://github.com/cleanlab/cleanvision/issues?q=is%3Aissue)
  or [submit a new issue](https://github.com/cleanlab/cleanvision/issues/new/choose).


[issue]: https://github.com/cleanlab/cleanvision/issues/new
