Metadata-Version: 2.1
Name: dreamsim
Version: 0.1.3
Summary: DreamSim similarity metric
Home-page: https://github.com/ssundaram21/dreamsim
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: open-clip-torch
Requires-Dist: peft (==0.1.0)
Requires-Dist: Pillow
Requires-Dist: torch
Requires-Dist: timm
Requires-Dist: scipy
Requires-Dist: torchvision
Requires-Dist: transformers

<!-- # ![icon](images/figs/icon.png)  DreamSim Perceptual Metric -->
<!-- # DreamSim Perceptual Metric <img src="images/figs/icon.png" align="left" width="50px"/>  -->
# DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data
### [Project Page](https://dreamsim-nights.github.io/) | [Paper](https://arxiv.org/abs/2306.09344) | [Bibtex](#bibtex)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1taEOMzFE9g81D9AwH27Uhy2U82tQGAVI?usp=sharing)

[Stephanie Fu](https://stephanie-fu.github.io)\* $^{1}$, [Netanel Tamir](https://netanel-tamir.github.io)\* $^{2}$, [Shobhita Sundaram](https://ssundaram21.github.io)\* $^{1}$, [Lucy Chai](https://people.csail.mit.edu/lrchai/) $^1$, [Richard Zhang](http://richzhang.github.io) $^3$, [Tali Dekel](https://www.weizmann.ac.il/math/dekel/) $^2$, [Phillip Isola](https://web.mit.edu/phillipi/) $^1$. (*equal contribution)<br>
$^1$ MIT, $^2$ Weizmann Institute of Science, $^3$ Adobe Research.



**Summary**

Current metrics for perceptual image similarity operate at the level of pixels and patches. These metrics compare images in terms of their low-level colors and textures, but fail to capture mid-level differences in layout, pose, semantic content, etc. Models that use image-level embeddings such as DINO and CLIP capture high-level and semantic judgements, but may not be aligned with human perception of more finegrained attributes.

DreamSim is a new metric for perceptual image similarity that bridges the gap between "low-level" metrics (e.g. LPIPS, PSNR, SSIM) and "high-level" measures (e.g. CLIP). Our model was trained by concatenating CLIP, OpenCLIP, and DINO embeddings, and then finetuning on human perceptual judgements. We gathered these judgements on a dataset of ~20k image triplets, generated by diffusion models. Our model achieves better alignment with human similarity judgements than existing metrics, and can be used for downstream applications such as image retrieval.

## Requirements
- Linux
- Python 3
- NVIDIA GPU + CUDA CuDNN (DreamSim is only supported on CUDA devices)

## Setup

**Option 1:** Install using pip: 

```pip install dreamsim```

The package is used for importing and using the DreamSim model.

**Option 2:** Clone our repo and install dependencies.
This is necessary for running our training/evaluation scripts.

```
python3 -m venv ds
source ds/bin/activate
pip install -r requirements.txt
export PYTHONPATH="$PYTHONPATH:$(realpath ./dreamsim)"
```
To install with conda:
```
conda create -n ds
conda activate ds
conda install pip # verify with the `which pip` command
pip install -r requirements.txt
export PYTHONPATH="$PYTHONPATH:$(realpath ./dreamsim)"
```

## Usage
**For walk-through examples of the below use-cases, check out our [Colab demo](https://colab.research.google.com/drive/1taEOMzFE9g81D9AwH27Uhy2U82tQGAVI?usp=sharing).**

### Quickstart: Perceptual similarity metric
The basic use case is to measure the perceptual distance between two images. **A higher score means more different, lower means more similar**. 

The following code snippet is all you need. The first time that you run `dreamsim` it will automatically download the model weights. The default model settings are specified in `./dreamsim/config.py`.
```
from dreamsim import dreamsim
from PIL import Image

model, preprocess = dreamsim(pretrained=True)

img1 = preprocess(Image.open("img1_path")).to("cuda")
img2 = preprocess(Image.open("img2_path")).to("cuda")
distance = model(img1, img2) # The model takes an RGB image from [0, 1], size batch_sizex3x224x224
```

To run on example images, run `demo.py`. The script should produce distances (0.424, 0.34). 

### (new!) Single-branch models
By default, DreamSim uses an ensemble of CLIP, DINO, and OpenCLIP (all ViT-B/16). If you need a lighter-weight model you can use *single-branch* versions of DreamSim where only a single backbone is finetuned. The available options are OpenCLIP-ViTB/32, DINO-ViTB/16, CLIP-ViTB/32, in order of performance. 

To load a single-branch model, use the `dreamsim_type` argument. For example:
```
dreamsim_dino_model, preprocess = dreamsim(pretrained=True, dreamsim_type="dino_vitb16")
```

### Feature extraction
To extract a *single image embedding* using dreamsim, use the `embed` method as shown in the following snippet:
```
img1 = preprocess(Image.open("img1_path")).to("cuda")
embedding = model.embed(img1)
```
The perceptual distance between two images is the cosine distance between their embeddings. If the embeddings are normalized (true by default) L2 distance can also be used.


### Image retrieval
Our model can be used for image retrieval, and plugged into existing such pipelines. The code below ranks a dataset of images based on their similarity to a given query image. 

To speed things up, instead of directly calling `model(query, image)` for each pair, we use the `model.embed(image)` method to pre-compute single-image embeddings, and then take the cosine distance between embedding pairs.
```
import pandas as pd
from tqdm import tqdm
import torch.nn.functional as F

# let query be a sample image.
# let images be a list of images we are searching.

# Compute the query image embedding
query_embed = model.embed(preprocess(query).to("cuda"))
dists = {}

# Compute the (cosine) distance between the query and each search image
for i, im in tqdm(enumerate(images), total=len(images)):
   img_embed = model.embed(preprocess(im).to("cuda"))
   dists[i] = (1 - F.cosine_similarity(query_embed, img_embed, dim=-1)).item()

# Return results sorted by distance
df = pd.DataFrame({"ids": list(dists.keys()), "dists": list(dists.values())})
return df.sort_values(by="dists")
```

### Perceptual loss function
Our model can be used as a loss function for iterative optimization (similarly to the LPIPS metric). These are the key lines; for the full example, refer to the [Colab](https://colab.research.google.com/drive/1taEOMzFE9g81D9AwH27Uhy2U82tQGAVI?usp=sharing).
```
for i in range(n_iters):
    dist = model(predicted_image, reference_image)
    dist.backward()
    optimizer.step()
```


<a name="bibtex"></a>
## Citation

If you find our work or any of our materials useful, please cite our paper:
```
@misc{fu2023dreamsim,
      title={DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data}, 
      author={Stephanie Fu and Netanel Tamir and Shobhita Sundaram and Lucy Chai and Richard Zhang and Tali Dekel and Phillip Isola},
      year={2023},
      eprint={2306.09344},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
```

## Acknowledgements
Our code borrows from the ["Deep ViT Features as Dense Visual Descriptors"](https://dino-vit-features.github.io/) repository for ViT feature extraction, and takes inspiration from the [UniverSeg](https://github.com/JJGO/UniverSeg) respository for code structure.

