Metadata-Version: 2.1
Name: clip_cam
Version: 0.1.5
Summary: A package for visualizing the prompt-image feature matching in ViT-based CLIP models, highlighting the alignment between image features and textual prompts.
Home-page: https://github.com/adityagandhamal/clip_cam
Author: Aditya Gandhamal, Aniruddh Sikdar
Author-email: 
License: MIT
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch
Requires-Dist: numpy
Requires-Dist: einops
Requires-Dist: opencv-python
Requires-Dist: matplotlib
Requires-Dist: Pillow
Requires-Dist: open_clip_torch==2.29.0

# clip_cam

`clip_cam` is a Python package for visualizing **the image-prompt feature matching** in ViT-based CLIP models, highlighting the alignment between image features and textual prompts. It allows you to visualize how CLIP interprets the relationship between an image and a text description, providing insights into its attention patterns.

## 🚀 Features

- Generate Grad-CAM-style heatmaps for image-text matching.
- Support for Vision Transformer (ViT) architectures.
- Easy integration with existing CLIP implementations.
- Custom checkpoint support for fine-tuned models.

## 📦 Installation

You can install `clip_cam` via pip:

```bash
pip install clip_cam
```

## 🔥 Usage

Run the following command to generate a visualization:

```bash
python clip_cam.py --model_name "ViT-B/16" --image_path "path/to/image.jpg" --text "your text prompt"
```

### Arguments:
- `--image_path`: Path to the input image.
- `--text`: Text input/prompt.
- `--model_name`: CLIP model name (default: `ViT-B/16`).
- `--checkpoint`: (Optional) Path to a fine-tuned CLIP model checkpoint.

### Example:
```bash
python clip_cam.py --model_name "ViT-B/16" --image_path "cat.jpg" --text "a cute kitten" 
```

## 🛠️ How It Works

1. **Model Loading**: Uses the specified CLIP model with optional fine-tuned checkpoint.
2. **Feature Extraction**:
   - Extracts dense visual features from the image.
   - Encodes the text prompt and normalizes the embeddings.
3. **Matching & Visualization**:
   - Computes image-text matching scores.
   - Resizes the matching map using bilinear interpolation.
   - Visualizes the results generating a heatmap of image-text matching.


## 🔥 Example Visualization

![Sample Output](https://raw.githubusercontent.com/adityagandhamal/clip_cam/main/assets/clip_cam.png)
_Example visualization showing attention heatmap over the image for the provided text prompt._

## 📜 License

This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

