Metadata-Version: 2.3
Name: face-cluster
Version: 0.1.1
Summary: CLI tool to cluster images by face similarity using InsightFace and Agglomerative Clustering
Author: j-about
Author-email: j-about <142051449+j-about@users.noreply.github.com>
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Environment :: GPU :: NVIDIA CUDA
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Image Recognition
Classifier: Topic :: Utilities
Requires-Dist: insightface>=0.7.3
Requires-Dist: numpy>=2.2.6
Requires-Dist: onnxruntime-gpu>=1.23.2 ; (platform_machine == 'AMD64' and sys_platform == 'linux') or (platform_machine == 'x86_64' and sys_platform == 'linux') or (platform_machine == 'AMD64' and sys_platform == 'win32') or (platform_machine == 'x86_64' and sys_platform == 'win32')
Requires-Dist: onnxruntime>=1.23.2 ; (platform_machine != 'AMD64' and platform_machine != 'x86_64') or sys_platform == 'darwin'
Requires-Dist: pydantic>=2.12.5
Requires-Dist: pydantic-settings>=2.12.0
Requires-Dist: rich>=14.3.2
Requires-Dist: scikit-learn>=1.7.2
Requires-Dist: typer>=0.21.1
Requires-Python: >=3.10.19
Project-URL: Homepage, https://github.com/j-about/Face-Cluster
Project-URL: Issues, https://github.com/j-about/Face-Cluster/issues
Description-Content-Type: text/markdown

# Face Cluster CLI

[![Python](https://img.shields.io/badge/python-%3E%3D3.10.19-3776ab?logo=python&logoColor=white)](https://www.python.org/)
[![Typing](https://img.shields.io/badge/typing-mypy--strict-blue)](https://mypy-lang.org/)
[![Tests](https://img.shields.io/badge/tests-34%20passed-brightgreen)]()

Command-line tool that **clusters images by face similarity** using [InsightFace](https://github.com/deepinsight/insightface) for detection and [Agglomerative Clustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html) for clustering.

---

## Overview

Face Cluster CLI scans a directory of images, detects every face using the InsightFace **antelopev2** model (512-dimensional embeddings), groups them into identity clusters via **agglomerative hierarchical clustering** with **cosine distance**, and exports the results into neatly organised subdirectories — one per person.

## Features

- **InsightFace antelopev2** — state-of-the-art face detection and recognition with automatic model download
- **Agglomerative cosine clustering** — hierarchical clustering with configurable distance threshold and linkage
- **GPU-first inference** — ships `onnxruntime-gpu` on Linux x86_64 / Windows for automatic NVIDIA CUDA acceleration; falls back to CPU on macOS and other platforms
- **Rich terminal UX** — progress bars, coloured logging, and a summary panel at the end
- **Environment variable configuration** — every setting overridable via `FACE_CLUSTER_*` env vars or `.env` file
- **Corrupted image resilience** — unreadable files are logged and skipped without aborting the pipeline
- **Interactive overwrite protection** — prompts before cleaning an existing output directory (`--force` to skip)
- **Strict type safety** — full `mypy --strict` compliance with Pydantic validation at every boundary

## Requirements

| Requirement | Details |
|-------------|---------|
| **Python** | >= 3.10.19 |
| **OS** | Linux, macOS, Windows |
| **GPU** *(auto-detected)* | NVIDIA GPU with CUDA 12.x drivers (Linux x86_64 / Windows). CPU fallback on macOS and other platforms. |

## GPU Support

On **Linux x86_64** and **Windows**, Face Cluster installs `onnxruntime-gpu` which bundles CUDA 12.x and cuDNN 9.x libraries. If an NVIDIA GPU with compatible drivers is detected, inference runs on GPU automatically. Otherwise, execution falls back to CPU transparently.

On **macOS** and other platforms (e.g. Linux aarch64), the CPU-only `onnxruntime` package is installed instead — no NVIDIA GPU wheels exist for these platforms.

> **Tip:** Run with `--verbose` to see which ONNX Runtime execution providers are active.

## Installation

### Run without installing (recommended for one-off use)

```bash
uvx face-cluster ./photos
```

### Install as a tool

```bash
# With uv
uv tool install face-cluster

# With pip
pip install face-cluster
```

### From source

```bash
git clone https://github.com/j-about/Face-Cluster.git
cd Face-Cluster
uv sync
uv run face-cluster --help
```

## Quick Start

```bash
face-cluster ./photos --output ./clusters --verbose
```

This will:

1. Scan `./photos` for supported images
2. Detect all faces and extract 512-d embeddings
3. Cluster faces by identity using agglomerative clustering
4. Copy images into `./clusters/cluster_000/`, `cluster_001/`, etc.
5. Print a summary panel:

```
╭──── Pipeline Summary ────╮
│  Total images          42 │
│  Images with faces     38 │
│  Total faces detected  51 │
│  Clusters               4 │
│  Largest cluster       18 │
│  Outliers               3 │
╰──────────────────────────╯
```

## Usage

```
face-cluster [OPTIONS] INPUT_DIR
```

### Arguments

| Argument | Description |
|----------|-------------|
| `INPUT_DIR` | **Required.** Path to a directory containing images to cluster. Must exist and be readable. |

### Options

| Option | Short | Default | Type | Constraint | Description |
|--------|-------|---------|------|------------|-------------|
| `--output` | `-o` | `./face_clusters` | `PATH` | — | Directory for exported cluster folders. |
| `--distance-threshold` | — | `0.8` | `FLOAT` | > 0 | Cosine distance threshold above which clusters are not merged. |
| `--linkage` | — | `complete` | `TEXT` | `average`, `complete`, `single` | Linkage criterion: 'average', 'complete', or 'single'. |
| `--min-cluster-size` | — | `2` | `INT` | >= 2 | Clusters smaller than this are reclassified as outliers. |
| `--batch-size` | — | `32` | `INT` | >= 1 | Number of images per progress-bar tick. |
| `--force` | `-f` | `false` | `FLAG` | — | Overwrite output directory without confirmation. |
| `--verbose` | `-v` | `false` | `FLAG` | — | Enable debug-level logging. |
| `--help` | — | — | — | — | Show help message and exit. |

### Examples

```bash
# Basic usage
face-cluster ./photos

# Custom output directory and stricter clustering
face-cluster ./photos -o ./results --distance-threshold 0.5 --linkage complete

# Force overwrite with verbose logging
face-cluster ./photos -o ./results -f -v

# Run from source
uv run face-cluster ./photos --output ./clusters
```

## Output Structure

```
face_clusters/
├── cluster_000/          # Identity A
│   ├── photo_001.jpg
│   ├── photo_007.jpg
│   └── photo_012.jpg
├── cluster_001/          # Identity B
│   ├── photo_003.jpg
│   └── photo_009.jpg
├── cluster_002/          # Identity C
│   └── ...
└── outliers/             # Faces in clusters too small (< min_cluster_size)
    ├── photo_022.jpg
    └── photo_035.jpg
```

- Cluster directories use **zero-padded labels** (`cluster_000`, `cluster_001`, ...).
- Images are **copied** (originals are never modified).
- If an image contains faces belonging to **multiple clusters**, it is copied into each relevant directory.
- **Filename collisions** are resolved automatically by appending a counter suffix (`photo_1.jpg`, `photo_2.jpg`, ...).

## Configuration

All settings can be overridden via environment variables with the `FACE_CLUSTER_` prefix. A `.env` file in the working directory is also supported.

| Environment Variable | Default | Type | Description |
|----------------------|---------|------|-------------|
| `FACE_CLUSTER_MODEL_NAME` | `antelopev2` | `str` | InsightFace model pack name. |
| `FACE_CLUSTER_DET_SIZE` | `(640, 640)` | `tuple[int, int]` | Detection input size (width, height). |
| `FACE_CLUSTER_BATCH_SIZE` | `32` | `int` | Number of images per progress-bar tick. |
| `FACE_CLUSTER_DISTANCE_THRESHOLD` | `0.8` | `float` | Cosine distance threshold above which clusters are not merged (> 0). |
| `FACE_CLUSTER_LINKAGE` | `complete` | `str` | Linkage criterion: `average`, `complete`, or `single`. |
| `FACE_CLUSTER_MIN_CLUSTER_SIZE` | `2` | `int` | Clusters smaller than this are reclassified as outliers (>= 2). |
| `FACE_CLUSTER_OUTPUT_DIR` | `./face_clusters` | `Path` | Output directory for cluster folders. |
| `FACE_CLUSTER_FORCE` | `false` | `bool` | Skip overwrite confirmation. |
| `FACE_CLUSTER_VERBOSE` | `false` | `bool` | Enable debug logging. |

> **Note:** CLI options take priority over environment variables.

**Validation rule:** `FACE_CLUSTER_LINKAGE` must be one of `average`, `complete`, or `single`.

## Supported Image Formats

| Extension |
|-----------|
| `.jpg` |
| `.jpeg` |
| `.png` |
| `.bmp` |
| `.webp` |
| `.tiff` |
| `.tif` |

Image discovery is **non-recursive** (only the top-level directory is scanned). Extension matching is case-insensitive.

## Exit Codes

| Code | Meaning | Examples |
|------|---------|---------|
| `0` | Success | Pipeline completed normally. |
| `1` | User error | Input directory does not exist, contains no images, or user aborted overwrite. |
| `2` | System error | Model failed to load, clustering failed, or export I/O error. |

## Development

### Setup

```bash
git clone https://github.com/j-about/Face-Cluster.git
cd Face-Cluster
uv sync
```

### Run Tests

```bash
uv run pytest
```

### Type Checking

```bash
uv run mypy --strict src/
```

### Run from Source

```bash
uv run face-cluster ./photos --verbose
```

## Architecture

```
INPUT_DIR
    |
    v
 Discover ──> Detect ──> Cluster ──> Export
 (pipeline)   (detector)  (clustering) (exporter)
    |             |            |            |
    v             v            v            v
 Image paths   Embedding    Cluster      Organised
 (sorted)      records      labels       directories
```

### Pipeline Steps

1. **Discover** — scan `INPUT_DIR` for supported image files (non-recursive)
2. **Detect** — extract 512-d face embeddings via InsightFace antelopev2
3. **Cluster** — L2-normalise embeddings, then run agglomerative clustering with cosine distance
4. **Export** — copy images into per-cluster subdirectories

### Module Map

| Module | Responsibility |
|--------|----------------|
| `cli.py` | Typer entry point, argument parsing, Rich summary panel |
| `config.py` | Pydantic `BaseSettings` with `FACE_CLUSTER_*` env var support |
| `models.py` | Data models: `EmbeddingRecord`, `ClusterResult`, `PipelineSummary` |
| `exceptions.py` | Custom exception hierarchy (`FaceClusterError` base) |
| `detector.py` | InsightFace lazy singleton, face detection, embedding extraction |
| `clustering.py` | L2 normalisation, agglomerative clustering |
| `exporter.py` | Copy images into per-cluster directories with collision handling |
| `pipeline.py` | Orchestrator connecting all pipeline steps |
| `logging_setup.py` | Rich `RichHandler` configuration with shared `Console` |