Metadata-Version: 2.1
Name: dedupgenie
Version: 1.2.2
Summary: Multilingual desktop duplicate file finder with forensic-grade detection (SHA-256, SimHash + LSH)
Home-page: https://github.com/lemarcgagnon/DuplicateFinder
Author: Marc Gagnon
Author-email: Marc Gagnon <marc@marcgagnon.ca>
License: MIT
Project-URL: Homepage, https://github.com/lemarcgagnon/DuplicateFinder
Keywords: duplicate,files,finder,dedup,forensic,simhash,multilingual,i18n
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: X11 Applications :: Qt
Classifier: Intended Audience :: End Users/Desktop
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Desktop Environment :: File Managers
Classifier: Topic :: System :: Filesystems
Classifier: Natural Language :: English
Classifier: Natural Language :: French
Classifier: Natural Language :: German
Classifier: Natural Language :: Italian
Classifier: Natural Language :: Arabic
Classifier: Natural Language :: Portuguese
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: PyQt5>=5.15

<p align="center">
  <img src="https://raw.githubusercontent.com/lemarcgagnon/DuplicateFinder/main/DedupGenie.png" alt="DedupGenie" width="180">
</p>

# DedupGenie

A desktop application that finds and manages duplicate files using forensic-grade detection algorithms. Built with Python and PyQt5.

![Python 3.8+](https://img.shields.io/badge/python-3.8%2B-blue)
![PyQt5](https://img.shields.io/badge/GUI-PyQt5-green)
![License](https://img.shields.io/badge/license-MIT-lightgrey)

## Features

- **Three detection modes:**
  - **Strict** — SHA-256 full-content hash. Zero false positives. Uses a progressive pipeline (size → head → tail → full hash) to skip unnecessary I/O.
  - **Balanced** — Same progressive pipeline but stops at head+tail match. Near-zero false positives, much faster on large files.
  - **Fuzzy** — SimHash with LSH banding. Catches similar-but-not-identical files (~85%+ content overlap). Works on both text and binary files.

- **Side-by-side comparison lab** — Click any file to see a visual diff with its duplicate, including text similarity percentage and SimHash distance.

- **Smart auto-clean** — One-click wizard that identifies redundant copies using path heuristics (detects "copy", "backup", temp folders, etc.) and moves them to quarantine.

- **Safe quarantine workflow** — Files are moved to a `_FORENSIC_QUARANTINE` folder before permanent deletion. Review, restore, or purge at any time.

- **Multilingual** — Full interface in 6 languages: English, Français, Deutsch, Italiano, العربية, Português. Switch instantly from the Language menu. Arabic includes RTL layout support.

## Install

Python 3.8+ and pip are required.

### Option 1 — Install with pip (recommended)

```bash
pip install dedupgenie
```

Then run from anywhere:

```bash
dedupgenie
```

To uninstall: `pip uninstall dedupgenie`

### Option 2 — Download and run directly

```bash
git clone https://github.com/lemarcgagnon/DuplicateFinder.git
cd DuplicateFinder
pip install -r requirements.txt
python3 app.py
```

### Linux note

On some distributions, you may need system packages for Qt:

```bash
# Ubuntu / Debian
sudo apt install python3-pyqt5

# Fedora
sudo dnf install python3-qt5
```

No additional system packages needed.

## Usage

1. **Set target directory** — Type a path or click **Browse...**
2. **Choose sensitivity** — Strict (exact), Balanced (fast), or Fuzzy (similar content)
3. **Click Analyze** — The scan runs in the background with a progress indicator
4. **Review results** — Click a folder on the left to see its files on the right. Duplicates are listed with their copies as child items.
5. **Compare** — Click any file to see a side-by-side comparison in the bottom panel
6. **Clean up** — Use **Select duplicates** → **Quarantine selected**, or let **Auto-clean duplicates** handle it automatically

## How detection works

```
Strict / Balanced:
  file size → first 4KB → last 4KB → full SHA-256 (Strict only)
  Each stage eliminates non-candidates before reading more data.

Fuzzy:
  tokenize text (or byte n-grams for binary)
  → 64-bit SimHash from shingle hashes
  → 8 LSH bands of 8 bits each
  Files sharing any band are candidate duplicates (~85%+ similarity threshold).
```

## Project structure

```
DedupGenie/
├── app.py              # Algorithms, UI, and logic (single-file app)
├── translations.py     # i18n strings (6 languages)
├── pyproject.toml      # Package metadata (pip install)
├── setup.py            # Fallback for older pip
├── requirements.txt    # Python dependencies (PyQt5)
├── app.log             # Runtime warnings (created on first run)
└── README.md
```

## Safety & Disclaimer

**Back up your data before using this tool.** Always keep a copy of important files before running any cleanup operation.

Several safeguards are built in to prevent accidental data loss:

- Every destructive action requires an explicit confirmation dialog.
- Files are quarantined (moved to `_FORENSIC_QUARANTINE/`) before permanent deletion — you can review and restore them at any time.
- The auto-clean wizard never deletes files directly; it only moves copies to quarantine.
- No file is ever touched without user-initiated action.

That said, **no software is infallible**. Depending on your operating system and configuration, deleted files may still be recoverable from your trash or recycle bin. However, this is not guaranteed.

> **This software is provided "as is", without warranty of any kind, express or implied. The authors are not responsible for any data loss resulting from the use of this tool. You use it entirely at your own risk.**

## Security audit

This codebase has been scanned before publication using industry-standard security tools:

- **[Bandit](https://bandit.readthedocs.io/)** (static analysis for Python) — **0 issues.** Six informational findings related to `subprocess` usage were reviewed and confirmed safe: all calls use list arguments (no shell interpolation) and operate on user-selected paths within the app.
- **[pip-audit](https://pypi.org/project/pip-audit/)** (dependency vulnerability scanner) — **0 known vulnerabilities** in project dependencies (PyQt5).

You can re-run these audits yourself at any time:

```bash
pip install bandit pip-audit
bandit -r app.py
pip-audit -r requirements.txt
```

## Credits

Created by **Marc Gagnon** ([marcgagnon.ca](https://marcgagnon.ca)) with **Gemini** and **Claude**.

## License

MIT
