Metadata-Version: 2.4
Name: DedupeCopy
Version: 1.1.2
Summary: Find duplicates / copy and restructure file layout command-line tool
Author-email: Erik Schweller <othererik@gmail.com>
Maintainer-email: Erik Schweller <othererik@gmail.com>
License: BSD-2-Clause
Project-URL: Homepage, https://pypi.python.org/pypi/DedupeCopy/
Project-URL: Repository, https://www.github.com/othererik/dedupe_copy
Keywords: de-duplication,file management,deduplication,file-copy,backup
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Intended Audience :: End Users/Desktop
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: License :: OSI Approved :: BSD License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Utilities
Classifier: Topic :: System :: Filesystems
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: color
Requires-Dist: colorama>=0.4.0; extra == "color"
Provides-Extra: fast-hash
Requires-Dist: xxhash>=3.4.1; extra == "fast-hash"
Dynamic: license-file

# DedupeCopy

A multi-threaded command-line tool for finding duplicate files and copying/restructuring file layouts while eliminating duplicates.

[![License](https://img.shields.io/badge/license-BSD-blue.svg)](LICENSE)
[![Python](https://img.shields.io/badge/python-3.6+-blue.svg)](https://www.python.org/downloads/)

## Table of Contents

- [Overview](#overview)
- [Architecture](#architecture)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Key Concepts](#key-concepts)
- [Usage Examples](#usage-examples)
- [Command-Line Options](#command-line-options)
- [Path Rules](#path-rules)
- [Advanced Workflows](#advanced-workflows)
- [Performance Tips](#performance-tips)
- [Troubleshooting](#troubleshooting)
- [Safety and Best Practices](#safety-and-best-practices)

## Overview

DedupeCopy is designed for consolidating and restructuring sprawling file systems, particularly useful for:

- **Backup consolidation**: Merge multiple backup sources while eliminating duplicates
- **Photo/media library organization**: Consolidate photos from various devices and organize by date
- **File system cleanup**: Identify and remove duplicate files
- **Server migration**: Copy files to new storage while preserving structure
- **Duplicate detection**: Generate reports of duplicate files without copying
- **Deleting duplicates**: Reclaim disk space by removing redundant files.

**The good bits:**
- Uses MD5 checksums for accurate duplicate detection
- Multi-threaded for fast processing
- Manifest system for resuming interrupted operations
- Flexible path restructuring rules
- Can compare against multiple file systems without full re-scans
- Configurable logging with verbosity levels (quiet, normal, verbose, debug)
- Colored output for better readability (optional)
- Helpful error messages with actionable suggestions
- Real-time progress with processing rates

**Note:** This is *not* a replacement for rsync or Robocopy for incremental synchronization. Those are good tools that might work for you, so do try them.

## Architecture

DedupeCopy uses a multi-threaded pipeline architecture to maximize performance when processing large file systems. Understanding this architecture helps explain performance characteristics and tuning options.

### High-Level Design

```
┌─────────────────────────────────────────────────────────────────┐
│                         MAIN THREAD                             │
│  • Orchestrates the entire operation                            │
│  • Manages thread lifecycle and coordination                    │
│  • Handles manifest loading/saving                              │
└─────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────┐
│                     THREAD POOLS (Queues)                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌────────────────┐    ┌────────────────┐    ┌───────────────┐  │
│  │  Walk Threads  │───▶│  Read Threads  │───▶│ Copy/Delete   │  │
│  │  (4 default)   │    │  (8 default)   │    │   Threads     │  │
│  │                │    │                │    │  (8 default)  │  │
│  └────────────────┘    └────────────────┘    └───────────────┘  │
│         │                      │                      │         │
│         ▼                      ▼                      ▼         │
│   Walk Queue            Work Queue            Copy/Delete       │
│   (directories)         (files to hash)         Queue           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────┐
│                      PROGRESS THREAD                            │
│  • Collects status updates from all worker threads              │
│  • Displays progress, rates, and statistics                     │
│  • Logs errors and warnings                                     │
└─────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────┐
│                    RESULT PROCESSOR                             │
│  • Processes file hashes from read threads                      │
│  • Detects duplicate files                                      │
│  • Updates manifest and collision dictionaries                  │
│  • Performs incremental saves every 50,000 files                │
└─────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────┐
│                    PERSISTENT STORAGE                           │
│  • Manifest: Maps hash → list of files with that hash           │
│  • Collision DB: Tracks duplicate files                         │
│  • SQLite-backed with disk caching (LRU-like eviction)          │
│  • Auto-saves every 50,000 files for crash recovery             │
└─────────────────────────────────────────────────────────────────┘
```

### Pipeline Stages

#### 1. **Walk Stage** (WalkThread)
- **Purpose**: Discover files and directories in the source paths
- **Thread Count**: 4 by default (configurable with `--walk-threads`)
- **Output**: Adds directories to `walk_queue` and files to `work_queue`
- **Filtering**: Applies extension filters and ignore patterns

#### 2. **Read Stage** (ReadThread)
- **Purpose**: Hash file contents to detect duplicates
- **Thread Count**: 8 by default (configurable with `--read-threads`)
- **Input**: Files from `work_queue`
- **Output**: Tuple of (hash, size, mtime, filepath) to `result_queue`
- **Algorithm**: `md5` or `xxh64` (configurable with `--hash-algo`)

#### 3. **Result Processing** (ResultProcessor)
- **Purpose**: Aggregate hashes and detect collisions
- **Thread Count**: 1 (single-threaded for data consistency)
- **Input**: Hash results from `result_queue`
- **Output**: Updates `manifest` and `collisions` dictionaries
- **Auto-save**: Incremental saves every 50,000 processed files

#### 4. **Copy/Delete Stage** (CopyThread/DeleteThread)
- **Purpose**: Perform file operations based on duplicate analysis
- **Thread Count**: 8 by default (configurable with `--copy-threads`)
- **Input**: Files from `copy_queue` or `delete_queue`
- **Operations**:
  - Copy unique files to destination (with optional path rules)
  - Delete duplicate files (keeping first occurrence)

#### 5. **Progress Monitoring** (ProgressThread)
- **Purpose**: Centralized logging and status updates
- **Thread Count**: 1 (collects events from all other threads)
- **Features**:
  - Real-time progress with processing rates
  - Queue size monitoring (to detect bottlenecks)
  - Error aggregation and reporting
  - Final summary statistics

### Data Structures

#### Manifest
- **Storage**: Disk-backed dictionary (SQLite + cache layer)
- **Format**: `hash → [(filepath, size, mtime), ...]`
- **Cache**: In-memory LRU cache (10,000 items default)
- **Persistence**: Auto-saved during operation and on completion

#### Disk Cache Dictionary
- **Purpose**: Handle datasets larger than available RAM
- **Backend**: SQLite with Write-Ahead Logging (WAL)
- **Optimization**: Batched commits (100 items) for performance
- **Eviction**: LRU or random eviction when cache is full

### Performance Characteristics

**Queue-Based Throttling**: When queues exceed 100,000 items, the system introduces deliberate delays to prevent memory exhaustion.

**Bottleneck Detection**: Progress thread monitors queue sizes:
- **Walk queue growing**: Too many directories, consider reducing `--walk-threads`
- **Work queue growing**: Hashing is slower than discovery, increase `--read-threads`
- **Copy queue growing**: I/O is slower than hashing, increase `--copy-threads`

**I/O Patterns**:
- **Walk threads**: Mostly metadata operations (directory listings)
- **Read threads**: Sequential reads of entire files
- **Copy threads**: Large sequential writes

### Thread Safety

- **Queues**: Python's `queue.Queue` provides thread-safe operations
- **Manifest**: Uses database-level locking (SQLite RLock)
- **Save operations**: Coordinated via `save_event` to pause workers during saves

## Installation

### Via pip (recommended)

```bash
pip install DedupeCopy
```

### With color support (optional)

For colored console output (errors in red, warnings in yellow, etc.):

```bash
pip install DedupeCopy[color]
```

### With fast hashing (optional)

For faster file hashing using xxhash:
```bash
pip install DedupeCopy[fast_hash]
```

### From source

```bash
git clone https://github.com/othererik/dedupe_copy.git
cd dedupe_copy
pip install -e .
# Or with color support:
pip install -e .[color]
```

### Requirements

- Python 3.8 or later
- Sufficient disk space for manifest files (typically small, but can grow for very large file sets)
- Optional: colorama for colored console output (installed with `[color]` extra)

## Quick Start

### Find duplicates in a directory

```bash
dedupecopy -p /path/to/search -r duplicates.csv
```

This scans `/path/to/search` and creates a CSV report of all duplicate files.

### Copy files while removing duplicates

```bash
dedupecopy -p /source/path -c /destination/path
```

This copies all files from source to destination, skipping any duplicates.

### Copy with manifest (recommended for large operations)

```bash
dedupecopy -p /source/path -c /destination/path -m manifest.db
```

Creates a manifest file that allows you to resume if interrupted.

### Delete Duplicates

```bash
dedupecopy -p /path/to/search --delete --dry-run
```

This will scan the specified path and show you which files would be deleted. Once you are sure, you can run the command again without `--dry-run` to perform the deletion.

## Key Concepts

### Manifests

Manifests are database files that store:
- MD5 checksums of processed files
- File metadata (size, modification time, path)
- Which files have been scanned

**Benefits:**
- Resume interrupted operations
- Compare file systems without re-scanning
- Incremental backup workflows
- Track what has been processed

**Usage:**
- `-m manifest.db` - Save manifest after processing
- `-i manifest.db` - Load existing manifest before processing
- Manifest files are stored in a disk-based cache format (`.db` extension)

### Duplicate Detection

Files are considered duplicates when:
1. They have identical MD5 checksums
2. They have the same file size

**Special case:** Empty (zero-byte) files are treated as duplicates by default. Use `--keep-empty` to treat each empty file as unique.

### Path Rules

Path rules control how files are organized in the destination:

| Rule | Description | Example Output |
|------|-------------|----------------|
| `no_change` | Preserve original directory structure | `/dest/original/path/file.jpg` |
| `mtime` | Organize by modification date (YYYY_MM) | `/dest/2024_03/file.jpg` |
| `extension` | Organize by file extension | `/dest/jpg/file.jpg` |

Rules can be combined and applied per extension pattern (see [Path Rules](#path-rules-1) section).

### Extension Patterns

Extension filters support wildcards:
- `jpg` - Match .jpg files
- `*.jp*g` - Match .jpg, .jpeg, .jpng, etc.
- `*` - Match all extensions

## Usage Examples

### Basic Operations

#### 1. Generate a duplicate file report

```bash
dedupecopy -p /Users/johndoe -r dupes.csv -m manifest.db
```

Creates a CSV report of all duplicates and saves a manifest for future use.

**With quiet output (minimal):**
```bash
dedupecopy -p /Users/johndoe -r dupes.csv -m manifest.db --quiet
```

**With verbose output (detailed progress):**
```bash
dedupecopy -p /Users/johndoe -r dupes.csv -m manifest.db --verbose
```

#### 2. Copy specific file types

```bash
dedupecopy -p /source -c /backup -e jpg -e png -e gif
```

Copy only image files (jpg, png, gif) to the backup directory.

#### 3. Copy with preserved directory structure

```bash
dedupecopy -p /source -c /backup -R "*:no_change"
```

The `*` pattern applies the `no_change` rule to all file types.

### Photo Organization

#### Organize photos by date

```bash
dedupecopy -p C:\pics -p D:\pics -e jpg -R "jpg:mtime" -c X:\organized_photos
```

Copies all JPG files from C: and D: drives, organizing them into folders by year/month (e.g., `2024_03/`).

#### Organize by extension AND date

```bash
dedupecopy -p /media -c /organized -R "*:extension" -R "*:mtime"
```

Creates structure like: `/organized/jpg/2024_03/photo.jpg`

### Multi-Source Consolidation

#### Copy from multiple sources to single destination

```bash
dedupecopy -p /source1 -p /source2 -p /source3 -c /backup -m backup_manifest.db
```

Scans all three source paths and copies unique files to backup.

#### Resume an interrupted copy

```bash
dedupecopy -p /source -c /destination -i manifest.db -m manifest.db
```

Loads the previous manifest and resumes where it left off.

### Advanced Pattern Matching

#### Ignore specific patterns

```bash
dedupecopy -p /source -c /backup --ignore "*.tmp" --ignore "*.cache" --ignore "**/Thumbs.db"
```

Excludes temporary files and thumbnails from processing.

#### Extension-specific rules

```bash
dedupecopy -p /media -c /organized \
  -R "*.jpg:mtime" \
  -R "*.mp4:extension" \
  -R "*.doc*:no_change"
```

Different organization rules for different file types.

## Command-Line Options

### Required Options (one of)

| Option | Description |
|--------|-------------|
| `-p PATH`, `--read-path PATH` | Source path(s) to scan. Can be specified multiple times. |
| `--no-walk` | Skip file system walk; use paths from loaded manifest only. |

### Core Options

| Option | Description |
|--------|-------------|
| `-c PATH`, `--copy-path PATH` | Destination path for copying files. Cannot be used with `--delete`. |
| `--delete` | Deletes duplicate files, keeping the first-seen file. Cannot be used with `--copy-path`. |
| `-r PATH`, `--result-path PATH` | Output path for CSV duplicate report. |
| `-m PATH`, `--manifest-dump-path PATH` | Path to save manifest file. |
| `-i PATH`, `--manifest-read-path PATH` | Path to load existing manifest. Can be specified multiple times. |
| `-e EXT`, `--extensions EXT` | File extension(s) to include (e.g., `jpg`, `*.png`). Can be specified multiple times. |
| `--ignore PATTERN` | File pattern(s) to exclude (supports wildcards). Can be specified multiple times. |
| `-R RULE`, `--path-rules RULE` | Path restructuring rule(s) in format `extension:rule`. Can be specified multiple times. |

### Special Options

| Option | Description |
|--------|-------------|
| `--compare PATH` | Load manifest but don't copy its files (for comparison only). Can be specified multiple times. |
| `--copy-metadata` | Preserve file timestamps and permissions (uses `shutil.copy2` instead of `copyfile`). |
| `--keep-empty` | Treat empty (0-byte) files as unique rather than duplicates. |
| `--ignore-old-collisions` | Only detect new duplicates (ignore duplicates already in loaded manifest). |
| `--dry-run` | Simulate operations without making any changes to the filesystem. |
| `--min-delete-size BYTES` | Minimum size of a file to be considered for deletion (e.g., `1048576` for 1MB). Default: `0`. |

### Output Control Options

| Option | Description |
|--------|-------------|
| `-q`, `--quiet` | Show only warnings and errors (minimal output). |
| `-v`, `--verbose` | Show detailed progress information (same as normal, kept for compatibility). |
| `--debug` | Show debug information including queue states and internal diagnostics. |
| `--no-color` | Disable colored output (useful for logging to files or non-terminal output). |


**Output Verbosity Levels:**
- **Normal** (default): Standard progress updates, errors, and summaries.
- **Quiet** (`--quiet`): Only warnings, errors, and the final summary.
- **Verbose** (`--verbose`): More frequent progress updates that include processing rates and timing details.
- **Debug** (`--debug`): All output including queue states and internal operations for troubleshooting.

### Performance Options

| Option | Default | Description |
|--------|---------|-------------|
| `--walk-threads N` | 4 | Number of threads for file system traversal. |
| `--read-threads N` | 8 | Number of threads for reading and hashing files. |
| `--copy-threads N` | 8 | Number of threads for copying files. |
| `--hash-algo ALGO` | `md5` | Hashing algorithm to use (`md5` or `xxh64`). `xxh64` requires the `fast_hash` extra. |

### Path Conversion Options

| Option | Description |
|--------|-------------|
| `--convert-manifest-paths-from PREFIX` | Original path prefix in manifest to replace. |
| `--convert-manifest-paths-to PREFIX` | New path prefix (useful when drive letters or mount points change). |

## Path Rules

Path rules determine how files are organized in the destination directory. Multiple rules can be applied to create nested structures.

### Available Rules

#### `no_change`
Preserves the original directory structure from the source path.

```bash
dedupecopy -p /source/photos -c /backup -R "*:no_change"
```

Result: `/backup/photos/2023/vacation/photo.jpg`

#### `mtime`
Organizes files by modification date in `YYYY_MM` format.

```bash
dedupecopy -p /source -c /backup -R "*.jpg:mtime"
```

Result: `/backup/2024_03/photo.jpg`

#### `extension`
Organizes files into folders by extension.

```bash
dedupecopy -p /source -c /backup -R "*:extension"
```

Result: `/backup/jpg/photo.jpg`

### Combining Rules

Rules are applied in the order specified and create nested directories:

```bash
dedupecopy -p /source -c /backup -R "*:extension" -R "*:mtime"
```

Result: `/backup/jpg/2024_03/photo.jpg`

### Extension-Specific Rules

Apply different rules to different file types:

```bash
dedupecopy -p /source -c /backup \
  -R "*.jpg:mtime" \
  -R "*.mp4:extension" \
  -R "*.pdf:no_change"
```

- JPG files organized by date
- MP4 files organized by extension
- PDF files keep original structure

### Wildcard Matching

Extension patterns support wildcards:

```bash
-R "*.jp*:mtime"        # Matches .jpg, .jpeg, .jpe, etc.
-R "image*.png:extension"  # Matches image1.png, image_photo.png, etc.
-R "*:no_change"        # Applies to all files
```

The most specific pattern wins when multiple patterns could match.

## Advanced Workflows

### Sequential Multi-Source Backup

When consolidating from multiple sources to a single target while avoiding duplicates between sources:

#### Step 1: Create manifests for all locations

```bash
# Scan target (if it has existing files)
dedupecopy -p /backup/target -m target_manifest.db

# Scan each source
dedupecopy -p /source1 -m source1_manifest.db
dedupecopy -p /source2 -m source2_manifest.db
```

#### Step 2: Copy each source sequentially

```bash
# Copy source1 (comparing against target)
dedupecopy -p /source1 -c /backup/target \
  -i source1_manifest.db \
  --compare target_manifest.db \
  --no-walk

# Copy source2 (comparing against target AND source1)
dedupecopy -p /source2 -c /backup/target \
  -i source2_manifest.db \
  --compare target_manifest.db \
  --compare source1_manifest.db \
  --no-walk
```

**How it works:**
- `--no-walk` skips re-scanning (uses manifest data)
- `--compare` loads manifests for duplicate checking but doesn't copy those files
- Each source is copied only if files aren't already in target or previous sources

### Manifest Path Conversion

If drive letters or mount points change between runs:

```bash
dedupecopy -i old_manifest.db -m new_manifest.db \
  --convert-manifest-paths-from "/Volumes/OldDrive" \
  --convert-manifest-paths-to "/Volumes/NewDrive" \
  --no-walk
```

Updates all paths in the manifest without re-scanning files.

### Incremental Backup

```bash
# Initial backup
dedupecopy -p /photos -c /backup/photos -m backup_manifest.db

# Later, add new photos (resuming/adding to existing backup)
dedupecopy -p /photos -c /backup/photos -i backup_manifest.db -m backup_manifest.db
```

Only new or modified files are processed.

### Comparison Without Copying

Compare two directories to find what's different:

```bash
# Scan both locations
dedupecopy -p /location1 -m manifest1.db
dedupecopy -p /location2 -m manifest2.db

# Generate report of files in location1 not in location2
dedupecopy -p /location1 -i manifest1.db --compare manifest2.db -r unique_files.csv --no-walk
```

## Performance Tips

### Thread Count Tuning

**Default settings (4/8/8)** work well for most scenarios.

**For SSDs/NVMe:**
```bash
--walk-threads 8 --read-threads 16 --copy-threads 16
```

**For HDDs:**
```bash
--walk-threads 2 --read-threads 4 --copy-threads 4
```

**For network shares:**
```bash
--walk-threads 2 --read-threads 4 --copy-threads 2
```
Network latency makes more threads counterproductive.

### Large File Sets

For very large directories (millions of files):

1. **Use manifests** - Essential for resumability
2. **Process in batches** - Use `--ignore` to exclude subdirectories, process separately
3. **Monitor memory** - Manifests use disk-based caching to minimize memory usage
4. **Incremental saves** - Manifests auto-save every 50,000 files

### Network Considerations

- **Network paths may timeout** - Tool retries after 3 seconds
- **SMB/CIFS shares** - Use lower thread counts
- **Bandwidth limits** - Reduce copy threads to avoid saturation
- **VPN connections** - May need much lower thread counts

### Manifest Storage

- Manifest files are stored as Berkeley DB files
- Size is proportional to number of unique files (typically a few MB per 100k files)
- Keep manifests on fast storage (SSD) for best performance
- Manifests are incrementally saved every 50,000 processed files

## Logging and Output Control

### Verbosity Levels

DedupeCopy provides flexible output control to suit different use cases:

#### Normal Mode (Default)
Standard output with progress updates every 1,000 files:

```bash
dedupecopy -p /source -c /destination
```

**Output includes:**
- Pre-flight configuration summary
- Progress updates with file counts and processing rates
- Error messages with helpful suggestions
- Final summary statistics

#### Quiet Mode
Minimal output - only warnings, errors, and final results:

```bash
dedupecopy -p /source -c /destination --quiet
```

**Best for:**
- Cron jobs and automated scripts
- When you only care about problems
- Reducing log file sizes

#### Verbose Mode
Detailed progress information:

```bash
dedupecopy -p /source -c /destination --verbose
```

**Output includes:**
- All normal mode output
- More frequent progress updates
- Detailed timing and rate information

#### Debug Mode
Comprehensive diagnostic information:

```bash
dedupecopy -p /source -c /destination --debug
```

**Output includes:**
- All verbose mode output
- Queue sizes and internal state
- Thread activity details
- Useful for troubleshooting performance issues

### Color Output

By default, DedupeCopy uses colored output when writing to a terminal (if colorama is installed):

- **Errors**: Red text
- **Warnings**: Yellow text
- **Info messages**: Default color
- **Debug messages**: Cyan text

To disable colors (e.g., when logging to a file):

```bash
dedupecopy -p /source -c /destination --no-color
```

Colors are automatically disabled when output is redirected to a file or pipe.

### Enhanced Features

#### Pre-Flight Summary
Before starting operations, you'll see a summary of configuration:

```
======================================================================
DEDUPE COPY - Operation Summary
======================================================================
Source path(s): 2 path(s)
  - /Volumes/Source1
  - /Volumes/Source2
Destination: /Volumes/Backup
Extension filter: jpg, png, gif
Path rules: *.jpg:mtime
Threads: walk=4, read=8, copy=8
Options: keep_empty=False, preserve_stat=True, no_walk=False
======================================================================
```

#### Progress with Rates
During operation, you'll see processing rates:

```
Discovered 5000 files (dirs: 250), accepted 4850. Rate: 142.3 files/sec
Work queue has 234 items. Progress queue has 12 items. Walk queue has 5 items.
...
Copied 4800 items. Skipped 50 items. Rate: 125.7 files/sec
```

#### Helpful Error Messages
Errors include context and suggestions:

```
Error processing '/path/to/file.txt': [PermissionError] Permission denied
  Suggestions: Check file permissions; Ensure you have read access to source files
```

#### Proactive Warnings
The tool warns about potential issues before they become problems:

```
WARNING: Work queue is large (42000 items). Consider reducing thread counts to avoid memory issues.
WARNING: Progress queue is backing up (12000 items). This may indicate slow processing.
```

### Examples

#### Silent operation for scripts
```bash
dedupecopy -p /source -c /backup --quiet 2>&1 | tee backup.log
```

#### Maximum detail for troubleshooting
```bash
dedupecopy -p /source -c /backup --debug --no-color > debug.log 2>&1
```

#### Normal operation with color
```bash
dedupecopy -p /source -c /backup --verbose
```

## Troubleshooting

### Common Issues

#### "Directory disappeared during walk"

**Cause:** Network path timeout or files deleted during scan.

**Solution:**
- Reduce `--walk-threads` for network paths
- Ensure stable network connection
- Exclude volatile directories with `--ignore`

#### Out of Memory Errors

**Cause:** Very large queue sizes.

**Solution:**
- Reduce thread counts
- Process in smaller batches
- Ensure sufficient swap space

#### Permission Errors

**Cause:** Insufficient permissions on source or destination.

**Solution:**
```bash
# Check permissions
ls -la /source/path
ls -la /destination/path

# Run with appropriate user or use sudo (carefully!)
```

#### Resuming Interrupted Runs

If a run is interrupted:

```bash
# Resume using the manifest
dedupecopy -p /source -c /destination -i manifest.db -m manifest.db
```

Files already processed (in manifest) are skipped.

#### Manifest Corruption

If manifest files become corrupted:

**Solution:**
- Delete manifest files and restart
- Manifest files: `manifest.db` and `manifest.db.read`
- Consider keeping backup copies of manifests for very long operations

### Getting Help

Check the output during run:
- Progress updates every 1,000 files with processing rates
- Error messages show problematic files with helpful suggestions
- Warnings alert you to potential issues proactively
- Final summary shows counts and errors

For debugging, use `--debug` mode:

```bash
dedupecopy -p /source -c /destination --debug --no-color > debug.log 2>&1
```

Debug output includes:
- File counts and progress with timing
- Queue sizes and internal state (useful if growing unbounded)
- Thread activity and performance metrics
- Specific error messages with file paths and suggestions

## Safety and Best Practices

### ⚠️ Important Warnings

1. **Test first**: Run with `-r` (report only) before using `-c` (copy) on important data
2. **Backup important data**: Always have backups before restructuring
3. **Use manifests**: They provide a record of what was processed
4. **Verify results**: Check file counts and spot-check files after copy operations
5. **Watch disk space**: Ensure sufficient space on destination

### Recommended Workflow

```bash
# Step 1: Generate report to understand what will happen
dedupecopy -p /source -r preview.csv -m preview_manifest.db

# Step 2: Review the CSV report
# Check duplicate counts, file types, sizes

# Step 3: Run the actual copy with manifest
dedupecopy -p /source -c /destination -i preview_manifest.db -m final_manifest.db

# Step 4: Verify
# Check file counts, spot-check files, verify important files copied
```

### What Gets Copied / Deleted

- **First occurrence** of each unique file (by MD5 hash) is kept.
- Subsequent identical files are either copied to the destination or deleted from the source.
- Files are considered unique if their MD5 hash differs.
- Empty files are treated as duplicates unless `--keep-empty` is used.
- Ignored patterns (`--ignore`) are never copied or deleted.

### What Doesn't Get Copied / Deleted

- The first-seen version of a file is never deleted.
- Files matching `--ignore` patterns.
- Files listed in `--compare` manifests (used for comparison only).
- Extensions not matching the `-e` filter (if specified).

### Preserving Metadata

By default, only file contents are copied. To preserve timestamps and permissions:

```bash
dedupecopy -p /source -c /destination --copy-metadata
```

This uses Python's `shutil.copy2()` which preserves:
- Modification time
- Access time
- File mode (permissions)

**Note:** Not all metadata may transfer across different file systems.

## Output Files

### CSV Duplicate Report

Format: `Collision #, MD5, Path, Size (bytes), mtime`

```csv
Src: ['/source/path']
Collision #, MD5, Path, Size (bytes), mtime
1, d41d8cd98f00b204e9800998ecf8427e, '/path/file1.jpg', 1024, 1633024800.0
1, d41d8cd98f00b204e9800998ecf8427e, '/path/file2.jpg', 1024, 1633024800.0
2, a3d5c12f8b9e4a1c2d3e4f5a6b7c8d9e, '/path/doc1.pdf', 2048, 1633111200.0
2, a3d5c12f8b9e4a1c2d3e4f5a6b7c8d9e, '/path/doc2.pdf', 2048, 1633111200.0
```

### Manifest Files

Binary database files (not human-readable):
- `manifest.db` - MD5 hashes and file metadata
- `manifest.db.read` - List of processed file paths

These enable resuming and incremental operations.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

This project is licensed under the Simplified BSD License.

## Project Links

- **GitHub**: https://github.com/othererik/dedupe_copy
- **PyPI**: https://pypi.org/project/DedupeCopy/

## Author

Erik Schweller (othererik@gmail.com)

---

**Status**: Tested and seems to work, but use with caution and always backup important data!
