Metadata-Version: 2.4
Name: duplifinder
Version: 8.0.0
Summary: Detect duplicate Python definitions, text patterns, and token similarities for codebase maintainability.
Author-email: Dhruv <dhruv13x@gmail.com>
Maintainer-email: Dhruv <dhruv13x@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/dhruv13x/duplifinder
Project-URL: Repository, https://github.com/dhruv13x/duplifinder.git
Project-URL: Issues, https://github.com/dhruv13x/duplifinder/issues
Project-URL: Changelog, https://github.com/dhruv13x/duplifinder/blob/main/CHANGELOG.md
Project-URL: Release Notes, https://github.com/dhruv13x/duplifinder/releases
Keywords: code-duplicates,python-ast,refactoring,code-analysis,static-analysis,ci-cd
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Code Generators
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Typing :: Typed
Requires-Python: >=3.12
Description-Content-Type: text/markdown
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0
Requires-Dist: tqdm>=4.66
Requires-Dist: pydantic>=2.5
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: hypothesis>=6.100.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: pytest-mock>=3.10.0; extra == "dev"
Requires-Dist: pytest-timeout>=2.3.0; extra == "dev"
Requires-Dist: ruff>=0.6.0; extra == "dev"
Requires-Dist: black>=24.3.0; extra == "dev"
Requires-Dist: mypy>=1.11.0; extra == "dev"
Requires-Dist: types-PyYAML>=6.0.1; extra == "dev"
Requires-Dist: trio>=0.25.0; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=7.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=2.0; extra == "docs"

<div align="center">
  <img src="https://raw.githubusercontent.com/dhruv13x/duplifinder/main/duplifinder_logo.png" alt="duplifinder logo" width="200"/>
</div>

<div align="center">

<!-- Package Info -->
[![PyPI version](https://img.shields.io/pypi/v/duplifinder.svg)](https://pypi.org/project/duplifinder/)
[![Python](https://img.shields.io/badge/python-3.8%2B-blue.svg)](https://www.python.org/)
![Wheel](https://img.shields.io/pypi/wheel/duplifinder.svg)
[![Release](https://img.shields.io/badge/release-PyPI-blue)](https://pypi.org/project/duplifinder/)

<!-- Build & Quality -->
[![Build status](https://github.com/dhruv13x/duplifinder/actions/workflows/publish.yml/badge.svg)](https://github.com/dhruv13x/duplifinder/actions/workflows/publish.yml)
[![Codecov](https://codecov.io/gh/dhruv13x/duplifinder/graph/badge.svg)](https://codecov.io/gh/dhruv13x/duplifinder)
[![Test Coverage](https://img.shields.io/badge/coverage-90%25%2B-brightgreen.svg)](https://github.com/dhruv13x/duplifinder/actions/workflows/test.yml)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Ruff](https://img.shields.io/badge/linting-ruff-yellow.svg)](https://github.com/astral-sh/ruff)
![Security](https://img.shields.io/badge/security-CodeQL-blue.svg)

<!-- Usage -->
![Downloads](https://img.shields.io/pypi/dm/duplifinder.svg)
![OS](https://img.shields.io/badge/os-Linux%20%7C%20macOS%20%7C%20Windows-blue.svg)
[![Python Versions](https://img.shields.io/pypi/pyversions/duplifinder.svg)](https://pypi.org/project/duplifinder/)

<!-- License -->
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

<!-- Docs -->
[![Docs](https://img.shields.io/badge/docs-latest-brightgreen.svg)](https://your-docs-link)

</div>

# Duplifinder

**Detect and refactor duplicate Python code**—classes, functions, async defs, text patterns, and token similarities—for cleaner, more maintainable codebases.

Duplifinder leverages Python's AST for precise scanning, parallelizes for large repos, and integrates seamlessly into CI/CD pipelines to enforce DRY principles and catch regressions early.

## Table of Contents

- [Features](#features)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Configuration](#configuration)
- [CLI Reference](#cli-reference)
- [Output Formats](#output-formats)
- [Advanced Usage](#advanced-usage)
- [Best Practices](#best-practices)
- [Development](#development)
- [Contributing](#contributing)
- [Troubleshooting](#troubleshooting)
- [License](#license)

## Features

- **AST-Powered Detection**: Identifies duplicates in `ClassDef`, `FunctionDef`, `AsyncFunctionDef` (including class methods as `ClassName.method`).
- **Text Pattern Matching**: Regex-based search for arbitrary snippets (e.g., TODOs, FIXMEs).
- **Token Similarity**: Detect near-duplicates via normalized token diffs (e.g., similar function bodies).
- **Search Mode**: Locate all occurrences of specific definitions (singletons or multiples).
- **Parallel Processing**: Threaded or multiprocessing for monorepos (GIL-aware; configurable workers).
- **Rich Outputs**: Human-readable console (with Rich tables), JSON for automation.
- **Configurable Filtering**: Glob excludes, regex name filters, ignore dirs.
- **Audit Logging**: Opt-in JSONL trails for file access/compliance (v6.1.0+).
- **CI-Friendly**: Exit codes for fails, dup thresholds, and metrics export.

## Installation

```bash
# From PyPI (stable)
pip install duplifinder

# From GitHub (latest)
pip install git+https://github.com/dhruv13x/duplifinder.git@main

# Editable dev install
git clone https://github.com/dhruv13x/duplifinder.git
cd duplifinder
pip install -e ".[dev]"
```

Requires Python 3.12+. No additional setup for core usage; dev extras include pytest/mypy/black.

## Quick Start

Scan the current directory for duplicates (defaults to classes/functions/async functions):

```bash
# Basic scan with console output
duplifinder .

# With previews and verbose logs
duplifinder . --preview --verbose

# JSON for CI parsing
duplifinder . --json > duplicates.json

# Fail CI on duplicates
duplifinder . --fail
```

### Text Pattern Mode

Hunt duplicated lines via regex:

```bash
# Find TODOs with min 2 occurrences
duplifinder . --pattern-regex "TODO:" --min 2 --json
```

### Token Similarity (Near-Dups)

For similar but not identical code:

```bash
duplifinder . --token-mode --similarity-threshold 0.85 --preview
```

### Search Specific Definitions

List all occurrences (even singles):

```bash
# Singleton check for a class
duplifinder . -s class UIManager --preview

# Multiple specs
duplifinder . -s class UIManager -s def dashboard_menu --json
```

### Focused Scans

```bash
# Only classes
duplifinder . -f class

# Specific name
duplifinder . -f MyClass

# Types + names
duplifinder . -f class -f HelperFunc
```

## Configuration

CLI flags override a `.duplifinder.yaml` (specified via `--config`).

```yaml
# .duplifinder.yaml
root: .
ignore: ".git,__pycache__,venv"
exclude_patterns: "*.pyc,tests/*"
exclude_names: "^_.*,experimental_.*"
find: ["class", "def"]  # Default: all types
find_regex: ["class UI.*Manager"]
pattern_regex: ["TODO:"]
search: ["class UIManager"]  # For -s mode
json: false
fail: false
min: 2
verbose: true
parallel: true
use_multiprocessing: false  # For CPU-bound (avoids GIL)
max_workers: 8
preview: true
audit: true  # v6.1.0+: Enable file access trails
audit_log: "./logs/audit.jsonl"
similarity_threshold: 0.8
dup_threshold: 0.1  # Alert if >10% dup rate
```

Load with `duplifinder . --config .duplifinder.yaml`.

## CLI Reference

| Flag | Description | Default |
|------|-------------|---------|
| `<root>` | Scan root(s) (dirs/files) | `.` |
| `--config <path>` | YAML config file | None |
| `--ignore <dirs>` | Comma-separated ignores (e.g., `.git,build`) | Built-ins |
| `--exclude-patterns <globs>` | File globs to skip (e.g., `*.pyc`) | None |
| `--exclude-names <regexes>` | Name regexes to filter (e.g., `^_`) | None |
| `-f, --find <items>` | Types/names (e.g., `class`, `MyClass`) | All types |
| `--find-regex <patterns>` | Regex for types/names | None |
| `--pattern-regex <patterns>` | Text mode regexes | None |
| `-s, --search <specs>` | Search specs (e.g., `class Foo`) | None |
| `--token-mode` | Enable token similarity | False |
| `--similarity-threshold <float>` | Token match ratio (0-1) | 0.8 |
| `--dup-threshold <float>` | Alert if dup rate > this | 0.1 |
| `--min <int>` | Min occurrences for dup | 2 |
| `-p, --preview` | Show snippets | False |
| `--json` | JSON output | False |
| `--fail` | Exit 1 on dups | False |
| `--verbose` | Detailed logs | False |
| `--parallel` | Concurrent processing | False |
| `--use-multiprocessing` | Use processes (not threads) | False |
| `--max-workers <int>` | Worker count | CPU cores |
| `--audit` | Enable audit logging | False |
| `--audit-log <path>` | Audit JSONL path | `.duplifinder_audit.jsonl` |
| `--version` | Show version | N/A |

Run `duplifinder --help` for full details.

## Output Formats

### Console (Rich-Enhanced)

Formatted tables with counts and previews:

```
class MyClass (3 occurrences):
┌─────────────────────┬──────────────────────────────────────┐
│ Location            │ Snippet                              │
├─────────────────────┼──────────────────────────────────────┤
│ /src/a.py:10        │ 10 class MyClass:                    │
│ /src/b.py:15        │ 15   def init(self):                 │
│ /src/c.py:8         │ 8     pass                           │
└─────────────────────┴──────────────────────────────────────┘
```

No dups: `[green]No duplicates found.[/green]`

Alerts: `[red]ALERT: Dup rate 15% > 10% threshold[/red]`

Audit nudge (if enabled): `[dim green]Audit trail logged to ./audit.jsonl[/dim green]`

### JSON (Machine-Readable)

```json
{
  "generated_at": "2025-11-01T13:29:00Z",
  "root": "/path/to/repo",
  "scanned_files": 123,
  "skipped_files": ["tests/broken.py"],
  "duplicate_count": 2,
  "duplicates": {
    "class MyClass": [
      {"loc": "/src/a.py:10", "snippet": "class MyClass:\n  pass", "type": "class"}
    ]
  }
}
```

For search: Includes `is_singleton`, `count`, `occurrences`.

## Advanced Usage

### Parallelism for Monorepos

For 10k+ files, enable multiprocessing to bypass GIL:

```bash
duplifinder . --parallel --use-multiprocessing --max-workers 16
```

Benchmark: ~2x speedup on CPU-bound token mode; monitor via `--verbose`.

### Audit Logging (v6.1.0+)

Opt-in JSONL trails for compliance (file access, skips, durations):

```bash
duplifinder . --audit --audit-log ./logs/audit.jsonl --verbose
```

Sample entry:
```json
{"timestamp": "2025-11-01T13:29:00Z", "event_type": "file_parsed", "path": "/src/a.py", "action": "ast_success", "bytes_read": 1024, "lines": 50}
```

Query with `jq`: `jq '.event_type == "scan_completed"' audit.jsonl` for SLOs.

### CI Integration

Gate merges on zero dups:

```yaml
# .github/workflows/ci.yml
- name: Check Duplicates
  run: duplifinder . --fail --json --min 2
```

Threshold alerts:
```bash
duplifinder . --dup-threshold 0.05 --fail  # Fail if >5% dup rate
```

## Best Practices

- **Start Simple**: `--preview --min 2 --verbose` for initial runs; tune excludes iteratively.
- **CI Gating**: `--json --fail` for regressions; parse output for PR comments.
- **Monorepo Scaling**: Use `--use-multiprocessing` for token/AST; cap `--max-workers` at 2x cores.
- **False Positive Tuning**: `--exclude-names "^test_.*"` for fixtures; `--exclude-patterns "migrations/*"`.
- **Compliance**: Enable `--audit` in prod scans; rotate logs via cron.

## Development

Install dev deps:
```bash
pip install -e ".[dev]"
```

Run tests:
```bash
pytest  # With coverage: pytest --cov=src/duplifinder --cov-fail-under=90
```

Lint/format:
```bash
black .
mypy src/duplifinder  # Strict mode
```

Local run:
```bash
duplifinder . --preview
```

Build/release:
```bash
python -m build
twine upload dist/*
```

## Contributing

1. **Issue First**: Describe bug/feature with repro steps or use case.
2. **Branch & PR**: Fork → `feat/your-feature` → PR with rationale/tests.
3. **Standards**: Follow PEP 8 (black), type hints (mypy), 90%+ coverage.
4. **Small Changes**: <300 LOC per PR; link ADRs for architecture.

See [CONTRIBUTING.md](CONTRIBUTING.md) for details.

## Troubleshooting

- **Syntax Errors**: Skipped files logged in `skipped_files` (JSON/verbose); fix or exclude.
- **Encoding Issues**: UTF-8 assumed; add `# coding: utf-8` or use `--exclude-patterns`.
- **High Skip Rate**: >10% triggers exit 3; tune ignores/excludes.
- **Perf Bottlenecks**: Profile with `--max-workers=1`; switch to multiprocessing for AST/token.
- **False Positives**: Layer `--exclude-names` regexes; test with `--preview`.

## License

MIT License. See [LICENSE](LICENSE) for details.

---

*Built with ❤️ for Python devs. Questions? [Open an issue](https://github.com/dhruv13x/duplifinder/issues).*
```
