Metadata-Version: 2.4
Name: chou
Version: 0.1.1
Summary: Academic paper PDF renaming tool - 学术论文PDF重命名工具
Author: cycleuser
License-Expression: MIT
Project-URL: Homepage, https://github.com/cycleuser/Chou
Project-URL: Repository, https://github.com/cycleuser/Chou
Project-URL: Issues, https://github.com/cycleuser/Chou/issues
Keywords: academic,paper,pdf,rename,citation
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Utilities
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: PyMuPDF>=1.23.0
Provides-Extra: gui
Requires-Dist: PySide6>=6.5.0; extra == "gui"
Provides-Extra: test
Requires-Dist: pytest>=7.0; extra == "test"
Provides-Extra: ocr-surya
Requires-Dist: surya-ocr>=0.14.0; extra == "ocr-surya"
Provides-Extra: ocr-paddle
Requires-Dist: paddleocr>=2.7.0; extra == "ocr-paddle"
Requires-Dist: paddlepaddle>=2.5.0; extra == "ocr-paddle"
Provides-Extra: ocr-rapid
Requires-Dist: rapidocr-onnxruntime>=1.3.0; extra == "ocr-rapid"
Provides-Extra: ocr-easy
Requires-Dist: easyocr>=1.7.0; extra == "ocr-easy"
Provides-Extra: ocr-tesseract
Requires-Dist: pytesseract>=0.3.10; extra == "ocr-tesseract"
Requires-Dist: Pillow>=9.0; extra == "ocr-tesseract"
Dynamic: license-file

# Chou (瞅) - Academic Paper PDF Renamer

A Python tool to automatically rename academic PDF papers to citation-style filenames by extracting title, author, and year information from the PDF content.

## Features

- Extracts title and authors from PDF first page using font size analysis
- **OCR support** for scanned PDFs (5 OCR backends available)
- Extracts publication year using 10 different strategies (supports English and Chinese)
- **Chinese name handling** - automatically uses full names for Chinese authors
- **Chinese thesis/dissertation support** - detects labeled fields like "论文题目", "作者姓名"
- Multiple author format options
- Dry-run mode for safe preview
- Handles special characters and Unicode in author names
- Logs all operations and exports results to CSV

## Requirements

- Python >= 3.10
- PyMuPDF (required)
- OCR backend (optional, for scanned PDFs)

## Installation

### From PyPI

```bash
pip install chou
```

### From Source

```bash
git clone https://github.com/cycleuser/Chou.git
cd Chou
pip install -e .
```

### With OCR Support

Choose one or more OCR backends based on your needs:

```bash
# Install with all OCR backends
pip install -e ".[ocr-surya,ocr-paddle,ocr-rapid,ocr-easy,ocr-tesseract]"

# Or install specific backends:
pip install surya-ocr          # Surya - Best accuracy, transformer-based (recommended)
pip install paddleocr paddlepaddle  # PaddleOCR - Good for Chinese
pip install rapidocr-onnxruntime    # RapidOCR - Lightweight, fast
pip install easyocr                 # EasyOCR - Easy to use
pip install pytesseract Pillow      # Tesseract - Classic OCR
```

## Quick Start

After installation, the `chou` command is available:

```bash
# Preview changes (dry-run mode, default)
chou --dir /path/to/papers --dry-run

# Actually rename files
chou --dir /path/to/papers --execute

# Show version
chou --version
```

## Usage

```bash
chou [options]
```

### Options

| Option | Short | Description |
|--------|-------|-------------|
| `--dir DIR` | `-d` | Directory containing PDF files (default: current) |
| `--dry-run` | `-n` | Preview without renaming (default: True) |
| `--execute` | `-x` | Actually rename files |
| `--format FMT` | `-f` | Author name format (see below) |
| `--num-authors N` | `-N` | Number of authors for n_* formats (default: 3) |
| `--recursive` | `-r` | Process subdirectories recursively (default: True) |
| `--no-recursive` | | Only process the specified directory |
| `--ocr-engine` | | Specify OCR engine (default: auto-detect) |
| `--no-ocr` | | Disable OCR fallback |
| `--output FILE` | `-o` | Export results to CSV file |
| `--log-file FILE` | `-l` | Log file path |
| `--verbose` | `-v` | Verbose output |

### Author Format Options (`-f`)

| Format | Example Output |
|--------|----------------|
| `first_surname` | `Wang et al. (2023) - Title.pdf` |
| `first_full` | `Weihao Wang et al. (2023) - Title.pdf` |
| `all_surnames` | `Wang, Zhang, You (2023) - Title.pdf` |
| `all_full` | `Weihao Wang, Rufeng Zhang, Mingyu You (2023) - Title.pdf` |
| `n_surnames` | `Wang, Zhang et al. (2023) - Title.pdf` |
| `n_full` | `Weihao Wang, Rufeng Zhang et al. (2023) - Title.pdf` |

**Note:** For Chinese authors, full names are always used (e.g., `张三` instead of just `张`) since single-character surnames are not meaningful.

### Examples

```bash
# Use first author's full name
chou -d /path/to/papers -f first_full --dry-run

# Use first 2 authors' surnames
chou -d /path/to/papers -f n_surnames -N 2 --dry-run

# Process and export results
chou -d /path/to/papers --execute -o results.csv

# Use specific OCR engine
chou -d /path/to/papers --ocr-engine rapidocr --dry-run

# Disable OCR
chou -d /path/to/papers --no-ocr --dry-run
```

## OCR Support

For scanned PDFs without embedded text, the tool automatically uses OCR. Available backends (in priority order):

| Backend | Install Command | Notes |
|---------|-----------------|-------|
| Surya | `pip install surya-ocr` | Best accuracy, transformer-based |
| PaddleOCR | `pip install paddleocr paddlepaddle` | Good for Chinese |
| RapidOCR | `pip install rapidocr-onnxruntime` | Lightweight, fast |
| EasyOCR | `pip install easyocr` | Easy to use |
| Tesseract | `pip install pytesseract Pillow` | Classic OCR |

The tool automatically selects the best available backend. To disable a specific backend:

```bash
# Disable Surya OCR (e.g., on low-memory systems)
export CHOU_DISABLE_SURYA=1
chou --dry-run
```

## Year Extraction Strategies

The tool uses 10 strategies to extract publication year, ranked by confidence:

1. **Conference + year** (100): `CVPR 2023`, `NeurIPS'22`, `AAAI-23`
2. **Ordinal edition** (90): `Thirty-Seventh AAAI Conference`
3. **Copyright notice** (85): `Copyright 2023`, `(c) 2023`
4. **Publication date** (80): `Published: 2023`, `Accepted: Jan 2023`
5. **Chinese year** (78): `2023年`, `二〇二三年`
6. **arXiv ID** (75): `arXiv:2301.12345`
7. **DOI with year** (75): `10.1109/CVPR.2023.xxx`
8. **Journal volume** (70): `Vol. 35, 2023`
9. **Date pattern** (60-65): `March 2023`, `2023/03`
10. **Frequent year** (20-50): Most common year in text

## Supported Conferences

AAAI, IJCAI, NeurIPS, ICML, ICLR, CVPR, ICCV, ECCV, ACL, EMNLP, NAACL, SIGIR, KDD, WWW, CHI, USENIX, and 50+ more.

## Project Structure

```
Chou/
├── chou/                  # Main package
│   ├── core/             # Core functionality
│   │   ├── processor.py       # PDF processing
│   │   ├── ocr_extractor.py   # OCR backends
│   │   ├── author_parser.py   # Author name parsing
│   │   ├── year_parser.py     # Year extraction
│   │   └── filename_gen.py    # Filename generation
│   ├── cli/              # Command-line interface
│   └── gui/              # GUI (optional)
├── tests/                # pytest tests
├── requirements.txt      # Dependencies
├── pyproject.toml        # Package configuration
├── README.md             # This file
└── README_CN.md          # Chinese documentation
```

## GUI (Optional)

A graphical user interface is available:

```bash
pip install chou[gui]
chou-gui
```

### Screenshots

**1. Initial Window** - Drag & drop PDFs or use toolbar to add files:

![Initial Window](images/01_initial.png)

**2. After Processing** - Extracted title, authors, year with preview of new filenames:

![Processed Results](images/02_processed.png)

**3. Renamed Files** - Files renamed to citation-style format in file manager:

![Renamed Files](images/03-results.png)

## Development

```bash
# Install development dependencies
pip install -e ".[test]"

# Run tests
pytest

# Run with verbose output
pytest -v
```

## Python API

```python
from chou import rename_papers

result = rename_papers(
    "./papers",
    author_format="first_surname",
    dry_run=True,
)
print(result.success)    # True / False
print(result.data)       # list of paper dicts
print(result.metadata)   # summary stats
```

## Agent Integration (OpenAI Function Calling)

Chou exposes an OpenAI-compatible tool for LLM agents:

```python
from chou.tools import TOOLS, dispatch

# Pass TOOLS to the OpenAI chat completion API
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=TOOLS,
)

# Dispatch the tool call
result = dispatch(
    tool_call.function.name,
    tool_call.function.arguments,
)
```

## CLI Help

![CLI Help](images/chou_help.png)

## License

MIT License
