Metadata-Version: 2.4
Name: pdfalive
Version: 0.7.2
Summary: A Python library and CLI tool that uses LLMs to enhance PDF files
Author: Adam Ever-Hadani
License-Expression: MIT
Keywords: pdf,llm,toc,bookmarks,cli
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Text Processing :: General
Requires-Python: <3.14,>=3.13
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: click>=8.3.1
Requires-Dist: langchain>=1.1.2
Requires-Dist: langsmith>=0.4.55
Requires-Dist: pydantic>=2.12.5
Requires-Dist: pymupdf>=1.26.6
Requires-Dist: tenacity>=9.1.0
Requires-Dist: rich>=14.0.0
Provides-Extra: anthropic
Requires-Dist: langchain-anthropic>=1.2.0; extra == "anthropic"
Provides-Extra: openai
Requires-Dist: langchain-openai>=0.3.0; extra == "openai"
Provides-Extra: ollama
Requires-Dist: ollama; extra == "ollama"
Requires-Dist: langchain-ollama>=0.3.0; extra == "ollama"
Dynamic: license-file

![pdfalive logo](https://github.com/promptromp/pdfalive/raw/main/docs/assets/pdfalive.png)

--------------------------------------------------------------------------------

[![CI](https://github.com/promptromp/pdfalive/actions/workflows/ci.yml/badge.svg)](https://github.com/promptromp/pdfalive/actions/workflows/ci.yml)
[![GitHub License](https://img.shields.io/github/license/promptromp/pdfalive)](https://github.com/promptromp/pdfalive/blob/main/LICENSE)
[![PyPI - Version](https://img.shields.io/pypi/v/pdfalive)](https://pypi.org/project/pdfalive/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pdfalive)](https://pypi.org/project/pdfalive/)

A Python library and CLI toolkit that brings PDF files alive with the power of LLMs.

## Highlights

- 📑 **Automatic TOC Generation** — Generate clickable Table of Contents (PDF bookmarks) using LLM inference. Supports arbitrarily large documents with intelligent batching.
- 🔍 **Smart OCR Detection** — Automatically detects scanned PDFs and performs OCR via [Tesseract](https://github.com/tesseract-ocr/tesseract) when needed.
- 📝 **Intelligent File Renaming** — Batch rename files using natural language instructions with LLM-powered inference.
- 🤖 **Multi-Provider LLM Support** — Use any LLM provider via [LangChain](https://github.com/langchain-ai/langchain): OpenAI, Anthropic, local models via [Ollama](https://ollama.ai/), and more.
- 🔄 **Built-in Resilience** — Automatic retry logic with exponential backoff for handling API rate limits.

## Installation

[Tesseract](https://github.com/tesseract-ocr/tesseract) is required for OCR functionality. On macOS:

```bash
brew install tesseract
```

Install pdfalive via [pip](https://pip.pypa.io/):

```bash
pip install pdfalive
```

Or run directly without installation using [uvx](https://docs.astral.sh/uv/guides/tools/):

```bash
uvx pdfalive generate-toc input.pdf output.pdf
```

## Usage

Use `--help` on any command for detailed options:

```bash
pdfalive --help
pdfalive generate-toc --help
```

### generate-toc

Generate a clickable Table of Contents using PDF bookmarks. The tool extracts font and text features from the PDF and uses an LLM to intelligently identify chapter and section headings.

```bash
pdfalive generate-toc input.pdf output.pdf

# Or modify the file in place
pdfalive generate-toc --inplace input.pdf
```

**Choosing an LLM:**

By default, pdfalive uses the latest OpenAI model. Use any [LangChain-supported model](https://python.langchain.com/docs/integrations/chat/):

```bash
# Use Claude
pdfalive generate-toc --model-identifier 'claude-sonnet-4-5' input.pdf output.pdf

# Use a local model via Ollama
pdfalive generate-toc --model-identifier 'ollama/llama3' input.pdf output.pdf
```

Set the appropriate API key for your provider (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, etc.).

**Scanned PDFs:**

OCR is enabled by default. Scanned documents without extractable text are automatically detected and processed:

```bash
# Default: OCR text layer discarded after TOC generation (preserves file size)
pdfalive generate-toc scanned.pdf output.pdf

# Include OCR text layer in output (makes PDF searchable)
pdfalive generate-toc --ocr-output scanned.pdf output.pdf

# Disable automatic OCR entirely
pdfalive generate-toc --no-ocr input.pdf output.pdf
```

**Postprocessing:**

For documents with a printed table of contents page, enable LLM postprocessing to refine results:

```bash
pdfalive generate-toc --postprocess input.pdf output.pdf
```

Postprocessing uses an additional LLM call to:
- Remove duplicate entries and fix typos
- Cross-reference against any printed TOC found in the document
- Add missing entries and correct page numbers

**Other options:**

| Option | Description |
|--------|-------------|
| `--inplace` | Modify the input file in place instead of creating a new output file |
| `--force` | Overwrite existing TOC if the PDF already has bookmarks |
| `--ocr-language` | Set OCR language (default: `eng`). Use [Tesseract language codes](https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html) |
| `--request-delay` | Delay between LLM calls for rate limiting (default: 2s) |

### extract-text

Extract text from scanned PDFs using OCR and save to a new PDF with an embedded text layer:

```bash
pdfalive extract-text input.pdf output.pdf

# Or modify the file in place
pdfalive extract-text --inplace input.pdf
```

This creates a searchable/selectable PDF without generating a TOC.

**Options:**

| Option | Description |
|--------|-------------|
| `--inplace` | Modify the input file in place instead of creating a new output file |
| `--force` | Force OCR even if document already has text |
| `--ocr-language` | Set OCR language (default: `eng`) |
| `--ocr-dpi` | DPI resolution for OCR processing (default: 300) |

### rename

Intelligently rename files using LLM inference. Analyzes filenames and applies renaming rules based on natural language instructions.

```bash
pdfalive rename -q "Add 'REVIEWED_' prefix" *.pdf
```

**Custom naming formats:**

Specify exact formatting including special characters — the LLM respects brackets, parentheses, dashes, and other formatting:

```bash
pdfalive rename -q "[Author Last Name] - Title (Year).pdf" paper1.pdf paper2.pdf
```

**Workflow:**

1. The tool analyzes each filename and generates rename suggestions
2. A preview table shows original names, proposed names, confidence scores, and reasoning
3. Confirm or cancel the operation (unless `-y` is used)
4. Files are renamed in place

**Automatic confirmation:**

```bash
pdfalive rename -q "Add sequential numbering prefix" -y *.pdf
```

**Options:**

| Option | Description |
|--------|-------------|
| `--model-identifier` | Choose which LLM to use (default: `gpt-5.1`) |
| `-y, --yes` | Automatically apply renames without confirmation |
| `--show-token-usage` | Display token usage statistics (default: enabled) |

## Development

We use [uv](https://docs.astral.sh/uv/) to manage the project:

```bash
# Install dependencies
uv sync

# Install in editable mode
uv pip install -e .
```

**Code quality tools:**

| Tool | Purpose |
|------|---------|
| [ruff](https://docs.astral.sh/ruff/) | Formatting and linting |
| [mypy](https://mypy-lang.org/) | Static type checking |
| [pytest](https://docs.pytest.org/) | Unit testing |
| [pre-commit](https://pre-commit.com/) | Git hooks for quality checks |

```bash
# Run linting
uv run ruff check .
uv run ruff format .

# Run type checking
uv run mypy pdfalive

# Run tests
uv run pytest
```

## License

pdfalive is distributed under the terms of the [MIT License](LICENSE).
