Metadata-Version: 2.4
Name: pdfalive
Version: 0.2.0
Summary: A Python library and CLI tool that uses LLMs to enhance PDF files
Author: Adam Ever-Hadani
License-Expression: MIT
Keywords: pdf,llm,toc,bookmarks,cli
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Text Processing :: General
Requires-Python: <3.14,>=3.13
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: click>=8.3.1
Requires-Dist: langchain>=1.1.2
Requires-Dist: pydantic>=2.12.5
Requires-Dist: pymupdf>=1.26.6
Requires-Dist: tenacity>=9.1.0
Requires-Dist: rich>=14.0.0
Provides-Extra: anthropic
Requires-Dist: langchain-anthropic>=1.2.0; extra == "anthropic"
Provides-Extra: openai
Requires-Dist: langchain-openai>=0.3.0; extra == "openai"
Provides-Extra: ollama
Requires-Dist: ollama; extra == "ollama"
Requires-Dist: langchain-ollama>=0.3.0; extra == "ollama"
Dynamic: license-file

# pdfalive

--------------------------------------------------------------------------------


[![CI](https://github.com/promptromp/pdfalive/actions/workflows/ci.yml/badge.svg)](https://github.com/promptromp/pdfalive/actions/workflows/ci.yml)
[![GitHub License](https://img.shields.io/github/license/promptromp/pdfalive)](https://github.com/promptromp/pdfalive/blob/main/LICENSE)
[![PyPI - Version](https://img.shields.io/pypi/v/pdfalive)](https://pypi.org/project/pdfalive/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pdfalive)](https://pypi.org/project/pdfalive/)

A Python library and set of CLI tools to bring PDF files alive with the magic of LLMs.

Features:

* Automatically generate a Table of Contents via PDF Bookmarks for PDF file using LLMs. Supports arbitrarily large files with intelligent batching.
* Automatically detect if OCR is needed to parse text from raster data. If needed, performs OCR via Tesseract OCR library.
* Choose which LLM to use from any vendor. Supports using local models via `ollama` as well. Retry logic included for handling rate limits.

## Installation

the [tesseract](https://github.com/tesseract-ocr/tesseract) library is required for OCR. This is used for PDFs where text is not parsed. On MacOS, can install via Homebrew:

	brew install tesseract

You can then install the pdfalive package via pip for example:

	pip install pdfalive


## Usage

To use the CLIs described below, you can install the python package (`pip install pdfalive`), or run the cli directly using [uvx](https://docs.astral.sh/uv/guides/tools/):

	uvx pdfalive generate-toc input.pdf output.pdf

More detailed examples of the CLI sub-commands are provided below.
You can also use `--help` on the main command-line and any of the sub-commands to get an idea of the different options supported.

### generate-toc

Automatically generate clickable Table of Contents (e.g. using PDF bookmarks) for a PDF file by extracting features from the PDF and then calling an LLM to infer the pages and section names from these.

Example usage:

	pdfalive generate-toc input.pdf output.pdf

By default we use the latest OpenAI ChatGPT available (run with --help to check what is the latest model we use as default), but you can change this by setting the model as part of invocation:

	pdfalive generate-toc --model-identifier 'claude-sonnet-4-5' input.pdf output.pdf

Model names should match the identifiers used by [LangChain](https://www.langchain.com/), which generally line up with the names used by the various provider APIs themselves.

Note that for using Anthropic models you'd want to set your api key via the environment variable `ANTHROPIC_API_KEY`. Similar mechanisms apply to OpenAI (`OPENAI_API_KEY`) and other vendors.


## Development

We use `uv` to manage the library. To install locally can run e.g. with:

	uv sync
	uv pip install -e .

We use `ruff` for formatting and linting, `mypy` for static type checking, and `pytest` for running unit-tests. We also use [pre-commit](https://pre-commit.com/) for ensuring high-quality commits.
