pdfmux

@nameet

PDF extraction that checks its own work.

Python API + CLI + MCP server. Built for LLM pipelines.

v0.4.0
$ pip install pdfmux click to copy

Python API

import pdfmux

# extract text as markdown
text = pdfmux.extract_text("report.pdf")

# structured dict with locked schema
data = pdfmux.extract_json("report.pdf")
# data["schema_version"]  →  "0.4.0"
# data["ocr_pages"]       →  [2, 5, 8]

# LLM-ready chunks with token estimates
chunks = pdfmux.load_llm_context("report.pdf")
# [{title, text, page_start, page_end, tokens, confidence}]

CLI

# convert a pdf — auto-audits every page
$ pdfmux invoice.pdf
✓ invoice.pdf → invoice.md (2 pages, 95% confidence)

# image-heavy pdf — bad pages re-extracted with ocr
$ pdfmux pitch-deck.pdf
✓ pitch-deck.md (12 pages, 85% confidence, 6 pages OCR'd)

# llm-ready chunked json with token estimates
$ pdfmux report.pdf -f llm

# per-page extraction breakdown
$ pdfmux analyze report.pdf

# batch convert a directory
$ pdfmux ./docs/ -o ./output/

# start mcp server for ai agents
$ pdfmux serve

How it works

Extract every page. Audit quality. Re-extract the bad ones.

1. Fast extract — PyMuPDF on every page (instant)
2. Audit — classify each page: good / bad / empty
3. Re-extract — OCR only the bad pages (RapidOCR)
4. Merge — combine good + OCR'd pages in order

All pages good? Zero OCR overhead. You only pay for what's broken.

Output formats

$ pdfmux report.pdf              # → markdown (default)
$ pdfmux report.pdf -f json       # → structured json + metadata
$ pdfmux report.pdf -f llm        # → chunked json + token estimates
$ pdfmux data.pdf   -f csv        # → tables as csv

Stats

Digital PDFs
0.01s/page
OCR install size
~200MB
Cost (90% of PDFs)
$0
Tests passing
85

Extractors

ExtractorHandlesSpeedInstall
PyMuPDFDigital text0.01s/pgBase
RapidOCRScanned / images0.5-2s/pgpdfmux[ocr]
DoclingTables0.3-3s/pgpdfmux[tables]
Gemini FlashComplex layouts2-5s/pgpdfmux[llm]

MCP Server

Give your AI agent the ability to read PDFs:

{ "mcpServers": { "pdfmux": { "command": "pdfmux", "args": ["serve"] } } }

Agents receive confidence scores + warnings when extraction is limited.

Optional extras

$ pip install "pdfmux[ocr]"     # RapidOCR (~200MB, CPU)
$ pip install "pdfmux[tables]"  # Docling
$ pip install "pdfmux[llm]"     # Gemini Flash
$ pip install "pdfmux[all]"     # everything