PDF extraction that
checks its own work

Every page extracted. Every page audited. Bad pages re-extracted with OCR automatically. Three lines of Python to production-grade PDF text for your LLM pipeline.

pip install pdfmux copy
open source · python 3.11+ · v1.0 stable · MIT licensed

The problem

You drop a PDF into your LLM pipeline and get garbage. Scanned invoices produce empty text. Pitch decks with charts extract as "Figure 1". Multi-column papers read in the wrong order. You build extraction logic, then debugging logic, then fallback logic...

pdfmux handles all of that in one call. It extracts every page, scores each one for quality, and re-extracts the failures with OCR — so you get clean text without writing a single fallback.

How it works

1. Fast extract — PyMuPDF on every page (instant)
2. Audit — 5 quality checks per page: good / bad / empty
3. Region OCR — surgical OCR on image regions in bad pages
4. Full OCR — re-extract remaining empty pages completely
5. Merge — combine good + fixed pages in order
All pages good? Zero OCR overhead. You only pay for what's broken. Parallel OCR with budget controls keeps costs predictable.

Three lines to production

python
import pdfmux

# extract as markdown — auto-audits every page
text = pdfmux.extract_text("report.pdf")

# structured json with locked schema
data = pdfmux.extract_json("report.pdf")

# LLM-ready chunks with token estimates
chunks = pdfmux.load_llm_context("report.pdf")
# → [{title, text, page_start, page_end, tokens, confidence}]

Built for production

Self-healing extraction
Multi-pass pipeline with region OCR. Bad pages get surgical image extraction. Empty pages get full OCR. No manual fallbacks.
Typed & stable API
Frozen dataclasses, structured error codes, locked JSON schema. API frozen for 1.x — your code won't break on updates.
Per-page confidence
5 quality checks: character density, alphabetic ratio, word structure, whitespace, mojibake detection. Know exactly which pages to trust.
MCP server
3 tools for AI agents: convert, analyze, batch. Give Claude or Cursor the ability to read any PDF with confidence scores.
LangChain & LlamaIndex
Drop-in document loaders for both frameworks. PDFMuxLoader and PDFMuxReader with chunk metadata built in.
Security hardened
File size limits, page count caps, configurable timeouts. Safe for processing untrusted PDFs in production pipelines.

Command line

bash
# convert — auto-audits every page
$ pdfmux invoice.pdf
✓ invoice.md (2 pages, 95% confidence)

# image-heavy pdf — bad pages re-extracted with OCR
$ pdfmux pitch-deck.pdf
✓ pitch-deck.md (12 pages, 85% confidence, 6 OCR'd)

# LLM-ready chunked json
$ pdfmux report.pdf -f llm

# quick triage — per-page quality without extraction
$ pdfmux analyze report.pdf

# start MCP server for AI agents
$ pdfmux serve

Extractors

ExtractorHandlesSpeedInstall
PyMuPDFDigital text0.01s/pgbase
RapidOCRScanned / images0.5-2s/pgpdfmux[ocr]
DoclingTables0.3-3s/pgpdfmux[tables]
Gemini FlashComplex layouts2-5s/pgpdfmux[llm]

MCP server

Give your AI agent the ability to read PDFs. Three tools: convert_pdf for extraction, analyze_pdf for quick triage, batch_convert for directories.

{ "mcpServers": { "pdfmux": { "command": "pdfmux", "args": ["serve"] } } }

Framework integrations

python
# LangChain
from pdfmux.integrations.langchain import PDFMuxLoader

loader = PDFMuxLoader("report.pdf")
docs = loader.load()  # → list[Document]

# LlamaIndex
from pdfmux.integrations.llamaindex import PDFMuxReader

reader = PDFMuxReader()
docs = reader.load_data("report.pdf")  # → list[Document]

Optional extras

bash
# add OCR for scanned pages (~200MB, CPU-only)
$ pip install "pdfmux[ocr]"

# add table extraction
$ pip install "pdfmux[tables]"

# add Gemini Flash for complex layouts
$ pip install "pdfmux[llm]"

# add LangChain or LlamaIndex loader
$ pip install "pdfmux[langchain]"
$ pip install "pdfmux[llamaindex]"

# install everything
$ pip install "pdfmux[all]"