Reliable document ingestion
for LLM pipelines

Extract, audit, repair, and structure PDFs automatically so your AI systems always receive clean data.

get started → MIT licensed · v1.0 stable new
pip install pdfmux copy

Most RAG pipelines fail before they reach the model

The problem is not the LLM. The problem is document ingestion. Broken column ordering, missing pages, OCR failures, tables flattened, slide decks returning empty text. Most tools extract once and hope for the best.

typical single-tool output
Page 1: (ok)
Page 2: (ok)
Page 3: 
Page 4: Amoun  Dscriptin
        $450   Consltng    Widgt
        $1200  Setp
Page 5: (ok)
No quality info. No way to know
which pages are broken.
pdfmux output
Page 1: good 0.98
Page 2: good 0.96
Page 3: badOCR'd 0.91
Page 4: badOCR'd 0.87
Page 5: good 0.97
✓ 5 pages, 94% avg confidence
  2 re-extracted with OCR

How it works

pdfmux doesn't extract once and hope. It runs a self-healing pipeline that audits every page, detects failures, and re-extracts them automatically.

1. Extract — PyMuPDF on every page (instant)
2. Audit — 5 quality checks per page: good / bad / empty
3. Region OCR — surgical OCR on image regions in bad pages
4. Full OCR — re-extract remaining empty pages completely
5. Merge — combine good + fixed pages in order
All pages good? Zero OCR overhead. You only pay for what's broken. Parallel OCR with budget controls keeps costs predictable.

Three lines of Python

python
import pdfmux

# extract as markdown — auto-audits every page
text = pdfmux.extract_text("report.pdf")

# structured json with locked schema
data = pdfmux.extract_json("report.pdf")

# LLM-ready chunks with token estimates
chunks = pdfmux.load_llm_context("report.pdf")
# → [{title, text, page_start, page_end, tokens, confidence}]

Built for LLM pipelines

pdfmux outputs structured content designed for RAG pipelines, vector databases, agent workflows, and knowledge retrieval systems.

Command line

bash
# convert — auto-audits every page
$ pdfmux invoice.pdf
✓ invoice.md (2 pages, 95% confidence)

# image-heavy pdf — bad pages re-extracted with OCR
$ pdfmux pitch-deck.pdf
✓ pitch-deck.md (12 pages, 85% confidence, 6 OCR'd)

# quick triage — per-page quality without extraction
$ pdfmux analyze report.pdf

# structured json output
$ pdfmux convert report.pdf --format json

# LLM-ready format with token estimates
$ pdfmux convert report.pdf --format llm

# start MCP server for AI agents
$ pdfmux serve

Extractors

pdfmux picks the best extractor per page automatically. You can also choose manually.

ExtractorHandlesSpeedInstall
PyMuPDFDigital text0.01s/pgbase
RapidOCRScanned / images0.5-2s/pgpdfmux[ocr]
DoclingTables0.3-3s/pgpdfmux[tables]
Gemini FlashComplex layouts2-5s/pgpdfmux[llm]

Expected accuracy across document types

For ingestion systems like pdfmux, what matters most is semantic chunk accuracy — correct text in the right order with reliable boundaries for RAG.

Document TypeText ExtractionLayout RecoveryTable Extraction
Simple text PDFs99–100%95–98%N/A
Academic papers97–99%90–95%80–90%
Business reports96–98%90–94%75–88%
Slide decks95–98%88–92%60–75%
Financial filings95–97%85–92%70–85%
Scanned PDFs85–95%75–88%60–75%
Legal contracts97–99%92–96%80–90%
Forms / gov docs90–96%80–90%65–80%

Aggregate across a mixed dataset:

MetricExpected Range
Text extraction accuracy96–99%
Layout recovery accuracy88–95%
Table extraction accuracy70–88%
OCR document accuracy85–94%

Built-in

🔄
Self-healing pipeline
Bad pages detected and re-extracted automatically. Zero manual intervention.
📊
Confidence scoring
5 quality checks per page. Know exactly which pages to trust.
⚙️
5 extraction backends
PyMuPDF, RapidOCR, Docling, Gemini Flash. Best extractor per page.
🧭
Per-page routing
Each page gets the right extractor based on content type and quality.
🤖
MCP server
Give AI agents PDF reading ability. Three tools, one command to start.
🔓
MIT licensed
Open source with a frozen API. Your code won't break on updates.

MCP server

Give your AI agent the ability to read PDFs. Three tools: convert_pdf for extraction, analyze_pdf for quick triage, batch_convert for directories.

{ "mcpServers": { "pdfmux": { "command": "pdfmux", "args": ["serve"] } } }

LangChain & LlamaIndex

python
# LangChain
from pdfmux.integrations.langchain import PDFMuxLoader

loader = PDFMuxLoader("report.pdf")
docs = loader.load()  # → list[Document]

# LlamaIndex
from pdfmux.integrations.llamaindex import PDFMuxReader

reader = PDFMuxReader()
docs = reader.load_data("report.pdf")  # → list[Document]

Optional extras

bash
# add OCR for scanned pages (~200MB, CPU-only)
$ pip install "pdfmux[ocr]"

# add table extraction
$ pip install "pdfmux[tables]"

# add Gemini Flash for complex layouts
$ pip install "pdfmux[llm]"

# add LangChain or LlamaIndex loader
$ pip install "pdfmux[langchain]"
$ pip install "pdfmux[llamaindex]"

# install everything
$ pip install "pdfmux[all]"

What pdfmux is not

Frequently asked questions

How is pdfmux different from just using PyMuPDF?
PyMuPDF is one of pdfmux's backends. pdfmux adds a quality audit on top: it scores every page, detects failures (blank output, mojibake, broken columns), and automatically re-extracts bad pages with a better extractor. PyMuPDF alone gives you text. pdfmux gives you reliable text.
Does it work offline?
Yes. The base install plus the OCR and tables extras all run locally with zero network calls. Only the Gemini Flash extractor requires an API key and internet. You can run a fully air-gapped pipeline with pip install "pdfmux[ocr,tables]".
What about GPU or cloud extractors?
pdfmux is CPU-only by default. RapidOCR uses ONNX Runtime (CPU). Docling runs locally on CPU. The only cloud extractor is Gemini Flash, which is optional and behind the [llm] extra. There's no GPU requirement.
Is it production ready?
v1.0 is stable with a frozen API. Dataclasses, error codes, and JSON schema are locked for the entire 1.x series. File size limits, page caps, and configurable timeouts make it safe for processing untrusted documents at scale.
How do I add a new extractor?
Implement the BaseExtractor interface with an extract() method and register it. The router will include it in per-page quality comparisons automatically. See the docs for the full extractor development guide.
Can I use it with LangChain or LlamaIndex?
Built-in. Install pdfmux[langchain] or pdfmux[llamaindex] and use the native loader classes. They return standard Document objects with confidence metadata attached.