Extract, audit, repair, and structure PDFs automatically so your AI systems always receive clean data.
pip install pdfmux
copy
The problem is not the LLM. The problem is document ingestion. Broken column ordering, missing pages, OCR failures, tables flattened, slide decks returning empty text. Most tools extract once and hope for the best.
Page 1: (ok) Page 2: (ok) Page 3: Page 4: Amoun Dscriptin $450 Consltng Widgt $1200 Setp Page 5: (ok) No quality info. No way to know which pages are broken.
Page 1: good 0.98 Page 2: good 0.96 Page 3: bad → OCR'd 0.91 Page 4: bad → OCR'd 0.87 Page 5: good 0.97 ✓ 5 pages, 94% avg confidence 2 re-extracted with OCR
pdfmux doesn't extract once and hope. It runs a self-healing pipeline that audits every page, detects failures, and re-extracts them automatically.
import pdfmux # extract as markdown — auto-audits every page text = pdfmux.extract_text("report.pdf") # structured json with locked schema data = pdfmux.extract_json("report.pdf") # LLM-ready chunks with token estimates chunks = pdfmux.load_llm_context("report.pdf") # → [{title, text, page_start, page_end, tokens, confidence}]
pdfmux outputs structured content designed for RAG pipelines, vector databases, agent workflows, and knowledge retrieval systems.
# convert — auto-audits every page $ pdfmux invoice.pdf ✓ invoice.md (2 pages, 95% confidence) # image-heavy pdf — bad pages re-extracted with OCR $ pdfmux pitch-deck.pdf ✓ pitch-deck.md (12 pages, 85% confidence, 6 OCR'd) # quick triage — per-page quality without extraction $ pdfmux analyze report.pdf # structured json output $ pdfmux convert report.pdf --format json # LLM-ready format with token estimates $ pdfmux convert report.pdf --format llm # start MCP server for AI agents $ pdfmux serve
pdfmux picks the best extractor per page automatically. You can also choose manually.
| Extractor | Handles | Speed | Install |
|---|---|---|---|
| PyMuPDF | Digital text | 0.01s/pg | base |
| RapidOCR | Scanned / images | 0.5-2s/pg | pdfmux[ocr] |
| Docling | Tables | 0.3-3s/pg | pdfmux[tables] |
| Gemini Flash | Complex layouts | 2-5s/pg | pdfmux[llm] |
For ingestion systems like pdfmux, what matters most is semantic chunk accuracy — correct text in the right order with reliable boundaries for RAG.
| Document Type | Text Extraction | Layout Recovery | Table Extraction |
|---|---|---|---|
| Simple text PDFs | 99–100% | 95–98% | N/A |
| Academic papers | 97–99% | 90–95% | 80–90% |
| Business reports | 96–98% | 90–94% | 75–88% |
| Slide decks | 95–98% | 88–92% | 60–75% |
| Financial filings | 95–97% | 85–92% | 70–85% |
| Scanned PDFs | 85–95% | 75–88% | 60–75% |
| Legal contracts | 97–99% | 92–96% | 80–90% |
| Forms / gov docs | 90–96% | 80–90% | 65–80% |
Aggregate across a mixed dataset:
| Metric | Expected Range |
|---|---|
| Text extraction accuracy | 96–99% |
| Layout recovery accuracy | 88–95% |
| Table extraction accuracy | 70–88% |
| OCR document accuracy | 85–94% |
Give your AI agent the ability to read PDFs. Three tools: convert_pdf for extraction, analyze_pdf for quick triage, batch_convert for directories.
# LangChain from pdfmux.integrations.langchain import PDFMuxLoader loader = PDFMuxLoader("report.pdf") docs = loader.load() # → list[Document] # LlamaIndex from pdfmux.integrations.llamaindex import PDFMuxReader reader = PDFMuxReader() docs = reader.load_data("report.pdf") # → list[Document]
# add OCR for scanned pages (~200MB, CPU-only) $ pip install "pdfmux[ocr]" # add table extraction $ pip install "pdfmux[tables]" # add Gemini Flash for complex layouts $ pip install "pdfmux[llm]" # add LangChain or LlamaIndex loader $ pip install "pdfmux[langchain]" $ pip install "pdfmux[llamaindex]" # install everything $ pip install "pdfmux[all]"
pip install "pdfmux[ocr,tables]".[llm] extra. There's no GPU requirement.BaseExtractor interface with an extract() method and register it. The router will include it in per-page quality comparisons automatically. See the docs for the full extractor development guide.pdfmux[langchain] or pdfmux[llamaindex] and use the native loader classes. They return standard Document objects with confidence metadata attached.