Every page extracted. Every page audited. Bad pages re-extracted with OCR automatically. Three lines of Python to production-grade PDF text for your LLM pipeline.
pip install pdfmux
copy
You drop a PDF into your LLM pipeline and get garbage. Scanned invoices produce empty text. Pitch decks with charts extract as "Figure 1". Multi-column papers read in the wrong order. You build extraction logic, then debugging logic, then fallback logic...
pdfmux handles all of that in one call. It extracts every page, scores each one for quality, and re-extracts the failures with OCR — so you get clean text without writing a single fallback.
import pdfmux # extract as markdown — auto-audits every page text = pdfmux.extract_text("report.pdf") # structured json with locked schema data = pdfmux.extract_json("report.pdf") # LLM-ready chunks with token estimates chunks = pdfmux.load_llm_context("report.pdf") # → [{title, text, page_start, page_end, tokens, confidence}]
PDFMuxLoader and PDFMuxReader with chunk metadata built in.# convert — auto-audits every page $ pdfmux invoice.pdf ✓ invoice.md (2 pages, 95% confidence) # image-heavy pdf — bad pages re-extracted with OCR $ pdfmux pitch-deck.pdf ✓ pitch-deck.md (12 pages, 85% confidence, 6 OCR'd) # LLM-ready chunked json $ pdfmux report.pdf -f llm # quick triage — per-page quality without extraction $ pdfmux analyze report.pdf # start MCP server for AI agents $ pdfmux serve
| Extractor | Handles | Speed | Install |
|---|---|---|---|
| PyMuPDF | Digital text | 0.01s/pg | base |
| RapidOCR | Scanned / images | 0.5-2s/pg | pdfmux[ocr] |
| Docling | Tables | 0.3-3s/pg | pdfmux[tables] |
| Gemini Flash | Complex layouts | 2-5s/pg | pdfmux[llm] |
Give your AI agent the ability to read PDFs. Three tools: convert_pdf for extraction, analyze_pdf for quick triage, batch_convert for directories.
# LangChain from pdfmux.integrations.langchain import PDFMuxLoader loader = PDFMuxLoader("report.pdf") docs = loader.load() # → list[Document] # LlamaIndex from pdfmux.integrations.llamaindex import PDFMuxReader reader = PDFMuxReader() docs = reader.load_data("report.pdf") # → list[Document]
# add OCR for scanned pages (~200MB, CPU-only) $ pip install "pdfmux[ocr]" # add table extraction $ pip install "pdfmux[tables]" # add Gemini Flash for complex layouts $ pip install "pdfmux[llm]" # add LangChain or LlamaIndex loader $ pip install "pdfmux[langchain]" $ pip install "pdfmux[llamaindex]" # install everything $ pip install "pdfmux[all]"