PDF extraction that checks its own work.
Python API + CLI + MCP server. Built for LLM pipelines.
v0.4.0$ pip install pdfmux
click to copy
import pdfmux # extract text as markdown text = pdfmux.extract_text("report.pdf") # structured dict with locked schema data = pdfmux.extract_json("report.pdf") # data["schema_version"] → "0.4.0" # data["ocr_pages"] → [2, 5, 8] # LLM-ready chunks with token estimates chunks = pdfmux.load_llm_context("report.pdf") # [{title, text, page_start, page_end, tokens, confidence}]
# convert a pdf — auto-audits every page $ pdfmux invoice.pdf ✓ invoice.pdf → invoice.md (2 pages, 95% confidence) # image-heavy pdf — bad pages re-extracted with ocr $ pdfmux pitch-deck.pdf ✓ pitch-deck.md (12 pages, 85% confidence, 6 pages OCR'd) # llm-ready chunked json with token estimates $ pdfmux report.pdf -f llm # per-page extraction breakdown $ pdfmux analyze report.pdf # batch convert a directory $ pdfmux ./docs/ -o ./output/ # start mcp server for ai agents $ pdfmux serve
Extract every page. Audit quality. Re-extract the bad ones.
$ pdfmux report.pdf # → markdown (default) $ pdfmux report.pdf -f json # → structured json + metadata $ pdfmux report.pdf -f llm # → chunked json + token estimates $ pdfmux data.pdf -f csv # → tables as csv
| Extractor | Handles | Speed | Install |
|---|---|---|---|
| PyMuPDF | Digital text | 0.01s/pg | Base |
| RapidOCR | Scanned / images | 0.5-2s/pg | pdfmux[ocr] |
| Docling | Tables | 0.3-3s/pg | pdfmux[tables] |
| Gemini Flash | Complex layouts | 2-5s/pg | pdfmux[llm] |
Give your AI agent the ability to read PDFs:
Agents receive confidence scores + warnings when extraction is limited.
$ pip install "pdfmux[ocr]" # RapidOCR (~200MB, CPU) $ pip install "pdfmux[tables]" # Docling $ pip install "pdfmux[llm]" # Gemini Flash $ pip install "pdfmux[all]" # everything