Metadata-Version: 2.4
Name: paper2md
Version: 0.1.0
Summary: Precision PDF-to-Markdown converter for research papers
Project-URL: Homepage, https://github.com/mohanjeyasankar/paper2md
Project-URL: Repository, https://github.com/mohanjeyasankar/paper2md
License: MIT
License-File: LICENSE
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Text Processing :: Markup :: Markdown
Requires-Python: >=3.10
Requires-Dist: pymupdf>=1.24.0
Requires-Dist: typer>=0.9.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == 'dev'
Provides-Extra: mcp
Requires-Dist: fastmcp<1.0,>=0.3; extra == 'mcp'
Provides-Extra: ocr
Requires-Dist: pillow>=9.0; extra == 'ocr'
Requires-Dist: pytesseract>=0.3.10; extra == 'ocr'
Description-Content-Type: text/markdown

# paper2md

Precision PDF-to-Markdown converter for research papers.

## Features

- **Title, author, and abstract extraction** from diverse paper formats
- **Heading hierarchy** detection via font size, weight, and allcaps analysis
- **Math rendering** with CM font-to-LaTeX mapping (~120 symbols)
- **Tables** detected via line-based layout, output in pipe format
- **Figures** from raster (xref), vector (drawings), and clustered composites
- **References** with bracket and alphanumeric key parsing
- **OCR fallback** for scanned PDFs (PyMuPDF OCR or pytesseract)
- **Multi-column support** via 1D clustering with adaptive thresholds
- **MCP server** with tools for PDF conversion, structured extraction, and metadata

## Installation

```bash
pip install paper2md
```

With MCP server support:

```bash
pip install paper2md[mcp]
```

With OCR support for scanned PDFs:

```bash
pip install paper2md[ocr]
```

## Usage

### CLI

```bash
paper2md paper.pdf -d output/
```

This writes the Markdown file and all extracted figure images to the output directory.

### Python API

```python
from paper2md import convert

result = convert("paper.pdf")
print(result.markdown)
```

### MCP Server

paper2md exposes three MCP tools: `convert_pdf`, `convert_pdf_structured`, and `extract_metadata`. Configure your MCP client to launch `paper2md.mcp_server`.

## Tested Formats

paper2md is tested against papers from the following venues and publishers:

arXiv, NeurIPS, CVPR, ICLR, IEEE, ACM, NAACL, Meta AI, DeepMind, JMLR, Nature, Springer

## Requirements

- Python 3.10+
- PyMuPDF 1.24+

## License

MIT
