Metadata-Version: 2.4
Name: iflow-mcp_labeveryday_mcp_pdf_reader
Version: 0.1.1
Summary: MCP server for reading PDFs with text extraction, image extraction, and OCR
Project-URL: Homepage, https://github.com/labeveryday/mcp_pdf_reader
Project-URL: Repository, https://github.com/labeveryday/mcp_pdf_reader
Project-URL: Issues, https://github.com/labeveryday/mcp_pdf_reader
Author-email: Your Name <your.email@example.com>
License: MIT
Keywords: fastmcp,mcp,ocr,pdf,text-extraction
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Multimedia :: Graphics
Classifier: Topic :: Text Processing
Requires-Python: >=3.11
Requires-Dist: fastmcp>=0.2.0
Requires-Dist: pillow>=10.0.0
Requires-Dist: pymupdf>=1.23.0
Requires-Dist: pytesseract>=0.3.10
Provides-Extra: dev
Requires-Dist: black>=23.0.0; extra == 'dev'
Requires-Dist: mypy>=1.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Provides-Extra: enhanced
Requires-Dist: numpy>=1.24.0; extra == 'enhanced'
Requires-Dist: opencv-python>=4.8.0; extra == 'enhanced'
Description-Content-Type: text/markdown

# MCP PDF Reader Server (Python + FastMCP)

A powerful Model Context Protocol (MCP) server built with FastMCP that provides comprehensive PDF processing capabilities including text extraction, image extraction, and OCR for reading text within images.

## Features

- **Text Extraction**: Extract text content from PDF pages
- **Image Extraction**: Extract all images from PDF files
- **OCR Capabilities**: Read text from images using Tesseract OCR
- **Comprehensive Analysis**: Get detailed PDF structure and metadata
- **Page Range Support**: Process specific page ranges
- **Multiple Languages**: OCR support for multiple languages

## Prerequisites

### System Dependencies

#### Tesseract OCR
You need to install Tesseract OCR on your system:

**Ubuntu/Debian:**
```bash
sudo apt update
sudo apt install tesseract-ocr tesseract-ocr-eng
```

**macOS:**
```bash
brew install tesseract
```

**Windows:**
1. Download from: https://github.com/UB-Mannheim/tesseract/wiki
2. Install and add to PATH
3. Or use: `conda install -c conda-forge tesseract`

#### Additional Language Packs (Optional)
```bash
# For multiple languages
sudo apt install tesseract-ocr-fra tesseract-ocr-deu tesseract-ocr-spa
```

## Installation

### Quick Start with UV

1. **Install UV (if not already installed):**
```bash
# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
```

2. **Clone/Create the project:**
```bash
mkdir mcp-pdf-reader-server
cd mcp-pdf-reader-server
```

3. **Initialize and install with UV:**
```bash
# Copy the files (pdf_reader_server.py and pyproject.toml)
# Then install dependencies
uv sync
```

4. **Verify installation:**
```bash
uv run python -c "import pytesseract; print(pytesseract.get_tesseract_version())"
```

### Alternative: Manual Setup

If you prefer traditional setup:

1. **Create virtual environment:**
```bash
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
```

2. **Install dependencies:**
```bash
pip install fastmcp PyMuPDF pytesseract Pillow
```

## Usage

### Running the Server

With UV:
```bash
uv run python pdf_reader_server.py
```

Or if you have the environment activated:
```bash
python pdf_reader_server.py
```

The server will start and listen for MCP requests on stdin/stdout.

### Available Tools

#### 1. `read_pdf_text`
Extract text content from PDF pages.

**Parameters:**
- `file_path` (string, required): Path to the PDF file
- `page_range` (object, optional): Dict with `start` and `end` page numbers

**Example:**
```json
{
  "file_path": "/path/to/document.pdf",
  "page_range": {"start": 1, "end": 5}
}
```

#### 2. `extract_pdf_images`
Extract all images from a PDF file.

**Parameters:**
- `file_path` (string, required): Path to the PDF file
- `output_dir` (string, optional): Directory to save images
- `page_range` (object, optional): Page range to process

**Example:**
```json
{
  "file_path": "/path/to/document.pdf",
  "output_dir": "/path/to/images/",
  "page_range": {"start": 1, "end": 3}
}
```

#### 3. `read_pdf_with_ocr`
Extract text from both regular text and images using OCR.

**Parameters:**
- `file_path` (string, required): Path to the PDF file
- `page_range` (object, optional): Page range to process
- `ocr_language` (string, optional): OCR language code (default: "eng")

**Example:**
```json
{
  "file_path": "/path/to/document.pdf",
  "ocr_language": "eng+fra",
  "page_range": {"start": 1, "end": 10}
}
```

**Supported OCR Languages:**
- `eng` - English
- `fra` - French
- `deu` - German
- `spa` - Spanish
- `eng+fra` - Multiple languages

#### 4. `get_pdf_info`
Get comprehensive metadata and statistics about a PDF.

**Parameters:**
- `file_path` (string, required): Path to the PDF file

#### 5. `analyze_pdf_structure`
Analyze the structure and content distribution of a PDF.

**Parameters:**
- `file_path` (string, required): Path to the PDF file

## Configuration with Claude Desktop

### With UV
Add this to your `claude_desktop_config.json`:

```json
{
  "mcpServers": {
    "pdf-reader": {
      "command": "uv",
      "args": ["run", "python", "/path/to/your/pdf_reader_server.py"],
      "cwd": "/path/to/your/mcp-pdf-reader-server"
    }
  }
}
```

### With Virtual Environment
```json
{
  "mcpServers": {
    "pdf-reader": {
      "command": "/path/to/your/.venv/bin/python",
      "args": ["/path/to/your/pdf_reader_server.py"]
    }
  }
}
```

### System Python
```json
{
  "mcpServers": {
    "pdf-reader": {
      "command": "python",
      "args": ["/path/to/your/pdf_reader_server.py"],
      "env": {
        "PYTHONPATH": "/path/to/your/.venv/lib/python3.x/site-packages"
      }
    }
  }
}
```

## Example Responses

### Text Extraction Response
```json
{
  "success": true,
  "file_path": "/path/to/document.pdf",
  "pages_processed": "1-3",
  "total_pages": 10,
  "pages_text": [
    {
      "page_number": 1,
      "text": "Page 1 content...",
      "word_count": 125
    }
  ],
  "combined_text": "All text combined...",
  "total_word_count": 1250,
  "total_character_count": 8750
}
```

### OCR Response
```json
{
  "success": true,
  "file_path": "/path/to/document.pdf",
  "pages_processed": "1-2",
  "ocr_language": "eng",
  "pages_data": [
    {
      "page_number": 1,
      "text": "Regular text from PDF...",
      "ocr_text": "Text extracted from images...",
      "images_with_text": [
        {
          "image_index": 1,
          "ocr_text": "Text from image 1",
          "confidence": "high"
        }
      ],
      "combined_text": "Combined text and OCR...",
      "text_word_count": 100,
      "ocr_word_count": 25
    }
  ],
  "summary": {
    "total_text_word_count": 200,
    "total_ocr_word_count": 50,
    "combined_word_count": 250,
    "images_processed": 3
  },
  "all_text_combined": "All extracted text..."
}
```

## Performance Considerations

### OCR Performance
- OCR processing can be slow for large images
- Consider processing smaller page ranges for faster results
- Images smaller than 50x50 pixels are automatically skipped

### Memory Usage
- Large PDFs with many images may consume significant memory
- The server processes pages sequentially to manage memory usage
- Extracted images are saved to disk to reduce memory pressure

### Optimization Tips
1. **Use page ranges** for large documents
2. **Specify output directories** for image extraction to avoid temp file buildup
3. **Choose appropriate OCR languages** to improve accuracy and speed
4. **Preprocess images** if OCR quality is poor (consider adding OpenCV)

## Troubleshooting

### Common Issues

1. **Tesseract not found:**
   ```
   TesseractNotFoundError: tesseract is not installed
   ```
   - Install Tesseract OCR system package
   - Ensure it's in your PATH

2. **Permission errors:**
   - Ensure the Python process has read access to PDF files
   - Ensure write access to output directories

3. **Poor OCR results:**
   - Try different OCR language codes
   - Consider image preprocessing
   - Check if images are high enough resolution

4. **Memory errors:**
   - Process smaller page ranges
   - Close other applications
   - Consider increasing available RAM

### Debug Mode

Run with debug logging using UV:
```bash
PYTHONUNBUFFERED=1 uv run python pdf_reader_server.py
```

Or with regular Python:
```bash
PYTHONUNBUFFERED=1 python pdf_reader_server.py
```

### Testing OCR
Test Tesseract directly:
```bash
tesseract --list-langs
tesseract image.png output.txt
```

## Dependencies

- **fastmcp**: Modern MCP server framework
- **PyMuPDF**: Fast PDF processing and rendering
- **pytesseract**: Python wrapper for Tesseract OCR
- **Pillow**: Image processing library
- **tesseract-ocr**: System OCR engine

## Advanced Features

### Custom OCR Configuration
You can modify the OCR configuration in the code:
```python
ocr_text = pytesseract.image_to_string(
    pil_image, 
    lang=ocr_language,
    config='--psm 6 -c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz '
)
```

### Image Preprocessing
For better OCR results, consider adding image preprocessing:
```python
# Add to requirements: opencv-python, numpy
import cv2
import numpy as np

# Preprocessing example
def preprocess_image(image):
    gray = cv2.cvtColor(np.array(image), cv2.COLOR_RGB2GRAY)
    thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
    return Image.fromarray(thresh)
```

## Contributing

1. Fork the repository
2. Create a feature branch
3. Add tests for new functionality
4. Submit a pull request

## License

MIT License - see LICENSE file for details.