Metadata-Version: 2.4
Name: stache-tools-ocr
Version: 0.1.1
Summary: OCR support for stache-tools CLI using Tesseract
License: MIT
License-File: LICENSE
Requires-Python: >=3.10
Requires-Dist: stache-ai-ocr>=0.1.1
Requires-Dist: stache-tools>=0.1.0
Provides-Extra: dev
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Description-Content-Type: text/markdown

# stache-tools-ocr

OCR support for stache-tools CLI. Automatically detects and processes scanned PDFs and images using Tesseract.

## Quick Start

```bash
# 1. Install system dependencies (one-time)
# Ubuntu/Debian:
sudo apt install ocrmypdf tesseract-ocr

# macOS:
brew install ocrmypdf

# Windows:
choco install ocrmypdf

# 2. Install Python package
pip install stache-tools-ocr

# 3. Use it automatically
stache ingest scanned.pdf
stache ingest *.jpg *.png
```

That's it! OCR loaders automatically register with the CLI.

## Architecture

stache-tools-ocr is a thin adapter that wraps [stache-ai-ocr](https://pypi.org/project/stache-ai-ocr/) to provide OCR capabilities for the stache-tools CLI.

**Design**: This package adapts stache-ai-ocr's file path interface to stache-tools' BinaryIO interface, enabling seamless integration with the `stache ingest` command.

**Benefits**:
- Single source of truth for OCR logic (in stache-ai-ocr)
- Simplified installation (dependencies pulled automatically)
- Rich metadata from OCR operations
- Works with all stache-tools features (`--parallel`, `--dry-run`, `--namespace`)

## Installation

### Python Package

```bash
pip install stache-tools-ocr
```

This automatically installs `stache-ai-ocr>=0.2.0` and all required dependencies (pdfplumber, ocrmypdf, tesseract bindings).

### System Dependencies

You still need to install Tesseract OCR separately:

**Ubuntu/Debian:**
```bash
sudo apt install ocrmypdf tesseract-ocr
```

**macOS:**
```bash
brew install ocrmypdf
```

**Windows:**
```bash
choco install ocrmypdf
```

### Development

```bash
cd stache-tools-ocr
pip install -e .
```

## Usage

Once installed, OCR loaders automatically register and activate:

```bash
# Scanned PDFs automatically use OCR
stache ingest scanned.pdf

# Image OCR (requires pytesseract)
stache ingest photo.jpg screenshot.png

# Works with all CLI features
stache ingest *.pdf *.jpg --parallel 4 --namespace books

# Dry run to preview processing
stache ingest scanned.pdf image.jpg --dry-run
```

## How It Works

### PDF Loader
- **Priority**: 10 (overrides BasicPDFLoader at priority 0)
- **Smart Detection**: Attempts text extraction first, falls back to OCR if document appears scanned
- **Method**: ocrmypdf CLI tool
- **Metadata**: Provides `ocr_used`, `ocr_failed`, `page_count`, `chars_per_page`, and `ocr_method`
- **Graceful Fallback**: Returns empty text with warning if ocrmypdf not installed
- **Thread Safe**: Works with `--parallel` mode

### Image Loader
- **Priority**: 5 (standard)
- **Formats**: JPG, JPEG, PNG, TIFF, TIF, BMP, GIF
- **Method**: pytesseract (Tesseract OCR)
- **Metadata**: Provides `ocr_used`, `ocr_method`, `image_format`, `image_size`
- **Graceful Fallback**: Returns empty text with warning if pytesseract not installed
- **Thread Safe**: Works with `--parallel` mode

## OCR Behavior

For details on OCR heuristics, timeout configuration, and complete metadata fields, see the [stache-ai-ocr documentation](https://github.com/yourusername/stache/tree/main/packages/stache-ai-ocr).

## Configuration

**Environment Variables:**

| Variable | Default | Purpose |
|----------|---------|---------|
| `STACHE_OCR_TIMEOUT` | 300 | OCR timeout in seconds |

**Override Loaders:**
```bash
# Force use of OCR loader
export STACHE_LOADER_PDF=OcrPdfLoader
stache ingest document.pdf

# Use basic loader (skip OCR)
export STACHE_LOADER_PDF=BasicPDFLoader
stache ingest document.pdf
```

## System Requirements

- **Python**: 3.10+
- **ocrmypdf**: System binary (for PDF OCR)
- **tesseract-ocr**: System binary (for image OCR)
- **pdfplumber, pytesseract, pillow**: Installed via pip

## Cost & Performance

- **Free** - no API costs
- **Offline** - works without internet
- **Speed**: ~1-3 seconds per page (CPU-bound)
- **Quality**: 99% accuracy for clean scans, 70-90% for poor quality

## Troubleshooting

**"ocrmypdf not found" error:**
```bash
# Verify installation
which ocrmypdf
tesseract --version

# Reinstall if needed
sudo apt install --reinstall ocrmypdf tesseract-ocr
```

**Tesseract not found (image OCR):**
```bash
# Install tesseract
# Ubuntu/Debian:
sudo apt install tesseract-ocr

# macOS:
brew install tesseract

# Windows:
choco install tesseract
```

**OCR timeout on large PDFs:**
```bash
# Increase timeout to 10 minutes
export STACHE_OCR_TIMEOUT=600
stache ingest large-document.pdf
```

**Poor OCR quality:**
- Ensure scan is at least 300 DPI
- Try pre-processing with image tools (deskew, denoise)
- See stache-ai-ocr docs for advanced OCR tuning

## License

MIT
