Metadata-Version: 2.4
Name: rostaing-ocr
Version: 1.2.1
Summary: High-Precision OCR Extraction for LLMs and RAG Systems: PDFs, Scanned PDFs, and Images
Home-page: https://github.com/Rostaing/rostaing-ocr
Author: Davila Rostaing
Author-email: rostaingdavila@gmail.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Multimedia :: Graphics :: Capture :: Scanners
Classifier: Topic :: Scientific/Engineering :: Image Processing
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Intended Audience :: Developers
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENCE
Requires-Dist: pymupdf>=1.20.0
Requires-Dist: python-doctr[torch]>=0.7.0
Requires-Dist: pillow>=9.0.0
Requires-Dist: numpy>=1.21.0
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

<p align="center">
  <a href="https://pypi.org/project/rostaing-ocr/"><img src="https://img.shields.io/pypi/v/rostaing-ocr?color=blue&label=PyPI%20version" alt="PyPI version"></a>
  <a href="https://pypi.org/project/rostaing-ocr/"><img src="https://img.shields.io/pypi/pyversions/rostaing-ocr.svg" alt="Python versions"></a>
  <a href="https://github.com/Rostaing/rostaing-ocr/blob/main/LICENSE"><img src="https://img.shields.io/pypi/l/rostaing-ocr.svg" alt="License"></a>
  <a href="https://pepy.tech/project/rostaing-ocr"><img src="https://static.pepy.tech/badge/rostaing-ocr" alt="Downloads"></a>
</p>

# rostaing-ocr

**Production-Grade Layout-Aware OCR for LLMs and RAG Systems**

`rostaing-ocr` is a high-performance Python library designed to extract text from PDFs, Scanned PDFs, and images while **preserving complex layouts**. Unlike standard OCR tools that output a "soup" of words, this library uses **Deep Learning** and geometric reconstruction to maintain tables, columns, and document structure.

It is specifically optimized for **Retrieval-Augmented Generation (RAG)** pipelines where maintaining the visual structure of data (like invoice tables) is critical for LLM comprehension.

## Key Features

- **mj-layout-aware:** Uses geometric clustering to reconstruct tables and columns. Data stays on the correct line, visually aligned.
- **🧹 Noise Filtering:** Automatically detects and removes low-confidence text such as **messy handwriting, signatures, and stamps** to keep the output clean.
- **⚡ Local Processing:** Runs 100% locally (CPU or GPU). No external APIs, no data leaving your server.
- **📄 Universal Input:** Handles PDFs (digital & scanned) and common image formats via a robust Base64 architecture.
- **🔒 Privacy Focused:** Temporary files are handled securely and deleted immediately after extraction.

## Installation

```bash
pip install rostaing-ocr
```

<!-- ## Dependencies -->

<!-- This package relies on modern Deep Learning libraries:
- `python-doctr[torch]` (The OCR Engine) -->
<!-- - `pymupdf` (PDF rendering)
- `numpy` (Matrix operations) -->

*(Note: The first run will automatically download the necessary OCR models ~300MB)*

## Usage

### 1. Basic Usage (Default Behavior)
By default, the extractor prints the result to the console and saves it to `output.txt`.

```python
from rostaing_ocr import ocr_extractor

# This immediately runs the extraction using DocTR
extractor = ocr_extractor("documents/invoice.pdf")

# The extracted text is now in 'output.txt'
print(extractor) # Prints status summary (Time taken, pages processed)
```

### 2. Custom Output File
You can specify a different filename. The file will be created or overwritten automatically.

```python
from rostaing_ocr import ocr_extractor

extractor = ocr_extractor(
    "data/report.png",
    output_file="results/report_analysis.txt"
)
```

### 3. Silent Mode (Background Processing)
Useful for batch processing or server backends where you don't want console logs.

```python
from rostaing_ocr import ocr_extractor

extractor = ocr_extractor(
    "financial_statement.pdf",
    print_to_console=False,
    save_file=True
)
```

### 4. Direct Integration (RAG Pipelines)
Access the text variable directly without reading the file.

```python
from rostaing_ocr import ocr_extractor

extractor = ocr_extractor("scan.jpg", print_to_console=False)

if extractor.status == "Success":
    clean_text = extractor.extracted_text
    # Send 'clean_text' to GPT-4, Mistral, Gemini, Claude, Grok, Llama... or your Vector DB
```

## How It Works (Architecture)

1. **Input Normalization:** Converts PDF pages or Images into High-Res Base64 streams.
2. **Deep Learning Inference:** DBNet for detection + CRNN for recognition.
3. **Noise Filtering:** Scans confidence scores. Text with low confidence (e.g., `< 0.4`), such as signatures or stamps, is discarded.
4. **Geometric Reconstruction:**
   - Flattens the document hierarchy.
   - Clusters words into visual lines based on Y-axis alignment.
   - Calculates horizontal gaps to insert dynamic spacing (tabs vs spaces) to simulate columns.
5. **Output:** Returns a clean, structured string that looks like the original document.

## License

MIT License
