Metadata-Version: 2.4
Name: leapocr
Version: 2.0.0
Summary: Official Python SDK for LeapOCR - Transform documents into structured data using AI-powered OCR
Project-URL: Homepage, https://leapocr.com
Project-URL: Documentation, https://docs.leapocr.com
Project-URL: Repository, https://github.com/leapocr/leapocr-python
Project-URL: Issues, https://github.com/leapocr/leapocr-python/issues
Project-URL: Logo, https://raw.githubusercontent.com/LeapOCR/leapocr-python/refs/heads/main/logo/web-app-manifest-512x512.png
Author-email: LeapOCR <support@leapocr.com>
Maintainer-email: LeapOCR <support@leapocr.com>
License: Apache-2.0
License-File: LICENSE
Keywords: ai,api-client,artificial-intelligence,async,asyncio,data-extraction,document-extraction,document-processing,ocr,optical-character-recognition,pdf,sdk,structured-data,text-extraction
Classifier: Development Status :: 4 - Beta
Classifier: Framework :: AsyncIO
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Internet :: WWW/HTTP :: Dynamic Content
Classifier: Topic :: Office/Business
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Image Recognition
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing
Classifier: Topic :: Text Processing :: Markup
Classifier: Typing :: Typed
Requires-Python: >=3.9
Requires-Dist: httpx>=0.28.0
Requires-Dist: pydantic<2.0,>=1.10.24
Requires-Dist: typing-extensions>=4.12.0
Provides-Extra: dev
Requires-Dist: mypy>=1.18.0; extra == 'dev'
Requires-Dist: pre-commit<4.4.0,>=4.3.0; extra == 'dev'
Requires-Dist: pytest-asyncio<1.3.0,>=1.2.0; extra == 'dev'
Requires-Dist: pytest-cov>=7.0.0; extra == 'dev'
Requires-Dist: pytest<9.0.0,>=8.4.0; extra == 'dev'
Requires-Dist: ruff>=0.14.0; extra == 'dev'
Description-Content-Type: text/markdown

# LeapOCR Python SDK

<div align="center">
  <img src="https://raw.githubusercontent.com/LeapOCR/leapocr-python/refs/heads/main/logo/web-app-manifest-512x512.png" alt="LeapOCR Logo" width="200" height="200">
</div>

<div align="center">

[![PyPI version](https://badge.fury.io/py/leapocr.svg)](https://badge.fury.io/py/leapocr)
[![Python Support](https://img.shields.io/pypi/pyversions/leapocr.svg)](https://pypi.org/project/leapocr/)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

Official Python SDK for [LeapOCR](https://www.leapocr.com/) - Transform documents into structured data using AI-powered OCR.

</div>

## Overview

LeapOCR provides enterprise-grade document processing with AI-powered data extraction. This SDK offers a Python-native, async-first interface for seamless integration into your applications.

## Installation

```bash
pip install leapocr
```

Or using [uv](https://github.com/astral-sh/uv):

```bash
uv add leapocr
```

## Quick Start

### Prerequisites

- Python 3.9 or higher
- LeapOCR API key ([sign up here](https://www.leapocr.com/signup))

### Basic Example

```python
import asyncio
import os
from leapocr import LeapOCR, ProcessOptions, Format, Model

async def main():
    # Initialize the SDK with your API key
    async with LeapOCR(os.getenv("LEAPOCR_API_KEY")) as client:
        # Submit document for processing
        job = await client.ocr.process_url(
            "https://example.com/document.pdf",
            options=ProcessOptions(
                format=Format.STRUCTURED,
                model=Model.STANDARD_V2,
                schema={
                    "type": "object",
                    "properties": {
                        "title": {"type": "string"},
                    },
                    "required": ["title"],
                },
            ),
        )

        print(f"Job created: {job.job_id}")

        # Wait for processing to complete
        result = await client.ocr.wait_until_done(job.job_id)

        print(f"Credits used: {result.credits_used}")
        print(f"Pages processed: {len(result.pages)}")

        # Access extracted data (string for markdown, dict for structured)
        first_page = result.pages[0].result
        if isinstance(first_page, str):
            print(f"Extracted text: {first_page[:200]}...")
        else:
            print(f"Extracted data: {first_page}")

        # Optional: Delete job after processing (auto-deleted after 7 days)
        await client.ocr.delete_job(job.job_id)

asyncio.run(main())
```

## Key Features

- **Async-First Design** - Built on asyncio with httpx for high-performance concurrent processing
- **Type-Safe API** - Full type hints and mypy strict mode support
- **Multiple Processing Formats** - Structured data extraction (dict) or markdown output (string)
- **Flexible Result Types** - Results are automatically typed: `str` for markdown, `dict` for structured data
- **Flexible Model Selection** - Choose from standard, pro, or custom AI models
- **Custom Schema Support** - Define extraction schemas for your specific use case
- **Built-in Retry Logic** - Automatic exponential backoff for transient failures
- **Context Manager Support** - Proper resource cleanup with async context managers
- **Direct File Upload** - Efficient multipart uploads for local files
- **Progress Tracking** - Real-time callbacks for long-running operations
- **Automatic Cleanup** - Jobs and files are automatically deleted after 7 days, or delete immediately with `delete_job()`

## Processing Models

Use the `model` parameter in `ProcessOptions` to specify a model. Defaults to `Model.STANDARD_V2`.

You can also use custom model strings for organization-specific models:

```python
from leapocr import ProcessOptions

options = ProcessOptions(
    model="my-custom-model-v2",  # Custom string model
)
```

## Usage Examples

### Processing from URL

```python
import asyncio
import os
from leapocr import LeapOCR, ProcessOptions, Format, Model

async def process_url():
    async with LeapOCR("your-api-key") as client:
        # Submit job
        job = await client.ocr.process_url(
            "https://example.com/invoice.pdf",
            options=ProcessOptions(
                format=Format.STRUCTURED,
                model=Model.PRO_V2,  # Use premium model for better accuracy
                schema={
                    "type": "object",
                    "properties": {
                        "invoice_number": {"type": "string"},
                        "invoice_date": {"type": "string"},
                        "total_amount": {"type": "number"},
                    },
                    "required": ["invoice_number", "total_amount"],
                },
            ),
        )

        # Wait for completion
        result = await client.ocr.wait_until_done(job.job_id)
        print(f"Credits used: {result.credits_used}")
        print(f"Pages: {len(result.pages)}")

        await client.ocr.delete_job(job.job_id)

asyncio.run(process_url())
```

### Processing Local Files

```python
import os
from pathlib import Path

async def process_file():
    async with LeapOCR("your-api-key") as client:
        # Submit file
        job = await client.ocr.process_file(
            Path("invoice.pdf"),
            options=ProcessOptions(
                format=Format.STRUCTURED,
                model=Model.STANDARD_V2,  # Default general-purpose model
                schema={
                    "type": "object",
                    "properties": {
                        "invoice_number": {"type": "string"},
                        "total_amount": {"type": "number"},
                    },
                    "required": ["invoice_number", "total_amount"],
                },
            ),
        )

        # Wait for completion
        result = await client.ocr.wait_until_done(job.job_id)

        for page in result.pages:
            # result can be string (markdown) or dict (structured)
            result_length = len(page.result) if isinstance(page.result, str) else len(str(page.result))
            print(f"Page {page.page_number}: {result_length} characters")

        await client.ocr.delete_job(job.job_id)

asyncio.run(process_file())
```

### Custom Schema Extraction

```python
async def extract_with_schema():
    schema = {
        "type": "object",
        "properties": {
            "patient_name": {"type": "string"},
            "date_of_birth": {"type": "string"},
            "medications": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "name": {"type": "string"},
                        "dosage": {"type": "string"},
                    },
                },
            },
        },
    }

    async with LeapOCR("your-api-key") as client:
        # Submit with schema
        job = await client.ocr.process_file(
            "medical-record.pdf",
            options=ProcessOptions(
                format=Format.STRUCTURED,
                model=Model.STANDARD_V2,
                schema=schema,
                instructions="Extract patient information and medications",
            ),
        )

        # Wait for completion
        result = await client.ocr.wait_until_done(job.job_id)

        # Access the structured data
        import json
        page_result = result.pages[0].result

        # Handle both dict (already parsed) and string (JSON)
        if isinstance(page_result, dict):
            data = page_result
        else:
            data = json.loads(page_result)

        print(f"Patient: {data['patient_name']}")
        print(f"Medications: {len(data['medications'])}")

        await client.ocr.delete_job(job.job_id)

asyncio.run(extract_with_schema())
```

### Output Formats

| Format                       | Description        | Use Case                                       | Result Type |
| ---------------------------- | ------------------ | ---------------------------------------------- | ----------- |
| `Format.STRUCTURED`          | Single JSON object | Extract specific fields across entire document | `dict`      |
| `Format.MARKDOWN`            | Text per page      | Convert document to readable text              | `str`       |

**Accessing Results:**

The `result` field in `PageResult` can be either:
- **String** (`str`) - for markdown format, contains the extracted text
- **Dictionary** (`dict`) - for structured format, contains the parsed JSON data

```python
# Handle both result types
for page in result.pages:
    if isinstance(page.result, dict):
        # Structured data - already parsed
        print(f"Invoice #: {page.result.get('invoice_number')}")
    else:
        # Markdown text
        print(f"Text: {page.result[:100]}...")
```

### Manual Job Management

```python
async def manual_polling():
    async with LeapOCR("your-api-key") as client:
        # Start processing
        job = await client.ocr.process_url(
            "https://example.com/document.pdf",
            options=ProcessOptions(format=Format.MARKDOWN, model=Model.STANDARD_V2),
        )

        print(f"Job created: {job.job_id}")

        # Poll for status
        import asyncio
        while True:
            status = await client.ocr.get_job_status(job.job_id)
            print(f"Status: {status.status.value} - {status.progress:.1f}%")

            if status.status.value == "completed":
                break

            await asyncio.sleep(2)

        # Get results
        result = await client.ocr.get_results(job.job_id)
        print(f"Processing complete: {len(result.pages)} pages")

asyncio.run(manual_polling())
```

### Progress Tracking

```python
from leapocr import PollOptions

async def track_progress():
    def progress_callback(status):
        print(f"Progress: {status.progress:.1f}% "
              f"({status.processed_pages}/{status.total_pages} pages)")

    async with LeapOCR("your-api-key") as client:
        # Submit job
        job = await client.ocr.process_file(
            "large-document.pdf",
            options=ProcessOptions(format=Format.MARKDOWN, model=Model.STANDARD_V2),
        )

        # Wait with progress tracking
        result = await client.ocr.wait_until_done(
            job.job_id,
            poll_options=PollOptions(
                poll_interval=2.0,
                max_wait=300.0,
                on_progress=progress_callback,
            ),
        )

asyncio.run(track_progress())
```

### Using Templates

Use pre-configured templates for common document types. Templates include predefined schemas, instructions, and model settings:

```python
async def use_template():
    async with LeapOCR("your-api-key") as client:
        # Use a pre-configured template by its slug
        job = await client.ocr.process_file(
            "invoice.pdf",
            options=ProcessOptions(
                template_slug="invoice-extraction",  # Reference existing template
            ),
        )

        result = await client.ocr.wait_until_done(job.job_id)

        # Access result (dict for structured templates, string for markdown)
        print(f"Extracted data: {result.pages[0].result}")

        await client.ocr.delete_job(job.job_id)

asyncio.run(use_template())
```

**Common Template Use Cases:**

- Invoice extraction
- Receipt processing
- Medical records
- Identity documents
- Custom organizational templates

### Deleting Jobs

**Automatic Deletion**: All jobs and their associated files are automatically deleted after **7 days** for security and storage management.

**Manual Deletion**: Delete jobs immediately after processing to remove sensitive data or free up resources sooner:

```python
async def delete_after_processing():
    async with LeapOCR("your-api-key") as client:
        # Submit and process document
        job = await client.ocr.process_file(
            "sensitive-doc.pdf",
            options=ProcessOptions(format=Format.MARKDOWN, model=Model.STANDARD_V2),
        )
        result = await client.ocr.wait_until_done(job.job_id)

        # Use the extracted data
        print(f"Extracted data: {result.pages[0].result}")

        # Delete job immediately (redacts content and marks as deleted)
        await client.ocr.delete_job(result.job_id)
        print("Job deleted successfully")

asyncio.run(delete_after_processing())
```

**Best Practices:**

- **Sensitive Data**: Delete jobs containing PII/PHI immediately after use
- **Compliance**: Manual deletion helps meet data retention requirements
- **Auto-Cleanup**: All jobs are automatically purged after 7 days regardless
- **Irreversible**: Deleted jobs cannot be recovered

### Concurrent Batch Processing

```python
import asyncio
import os
from leapocr import LeapOCR, ProcessOptions, Format, Model

async def batch_process():
    urls = [
        "https://example.com/doc1.pdf",
        "https://example.com/doc2.pdf",
        "https://example.com/doc3.pdf",
    ]

    async with LeapOCR("your-api-key") as client:
        # Submit all documents concurrently
        jobs = await asyncio.gather(*[
            client.ocr.process_url(
                url,
                options=ProcessOptions(format=Format.MARKDOWN, model=Model.STANDARD_V2),
            )
            for url in urls
        ])

        # Wait for all to complete
        results = await asyncio.gather(*[
            client.ocr.wait_until_done(job.job_id)
            for job in jobs
        ])

        total_credits = sum(r.credits_used for r in results)
        total_pages = sum(len(r.pages) for r in results)

        print(f"Processed {len(results)} documents")
        print(f"Total credits: {total_credits}")
        print(f"Total pages: {total_pages}")

        await asyncio.gather(*[
            client.ocr.delete_job(job.job_id)
            for job in jobs
        ])

asyncio.run(batch_process())
```

For more examples, see the [`examples/`](./examples) directory.

## Configuration

### Custom Configuration

```python
from leapocr import LeapOCR, ClientConfig

config = ClientConfig(
    base_url="https://api.leapocr.com/api/v1",
    timeout=120.0,
    max_retries=5,
    retry_delay=2.0,
    retry_multiplier=2.0,
)

async with LeapOCR("your-api-key", config) as client:
    # Use client with custom configuration
    pass
```

### Environment Variables

```bash
export LEAPOCR_API_KEY="your-api-key"
export OCR_BASE_URL="https://api.leapocr.com/api/v1"  # optional
```

## Error Handling

The SDK provides a comprehensive error hierarchy for robust error handling:

```python
from leapocr import (
    LeapOCRError,
    AuthenticationError,
    ValidationError,
    APIError,
    JobError,
    NetworkError,
)

async def handle_errors():
    try:
        async with LeapOCR("your-api-key") as client:
            result = await client.ocr.process_and_wait("document.pdf")

    except AuthenticationError as e:
        print(f"Authentication failed: {e.message}")

    except ValidationError as e:
        print(f"Invalid input: {e.message}")
        if e.field:
            print(f"  Field: {e.field}")

    except JobError as e:
        print(f"Job failed: {e.message}")
        print(f"  Job ID: {e.job_id}")

    except NetworkError as e:
        print(f"Network error: {e.message}")
        # Implement retry logic

    except APIError as e:
        print(f"API error: {e.message}")
        print(f"  Status code: {e.status_code}")

    except LeapOCRError as e:
        print(f"SDK error: {e.message}")

asyncio.run(handle_errors())
```

### Error Types

- `LeapOCRError` - Base class for all SDK errors
- `AuthenticationError` - Invalid or missing API key
- `RateLimitError` - Rate limit exceeded
- `ValidationError` - Invalid input parameters
- `FileError` - File-related errors (not found, too large, etc.)
- `JobError` - Job processing errors
- `JobFailedError` - Job processing failed
- `JobTimeoutError` - Job processing timed out
- `NetworkError` - Network connectivity issues
- `APIError` - API returned an error response
- `InsufficientCreditsError` - Not enough credits

## API Reference

### Core Classes

#### `LeapOCR`

Main client class for interacting with the LeapOCR API.

```python
class LeapOCR:
    def __init__(self, api_key: str, config: ClientConfig | None = None)
    async def close(self) -> None
    async def health(self) -> bool

    # Use as async context manager
    async with LeapOCR(api_key) as client:
        ...
```

#### `OCRService`

OCR operations service accessible via `client.ocr`.

```python
# Process file or URL (submit job)
async def process_file(
    file: str | Path | BinaryIO,
    options: ProcessOptions | None = None,
) -> ProcessResult

async def process_url(
    url: str,
    options: ProcessOptions | None = None,
) -> ProcessResult

# Wait for job completion
async def wait_until_done(
    job_id: str,
    poll_options: PollOptions | None = None,
) -> JobResult

# Job management
async def get_job_status(job_id: str) -> JobStatus
async def get_results(job_id: str, page: int = 1, limit: int = 100) -> JobResult
async def delete_job(job_id: str) -> dict[str, Any]
```

### Data Models

#### `ProcessOptions`

```python
@dataclass
class ProcessOptions:
    format: Format | None = None
    model: Model | None = None
    schema: dict[str, Any] | None = None
    instructions: str | None = None
    template_slug: str | None = None  # Use existing template by slug (handles format/model/schema/instructions)
    metadata: dict[str, str] = field(default_factory=dict)
```

#### `PollOptions`

```python
@dataclass
class PollOptions:
    poll_interval: float = 2.0  # seconds
    max_wait: float = 300.0  # seconds (5 minutes)
    on_progress: Callable[[JobStatus], None] | None = None
```

#### `ClientConfig`

```python
@dataclass
class ClientConfig:
    base_url: str = "https://api.leapocr.com/api/v1"
    timeout: float = 30.0
    max_retries: int = 3
    retry_delay: float = 1.0
    retry_multiplier: float = 2.0
```

#### `JobResult`

```python
@dataclass
class JobResult:
    job_id: str
    status: JobStatusType
    pages: list[PageResult]
    file_name: str
    total_pages: int
    processed_pages: int
    credits_used: int
    model: str
    result_format: str
    completed_at: datetime
    pagination: PaginationInfo | None = None
```

#### `PageResult`

```python
@dataclass
class PageResult:
    page_number: int
    result: str | dict[str, Any]  # String for markdown, dict for structured
    id: str | None = None
    confidence: float | None = None
```

**Note**: The `result` field type depends on the format:
- `Format.MARKDOWN` → `str` (plain text)
- `Format.STRUCTURED` → `dict` (parsed JSON object)

## Development

### Prerequisites

- Python 3.9+
- [uv](https://github.com/astral-sh/uv) (recommended) or pip
- [OpenAPI Generator](https://openapi-generator.tech/) (for code generation)
- Java 21+ (for OpenAPI Generator)

### Setup

```bash
# Clone the repository
git clone https://github.com/leapocr/leapocr-python.git
cd leapocr-python

# Install dependencies
make dev-install

# Or using uv directly
uv sync
```

### Common Tasks

```bash
make test               # Run unit tests
make test-cov           # Run tests with coverage
make test-integration   # Run integration tests (requires API key)
make lint               # Run linters
make format             # Format code
make type-check         # Run type checking
make check              # Run all checks (format, lint, type-check)
```

### Code Generation

The SDK includes generated client code from the OpenAPI specification:

```bash
make fetch-spec         # Download OpenAPI spec
make generate           # Generate client code
make regenerate         # Fetch + generate (full refresh)
```

### Running Tests

```bash
# Unit tests only
pytest tests/unit/

# Integration tests (requires API key)
export LEAPOCR_API_KEY="your-api-key"
pytest tests/integration/

# All tests with coverage
pytest tests/ --cov=leapocr --cov-report=html
```

### Project Structure

```
leapocr-python/
├── leapocr/               # Main package
│   ├── __init__.py        # Public API
│   ├── client.py          # Main client class
│   ├── ocr.py             # OCR service
│   ├── models.py          # Data models
│   ├── errors.py          # Error classes
│   ├── config.py          # Configuration
│   ├── _internal/         # Internal utilities
│   │   ├── retry.py       # Retry logic
│   │   ├── upload.py      # File upload
│   │   ├── polling.py     # Status polling
│   │   ├── validation.py  # Input validation
│   │   └── utils.py       # Common utilities
│   └── generated/         # Generated OpenAPI client
├── tests/
│   ├── unit/              # Unit tests
│   └── integration/       # Integration tests
├── examples/              # Usage examples
├── scripts/               # Development scripts
└── Makefile               # Common development tasks
```

## Contributing

We welcome contributions! Please follow these guidelines:

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'feat: add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

### Code Standards

- Follow PEP 8 style guide
- Add type hints for all functions
- Write docstrings for public APIs
- Add tests for new features
- Update documentation as needed

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Support & Resources

- **Documentation**: [docs.leapocr.com](https://docs.leapocr.com)
- **API Reference**: [API Documentation](https://docs.leapocr.com/api)
- **Issues**: [GitHub Issues](https://github.com/leapocr/leapocr-python/issues)
- **Examples**: [examples/](./examples)
- **Website**: [leapocr.com](https://www.leapocr.com)
- **Support**: support@leapocr.com

---

**Version**: 0.0.4
