Metadata-Version: 2.4
Name: retab
Version: 0.0.110
Summary: Retab official python library
Home-page: https://github.com/retab-dev/retab
Author: Retab
Author-email: contact@retab.com
Project-URL: Team website, https://retab.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: MacOS
Classifier: Intended Audience :: Science/Research
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: Pillow
Requires-Dist: httpx
Requires-Dist: pydantic
Requires-Dist: pydantic_core
Requires-Dist: requests
Requires-Dist: backoff
Requires-Dist: numpy
Requires-Dist: rich
Requires-Dist: puremagic
Requires-Dist: fastapi
Requires-Dist: pycountry
Requires-Dist: phonenumbers
Requires-Dist: email_validator
Requires-Dist: python-stdnum
Requires-Dist: nanoid
Requires-Dist: openai
Requires-Dist: anthropic
Requires-Dist: google-genai
Requires-Dist: tiktoken
Requires-Dist: truststore
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: project-url
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Retab Python SDK

Official Python SDK for Retab document extraction.

## Installation

```bash
pip install retab
```

The client reads `RETAB_API_KEY` from the environment by default.

## Quick Start

```python
import os

from retab import Retab

client = Retab(api_key=os.environ["RETAB_API_KEY"])

invoice_schema = {
    "type": "object",
    "properties": {
        "invoice_number": {"type": "string"},
        "invoice_date": {"type": "string"},
        "total_amount": {"type": "number"},
    },
    "required": ["invoice_number", "total_amount"],
}

result = client.documents.extract(
    json_schema=invoice_schema,
    document="invoice.pdf",
    model="retab-micro",
)

print(result.data)
print(result.text)
print(result.likelihoods)
print(result.extraction_id)
```

`documents.extract(...)` returns a `RetabParsedChatCompletion`.

- `result.data` is the parsed structured output
- `result.text` is the raw JSON string
- `result.likelihoods` mirrors the extracted structure with confidence signals
- `result.extraction_id` can be used with the `extractions` API later

## What `extract` Accepts

`json_schema` can be:

- a Python `dict`
- a path to a JSON schema file

`document` can be:

- a local file path
- a file-like object
- a URL
- `MIMEData`

Useful extraction options:

- `n_consensus`: run multiple passes and reconcile the result
- `image_resolution_dpi`: control image rendering quality for vision models
- `metadata`: attach your own tags for later filtering
- `additional_messages`: add extra instructions or context after the document content

## Async Extraction

```python
import os

from retab import AsyncRetab


async def main() -> None:
    client = AsyncRetab(api_key=os.environ["RETAB_API_KEY"])

    async with client:
        result = await client.documents.extract(
            json_schema={
                "type": "object",
                "properties": {
                    "booking_reference": {"type": "string"},
                    "guest_name": {"type": "string"},
                },
            },
            document="booking-confirmation.pdf",
            model="retab-micro",
        )

    print(result.data)
```

## Streaming Extraction

`extract_stream(...)` yields partial `RetabParsedChatCompletion` objects as the JSON fills in.

```python
from retab import Retab

client = Retab()

with client.documents.extract_stream(
    json_schema={
        "type": "object",
        "properties": {
            "invoice_number": {"type": "string"},
            "total_amount": {"type": "number"},
        },
    },
    document="invoice.pdf",
    model="retab-micro",
) as stream:
    for partial in stream:
        print(partial.data)
```

For async code:

```python
async with client.documents.extract_stream(
    json_schema=invoice_schema,
    document="invoice.pdf",
    model="retab-micro",
) as stream:
    async for partial in stream:
        print(partial.data)
```

## Adding Context with `additional_messages`

The SDK supports the same message structure used in the tests: plain text messages, system or developer guidance, and multipart content.

```python
result = client.documents.extract(
    json_schema=invoice_schema,
    document="invoice.pdf",
    model="retab-micro",
    additional_messages=[
        {
            "role": "developer",
            "content": "Extract values exactly as written. Do not normalize vendor names.",
        },
        {
            "role": "user",
            "content": "Focus on invoice number, invoice date, and total amount due.",
        },
    ],
)
```

## Working with Stored Extractions

Every extraction can be retrieved later through `client.extractions`.

```python
result = client.documents.extract(
    json_schema=invoice_schema,
    document="invoice.pdf",
    model="retab-micro",
    metadata={"batch_id": "march-2026"},
)

stored = client.extractions.get(result.extraction_id)
print(stored.predictions)

page_sources = client.extractions.sources(result.extraction_id)
print(page_sources.sources)

recent = client.extractions.list(limit=20, metadata={"batch_id": "march-2026"})
for item in recent.items:
    print(item.id, item.file.filename)
```

`client.extractions.download(...)` returns a pre-signed download URL for `jsonl`, `csv`, or `xlsx` exports.

## Workflows

The Python SDK also supports workflow discovery, execution, and step inspection.

```python
from pathlib import Path

from retab import Retab

client = Retab()

workflow = client.workflows.get_entities("wf_abc123")
document_start_id = workflow.start_nodes[0].id

run = client.workflows.runs.create(
    workflow_id=workflow.workflow.id,
    documents={document_start_id: Path("invoice.pdf")},
)

run = client.workflows.runs.wait_for_completion(run.id, poll_interval_seconds=1.0)
run.raise_for_status()

print(run.output)

step = client.workflows.runs.steps.get(run.id, "extract-node-id")
print(step.extracted_data)
```

Useful workflow helpers:

- `client.workflows.get_entities(workflow_id)` returns the workflow graph and exposes `.start_nodes` and `.start_json_nodes`
- `client.workflows.runs.wait_for_completion(run.id)` polls until the run reaches `completed`, `error`, or `cancelled`
- `client.workflows.runs.steps.get(run.id, node_id)` returns typed handle inputs and outputs
- `client.workflows.runs.steps.get_all(run)` fetches step outputs for every node in one call
- `client.workflows.blocks.*` and `client.workflows.edges.*` let you create or update workflow graphs from code

## Notes

- `n_consensus=1` is the fastest option
- higher `n_consensus` usually improves robustness on noisy or ambiguous documents
- if schema validation fails, `result.choices[0].message.parsed` may be `None`

## Links

- Docs: [https://docs.retab.com](https://docs.retab.com)
- API reference: [https://docs.retab.com/api-reference/introduction](https://docs.retab.com/api-reference/introduction)
- Repository: [https://github.com/retab-dev/retab](https://github.com/retab-dev/retab)
