Metadata-Version: 2.4
Name: langchain-docdigitizer
Version: 0.2.0
Summary: LangChain document loader for the DocDigitizer document processing API
Project-URL: Homepage, https://github.com/DocDigitizer/dd-v3-integrations
Project-URL: Documentation, https://github.com/DocDigitizer/dd-v3-integrations/tree/main/integrations/langchain/python
Project-URL: Repository, https://github.com/DocDigitizer/dd-v3-integrations
Author-email: DocDigitizer <support@docdigitizer.com>
License-Expression: MIT
Keywords: docdigitizer,document-loader,extraction,langchain,ocr,pdf
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Requires-Dist: docdigitizer>=0.2.0
Requires-Dist: langchain-core>=0.3.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: pyyaml>=6.0; extra == 'dev'
Description-Content-Type: text/markdown

# langchain-docdigitizer

LangChain document loader for the [DocDigitizer](https://docdigitizer.com) document processing API.

> **v0.1.x is deprecated.** Upgrade to v0.2.0+ for the new API endpoint. The previous endpoint (`https://apix.docdigitizer.com/sync`) will be removed in a future release.

## Installation

```bash
pip install langchain-docdigitizer
```

## Usage

```python
from langchain_docdigitizer import DocDigitizerLoader

# Load a single PDF
loader = DocDigitizerLoader(api_key="dd_live_...")
docs = loader.load("invoice.pdf")

print(docs[0].page_content)       # JSON with extracted fields
print(docs[0].metadata)           # document_type, confidence, country_code, etc.

# Load all PDFs from a directory
loader = DocDigitizerLoader(api_key="dd_live_...", file_path="invoices/")
docs = loader.load()

# Use in a RAG pipeline
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
chunks = splitter.split_documents(docs)
vectorstore = FAISS.from_documents(chunks, OpenAIEmbeddings())
```

## Configuration

| Parameter | Environment Variable | Default |
|-----------|---------------------|---------|
| `api_key` | `DOCDIGITIZER_API_KEY` | — |
| `base_url` | `DOCDIGITIZER_BASE_URL` | `https://api.docdigitizer.com/v3/docingester` |
| `timeout` | `DOCDIGITIZER_TIMEOUT` | `300` |
| `max_retries` | — | `3` |
| `pipeline` | — | `None` |
| `content_format` | — | `"json"` |

### Content Formats

- `"json"` (default): `page_content` is a JSON string of extracted fields
- `"text"`: `page_content` is key-value pairs separated by newlines (`key: value`)
- `"kv"`: `page_content` is `key=value` pairs separated by newlines

## Document Metadata

Each LangChain `Document` includes metadata:

| Field | Type | Description |
|-------|------|-------------|
| `source` | `str` | File path of the processed PDF |
| `document_type` | `str` | Detected document type (e.g., "Invoice") |
| `confidence` | `float` | Classification confidence (0-1) |
| `country_code` | `str` | Detected country code (e.g., "PT") |
| `pages` | `list[int]` | Page numbers where document was found |
| `page_range` | `dict` | Start/end page range |
| `trace_id` | `str` | Unique trace identifier |

## License

MIT
