Metadata-Version: 2.4
Name: langchain-opendataloader-pdf
Version: 1.2.0
Summary: A LangChain integration for OpenDataLoader PDF
Project-URL: Homepage, https://github.com/opendataloader-project/opendataloader-pdf
Project-URL: Repository, https://github.com/opendataloader-project/langchain-opendataloader-pdf
Project-URL: Issues, https://github.com/opendataloader-project/langchain-opendataloader-pdf/issues
Author-email: opendataloader-project <open.dataloader@hancom.com>
License-Expression: MIT
License-File: LICENSE
Keywords: ai,dataloader,document,document loader,document-parsing,langchain,pdf
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Requires-Python: <4.0,>=3.10
Requires-Dist: langchain-core<2.0,>=1.0
Requires-Dist: opendataloader-pdf>=1.5.1
Description-Content-Type: text/markdown

# langchain-opendataloader-pdf

LangChain document loader for [OpenDataLoader PDF](https://github.com/opendataloader-project/opendataloader-pdf) — parse PDFs into structured `Document` objects for RAG pipelines.

[![PyPI version](https://img.shields.io/pypi/v/langchain-opendataloader-pdf.svg)](https://pypi.org/project/langchain-opendataloader-pdf/)
[![License](https://img.shields.io/pypi/l/langchain-opendataloader-pdf.svg)](https://github.com/opendataloader-project/langchain-opendataloader-pdf/blob/main/LICENSE)

## Features

- **Accurate reading order** — XY-Cut++ algorithm handles multi-column layouts correctly
- **Table extraction** — Preserves table structure in output
- **Multiple formats** — Text, Markdown, JSON, HTML
- **100% local** — No cloud APIs, your documents never leave your machine
- **Fast** — Rule-based extraction, no GPU required

## Requirements

- Python >= 3.10
- Java 11+ available on system `PATH`

## Installation

```bash
pip install -U langchain-opendataloader-pdf
```

## Quick Start

```python
from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader

# Load a PDF as text
loader = OpenDataLoaderPDFLoader(
    file_path="document.pdf",
    format="text"
)
documents = loader.load()

print(documents[0].page_content)
```

## Usage Examples

### Basic Usage

```python
from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader

# Single file
loader = OpenDataLoaderPDFLoader(file_path="report.pdf")
docs = loader.load()

# Multiple files
loader = OpenDataLoaderPDFLoader(
    file_path=["report1.pdf", "report2.pdf", "documents/"]
)
docs = loader.load()
```

### Output Formats

```python
# Plain text (default) - best for simple RAG
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="text")

# Markdown - preserves headings, lists, tables
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="markdown")

# JSON - structured data with bounding boxes
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="json")

# HTML - styled output
loader = OpenDataLoaderPDFLoader(file_path="doc.pdf", format="html")
```

### Tagged PDF Support

For accessible PDFs with structure tags (common in government/legal documents):

```python
loader = OpenDataLoaderPDFLoader(
    file_path="accessible_document.pdf",
    use_struct_tree=True  # Use native PDF structure
)
```

### Table Detection

For documents with complex tables:

```python
loader = OpenDataLoaderPDFLoader(
    file_path="financial_report.pdf",
    format="markdown",
    table_method="cluster"  # Better for borderless tables
)
```

### Password-Protected PDFs

```python
loader = OpenDataLoaderPDFLoader(
    file_path="encrypted.pdf",
    password="secret123"
)
```

### Image Handling

```python
# Images are excluded by default (image_output="off")
# This is optimal for text-based RAG pipelines

# Embed images as Base64 (for multimodal RAG)
loader = OpenDataLoaderPDFLoader(
    file_path="doc.pdf",
    format="markdown",
    image_output="embedded",
    image_format="jpeg"  # or "png"
)

# Save images as files to a local directory
loader = OpenDataLoaderPDFLoader(
    file_path="doc.pdf",
    format="markdown",
    image_output="external",
    image_dir="./images",   # images saved here; defaults to temp dir if not set
    image_format="png"
)
```

### Suppress Logging

```python
loader = OpenDataLoaderPDFLoader(
    file_path="doc.pdf",
    quiet=True  # No console output
)
```

## RAG Pipeline Example

```python
from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

# Load PDF
loader = OpenDataLoaderPDFLoader(
    file_path="knowledge_base.pdf",
    format="markdown",
    quiet=True
)
documents = loader.load()

# Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = splitter.split_documents(documents)

# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(chunks, embeddings)

# Query
results = vectorstore.similarity_search("What is the main topic?")
```

## Parameters Reference

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `file_path` | `str \| List[str]` | — | **(Required)** PDF file path(s) or directories |
| `format` | `str` | `"text"` | Output format: `"text"`, `"markdown"`, `"json"`, `"html"` |
| `split_pages` | `bool` | `True` | Split into separate Documents per page |
| `quiet` | `bool` | `False` | Suppress console logging |
| `password` | `str` | `None` | Password for encrypted PDFs |
| `use_struct_tree` | `bool` | `False` | Use PDF structure tree (tagged PDFs) |
| `table_method` | `str` | `"default"` | `"default"` (border-based) or `"cluster"` (border + clustering) |
| `reading_order` | `str` | `"xycut"` | `"xycut"` or `"off"` |
| `keep_line_breaks` | `bool` | `False` | Preserve original line breaks |
| `image_output` | `str` | `"off"` | `"off"`, `"embedded"` (Base64), or `"external"` |
| `image_format` | `str` | `"png"` | `"png"` or `"jpeg"` |
| `image_dir` | `str` | `None` | Directory for extracted images when using `image_output="external"` |
| `content_safety_off` | `List[str]` | `None` | Disable safety filters: `"hidden-text"`, `"off-page"`, `"tiny"`, `"hidden-ocg"`, `"all"` |
| `replace_invalid_chars` | `str` | `None` | Replacement for invalid characters |

## Document Metadata

Each returned `Document` includes metadata:

```python
doc.metadata
# {'source': 'document.pdf', 'format': 'text', 'page': 1}
```

## License

MIT License. See [LICENSE](LICENSE) for details.

## Links

- [OpenDataLoader PDF](https://github.com/opendataloader-project/opendataloader-pdf) — Core PDF parsing engine
- [LangChain Python Docs](https://python.langchain.com/docs/integrations/document_loaders/opendataloader_pdf/) — Python API reference
- [LangChain Integration Guide](https://docs.langchain.com/oss/python/integrations/document_loaders/opendataloader_pdf) — Integration documentation
- [PyPI Package](https://pypi.org/project/langchain-opendataloader-pdf/)
