Metadata-Version: 2.1
Name: docprompt
Version: 0.1.4
Summary: Documents and large language models.
Home-page: https://github.com/Page-Leaf/docprompt
License: Apache-2.0
Author: Frankie Colson
Author-email: frank@pageleaf.io
Requires-Python: >=3.9,<3.13
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.8
Provides-Extra: azure
Provides-Extra: dev
Provides-Extra: doc
Provides-Extra: google
Provides-Extra: modeling
Provides-Extra: test
Requires-Dist: azure-ai-formrecognizer (>=3.3.0) ; extra == "azure"
Requires-Dist: black (>=23.10.0,<24.0.0) ; extra == "test"
Requires-Dist: bump2version (>=1.0.1,<2.0.0) ; extra == "dev"
Requires-Dist: flake8 (>=6.1.0,<7.0.0) ; extra == "test"
Requires-Dist: flake8-docstrings (>=1.7.0,<2.0.0) ; extra == "test"
Requires-Dist: fsspec (>=2023.10.0,<2024.0.0)
Requires-Dist: google-cloud-documentai (>=2.20.0) ; extra == "google"
Requires-Dist: isort (>=5.12.0,<6.0.0) ; extra == "test"
Requires-Dist: mkdocs (>=1.1.2,<2.0.0) ; extra == "doc"
Requires-Dist: mkdocs-autorefs (>=0.2.1,<0.3.0) ; extra == "doc"
Requires-Dist: mkdocs-include-markdown-plugin (>=1.0.0,<2.0.0) ; extra == "doc"
Requires-Dist: mkdocs-material (>=6.1.7,<7.0.0) ; extra == "doc"
Requires-Dist: mkdocs-material-extensions (>=1.0.1,<2.0.0)
Requires-Dist: mkdocstrings (>=0.15.2,<0.16.0) ; extra == "doc"
Requires-Dist: mypy (>=1.6.1,<2.0.0) ; extra == "test"
Requires-Dist: numpy (>=1.26.1,<2.0.0) ; extra == "modeling"
Requires-Dist: pikepdf (>=8.11.2,<9.0.0)
Requires-Dist: pillow (>=9.0.1)
Requires-Dist: pip (>=20.3.1,<21.0.0) ; extra == "dev"
Requires-Dist: pre-commit (>=2.12.0,<3.0.0) ; extra == "dev"
Requires-Dist: pydantic (>=2.1.0)
Requires-Dist: pypdf (>=3.16.4,<4.0.0)
Requires-Dist: pytest (>=7.4.2,<8.0.0) ; extra == "test"
Requires-Dist: pytest-cov (>=4.1.0,<5.0.0) ; extra == "test"
Requires-Dist: python-dateutil (>=2.8.2,<3.0.0)
Requires-Dist: python-magic (>=0.4.24)
Requires-Dist: tenacity (>=8.2.3,<9.0.0)
Requires-Dist: toml (>=0.10.2,<0.11.0) ; extra == "dev"
Requires-Dist: torch (>=2.1.0,<3.0.0) ; extra == "modeling"
Requires-Dist: tox (>=3.20.1,<4.0.0) ; extra == "dev"
Requires-Dist: tqdm (>=4.61.0)
Requires-Dist: transformers (>=4.34.1,<5.0.0) ; extra == "modeling"
Requires-Dist: twine (>=3.3.0,<4.0.0) ; extra == "dev"
Requires-Dist: virtualenv (>=20.2.2,<21.0.0) ; extra == "dev"
Description-Content-Type: text/markdown

# Docprompt

Docprompt is a library for Document AI. It aims to make enterprise-level document analysis easy thanks to the zero-shot capability of large language models while also providing a toolset for working with various document formats.

## Supercharged Document Analysis

* Common utilities for interacting with PDFs
  * PDF loading and serialization
  * PDF byte compression using Ghostscript :ghost:
  * Fast rasterization :fire: :rocket:
  * Page splitting, re-export with PikePDF
* Support for most OCR providers with batched inference
  * Google :white_check_mark:
  * Azure Document Intelligence :red_circle:
  * Amazon Textract :red_circle:
  * Tesseract :red_circle:



[![pypi](https://img.shields.io/pypi/v/docprompt.svg)](https://pypi.org/project/docprompt/)
[![python](https://img.shields.io/pypi/pyversions/docprompt.svg)](https://pypi.org/project/docprompt/)
[![Build Status](https://github.com/psu3d0/docprompt/actions/workflows/dev.yml/badge.svg)](https://github.com/psu3d0/docprompt/actions/workflows/dev.yml)
[![codecov](https://codecov.io/gh/psu3d0/docprompt/branch/main/graphs/badge.svg)](https://codecov.io/github/psu3d0/docprompt)



Documents and large language models


* Documentation: <https://docprompt.dev>
* GitHub: <https://github.com/Page-Leaf/docprompt>
* PyPI: <https://pypi.org/project/docprompt/>
* Free software: Apache-2.0


## Features

* Representations for common document layout types - `TextBlock`, `BoundingBox`, etc
* Generic implementations of OCR providers

## Installation

Use the package manager [pip](https://pip.pypa.io/en/stable/) to install Docprompt.

```bash
pip install docprompt
```

With an OCR provider

```bash
pip install "docprompt[google]
```


## Usage


### Simple Operations
```python
from docprompt import load_document

# Load a document
document = load_document("path/to/my.pdf")

# Rasterize a single page using Ghostscript
page_number = 5
rastered = document.rasterize_page(page_number, dpi=120)

# Split a pdf based on a page range
document_2 = document.split(start=125, stop=130)
```

### Performing OCR
```python
from docprompt import load_document, DocumentNode
from docprompt.tasks.ocr.gcp import GoogleOcrProvider

provider = GoogleOcrProvider.from_service_account_file(
  project_id=my_project_id,
  processor_id=my_processor_id,
  service_account_file=path_to_service_file
)

document = load_document("path/to/my.pdf")

# A container holds derived data for a document, like OCR or classification results
document_node = DocumentNode.from_document(document)

provider.process_document_node(document_node) # Caches results on the document_node

document_node[0].ocr_result # Access OCR results
```

