Metadata-Version: 2.4
Name: harvestor
Version: 0.0.1
Summary: Harvest intelligence from any document - AI-powered data extraction and validation
Keywords: ai,llm,data-extraction,pdf,document-processing,ocr,anthropic,openai
Author: THUAUD Simon
Author-email: THUAUD Simon <sim.thuaud@gmail.com>
License-Expression: MIT
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: General
Requires-Dist: langchain>=0.1.0
Requires-Dist: langchain-anthropic>=0.1.0
Requires-Dist: langchain-openai>=0.0.5
Requires-Dist: anthropic>=0.18.0
Requires-Dist: openai>=1.10.0
Requires-Dist: pymupdf>=1.23.0
Requires-Dist: pdfplumber>=0.10.0
Requires-Dist: pillow>=10.2.0
Requires-Dist: pytesseract>=0.3.10
Requires-Dist: opencv-python-headless>=4.8.0
Requires-Dist: pydantic>=2.5.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: sqlalchemy>=2.0.0
Requires-Dist: click>=8.1.0
Requires-Dist: rich>=13.7.0
Requires-Dist: tqdm>=4.66.0
Requires-Dist: pytest>=7.4.0 ; extra == 'dev'
Requires-Dist: pytest-mock>=3.12.0 ; extra == 'dev'
Requires-Dist: pytest-cov>=4.1.0 ; extra == 'dev'
Requires-Dist: ruff>=0.1.0 ; extra == 'dev'
Requires-Python: >=3.13, <3.14
Project-URL: Homepage, https://github.com/SIMOUNIX/harvestor
Project-URL: Repository, https://github.com/SIMOUNIX/harvestor
Project-URL: Issues, https://github.com/SIMOUNIX/harvestor/issues
Provides-Extra: dev
Description-Content-Type: text/markdown

# 🌾 Harvestor

**AI-powered document data extraction toolkit**

Extract structured data from documents (invoices, receipts, forms) using Claude's vision API. Easily integrate into your Python applications with flexible input options and built-in cost tracking.

> ⚠️ **Early Development**: This project is in active development. Core functionality is working, but many features are still being built.

## What Works Now

- ✅ **Vision API Integration**: Extract data from images (.jpg, .png, .gif, .webp)
- ✅ **Flexible Input**: Accepts file paths, bytes, or file-like objects (like PIL, requests)
- ✅ **Cost Tracking**: Built-in monitoring and limits for API usage
- ✅ **Structured Output**: Returns Pydantic-validated data models that you can define
- 🚧 **Multi-strategy Extraction**: Cost-optimized cascade to reduce api calls (planned)

## Quick Start

```bash
# Install dependencies
uv sync

# Setup environment
cp .env.template .env
# Add your Anthropic API key to .env

# Run a test
uv run python example.py
```

## Basic Usage

```python
from harvestor import harvest

# From file path
result = harvest("invoice.jpg")
print(f"Invoice #: {result.data.get('invoice_number')}")
print(f"Total: ${result.data.get('total_amount')}")
print(f"Cost: ${result.total_cost:.4f}")

# From bytes (e.g., API upload)
with open("invoice.jpg", "rb") as f:
    data = f.read()
result = harvest(data, filename="invoice.jpg")

# From file-like object
from io import BytesIO
buffer = BytesIO(image_data)
result = harvest(buffer, filename="invoice.jpg")

# Display summary output
print(result.to_summary())
```

## Testing

```bash
# Install test dependencies
uv sync --extra dev

# Run tests
make test

# Run with coverage
make test-cov
```

## Requirements

- Python 3.13
- Anthropic API key (for Claude vision API)

## Citation

For testing and evaluation, we are using the following dataset:

> Limam, M., et al. FATURA Dataset. Zenodo, 13 Dec. 2023, https://doi.org/10.5281/zenodo.10371464.

## License

MIT
