Metadata-Version: 2.3
Name: invocr
Version: 1.0.3
Summary: Invoice OCR System - Convert invoices between PDF, JSON, XML, HTML formats using OCR
License: Apache
Author: InvOCR Team
Author-email: team@invocr.com
Requires-Python: >=3.9,<4.0
Classifier: License :: Other/Proprietary License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Dist: aiofiles (>=23.2.1,<24.0.0)
Requires-Dist: click (>=8.1.7,<9.0.0)
Requires-Dist: easyocr (>=1.7.0,<2.0.0)
Requires-Dist: fastapi (>=0.104.1,<0.105.0)
Requires-Dist: jinja2 (>=3.1.2,<4.0.0)
Requires-Dist: loguru (>=0.7.2,<0.8.0)
Requires-Dist: lxml (>=4.9.3,<5.0.0)
Requires-Dist: numpy (>=1.24.3,<2.0.0)
Requires-Dist: opencv-python (>=4.8.1.78,<5.0.0.0)
Requires-Dist: pdf2image (>=1.16.3,<2.0.0)
Requires-Dist: pdfplumber (>=0.9.0,<0.10.0)
Requires-Dist: pillow (>=10.1.0,<11.0.0)
Requires-Dist: pydantic (>=2.5.0,<3.0.0)
Requires-Dist: pydantic-settings (>=2.1.0,<3.0.0)
Requires-Dist: pytesseract (>=0.3.10,<0.4.0)
Requires-Dist: python-multipart (>=0.0.6,<0.0.7)
Requires-Dist: uvicorn[standard] (>=0.24.0,<0.25.0)
Requires-Dist: weasyprint (>=60.2,<61.0)
Project-URL: Documentation, https://invocr.readthedocs.io
Project-URL: Homepage, https://github.com/invocr/invocr
Project-URL: Repository, https://github.com/invocr/invocr
Description-Content-Type: text/markdown

# InvOCR - Intelligent Invoice Processing

> 🔍 Enterprise-grade document processing with advanced OCR for invoices, receipts, and financial documents

[![Python 3.9+](https://img.shields.io/badge/Python-3.9%2B-blue.svg)](https://www.python.org/downloads/)
[![FastAPI](https://img.shields.io/badge/FastAPI-0.104%2B-green.svg)](https://fastapi.tiangolo.com/)
[![Docker](https://img.shields.io/badge/Docker-Ready-blue.svg)](https://www.docker.com/)
[![License](https://img.shields.io/badge/License-Apache%202.0-yellow.svg)](https://opensource.org/licenses/Apache-2.0)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

**InvOCR** is a powerful document processing system that automates the extraction and conversion of financial documents. It supports multiple input formats (PDF, images) and output formats (JSON, XML, HTML, PDF) with multi-language OCR capabilities.

## 🚀 Key Features

### 📄 Document Processing Pipeline
- **Input Formats**: PDF, PNG, JPG, TIFF
- **Output Formats**: JSON, XML, HTML, PDF
- **Conversion Workflows**:
  - PDF/Image → Text (OCR)
  - Text → Structured Data
  - Data → Standard Formats (EU XML, HTML, PDF)

### 🔍 Advanced OCR Capabilities
- **Multi-engine Support**: Tesseract OCR + EasyOCR
- **Language Support**: English, Polish, German, French, Spanish, Italian
- **Smart Features**:
  - Auto-language detection
  - Layout analysis
  - Table extraction
  - Signature detection

### 🛠️ Technical Highlights
- **REST API**: FastAPI-based, async-ready
- **CLI**: Intuitive command-line interface
- **Docker Support**: Easy deployment
- **Batch Processing**: Process multiple documents
- **Templating System**: Customizable output formats
- **Validation**: Built-in data validation

### 📋 Supported Document Types
| Type | Description | Key Features |
|------|-------------|--------------|
| **Invoices** | Commercial invoices | Line items, totals, tax details |
| **Receipts** | Retail receipts | Merchant info, items, totals |
| **Bills** | Utility bills | Account info, payment details |
| **Bank Statements** | Account statements | Transactions, balances |
| **Custom** | Any document | Configurable templates |

## 🛠️ Basic Usage

### Using the CLI

```bash
# Convert PDF to JSON
invocr convert invoice.pdf output.json

# Process image with specific languages
invocr img2json receipt.jpg --languages en,pl,de

# Start the API server
invocr serve

# Run batch processing
invocr batch ./invoices/ ./output/ --format xml
```

### Using the API

```python
import requests

# Convert document
response = requests.post(
    "http://localhost:8000/convert",
    files={"file": open("document.pdf", "rb")},
    data={"target_format": "json"}
)
print(response.json())
```

## 🏗️ Project Structure

```
invocr/
├── 📁 invocr/                 # Main package
│   ├── 📁 core/               # Core processing modules
│   │   ├── ocr.py            # OCR engine (Tesseract + EasyOCR)
│   │   ├── converter.py      # Universal format converter
│   │   ├── extractor.py      # Data extraction logic
│   │   └── validator.py      # Data validation
│   │
│   ├── 📁 formats/            # Format-specific handlers
│   │   ├── pdf.py           # PDF operations
│   │   ├── image.py         # Image processing
│   │   ├── json_handler.py  # JSON operations
│   │   ├── xml_handler.py   # EU XML format
│   │   └── html_handler.py  # HTML generation
│   │
│   ├── 📁 api/               # REST API
│   │   ├── main.py          # FastAPI application
│   │   ├── routes.py        # API endpoints
│   │   └── models.py        # Pydantic models
│   │
│   ├── 📁 cli/               # Command line interface
│   │   └── commands.py      # CLI commands
│   │
│   └── 📁 utils/             # Utilities
│       ├── config.py        # Configuration
│       ├── logger.py        # Logging setup
│       └── helpers.py       # Helper functions
│
├── 📁 tests/                 # Test suite
├── 📁 scripts/               # Installation scripts
├── 📁 docs/                  # Documentation
├── 🐳 Dockerfile             # Docker configuration
├── 🐳 docker-compose.yml     # Docker Compose
├── 📋 pyproject.toml         # Poetry configuration
└── 📖 README.md              # This file
```

## 🚀 Quick Start

### Prerequisites
- Python 3.9+
- Tesseract OCR 4.0+
- Poppler Utils
- Docker (optional)

### Installation

#### Option 1: Using Docker (Recommended)
```bash
# Clone repository
git clone https://github.com/fin-officer/invocr.git
cd invocr

# Build and start services
docker-compose up -d --build

# Access the API at http://localhost:8000
# View API docs at http://localhost:8000/docs
```

#### Option 2: Local Installation

1. Install system dependencies (Ubuntu/Debian):
```bash
sudo apt update
sudo apt install -y tesseract-ocr tesseract-ocr-pol tesseract-ocr-deu \
    tesseract-ocr-fra tesseract-ocr-spa tesseract-ocr-ita \
    poppler-utils libpango-1.0-0 libharfbuzz0b python3-dev build-essential
```

2. Install Python dependencies:
```bash
# Install Poetry if not installed
curl -sSL https://install.python-poetry.org | python3 -


## 🚀 Development

### Running Tests
```bash
# Run all tests
poetry run pytest

# Run tests with coverage
poetry run pytest --cov=invocr --cov-report=html
```

### Code Quality
```bash
# Run linters
poetry run flake8 invocr/
poetry run mypy invocr/

# Format code
poetry run black invocr/ tests/
poetry run isort invocr/ tests/
```

### Building the Package
```bash
# Build package
poetry build

# Publish to PyPI (requires credentials)
poetry publish
```

## 📚 Documentation

For detailed documentation, see:
- [API Reference](./docs/api.md)
- [CLI Usage](./docs/cli.md)
- [Development Guide](./docs/development.md)

## 🤝 Contributing

We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details.

## 📄 License

This project is licensed under the Apache 2.0 License - see the [LICENSE](LICENSE) file for details.

## 📞 Support

For support, please open an issue in the [issue tracker](https://github.com/fin-officer/invocr/issues).

## 📊 Project Status

![GitHub last commit](https://img.shields.io/github/last-commit/fin-officer/invocr)
![GitHub issues](https://img.shields.io/github/issues/fin-officer/invocr)
![GitHub pull requests](https://img.shields.io/github/issues-pr/fin-officer/invocr)

---

<div align="center">
  Made with ❤️ by the InvOCR Team
</div>
poetry install

# Setup environment
cp .env.example .env
```

### Option 3: Docker

```bash
# Using Docker Compose (easiest)
docker-compose up

# Or build manually
docker build -t invocr .
docker run -p 8000:8000 invocr
```

## 📚 Usage Examples

### CLI Commands

```bash
# Convert PDF to JSON
invocr convert invoice.pdf invoice.json

# Convert with specific languages
invocr convert -l en,pl,de document.pdf output.json

# PDF to images
invocr pdf2img document.pdf ./images/ --format png --dpi 300

# Image to JSON (OCR)
invocr img2json scan.png data.json --doc-type invoice

# JSON to EU XML format
invocr json2xml data.json invoice.xml

# Batch processing
invocr batch ./input_files/ ./output/ --format json --parallel 4

# Full pipeline: PDF → IMG → JSON → XML → HTML → PDF
invocr pipeline document.pdf ./results/

# Start API server
invocr serve --host 0.0.0.0 --port 8000
```

### REST API

```bash
# Start server
invocr serve

# Convert file
curl -X POST "http://localhost:8000/convert" \
  -F "file=@invoice.pdf" \
  -F "target_format=json" \
  -F "languages=en,pl"

# Check job status
curl "http://localhost:8000/status/{job_id}"

# Download result
curl "http://localhost:8000/download/{job_id}" -o result.json
```

### Python API

```python
from invocr import create_converter

# Create converter instance
converter = create_converter(languages=['en', 'pl', 'de'])

# Convert PDF to JSON
result = converter.pdf_to_json('invoice.pdf')
print(result)

# Convert image to JSON with OCR
data = converter.image_to_json('scan.png', document_type='invoice')

# Convert JSON to EU XML
xml_content = converter.json_to_xml(data, format='eu_invoice')

# Full conversion pipeline
result = converter.convert('input.pdf', 'output.json', 'auto', 'json')
```

## 🌐 API Documentation

When running the API server, visit:
- **Interactive docs**: http://localhost:8000/docs
- **ReDoc**: http://localhost:8000/redoc
- **OpenAPI JSON**: http://localhost:8000/openapi.json

### Key Endpoints

- `POST /convert` - Convert single file
- `POST /convert/pdf2img` - PDF to images
- `POST /convert/img2json` - Image OCR to JSON
- `POST /batch/convert` - Batch processing
- `GET /status/{job_id}` - Job status
- `GET /download/{job_id}` - Download result
- `GET /health` - Health check
- `GET /info` - System information

## 🔧 Configuration

### Environment Variables

Key configuration options in `.env`:

```bash
# OCR Settings
DEFAULT_OCR_ENGINE=auto          # tesseract, easyocr, auto
DEFAULT_LANGUAGES=en,pl,de,fr,es # Supported languages
OCR_CONFIDENCE_THRESHOLD=0.3     # Minimum confidence

# Processing
MAX_FILE_SIZE=52428800          # 50MB limit
PARALLEL_WORKERS=4              # Concurrent processing
MAX_PAGES_PER_PDF=10           # Page limit

# Storage
UPLOAD_DIR=./uploads
OUTPUT_DIR=./output
TEMP_DIR=./temp
```

### Supported Languages

| Code | Language | Tesseract | EasyOCR |
|------|----------|-----------|---------|
| `en` | English | ✅ | ✅ |
| `pl` | Polish | ✅ | ✅ |
| `de` | German | ✅ | ✅ |
| `fr` | French | ✅ | ✅ |
| `es` | Spanish | ✅ | ✅ |
| `it` | Italian | ✅ | ✅ |

## 📊 Supported Formats

### Input Formats
- **PDF** (.pdf)
- **Images** (.png, .jpg, .jpeg, .tiff, .bmp)
- **JSON** (.json)
- **XML** (.xml)
- **HTML** (.html)

### Output Formats
- **JSON** - Structured data
- **XML** - EU Invoice standard
- **HTML** - Responsive templates
- **PDF** - Professional documents

## 🧪 Testing

```bash
# Run all tests
poetry run pytest

# Run with coverage
poetry run pytest --cov=invocr

# Run specific test file
poetry run pytest tests/test_ocr.py

# Run API tests
poetry run pytest tests/test_api.py
```

## 🚀 Deployment

### Production with Docker

```yaml
# docker-compose.prod.yml
version: '3.8'
services:
  invocr:
    image: invocr:latest
    ports:
      - "80:8000"
    environment:
      - ENVIRONMENT=production
      - WORKERS=4
    volumes:
      - ./data:/app/data
```

### Kubernetes

```yaml
# k8s-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: invocr
spec:
  replicas: 3
  selector:
    matchLabels:
      app: invocr
  template:
    metadata:
      labels:
        app: invocr
    spec:
      containers:
      - name: invocr
        image: invocr:latest
        ports:
        - containerPort: 8000
```

## 🤝 Contributing

1. Fork the repository
2. Create feature branch (`git checkout -b feature/amazing-feature`)
3. Make changes
4. Add tests
5. Run tests (`poetry run pytest`)
6. Commit changes (`git commit -m 'Add amazing feature'`)
7. Push to branch (`git push origin feature/amazing-feature`)
8. Open Pull Request

### Development Setup

```bash
# Install development dependencies
poetry install --with dev

# Install pre-commit hooks
poetry run pre-commit install

# Run linting
poetry run black invocr/
poetry run isort invocr/
poetry run flake8 invocr/

# Run type checking
poetry run mypy invocr/
```

## 📈 Performance

### Benchmarks

| Operation | Time | Memory |
|-----------|------|--------|
| PDF → JSON (1 page) | ~2-3s | ~50MB |
| Image OCR → JSON | ~1-2s | ~30MB |
| JSON → XML | ~0.1s | ~10MB |
| JSON → HTML | ~0.2s | ~15MB |
| HTML → PDF | ~1-2s | ~40MB |

### Optimization Tips

- Use `--parallel` for batch processing
- Enable `IMAGE_ENHANCEMENT=false` for faster OCR
- Use `tesseract` engine for better performance
- Configure `MAX_PAGES_PER_PDF` for large documents

## 🔒 Security

- File upload validation
- Size limits enforced
- Input sanitization
- No execution of uploaded content
- Rate limiting available
- CORS configuration

## 📋 Requirements

### System Requirements
- **Python**: 3.9+
- **Memory**: 1GB+ RAM
- **Storage**: 500MB+ free space
- **OS**: Linux, macOS, Windows (Docker)

### Dependencies
- **Tesseract OCR**: Text recognition
- **EasyOCR**: Neural OCR engine
- **WeasyPrint**: HTML to PDF conversion
- **FastAPI**: Web framework
- **Pydantic**: Data validation

## 🐛 Troubleshooting

### Common Issues

**OCR not working:**
```bash
# Check Tesseract installation
tesseract --version

# Install missing languages
sudo apt install tesseract-ocr-pol
```

**WeasyPrint errors:**
```bash
# Install system dependencies
sudo apt install libpango-1.0-0 libharfbuzz0b
```

**Import errors:**
```bash
# Reinstall dependencies
poetry install --force
```

**Permission errors:**
```bash
# Fix file permissions
chmod -R 755 uploads/ output/
```

## 📞 Support

- 📧 **Email**: support@invocr.com
- 🐛 **Issues**: [GitHub Issues](https://github.com/your-username/invocr/issues)
- 💬 **Discussions**: [GitHub Discussions](https://github.com/your-username/invocr/discussions)
- 📚 **Wiki**: [Project Wiki](https://github.com/your-username/invocr/wiki)

## 📄 License

This project is licensed under the Apache License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

- [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) - OCR engine
- [EasyOCR](https://github.com/JaidedAI/EasyOCR) - Neural OCR
- [FastAPI](https://fastapi.tiangolo.com/) - Web framework
- [WeasyPrint](https://weasyprint.org/) - HTML/CSS to PDF
- [Poetry](https://python-poetry.org/) - Dependency management

---

**Made with ❤️ for the open source community**

⭐ **Star this repository if you find it useful!**









