Metadata-Version: 2.4
Name: nanonets-extractor
Version: 0.1.4
Summary: A unified document extraction library supporting local CPU, GPU, and cloud processing
Home-page: https://github.com/nanonets/document-extractor
Author: Nanonets
Author-email: Nanonets <support@nanonets.com>
Maintainer-email: Nanonets <support@nanonets.com>
License: MIT
Project-URL: Homepage, https://github.com/nanonets/document-extractor
Project-URL: Documentation, https://docs.nanonets.com
Project-URL: Repository, https://github.com/nanonets/document-extractor
Project-URL: Bug Tracker, https://github.com/nanonets/document-extractor/issues
Project-URL: API Keys, https://app.nanonets.com/#/keys
Keywords: document,extraction,ocr,pdf,ai,machine-learning
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: General
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.25.0
Requires-Dist: Pillow>=8.0.0
Requires-Dist: PyPDF2>=2.0.0
Requires-Dist: python-docx>=0.8.11
Requires-Dist: openpyxl>=3.0.0
Requires-Dist: pandas>=1.3.0
Requires-Dist: pydantic>=1.9.0
Requires-Dist: typing-extensions>=4.0.0
Provides-Extra: cpu
Requires-Dist: numpy<2.0.0,>=1.21.0; extra == "cpu"
Requires-Dist: opencv-python>=4.5.0; extra == "cpu"
Requires-Dist: pytesseract>=0.3.8; extra == "cpu"
Requires-Dist: easyocr>=1.6.0; extra == "cpu"
Provides-Extra: gpu
Requires-Dist: numpy<2.0.0,>=1.21.0; extra == "gpu"
Requires-Dist: torch>=1.12.0; extra == "gpu"
Requires-Dist: torchvision>=0.13.0; extra == "gpu"
Requires-Dist: transformers>=4.20.0; extra == "gpu"
Requires-Dist: opencv-python>=4.5.0; extra == "gpu"
Provides-Extra: all
Requires-Dist: numpy<2.0.0,>=1.21.0; extra == "all"
Requires-Dist: opencv-python>=4.5.0; extra == "all"
Requires-Dist: pytesseract>=0.3.8; extra == "all"
Requires-Dist: easyocr>=1.6.0; extra == "all"
Requires-Dist: torch>=1.12.0; extra == "all"
Requires-Dist: torchvision>=0.13.0; extra == "all"
Requires-Dist: transformers>=4.20.0; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=6.0.0; extra == "dev"
Requires-Dist: pytest-cov>=2.10.0; extra == "dev"
Requires-Dist: black>=21.0.0; extra == "dev"
Requires-Dist: flake8>=3.8.0; extra == "dev"
Requires-Dist: mypy>=0.910; extra == "dev"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# Nanonets Document Extractor

A Python library for extracting data from any document using AI. Supports cloud API, local CPU, and GPU processing.

## Quick Start

### Installation

```bash
# For cloud processing only (recommended)
pip install nanonets-extractor

# For local CPU processing
pip install nanonets-extractor[cpu]

# For local GPU processing  
pip install nanonets-extractor[gpu]
```

### Get Your Free API Key
Get your free API key from [https://app.nanonets.com/#/keys](https://app.nanonets.com/#/keys)

## Usage

### Basic Example

```python
from nanonets_extractor import DocumentExtractor

# Initialize extractor
extractor = DocumentExtractor(
    mode="cloud",
    api_key="your_api_key_here"
)

# Extract data from any document
result = extractor.extract(
    file_path="invoice.pdf",
    output_type="flat-json"
)

print(result)
```

## Initialization Parameters

### DocumentExtractor()

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `mode` | str | Yes | Processing mode: `"cloud"`, `"cpu"`, or `"gpu"` |
| `api_key` | str | Yes (cloud mode) | Your Nanonets API key |
| `model_path` | str | No | Custom model path for local processing |
| `device` | str | No | GPU device (e.g., "cuda:0") for GPU mode |

### Processing Modes

#### 1. Cloud Mode (Recommended)
```python
extractor = DocumentExtractor(
    mode="cloud",
    api_key="your_api_key"
)
```
- ✅ No setup required
- ✅ Fastest processing
- ✅ Most accurate
- ✅ Supports all document types

#### 2. CPU Mode
```python
extractor = DocumentExtractor(mode="cpu")
```
- ✅ Works offline
- ⚠️ Slower processing
- ⚠️ Requires local dependencies

#### 3. GPU Mode
```python
extractor = DocumentExtractor(
    mode="gpu",
    device="cuda:0"  # optional
)
```
- ✅ Faster than CPU
- ✅ Works offline
- ⚠️ Requires CUDA-capable GPU

## Extract Method

### extractor.extract()

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `file_path` | str | Yes | Path to your document |
| `output_type` | str | No | Output format (default: "flat-json") |
| `specified_fields` | list | No | Extract only specific fields |
| `json_schema` | dict | No | Custom JSON schema for output |

### Output Types

| Type | Description | Example |
|------|-------------|---------|
| `"flat-json"` | Simple key-value pairs | `{"invoice_number": "123", "total": "100.00"}` |
| `"markdown"` | Formatted markdown text | `# Invoice\n**Total:** $100.00` |
| `"specified-fields"` | Only requested fields | Must provide `specified_fields` parameter |
| `"specified-json"` | Custom JSON structure | Must provide `json_schema` parameter |

## Supported Document Types

Works with **any document type**:
- 📄 **PDFs** - Invoices, contracts, reports
- 🖼️ **Images** - Screenshots, photos, scans  
- 📊 **Spreadsheets** - Excel, CSV files
- 📝 **Text Documents** - Word docs, text files
- 🆔 **ID Documents** - Passports, licenses, certificates
- 🧾 **Receipts** - Any receipt or bill
- 📋 **Forms** - Tax forms, applications, surveys

## Examples

### Extract Invoice Data
```python
extractor = DocumentExtractor(mode="cloud", api_key="your_key")

result = extractor.extract(
    file_path="invoice.pdf",
    output_type="flat-json"
)
# Returns: {"invoice_number": "INV-001", "total": "150.00", "date": "2024-01-15", ...}
```

### Extract Specific Fields
```python
result = extractor.extract(
    file_path="resume.pdf",
    output_type="specified-fields",
    specified_fields=["name", "email", "phone", "experience"]
)
# Returns: {"name": "John Doe", "email": "john@email.com", ...}
```

### Get Markdown Output
```python
result = extractor.extract(
    file_path="report.pdf",
    output_type="markdown"
)
# Returns formatted markdown text
```

### Custom JSON Schema
```python
schema = {
    "personal_info": {
        "name": "string",
        "email": "string"
    },
    "skills": ["string"]
}

result = extractor.extract(
    file_path="resume.pdf",
    output_type="specified-json",
    json_schema=schema
)
```

## Command Line Usage

```bash
# Extract to JSON
nanonets-extractor document.pdf --output-type flat-json

# Extract specific fields
nanonets-extractor invoice.pdf --output-type specified-fields --fields invoice_number,total,date

# Use cloud API
nanonets-extractor document.pdf --mode cloud --api-key your_key

# Save to file
nanonets-extractor document.pdf --output result.json
```

## Error Handling

```python
from nanonets_extractor import DocumentExtractor
from nanonets_extractor.exceptions import ExtractionError, APIError

try:
    extractor = DocumentExtractor(mode="cloud", api_key="your_key")
    result = extractor.extract("document.pdf")
    print(result)
except APIError as e:
    print(f"API error: {e}")
except ExtractionError as e:
    print(f"Extraction failed: {e}")
```

## Environment Variables

Set your API key as an environment variable:

```bash
export NANONETS_API_KEY="your_api_key_here"
```

Then use without specifying the key:
```python
extractor = DocumentExtractor(mode="cloud")  # Uses env variable
```

## License

MIT License - see LICENSE file for details.

## Support

- 📧 Email: support@nanonets.com
- 🌐 Website: https://nanonets.com
- 📖 Documentation: https://nanonets.com/documentation 
