Metadata-Version: 2.4
Name: socr
Version: 1.0.1
Summary: Multi-engine document OCR with cascading fallback
Project-URL: Homepage, https://github.com/r-uben/socr
Project-URL: Repository, https://github.com/r-uben/socr
Project-URL: Issues, https://github.com/r-uben/socr/issues
Author-email: Ruben Fernandez-Fuertes <fernandezfuertesruben@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: deepseek,document-processing,gemini,marker,mistral,nougat,ocr,pdf
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: <3.13,>=3.11
Requires-Dist: click>=8.1.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: pillow>=10.0.0
Requires-Dist: pymupdf>=1.24.0
Requires-Dist: pyyaml>=6.0.0
Requires-Dist: rich>=13.0.0
Provides-Extra: all
Requires-Dist: deepseek-ocr-cli>=0.1.0; extra == 'all'
Requires-Dist: gemini-ocr-cli>=0.2.0; extra == 'all'
Requires-Dist: marker-ocr-cli>=0.2.0; extra == 'all'
Requires-Dist: mistral-ocr-cli>=0.1.0; extra == 'all'
Requires-Dist: nougat-ocr-cli>=0.1.2; extra == 'all'
Provides-Extra: cloud
Requires-Dist: gemini-ocr-cli>=0.2.0; extra == 'cloud'
Requires-Dist: mistral-ocr-cli>=0.1.0; extra == 'cloud'
Provides-Extra: deepseek
Requires-Dist: deepseek-ocr-cli>=0.1.0; extra == 'deepseek'
Provides-Extra: dev
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: ruff>=0.8.0; extra == 'dev'
Provides-Extra: gemini
Requires-Dist: gemini-ocr-cli>=0.2.0; extra == 'gemini'
Provides-Extra: local
Requires-Dist: deepseek-ocr-cli>=0.1.0; extra == 'local'
Requires-Dist: marker-ocr-cli>=0.2.0; extra == 'local'
Requires-Dist: nougat-ocr-cli>=0.1.2; extra == 'local'
Provides-Extra: marker
Requires-Dist: marker-ocr-cli>=0.2.0; extra == 'marker'
Provides-Extra: mistral
Requires-Dist: mistral-ocr-cli>=0.1.0; extra == 'mistral'
Provides-Extra: nougat
Requires-Dist: nougat-ocr-cli>=0.1.2; extra == 'nougat'
Description-Content-Type: text/markdown

# socr

[![PyPI](https://img.shields.io/pypi/v/socr)](https://pypi.org/project/socr/)
[![Python 3.11–3.12](https://img.shields.io/pypi/pyversions/socr)](https://pypi.org/project/socr/)
[![License](https://img.shields.io/github/license/r-uben/socr)](LICENSE)

Multi-engine document OCR with cascading fallback and quality audit.

`socr` orchestrates multiple OCR engines — calling each as a CLI subprocess, auditing output quality, and falling back to a different engine when results are poor. Each engine is a standalone CLI tool (`gemini-ocr`, `deepseek-ocr`, `marker-ocr`, etc.) that can also be used independently.

## Install

```bash
pip install socr

# With specific engine backends
pip install socr[gemini]          # Google Gemini (cloud)
pip install socr[local]           # DeepSeek + Nougat (local/free)
pip install socr[all]             # All engines
```

Engines are installed separately because they have different dependencies (torch, cloud SDKs, etc.). Install only what you need.

## Usage

```bash
# Process a PDF
socr paper.pdf

# Choose engine
socr paper.pdf --primary gemini
socr paper.pdf --primary marker

# Save extracted figures
socr paper.pdf --save-figures

# Batch process a directory
socr batch ~/Papers/ -o ./results/
socr batch ~/Papers/ --dry-run        # preview what would be processed
socr batch ~/Papers/ --reprocess      # force reprocess all

# Check which engines are available
socr engines
```

## How it works

```
PDF → Primary OCR → Quality Audit → (Fallback OCR if needed) → Markdown
```

1. **Primary OCR** — Calls the primary engine CLI on the whole PDF
2. **Quality audit** — Heuristic checks (word count, garbage ratio, repetition)
3. **Fallback** — If audit fails, tries a different engine

Each engine is a separate CLI binary. `socr` calls it as a subprocess, reads the output markdown, and applies the quality pipeline.

## Engines

| Engine | Package | Type | Notes |
|--------|---------|------|-------|
| Gemini | `gemini-ocr-cli` | Cloud | Google Gemini, ~$0.0002/page |
| Mistral | `mistral-ocr-cli` | Cloud | Mistral AI |
| Marker | `marker-ocr-cli` | Local | Layout-aware (Surya + Texify) |
| DeepSeek | `deepseek-ocr-cli` | Local | Via Ollama |
| Nougat | `nougat-ocr-cli` | Local | Academic papers, Python <3.13 |

Check availability:
```
$ socr engines

  [+] gemini       cloud, ~$0.0002/page
  [+] marker       local, layout-aware (Surya + Texify)
  [+] mistral      cloud, ~$0.001/page
  [+] deepseek     local via Ollama
  [x] nougat       local, academic papers
```

## CLI reference

```
socr process <PDF> [OPTIONS]
  -o, --output-dir PATH       Output directory
  --primary ENGINE             Primary OCR engine (gemini, marker, deepseek, etc.)
  --fallback ENGINE            Fallback engine
  --no-audit                   Skip quality audit
  --save-figures               Save extracted figure images
  --timeout SECONDS            Subprocess timeout (default: 300)
  --profile NAME               Load ~/.config/socr/{name}.yaml
  --config PATH                Custom YAML config file
  -q, --quiet                  Suppress non-error output
  -v, --verbose                Verbose output
  --dry-run                    List files without processing
  --reprocess                  Force reprocess already-done files

socr batch <DIR> [OPTIONS]
  Same options as process, plus:
  --limit N                    Process first N files

socr engines                   Show available engines
```

## Output

```
output/<doc_stem>/
├── <doc_stem>.md        # OCR text
├── metadata.json        # Processing stats
└── figures/             # With --save-figures
    └── figure_1_page3.png
```

## Configuration

Create `~/.config/socr/config.yaml`:

```yaml
primary_engine: gemini
fallback_engine: marker
timeout: 300
save_figures: false
audit_enabled: true
audit_min_words: 50
```

Or use profiles: `~/.config/socr/fast.yaml` → `socr paper.pdf --profile fast`

## Engine CLIs

Each backend is an independent CLI tool:

- [gemini-ocr-cli](https://github.com/r-uben/gemini-ocr-cli) — Google Gemini
- [deepseek-ocr-cli](https://github.com/r-uben/deepseek-ocr-cli) — DeepSeek via Ollama
- [mistral-ocr-cli](https://github.com/r-uben/mistral-ocr-cli) — Mistral AI
- [marker-ocr-cli](https://github.com/r-uben/marker-ocr-cli) — Marker (Surya + Texify)
- [nougat-ocr-cli](https://github.com/r-uben/nougat-ocr-cli) — Meta Nougat

## License

MIT
