Metadata-Version: 2.1
Name: hocr-spec
Version: 0.2.0
Summary: Implementation of the hOCR specs
Home-page: https://github.com/kba/hocr-spec-python
Author: Konstantin Baierer
Author-email: konstantin.baierer@gmail.com
License: MIT
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: End Users/Desktop
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3.3
Classifier: Programming Language :: Python :: 3.4
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Multimedia :: Graphics :: Graphics Conversion
Classifier: Topic :: Scientific/Engineering :: Image Recognition
Classifier: Topic :: Utilities
Description-Content-Type: text/markdown
Requires-Dist: lxml (>=3.5.0)
Requires-Dist: future

# hocr-spec-python
Validation of hOCR close to the specs

<!-- BEGIN-MARKDOWN-TOC -->
* [Rationale](#rationale)
* [Installation](#installation)
* [Command line interface](#command-line-interface)
* [API example](#api-example)

<!-- END-MARKDOWN-TOC -->

## Rationale

[hOCR](https://github.com/kba/hocr-spec) is a flavor of HTML for encoding
the results of Optical Character Recognition (OCR) engines. It is supported by
most OCR engines, such as
[tesseract](https://github.com/tesseract-ocr/tesseract),
[ocropus/ocropy](https://github.com/tmbdev/ocropy) and
[kraken](https://github.com/mittagessen/kraken).

The [hOCR specifications](https://github.com/kba/hocr-spec) is at the same time
very simple (hOCR is just HTML) and hard to implement, due to its terseness and
lack of up-to-date code samples. This project aims to implement the rules
defined by the specs from the ground up to serve as a validation tool and
reference implementation. It is meant to help hOCR implementers and support
tools like [hocr-tools](https://github.com/tmbdev/hocr-tools).

## Installation

Use pip:

```sh
# System-wide:
sudo pip install [--user] hocr-spec
# For current user:
pip install --user hocr-spec
```

From source:

```sh
git clone https://github.com/kba/hocr-spec-python
cd hocr-spec-python
# System-wide:
sudo python setup.py install
# For current user:
python setup.py install --user
```

## Command line interface

<!-- BEGIN-EVAL echo; ./hocr-spec -h |sed 's/^/    /' -->

    usage: hocr-spec [-h] [--format {text,bool,ansi,xml}]
                     [--profile {relaxed,standard}]
                     [--implicit_capabilities CAPABILITY]
                     [--skip-check {attributes,classes,metadata,properties}]
                     [--parse-strict] [--silent]
                     sources [sources ...]

    positional arguments:
      sources               hOCR file to check or '-' to read from STDIN

    optional arguments:
      -h, --help            show this help message and exit
      --format {text,bool,ansi,xml}, -f {text,bool,ansi,xml}
                            Report format
      --profile {relaxed,standard}, -p {relaxed,standard}
                            Validation profile
      --implicit_capabilities CAPABILITY, -C CAPABILITY
                            Enable this capability. Use '*' to enable all
                            capabilities. In addition to the 'ocr*' classes, you
                            can use ['ocrp_dir', 'ocrp_font', 'ocrp_lang',
                            'ocrp_nlp', 'ocrp_poly']
      --skip-check {attributes,classes,metadata,properties}, -X {attributes,classes,metadata,properties}
                            Skip one check
      --parse-strict        Parse HTML with less tolerance for errors
      --silent, -s          Don't produce any output but signal success with exit
                            code.

<!-- END-EVAL -->

## API example

```python
from hocr_spec import HocrValidator

validator = HocrValidator()
report = validator.validate('/path/to/sample.hocr')
print(report.format('xml'))
# <report valid='false'>...</report>
```



