Metadata-Version: 2.1
Name: htrvx
Version: 0.0.2
Summary: HTRVX, HTR Validation with XSD
Home-page: https://github.com/htr-united/htrvx
Author: Thibault Clérice & Ariane Pinche
License: MIT
Platform: UNKNOWN
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.6.0
Description-Content-Type: text/markdown
Requires-Dist: click (<=9.0)
Requires-Dist: lxml
Requires-Dist: requests


# HTRVX : HTR Validation for eXtra-quality controlled documents

HTRVX - pronounced Ashterux - allows for quality control of XML using XSD schema validation, Segmonto validation and other verifications. 

## How to install

Simply run `pip install htrvx`

## How to run

The basic way to run the script is `htrvx PATHTOFILES --format FORMAT`, eg. `htrvx ./tests/test_data/page/*.xml --format page`

Each verification is an opt-in verification: you need to express the fact that you want to check it.

- `--segmonto` will check for Segmonto compliancy
- `--xsd` will check if the data are compliant with XML Schemas
- `--check-empty` will check if regions have no lines or if lines have no text
    - `--check-empty` can be refined with `--raise-empty` to throw an error if empty elements are found, otherwise it's simply reported.

Other parameters mainly have to do with verbosity: `--verbose` displays details about errors, `--group` groups errors (instead of showing one line per error, groups by error types).

| Parameters               | Default | Function                                                    |
|--------------------------|---------|-------------------------------------------------------------|
| -v, --verbose            | False   | Prints more information                                     |
| -f, --format [alto,page] | ALTO    | Format of files                                             |
| -s, --segmonto           | False   | Apply Segmonto Zoning verification                          |
| -e, --check-empty        | False   | Check for empty lines or empty zones                        |
| -r, --raise-empty        | False   | Warns but not fails if empty lines or empty zones are found |
| -x, --xsd                | False   | Apply XSD Schema verification                               |
| -g, --group              | False   | Group error types (reduce verbosity)                        |

## Github Action code

If you want to add this to your github repository, as a continuous integration workflow, add a file `htrux.yml` at in the path `.github/workflows` of your repository.


```yaml
# This workflow will install Python dependencies, run tests and lint with a single version of Python
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions

name: HTRVX

on: [push, pull_request] # You can edit this of course !

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - name: Set up Python 3.8
      uses: actions/setup-python@v2
      with:
        python-version: 3.8
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install htrvx
    - name: Run HTRVX
      run: |
        htrvx --verbose --group --format alto --segmonto --xsd --check-empty --raise-empty UNIX/Path/to/**/your/*.xml

```

