Metadata-Version: 2.2
Name: iiif2annos
Version: 0.0.3
Summary: OCR a IIIF images in a manifest and generate annotations 
Home-page: https://github.com/glenrobson/iiif2annos
Author: Glen Robson
Author-email: glen.robson@gmail.com
License: MIT
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pillow<12.0.0
Requires-Dist: requests<3.0.0
Requires-Dist: pytesseract<4.0.0
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# iiif2annos
Read a manifest, OCR the images, create AnnotationLists and add them to a copy of the manifest

This tool uses the [tesseract](https://tesseract-ocr.github.io/) OCR engine. Ensure you have this installed and on your $PATH before running the code below. 

```
usage: ocr.py [-h] [--base-output-uri OUTPUTURI] [--lang LANG] [-c] manifest output

Read a manifest, OCR all the pages then adds the results as annotation lists

positional arguments:
  manifest              URL to Manifest file
  output                Output directory for annotation lists

options:
  -h, --help            show this help message and exit
  --base-output-uri OUTPUTURI
                        Output URI for annotations and annotation list
  --lang LANG           Language to pass to the OCR engine see: https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html
  -c, --confidence      Include OCR confidence value in text of the annotation?
```

This should work with v2 manifests and v3 manifest. For v2 AnnotationLists are created for v3 AnnotationPages are created. 

## Example

```
python iiif2annos/ocr.py --lang frk --base-output-uri http://localhost:5500/newspaper https://preview.iiif.io/cookbook/update_newspaper/recipe/0068-newspaper/newspaper_issue_1-manifest.json  newspaper
```


Using these blogs as a guide:

 * https://nanonets.com/blog/ocr-with-tesseract/#ocr-with-pytesseract-and-opencv 
 * https://pypi.org/project/pytesseract/
