Metadata-Version: 2.4
Name: cthulhu-eval
Version: 0.3.0
Summary: HTR / OCR models evaluation agnostic Python package.
Author: Lucas Terriel, Alix Chagué
Author-email: lucas.terriel@inria.fr, alix.chague@inria.fr
License: MIT
Keywords: HTR,OCR,evaluation,metrics,handwritten text recognition,optical character recognition
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: python-Levenshtein>=0.21
Requires-Dist: termcolor>=1.1.0
Requires-Dist: Unidecode>=1.3
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: keywords
Dynamic: license
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Cthulhu

**Cthulhu** is a lightweight Python library for evaluating the quality of HTR (Handwritten Text Recognition) and OCR (Optical Character Recognition) transcriptions.

It compares a **ground truth** against a **prediction** and returns a set of standard metrics (WER, CER, Levenshtein distance, etc.), with optional text normalisation transforms applied before scoring.

> Originally derived from [kami-lib](https://gitlab.inria.fr/dh-projects/kami/kami-lib) — redesigned as a standalone evaluation toolkit with no deep-learning dependencies.

---

## Features

- **Flexible input** — accepts plain strings, `.txt` files, ALTO XML, and PAGE-XML
- **Rich metrics** — WER, CER, WACC, MER, CIP, CIL, ROUGE-1/2/L, Levenshtein, Hamming, weighted variants
- **Text transforms** — normalise both sequences before scoring (remove digits, punctuation, diacritics, change case)
- **No heavy dependencies** — no Kraken, no PyTorch; pure Python + three lightweight packages

---

## Installation

```bash
pip install cthulhu-eval
```

**Dependencies:**

| Package | Role |
|---|---|
| `python-Levenshtein >= 0.21` | Fast C-extension for edit distance |
| `Unidecode >= 1.3` | Diacritics removal |
| `termcolor >= 1.1` | Coloured log output |

**Python:** 3.9 or later

---

## Quick start

```python
from cthulhu.Cthulhu import Cthulhu

# Two plain strings
k = Cthulhu(["ground truth text", "predicted text"])
print(k.scores.board)   # full metrics dict
print(k.scores.wer)     # word error rate
print(k.scores.cer)     # character error rate
```

---

## Input formats

`data` must be a list of exactly two elements. Each element can be:

| Type | Example |
|---|---|
| Plain string | `"mon texte de référence"` |
| Text file | `"./gt.txt"` |
| ALTO XML (v2–v4) | `"./document_alto.xml"` |
| PAGE-XML (PcGts) | `"./document_page.xml"` |

Mix and match freely:

```python
# XML ground truth vs plain-text prediction
k = Cthulhu(["./gt_alto.xml", "./prediction.txt"])

# Two XML files
k = Cthulhu(["./gt_page.xml", "./pred_page.xml"])

# String vs text file
k = Cthulhu(["reference string", "./pred.txt"])
```

---

## Metrics

All metrics are available via `k.scores.board` (dict) or as individual attributes on `k.scores`.

| Metric | Attribute | Description |
|---|---|---|
| Levenshtein distance (char) | `lev_distance_char` | Edit distance at character level |
| Levenshtein distance (word) | `lev_distance_words` | Edit distance at word level |
| Hamming distance | `hamming` | Char-level; `"Ø"` if lengths differ |
| WER | `wer` | Word Error Rate |
| CER | `cer` | Character Error Rate |
| WACC | `wacc` | Word Accuracy (`1 − WER`) |
| WER Hunt | `wer_hunt` | WER with halved insertion/deletion costs |
| MER | `mer` | Match Error Rate |
| CIP | `cip` | Character Information Preserved |
| CIL | `cil` | Character Information Lost |
| ROUGE-1 | `rouge_1` | Unigram overlap (precision, recall, F1) |
| ROUGE-2 | `rouge_2` | Bigram overlap (precision, recall, F1) |
| ROUGE-L | `rouge_l` | Longest Common Subsequence (precision, recall, F1) |
| Hits | `hits` | Matching characters |
| Substitutions | `substs` | Character substitutions |
| Deletions | `deletions` | Character deletions |
| Insertions | `insertions` | Character insertions |

### Custom error weights

```python
k = Cthulhu(
    ["reference", "prediction"],
    insertion_cost=0.5,
    deletion_cost=0.5,
    substitution_cost=1.0,
)
```

### Display options

```python
k = Cthulhu(
    ["reference", "prediction"],
    percent=True,        # express rates as percentages (e.g. 17.2 instead of 0.172)
    truncate=True,       # truncate floats
    round_digits=".001", # precision
)
```

---

## Text transforms

Apply normalisation steps before computing metrics.
Each transform is scored individually **and** all together, letting you see the impact of each choice.

```python
k = Cthulhu(
    ["Déjà 13 fois, Maxime !", "deja 13 fois maxime"],
    apply_transforms="XP",  # remove diacritics + punctuation
)

# k.scores.board contains:
# {
#   "default":           {...},  # raw scores
#   "remove_diacritics": {...},  # after X only
#   "remove_punctuation":{...},  # after P only
#   "all_transforms":    {...},  # after X + P combined
#   "Length_reference":  ...,
#   "Total_diacritics_removed_from_reference": ...,
#   ...
# }
```

**Transform codes:**

| Code | Name | Effect |
|---|---|---|
| `D` | Remove digits | `"1871"` → `""` |
| `U` | Uppercase | `"texte"` → `"TEXTE"` |
| `L` | Lowercase | `"TEXTE"` → `"texte"` |
| `P` | Remove punctuation | `"Bonjour !"` → `"Bonjour "` |
| `X` | Remove diacritics | `"étaient"` → `"etaient"` |

Codes can be combined freely: `"XP"`, `"DLP"`, `"XPLU"`, etc.

You can also use the transformation classes directly:

```python
from cthulhu.preprocessing.transformation import (
    ToCompose,
    RemoveDiacritics,
    RemovePunctuation,
    ToLowerCase,
    RemoveNonUsefulWords,
    RemoveDigits,
    RemoveSpecificWords,
    SubRegex,
    Strip,
)

# Apply a chain of transforms to any string or list of strings
result = ToCompose(
    ["Déjà 13 fois, Maxime !", ""],
    [RemoveDiacritics(), RemovePunctuation(), ToLowerCase(), RemoveNonUsefulWords()]
)
print(result.reference)  # "deja 13 fois maxime"
```

---

## Using `Scorer` directly

For lower-level access, use the `Scorer` class without the `Cthulhu` facade:

```python
from cthulhu.metrics.evaluation import Scorer

scorer = Scorer(
    reference="Six semaines plus tard",
    prediction="Six semaiNEs plus tard",
    show_percent=True,
    truncate_score=True,
    round_digits=".001",
)

print(scorer.wer)
print(scorer.cer)
print(scorer.board)
```

---

## Project structure

```
cthulhu/
├── cthulhu/
│   ├── Cthulhu.py           # Main facade class
│   ├── metrics/
│   │   ├── _base_metrics.py # Encoding helpers, rounding utilities
│   │   └── evaluation.py    # Scorer class
│   ├── parser/
│   │   ├── parser_xml.py    # ALTO / PAGE-XML parser (stdlib only)
│   │   └── parser_text.py   # Plain-text file reader
│   ├── preprocessing/
│   │   └── transformation.py # Transform classes
│   └── utils/
│       └── _utils.py        # Logging, timing decorator
├── tests/
├── datatest/
├── requirements.txt
└── setup.py
```

---

## Running tests

```bash
python -m pytest tests/ -v
```

19 tests, ~0.06 s.

---

## Roadmap

- [ ] API-based HTR inference — send an image to an external model endpoint and compare the result against a ground-truth XML, without any local model

---

## Authors & licence

Original work by **Alix Chagué** and **Lucas Terriel** (Inria).
MIT Licence.
