Metadata-Version: 2.4
Name: ocr_stringdist
Version: 0.0.3
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python :: Implementation :: PyPy
Classifier: Operating System :: OS Independent
License-File: LICENSE
Summary: String distances considering OCR errors.
Author: Niklas von Moers <niklasvmoers@protonmail.com>
Author-email: Niklas von Moers <niklasvmoers@protonmail.com>
License: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: repository, https://github.com/NiklasvonM/ocr-stringdist

# OCR-StringDist

A Python library for string distance calculations that account for common OCR (optical character recognition) errors.

[![PyPI](https://img.shields.io/badge/PyPI-Package-blue)](https://pypi.org/project/ocr-stringdist/)
[![License](https://img.shields.io/badge/License-MIT-green)](LICENSE)

## Overview

OCR-StringDist provides specialized string distance algorithms that accommodate for optical character recognition (OCR) errors. Unlike traditional string comparison algorithms, OCR-StringDist considers common OCR confusions (like "0" vs "O", "6" vs "G", etc.) when calculating distances between strings.

> **Note:** This project is in early development. APIs may change in future releases.

## Installation

```bash
pip install ocr-stringdist
```

## Features

- **Weighted Levenshtein Distance**: An adaptation of the classic Levenshtein algorithm with custom substitution costs for character pairs that are commonly confused in OCR models.
- **Pre-defined OCR Distance Map**: A built-in distance map for common OCR confusions (e.g., "0" vs "O", "1" vs "l", "5" vs "S").
- **Customizable Cost Maps**: Create your own substitution cost maps for specific OCR systems or domains.

## Usage

```python
import ocr_stringdist as osd

# Using default OCR distance map
distance = osd.weighted_levenshtein_distance("OCR5", "OCRS")
print(f"Distance between 'OCR5' and 'OCRS': {distance}")  # Will be less than 1.0

# Custom cost map
custom_map = {("f", "t"): 0.2, ("m", "n"): 0.1}
distance = osd.weighted_levenshtein_distance(
    "first", "tirst",
    cost_map=custom_map,
    symmetric=True,
    default_cost=1.0
)
print(f"Distance with custom map: {distance}")
```

## Acknowledgements

This project is inspired by [jellyfish](https://github.com/jamesturk/jellyfish), providing the base implementations of the algorithms used here.

