Metadata-Version: 2.4
Name: bambara-text-normalizer
Version: 1.0.0
Summary: Text normalization (TN/ITN) and ASR evaluation framework for Bambara (Bamanankan) language processing
Author-email: sudoping01 <sudoping01@gmail.com>
Maintainer-email: sudoping01 <sudoping01@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/MALIBA-AI/bambara-normalizer
Project-URL: Documentation, https://github.com/MALIBA-AI/bambara-normalizer
Project-URL: Repository, https://github.com/MALIBA-AI/bambara-normalizer
Project-URL: Issues, https://github.com/MALIBA-AI/bambara-normalizer/issues
Keywords: bambara,bamanankan,asr,speech-recognition,text-normalization,wer,nlp,african-languages,manding,mali
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: jiwer>=3.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: black>=23.0; extra == "dev"
Requires-Dist: isort>=5.0; extra == "dev"
Requires-Dist: mypy>=1.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=6.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=1.0; extra == "docs"
Dynamic: license-file

<h1 align="center">
    Bambara Text Normalizer
</h1>
<p align="center">
    <em>Text Normalization & ASR Evaluation Framework for Bambara (Bamanankan)</em>
</p>
<p align="center">
    <a href="#installation">Installation</a> •
    <a href="#text-normalization">Normalization</a> •
    <a href="#asr-evaluation-framework">ASR Evaluation</a> •
    <a href="#contraction-modes">Modes</a> •
    <a href="#command-line-interface">CLI</a> •
    <a href="#linguistic-decisions">Linguistics</a> •
    <a href="#references">References</a>
</p>

---

## Purpose

This tool serves **two complementary purposes** for Bambara language processing:

| Purpose | Description |
|---------|-------------|
| **Text Normalization** | Standardize Bambara text for any downstream NLP task (TTS, MT, NER, etc.) |
| **ASR Evaluation** | Fair WER/CER computation that accounts for valid orthographic variation |

> [!NOTE]
> Bambara orthography allows variation: the same utterance can be written as `k'a ta` or `ka a ta`  both are correct. Without normalization, evaluation metrics unfairly penalize models for human writing inconsistencies rather than actual recognition errors.

---

## Installation

```bash
pip install git+https://github.com/sudoping01/bambara-text-normalization.git
```

---

## Text Normalization



```python
from bambara_normalizer import normalize

# Default: expand contractions
normalize("⁠Ne k’a ma ko ayi")           
normalize("⁠K’ale t’a fɛ k’a kɛ")    
normalize("⁠K’i k’i janto i yɛrɛ la")        


# Contract mode: collapse expanded forms
normalize("Ne ko a ma ko ayi", mode="contract")    
normalize("Ko ale tɛ a fɛ ka o kɛ", mode="contract")    
normalize("Ko i ka i janto i yɛrɛ la", mode="contract")    

# Preserve mode: don't touch contractions
normalize("K’i k’i janto i yɛrɛ la", mode="preserve")     
```

### Custom Settings

```python
from bambara_normalizer import normalize

# Full control over normalization
text = normalize(
    "Ka na son k’o k’a la",
    mode="expand",                      # "expand" | "contract" | "preserve"
    preserve_tones=False,               
    normalize_legacy_orthography=True, 
    lowercase=True,                     
    remove_punctuation=False,           
    normalize_whitespace=True,         
    normalize_apostrophes=True,         
    normalize_special_chars=True,    
    expand_dates = False,
    expand_numbers=False,  
    expand_times=False,            
    remove_diacritics_except_tones=False,  
    handle_french_loanwords=True,   
    strip_repetitions=False,       
    normalize_compounds=True, 
)
```

### Using BambaraNormalizer Class

For repeated normalization with consistent settings:

```python
from bambara_normalizer import BambaraNormalizer, BambaraNormalizerConfig


config = BambaraNormalizerConfig(contraction_mode="expand") # change it to "contract" or preserve
normalizer = BambaraNormalizer(config)
normalizer("A y'a fɔ")      
normalizer("k'a la")   

# Contraction mode
config = BambaraNormalizerConfig(contraction_mode="contract")
normalizer = BambaraNormalizer(config)
normalizer("bɛ a fɔ")     
normalizer("ka a ta")     
```

### Predefined Configuration Presets

```python
from bambara_normalizer import BambaraNormalizer, BambaraNormalizerConfig

# For WER evaluation (aggressive normalization, removes tones)
normalizer = BambaraNormalizer(BambaraNormalizerConfig.for_wer_evaluation())

# For WER with contract mode
normalizer = BambaraNormalizer(BambaraNormalizerConfig.for_wer_evaluation(mode="contract"))

# For CER evaluation
normalizer = BambaraNormalizer(BambaraNormalizerConfig.for_cer_evaluation())

# Preserve tone marks
normalizer = BambaraNormalizer(BambaraNormalizerConfig.preserving_tones())

# Minimal normalization (only essential fixes)
normalizer = BambaraNormalizer(BambaraNormalizerConfig.minimal())
```


---

## Number Normalization

The normalizer supports bidirectional number conversion between digits and Bambara words (TN/ITN).

### With Normalizer
```python
from bambara_normalizer import normalize

normalize("A ye 100 sɔrɔ", expand_numbers=True)   # => "a ye kɛmɛ sɔrɔ"
normalize("A ye 100 sɔrɔ", expand_numbers=False)  # => "a ye 100 sɔrɔ"

# WER preset has expand_numbers=True by default
normalize("A ye 5 ta", preset="wer")  # => "a ye duuru ta"
```

### Digits to Words (Text Normalization)
```python
from bambara_normalizer import number_to_bambara, normalize_numbers_in_text


number_to_bambara(5)        # => "duuru"
number_to_bambara(123)      # => "kɛmɛ ni mugan ni saba"
number_to_bambara(1000)     # => "waa kelen"
number_to_bambara(5.3)      # => "duuru tomi saba"

# In text
normalize_numbers_in_text("A ye 5 wari di")      # => "A ye duuru wari di"
normalize_numbers_in_text("Mɔgɔ 100 nana")       # => "Mɔgɔ kɛmɛ nana"
normalize_numbers_in_text("N ye shekɛ 1000 sɔrɔ") # => "N ye shekɛ waa kelen sɔrɔ"
```

### Words to Digits (Inverse Text Normalization)
```python
from bambara_normalizer import bambara_to_number, denormalize_numbers_in_text

bambara_to_number("duuru")                    # => 5
bambara_to_number("kɛmɛ ni mugan ni saba")    # => 123
bambara_to_number("duuru tomi saba")          # => 5.3

# In text
denormalize_numbers_in_text("A ye duuru di a ma")  # => "A ye 5 di a ma"
denormalize_numbers_in_text("Mɔgɔ kɛmɛ nana")      # => "Mɔgɔ 100 nana"
```



### Number Vocabulary

| Value | Bambara | Value | Bambara |
|-------|---------|-------|---------|
| 0 | fu | 10 | tan |
| 1 | kelen | 20 | mugan |
| 2 | fila | 30 | bi saba |
| 3 | saba | 40 | bi naani |
| 4 | naani | 50 | bi duuru |
| 5 | duuru | 100 | kɛmɛ |
| 6 | wɔɔrɔ | 1000 | waa |
| 7 | wolonwula | 1,000,000 | miliyɔn |
| 8 | seegin | decimal | tomi |
| 9 | kɔnɔntɔn | connector | ni |

---
---

## Date Normalization

The normalizer supports bidirectional date conversion between standard formats and Bambara expressions (TN/ITN).

### With Normalizer
```python
from bambara_normalizer import normalize

normalize("A bɛ na 13-10-2024 la", expand_dates=True)   # ==> "a bɛ na oktɔburu tile tan ni saba san Baa fila ni mugan ni naani la"
normalize("A bɛ na 13-10-2024 la", expand_dates=False)  # => "a bɛ na 13-10-2024 la"

# WER preset has expand_dates=True by default
normalize("A bɛ na 25-01-2008 la", preset="wer")  # => "a bɛ na zanwuye tile mugan ni duuru san Baa fila ni seegin la"
```

### Date to Bambara (Text Normalization)
```python
from bambara_normalizer import date_to_bambara, format_date_bambara, normalize_dates_in_text
from datetime import date

# Single dates
date_to_bambara(2024, 10, 13)      # => "Oktɔburu tile tan ni saba san baa fila ni mugan ni naani"
date_to_bambara(2008, 1, 25)       # => "Zanwuye tile mugan ni duuru san baa fila ni seegin"

# With "kalo" (month) included
date_to_bambara(2024, 10, 13, include_kalo=True)  # => "Oktɔburu kalo tile tan ni saba san ..."

# With day of week
date_to_bambara(2024, 10, 13, include_day_of_week=True)  # => "Kari Oktɔburu tile ..." (Sunday)

# From date object or string
format_date_bambara(date(2024, 10, 13))  # => "Oktɔburu tile tan ni saba san ..."
format_date_bambara("13-10-2024")        # => "Oktɔburu tile tan ni saba san ..."

# In text
normalize_dates_in_text("A bɛ na 13-10-2024 la")  # => "A bɛ na Oktɔburu tile tan ni saba san baa fila ni mugan ni naani la"
```

### Bambara to Date (Inverse Text Normalization)
```python
from bambara_normalizer import bambara_to_date

bambara_to_date("Oktɔburu tile tan ni saba san baa fila ni mugan ni naani")
# => datetime.date(2024, 10, 13)

bambara_to_date("Zanwuye tile mugan ni duuru san baa fila ni seegin")
# => datetime.date(2008, 1, 25)
```

### Date Format

Bambara dates follow this structure:
```
[Month] (kalo) tile [day] san [year]
```

Example: **13-10-2024** => `Oktɔburu tile tan ni saba san baa fila ni mugan ni naani`

Literal translation: "October day thirteen year two thousand twenty-four"

### Days of the Week

| Day | Bambara | Day | Bambara |
|-----|---------|-----|---------|
| Monday | Tɛnɛn | Friday | Juma |
| Tuesday | Tarata | Saturday | Sibiri |
| Wednesday | Araba | Sunday | Kari |
| Thursday | Alamisa | | |

### Months of the Year

| Month | Bambara | Month | Bambara |
|-------|---------|-------|---------|
| January | Zanwuye | July | Zuluye |
| February | Feburuye | August | Uti |
| March | Marsi | September | Sɛtanburu |
| April | Awirili | October | Oktɔburu |
| May | Mɛ | November | Nɔwanburu |
| June | Zuwen | December | Desanburu |



---

## Time Normalization

The normalizer supports bidirectional time and duration conversion between standard formats and Bambara expressions (TN/ITN).

### With Normalizer
```python
from bambara_normalizer import normalize

normalize("A nana 7:30 la", expand_times=True)   # => "a nana nɛgɛ kaɲɛ wolonwula ni sanga bi saba la"
normalize("A nana 7:30 la", expand_times=False)  # => "a nana 7:30 la"

# WER preset has expand_times=True by default
normalize("A nana 13:50 la", preset="wer")  # => "a nana nɛgɛ kaɲɛ tan ni saba ni sanga bi duuru la"
```

### Clock Time to Bambara (Text Normalization)
```python
from bambara_normalizer import time_to_bambara, format_time_bambara, normalize_times_in_text
from datetime import time

# Clock times
time_to_bambara(1, 0)       # => "Nɛgɛ kaɲɛ kelen"
time_to_bambara(1, 5)       # => "Nɛgɛ kaɲɛ kelen ni sanga duuru"
time_to_bambara(7, 30)      # => "Nɛgɛ kaɲɛ wolonwula  ni sanga bi saba"
time_to_bambara(13, 50)     # => "Nɛgɛ kaɲɛ tan ni saba ni sanga bi duuru"

# From time object or string
format_time_bambara(time(7, 30))  # => "Nɛgɛ kaɲɛ wolonwula ni sanga bi saba"
format_time_bambara("13:50")      # => "Nɛgɛ kaɲɛ tan ni sab ni sanga bi duuru"

# In text
normalize_times_in_text("A nana 7:30 la")  # => "A nana nɛgɛ kaɲɛ wolonwula ni sanga bi saba la"
```

### Bambara to Clock Time (Inverse Text Normalization)
```python
from bambara_normalizer import bambara_to_time

bambara_to_time("Nɛgɛ kaɲɛ kelen")
# => datetime.time(1, 0)

bambara_to_time("Nɛgɛ kaɲɛ wolonwula ni sanga bi saba")
# => datetime.time(7, 30)
```

### Duration to Bambara
```python
from bambara_normalizer import duration_to_bambara, format_duration_bambara

# Durations
duration_to_bambara(minutes=30)                      # => "miniti bi saba"
duration_to_bambara(hours=1, minutes=30)             # => "lɛrɛ kelen ni miniti bi saba"
duration_to_bambara(hours=1, minutes=30, seconds=10) # => "lɛrɛ kelen ni miniti bi saba ni segɔni tan"

# From string format
format_duration_bambara("30min")      # => "miniti bi saba"
format_duration_bambara("1h30min")    # => "lɛrɛ kelen ni miniti bi saba"
format_duration_bambara("1h30min10s") # => "lɛrɛ kelen ni miniti bi saba ni segɔni tan"
```

### Bambara to Duration (Inverse Text Normalization)
```python
from bambara_normalizer import bambara_to_duration

bambara_to_duration("miniti bi saba")
# => (0, 30, 0)  # (hours, minutes, seconds)

bambara_to_duration("lɛrɛ kelen ni miniti bi saba")
# => (1, 30, 0)

bambara_to_duration("lɛrɛ kelen ni miniti bi saba ni segɔni tan")
# => (1, 30, 10)
```

### Time Format

Clock time follows this structure:
```
Nɛgɛ kaɲɛ [hour] ( ni sanga [minutes])
```

Example: **7:30** => `Nɛgɛ kaɲɛ wolonwula ni sanga bi saba`

Literal translation: "Clock needle seven passed with minute thirty"

Duration follows this structure:
```
(lɛrɛ [hours] ni) (miniti [minutes] ni) (segɔni [seconds])
```

Example: **1h30min10s** => `lɛrɛ kelen ni miniti bi saba ni segɔni tan`


## ASR Evaluation Framework

### Quick Evaluation

```python
from bambara_normalizer import evaluate


result = evaluate(
    reference="B'a fɔ ka taa",
    hypothesis="bɛ a fɔ ka taa"

)

print(f"WER: {result.wer:.2%}")  
print(f"CER: {result.cer:.2%}")
print(f"MER: {result.mer:.2%}")
```

### Evaluator with Mode Selection

> [!IMPORTANT]
> The `mode` parameter determines how contractions are handled during evaluation. This significantly impacts WER scores when reference and hypothesis use different orthographic conventions.

```python
from bambara_normalizer import evaluate


result = evaluate(
    reference="k'a ta", 
    hypothesis="ka a ta"
    mode="expand" # contract | preserve 
    )
print(f"WER: {result.wer:.2%}")   
```

### Flexible Configuration

For full control use the evalution class and define the normalization configuration:

```python
from bambara_normalizer import (
    BambaraNormalizer, 
    BambaraNormalizerConfig, 
    BambaraEvaluator
)

# Define custom normalizer: same then the config we did upside
config = BambaraNormalizerConfig(
    contraction_mode="contract",
    preserve_tones=False,
    lowercase=True,
    remove_punctuation=True,
    normalize_legacy_orthography=True,
)


evaluator = BambaraEvaluator(config=config)


result = evaluator.evaluate(
    reference="K'a fɔ́!",
    hypothesis="ka a fo"
)
print(f"WER: {result.wer:.2%}")
```

### Batch Evaluation

```python
from bambara_normalizer import BambaraEvaluator

evaluator = BambaraEvaluator(mode="contract")

references = ["k'a ta", "b'a fɔ", "n'a ma"]
hypotheses = ["ka a ta", "bɛ a fɔ", "na a ma"]

aggregate, individual = evaluator.evaluate_batch(references, hypotheses)

print(f"Overall WER: {aggregate.wer:.2%}")
for i, result in enumerate(individual):
    print(f"  [{i}] WER: {result.wer:.2%}")
```

### Available Metrics

| Metric | Method | Description |
|--------|--------|-------------|
| **WER** | `evaluator.wer(ref, hyp)` | Word Error Rate |
| **CER** | `evaluator.cer(ref, hyp)` | Character Error Rate |
| **MER** | `evaluator.mer(ref, hyp)` | Match Error Rate |
| **WIL** | `evaluator.wil(ref, hyp)` | Word Information Lost |
| **WIP** | `evaluator.wip(ref, hyp)` | Word Information Preserved |
| **DER** | `result.der` | Diacritic Error Rate (tone accuracy) |

---

## Contraction Modes

> [!WARNING]
> Choosing the right mode is critical for fair ASR evaluation. Using the wrong mode can inflate or deflate WER scores artificially.

Version 2.0 introduces **three contraction modes** to handle bidirectional Bambara orthography:

| Mode | Direction | When to Use |
|------|-----------|-------------|
| `expand` | `b'a` => `bɛ a` | Default. Full linguistic analysis with k'/n' disambiguation |
| `contract` | `bɛ a` => `b'a` | Simpler, more forgiving. No disambiguation ambiguity |
| `preserve` | No change | Debugging, or when you want raw comparison |

### Why Contract Mode Matters

**Expansion is complex**  the contraction `k'` can expand to three different words:

| Contraction | Possible Expansions | Meaning |
|-------------|---------------------|---------|
| `k'a` | `ka a` | infinitive marker |
| `k'a` | `kɛ a` | verb "to do" |
| `k'a` | `ko a` | verb "to say" |

The normalizer uses context to disambiguate, but some cases are genuinely ambiguous.

**Contraction is simple**  all variants collapse to the same form:

```
ka a  ─┐
kɛ a  ─┼─>  k'a
ko a  ─┘
```

> [!TIP]
> For ASR evaluation, **contract mode is more forgiving** because it doesn't penalize the model for disambiguation differences when both forms are linguistically valid.

### Contraction Mappings

| Expanded | Contracted | Function |
|----------|------------|----------|
| `bɛ` + vowel | `b'` | Affirmative imperfective |
| `tɛ` + vowel | `t'` | Negative imperfective |
| `ye` + vowel | `y'` | Perfective marker |
| `ni` + vowel | `n'` | Conjunction |
| `na` + vowel | `n'` | Verb "come" |
| `ka` + vowel | `k'` | Infinitive marker |
| `kɛ` + vowel | `k'` | Verb "to do" |
| `ko` + vowel | `k'` | Verb "to say" |

---

## Command Line Interface

### Basic Usage

```bash

# default mode is expand
bambara-normalize "B'a fɔ́"
# Output: bɛ a fɔ

# Contract mode
bambara-normalize --mode contract "bɛ a fɔ"
# Output: b'a fɔ

# Preserve mode
bambara-normalize --mode preserve "B'a fɔ"
# Output: b'a fɔ
```

### With Presets

```bash
# WER preset (aggressive normalization)
bambara-normalize --preset wer "K'a fɔ́!"
# Output: ka a fɔ

# WER preset with contract mode
bambara-normalize --preset wer --mode contract "Ka a fɔ"
# Output: k'a fɔ

# CER preset
bambara-normalize --preset cer "B'a fɔ"
```

### File Evaluation

```bash
# Evaluate reference vs hypothesis files
bambara-normalize --evaluate reference.txt hypothesis.txt

# With contract mode
bambara-normalize --evaluate --mode contract ref.txt hyp.txt

# Output detailed metrics
bambara-normalize --evaluate --detailed ref.txt hyp.txt
```

### Batch Processing

```bash
# Process file line by line
bambara-normalize --input corpus.txt --output normalized.txt

# With specific mode
bambara-normalize --input corpus.txt --output normalized.txt --mode contract
```

---

## Linguistic Decisions

### Why Normalize?

Bambara orthography allows variation. For the same spoken utterance:
- Annotator A writes: `k'a ta`
- Annotator B writes: `ka a ta`

**Both are correct.** Without normalization, we penalize models for human writing inconsistencies, not recognition errors.

### n' Disambiguation

| Pattern | Expansion | Meaning |
|---------|-----------|---------|
| `n' + pronoun + ma` | `na` | Verb "to come" |
| `n' + other` | `ni` | Conjunction (default) |

**Examples:**
- `n'a ma` => `na a ma` (come to him)
- `n'a ta` => `ni a ta` (if he takes)

### k' Disambiguation Rules

Applied in priority order (derived from [Daba grammar](https://github.com/maslinych/daba)):

| Priority | Pattern | Result | Example |
|----------|---------|--------|---------|
| 1 | `k' + pronoun + ma + X + ye` | `kɛ` | `k'a ma hɛrɛ ye` => `kɛ a ma hɛrɛ ye` |
| 2 | `k' + pronoun + ma +` speech marker | `ko` | `k'anw ma ko` => `ko anw ma ko` |
| 3 | `k' + pronoun +` postposition | `kɛ` | `k'a la` => `kɛ a la` |
| 4 | `k' + pronoun +` clause marker | `ko` | `k'an ka ta` => `ko an ka ta` |
| 5 | Default | `ka` | `k'a ta` => `ka a ta` |

**Postpositions:** `la`, `na`, `ye`, `fɛ`, `kɔnɔ`, `kɔ`, `kɔrɔ`, `kan`, `kun`, `ɲɛ`, `bolo`

**Clause markers:** `ka`, `kana`, `bɛ`, `tɛ`, `bɛna`, `tɛna`, `tun`, `mana`

### Legacy Orthography Conversion

| Legacy | Modern | Notes |
|--------|--------|-------|
| `è` | `ɛ` | Pre-standard spelling |
| `ò` | `ɔ` | Pre-standard spelling |
| `ny` | `ɲ` | Digraph => single character |
| `ng` | `ŋ` | Digraph => single character |
| `ñ` | `ɲ` | Spanish/Senegalese variant |

---

## Known Limitations

### Inherent Linguistic Ambiguity

> [!CAUTION]
> Some Bambara constructions are **genuinely ambiguous** and cannot be resolved without broader context. This is not a bug  it reflects real ambiguity in the language.

#### The `ye` Problem

The word `ye` has **five** grammatical functions:

| Function | Example | Meaning |
|----------|---------|---------|
| Postposition | `à fɔ́ ń yé` | say it **to** me |
| Perfective | `ù ye ɲɔ̀ gòsi` | they **have** beaten |
| Copula | `ò yé kɔ̀nɔ yé` | it **is** a bird |
| Verb "see" | `ka a ye` | **to see** it |
| Imperative | `á' yé nà!` | come! (plural) |

This creates genuine ambiguity for `k'a ye`:

| Interpretation | Expansion | Meaning |
|----------------|-----------|---------|
| Postposition | `kɛ a ye` | do it for him |
| Verb "see" | `ka a ye` | to see it |

**Default behavior:** The normalizer chooses `kɛ a ye` (postposition is more frequent).

**Solution:** Use `mode="contract"` for ASR evaluation to avoid disambiguation penalties:

```python
evaluator = BambaraEvaluator(mode="contract")
# Both "kɛ a ye" and "ka a ye" => "k'a ye" 
```

### Scope

The normalizer uses **local context** (1-3 word lookahead). It does not:
- Parse full sentence structure
- Use dictionary/lexicon for POS tagging
- Consider discourse-level context

---

## Utility Functions

```python
from bambara_normalizer import (
    is_contraction,
    can_contract,
    find_contractions,
    find_contractable_sequences,
    compare_normalization_modes,
    analyze_text,
    is_bambara_vowel,
    get_tone,
    remove_tones,
    number_to_bambara,
    bambara_to_number,
    normalize_numbers_in_text,
    denormalize_numbers_in_text,
    is_number_word,
    
    bambara_to_date,
    bambara_to_day_of_week,
    bambara_to_month,
    date_to_bambara,
    day_of_week_to_bambara,
    denormalize_dates_in_text,
    format_date_bambara,
    is_bambara_day,
    is_bambara_month,
    month_to_bambara,
    normalize_dates_in_text,

    time_to_bambara,
    bambara_to_time,
    format_time_bambara,
    duration_to_bambara,
    bambara_to_duration,
    format_duration_bambara,
    normalize_times_in_text,
    is_time_word,
)


is_contraction("b'a")                   
is_contraction("bɛ")                     
can_contract("bɛ a")                      

# Find patterns in text
find_contractions("B'a fɔ k'a ta")       # ["b'", "k'"]
find_contractable_sequences("bɛ a fɔ")   # [('bɛ', 'a')]

# Compare modes side-by-side
compare_normalization_modes("b'a fɔ")
# {'original': "b'a fɔ", 'expand': 'bɛ a fɔ', 'contract': "b'a fɔ", 'preserve': "b'a fɔ"}

# Full text analysis
analyze_text("B'a fɔ k'a la")
# {'word_count': 4, 'contractions_found': ["b'", "k'"], 'has_tone_marks': False, ...}

# Tone handling
get_tone("fɔ́")                           # "high"
remove_tones("fɔ́ bɛ̀")                    # "fɔ bɛ"

# Number conversion: digits => Bambara words
number_to_bambara(5)                     # "duuru"
number_to_bambara(23)                    # "mugan ni saba"
number_to_bambara(100)                   # "kɛmɛ"
number_to_bambara(123)                   # "kɛmɛ ni mugan ni saba"
number_to_bambara(1000)                  # "waa kelen"
number_to_bambara(5.3)                   # "duuru tomi saba"

# Number conversion: Bambara words => digits
bambara_to_number("duuru")               # 5
bambara_to_number("mugan ni saba")       # 23
bambara_to_number("kɛmɛ")                # 100
bambara_to_number("waa kelen")           # 1000
bambara_to_number("duuru tomi saba")     # 5.3

# Number normalization in text
normalize_numbers_in_text("A ye 5 di")       # "A ye duuru  di"
normalize_numbers_in_text("Mɔgɔ 100 nana")        # "Mɔgɔ kɛmɛ nana"
normalize_numbers_in_text("A be san 25 bɔ")       # "A be san mugan ni duuru bɔ"

# Inverse: Bambara words => digits in text
denormalize_numbers_in_text("A ye duuru di")  # "A ye 5  di"
denormalize_numbers_in_text("Mɔgɔ kɛmɛ nana")      # "Mɔgɔ 100 nana"

# Check if word is a number word
is_number_word("duuru")                  # True
is_number_word("kɛmɛ")                   # True
is_number_word("fɔ")                     # False


# Date conversion: dates => Bambara
date_to_bambara(2024, 10, 13)            # "Oktɔburu tile tan ni saba san baa fila ni mugan ni naani"
format_date_bambara("13-10-2024")        # Same as above

# Date conversion: Bambara => dates
bambara_to_date("Oktɔburu tile tan ni saba san baa fila ni mugan ni naani")  # datetime.date(2024, 10, 13)

# Day/Month helpers
day_of_week_to_bambara(0)                # "Tɛnɛn" (Monday)
day_of_week_to_bambara(6)                # "Kari" (Sunday)
month_to_bambara(10)                     # "Oktɔburu"
bambara_to_month("Oktɔburu")             # 10

# Date normalization in text
normalize_dates_in_text("A bɛ na 13-10-2024 la")  # "A bɛ na Oktɔburu tile ... la"

# Check if word is date-related
is_bambara_month("Oktɔburu")             # True
is_bambara_day("Juma")                   # True


# Time conversion: clock times → Bambara
time_to_bambara(1, 0)                    # "Nɛgɛ kaɲɛ kelen"
time_to_bambara(7, 30)                   # "Nɛgɛ kaɲɛ wolonwula ni sanga bi saba"
format_time_bambara("13:50")             # "Nɛgɛ kaɲɛ tan ni saba ni sanga bi duuru"

# Time conversion: Bambara → clock times
bambara_to_time("Nɛgɛ kaɲɛ wolonwula ni sanga bi saba")  # datetime.time(7, 30)

# Duration conversion: durations → Bambara
duration_to_bambara(minutes=30)          # "miniti bi saba"
duration_to_bambara(hours=1, minutes=30) # "lɛrɛ kelen ni miniti bi saba"
format_duration_bambara("1h30min10s")    # "lɛrɛ kelen ni miniti bi saba ni segɔni tan"

# Duration conversion: Bambara → durations
bambara_to_duration("lɛrɛ kelen ni miniti bi saba")  # (1, 30, 0)

# Time normalization in text
normalize_times_in_text("A nana 7:30 la")  # "A nana Nɛgɛ kaɲɛ wolonwula ... la"

# Check if word is time-related
is_time_word("lɛrɛ")                      # True
is_time_word("miniti")                    # True
is_time_word("segɔni")                    # True
```


---

## Evaluation Metrics

| Metric | Description | Range |
|--------|-------------|-------|
| **WER** | Word Error Rate | 0.0 – ∞ |
| **CER** | Character Error Rate | 0.0 – ∞ |
| **MER** | Match Error Rate | 0.0 – 1.0 |
| **WIL** | Word Information Lost | 0.0 – 1.0 |
| **WIP** | Word Information Preserved | 0.0 – 1.0 |
| **DER** | Diacritic Error Rate (tone accuracy) | 0.0 – ∞ |

---



## References

#### Linguistic Resources
- [Bambara Reference Corpus](http://cormand.huma-num.fr/)  Primary corpus
- [Daba Morphological Analyzer](https://github.com/maslinych/daba)  Grammar rules
- [Bamadaba Dictionary](http://cormand.huma-num.fr/bamadaba.html)  Lexical database
- DNAFLA / AMALAN  Bambara standardization body

#### Standards
- UNESCO Bamako Meeting (1966)
- Niamey African Reference Alphabet (1978)

#### Tools
- [jiwer](https://github.com/jitsi/jiwer)  ASR evaluation metrics

#### Related Work
- [NVIDIA NeMo Text Normalization](https://github.com/NVIDIA/NeMo)
- [NeMo TN/ITN Documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/text_normalization/wfst/wfst_text_normalization.html)

---

<p align="center">
    <a href="https://github.com/MALIBA-AI">MALIBA-AI</a>
</p>
