Metadata-Version: 2.4
Name: cadar
Version: 0.1.13
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Rust
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Operating System :: OS Independent
License-File: LICENSE
Summary: Canonicalization and Darija Representation - Bidirectional transliteration for Darija
Keywords: darija,arabic,transliteration,moroccan,nlp,darija-moroccan,romanization
Author: Ouail LAAMIRI
License: Apache-2.0
Requires-Python: >=3.8
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Bug Tracker, https://github.com/Oit-Technologies/CaDaR/issues
Project-URL: Changelog, https://github.com/Oit-Technologies/CaDaR/blob/main/CHANGELOG.md
Project-URL: Documentation, https://oit-technologies.github.io/CaDaR/
Project-URL: Homepage, https://github.com/Oit-Technologies/CaDaR
Project-URL: Repository, https://github.com/Oit-Technologies/CaDaR

# CaDaR: Canonicalization and Darija Representation

<div align="center">

**High-performance bidirectional transliteration for Darija (Moroccan Arabic)**

[![License: Apache-2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![Rust](https://img.shields.io/badge/rust-stable-orange.svg)](https://www.rust-lang.org/)

</div>

## Overview

CaDaR is a robust, FST-style (Finite State Transducer) transliteration library designed specifically for Darija (Moroccan Arabic). It provides bidirectional conversion between Arabic script and Latin (Romanized/Bizi) script, along with standardization capabilities for both scripts.

### Key Features

- **Bidirectional Transliteration**: Convert seamlessly between Arabic and Latin scripts
- **Intelligent Normalization**: Handles noise, diacritics, and common variations
- **Darija-Aware Processing**: Respects Darija-specific linguistic patterns
- **Intermediate Canonical Representation (ICR)**: Unified internal representation for accurate conversion
- **High Performance**: Written in Rust with Python bindings for optimal speed
- **Extensible**: Designed to support multiple Darija dialects (currently Moroccan)

## Architecture

CaDaR uses a 6-stage pipeline with an Intermediate Canonical Representation (ICR):

```
Raw Input
   ↓
Stage 1: Script Detection
   ↓
Stage 2: Noise Cleaning & Normalization
   ↓
Stage 3: Tokenization (Darija-aware)
   ↓
Stage 4: Canonical Darija Representation (ICR)
   ↓
Stage 5: Target Script Generation
   ↓
Stage 6: Post-validation & Fixes
   ↓
Clean Standard Output
```

### What is ICR?

The **Intermediate Canonical Representation (ICR)** is the core innovation of CaDaR. It's a script-independent, phonologically-grounded representation that:

- Abstracts away script-specific quirks
- Preserves Darija phonological distinctions
- Enables lossless round-trip conversions
- Allows for consistent normalization across scripts

## Installation

### From PyPI

```bash
pip install cadar
```

### From Source

#### Prerequisites

- Python 3.8 or higher
- Rust toolchain (for building from source)
- Maturin (for Python packaging)

#### Build and Install

```bash
# Clone repository
git clone https://github.com/Oit-Technologies/CaDaR.git
cd CaDaR

# Install Maturin
pip install maturin

# Build and install in development mode
maturin develop

# Or build a wheel for distribution
maturin build --release
```

## Usage

### Python API

CaDaR provides four main functions that match the requested API:

#### 1. `ara2bizi()` - Arabic to Latin/Bizi

```python
import cadar

result = cadar.ara2bizi("سلام", darija="Ma")
print(result)
# Output: "slam"
```

#### 2. `bizi2ara()` - Latin/Bizi to Arabic

```python
import cadar

result = cadar.bizi2ara("salam", darija="Ma")
print(result)
# Output: "سلام"
```

#### 3. `ara2ara()` - Arabic Standardization

```python
import cadar

result = cadar.ara2ara("أنَا", darija="Ma")
print(result)
# Output: "انا"
```

#### 4. `bizi2bizi()` - Latin/Bizi Standardization

```python
import cadar

result = cadar.bizi2bizi("salaaaam", darija="Ma")
print(result)
# Output: "salam"
```

### Convenience Functions

CaDaR also provides convenience functions with auto-detection:

```python
import cadar

# Auto-detect and transliterate
text = "سلام"  # Arabic
result = cadar.transliterate(text, target="latin", darija="Ma")
print(result)  # "slam"

# Auto-detect and standardize
text = "salaaaam"
result = cadar.standardize(text, script="auto", darija="Ma")
print(result)  # "salam"
```

### CaDaR Class

For multiple operations, use the CaDaR class:

```python
import cadar

# Create processor
processor = cadar.CaDaR(darija="Ma")

# Use for multiple operations
texts = ["سلام", "كيفاش", "بزاف"]
for text in texts:
    result = processor.ara2bizi(text)
    print(f"{text} → {result}")
```

## Version History

See [CHANGELOG.md](https://github.com/Oit-Technologies/CaDaR/blob/main/CHANGELOG.md) for details.

## License

CaDaR is licensed under the [Apache License 2.0](LICENSE).

## Contributing

Contributions are welcome! Please see [CONTRIBUTING.md](https://github.com/Oit-Technologies/CaDaR/blob/main/CONTRIBUTING.md) for guidelines.

## Links

- [Documentation](https://oit-technologies.github.io/CaDaR/)
- [GitHub Repository](https://github.com/Oit-Technologies/CaDaR)
- [PyPI Package](https://pypi.org/project/cadar/)
- [Issue Tracker](https://github.com/Oit-Technologies/CaDaR/issues)
