Metadata-Version: 2.1
Name: anltk
Version: 1.0.7
Summary: Arabic Natural Language Toolkit (ANLTK)
Keywords: NLP,Arabic,python,arabic,c++
Author-Email: Abdullah Alattar <abdullah.mohammad.alattar@gmail.com>
License: Boost Software License 1.0
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Software Development
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Natural Language :: Arabic
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX
Classifier: Operating System :: Unix
Classifier: Operating System :: MacOS
Project-URL: Homepage, https://github.com/Abdullah-AlAttar/anltk
Project-URL: Source, https://github.com/Abdullah-AlAttar/anltk
Project-URL: Bug Tracker, https://github.com/Abdullah-AlAttar/anltk/issues
Requires-Python: >=3.7
Description-Content-Type: text/markdown

![example workflow](https://github.com/Abdullah-AlAttar/anltk/actions/workflows/c-cpp.yml/badge.svg)
![example workflow](https://github.com/Abdullah-AlAttar/anltk/actions/workflows/wheels.yml/badge.svg)
[![PyPI version](https://badge.fury.io/py/anltk.svg)](https://badge.fury.io/py/anltk)
[![License](https://img.shields.io/badge/License-Boost_1.0-lightblue.svg)](https://www.boost.org/LICENSE_1_0.txt)
[![Downloads](https://static.pepy.tech/personalized-badge/anltk?period=total&units=international_system&left_color=blue&right_color=orange&left_text=Downloads)](https://pepy.tech/project/anltk)

# Arabic Natural Language Toolkit (ANLTK)

ANLTK is a set of Arabic natural language processing tools. developed with focus on simplicity and performance.

## ANLTK is a C++ library, with python bindings

## Installation

for python :

```
pip install anltk
```

## Building

Note: Currently only tested on Linux, prebuilt python wheels are available for Linux, Windows, Macos on [pypi](https://pypi.org/project/anltk/)

### Dependencies

* [utfcpp](https://github.com/nemtrif/utfcpp.git), automatically downloaded.
* [utf8proc](https://github.com/JuliaStrings/utf8proc), automatically downloaded.
* C++ Compiler that supports c++17.
* Python3, [meson](https://mesonbuild.com/), [ninja](https://ninja-build.org/)
* [Task](https://taskfile.dev/) (optional, for simplified build commands)

### Building C++ Library

```bash
git clone https://github.com/Abdullah-AlAttar/anltk.git
cd anltk/

# Using taskfile (recommended)
task configure
task build
task test

# Or manually with meson
meson build --buildtype=release -Dbuild_tests=false
cd build
ninja
```

### Building Python Bindings

```bash
# Complete setup (creates venv, installs deps, builds package)
task py:setup

# Or step by step:
task py:venv              # Create virtual environment
task py:deps              # Install build dependencies
task py:install           # Install in development mode

# Test the installation
task py:test              # Run quick tests

# Build wheel for distribution
task py:wheel             # Build wheel package

# Clean build artifacts
task clean                # Clean all build artifacts
```

### Manual Python Build (without taskfile)

```bash
python3 -m venv .venv
.venv/bin/pip install --upgrade pip meson-python build pybind11 ninja patchelf
.venv/bin/pip install -e .
```

## Usage Examples

### C++ API

```c++
#include "anltk/anltk.hpp"
#include <iostream>
#include <string>

int main()
{

    std::string ar_text = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ";

    std::cout << anltk::transliterate(ar_text, anltk::CharMapping::AR2BW) << '\n';
    // >bjd hwz HTy klmn sEfS qr$t vx* DZg

    std::string text = "فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ.";

    std::cout << anltk::remove_tashkeel(text) << '\n';
    // فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.

    // Third paramters is a stop_list, charactres in this list won't be removed
    std::cout << anltk::remove_non_alpha(text, " ") << '\n';
    // فراشة ملونة تطير في البستان حلوة مهندمة تدهش الإنسان

    anltk::TafqitOptions opts;
    std::cout<< anltk::tafqit(15000120, opts) <<'\n';
    // خمسة عشر مليونًا ومائة وعشرون
}

```

### Python API

```python
import anltk


ar = "أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ"
bw = anltk.transliterate(ar, anltk.AR2BW)
print(bw)
# >bjd hwz HTy klmn sEfS qr$t vx* DZg

print(anltk.remove_tashkeel("فَرَاشَةٌ مُلَوَّنَةٌ تَطِيْرُ في البُسْتَانِ، حُلْوَةٌ مُهَنْدَمَةٌ تُدْهِشُ الإِنْسَانَ."))

# فراشة ملونة تطير في البستان، حلوة مهندمة تدهش الإنسان.

print(anltk.tafqit(15000120))
# خمسة عشر مليونًا ومائة وعشرون
```

**For list of features see [Features.md](Features.md)**

## Benchmarks

Processing a file containing 500000 Line, 6787731 Word, 112704541 Character. the task is to remove diacritics / transliterate to buckwalter

### **Buckwatler transliteration**

| Method           | Time          |   |   |
|------------------|---------------|---|---|
| anltk python-api | 1.379 seconds |   |   |
| python [camel_tools](https://github.com/CAMeL-Lab/camel_tools)  | 11.46 seconds |   |   |

### **Remove Diacritics**

| Method           | Time          |   |   |
|------------------|---------------|---|---|
| anltk python-api | 0.989 seconds |   |   |
| python [camel_tools](https://github.com/CAMeL-Lab/camel_tools)   | 4.892 seconds |   |   |
