Metadata-Version: 2.4
Name: altatk
Version: 3.1.0
Summary: ALTA tokenizer for encoding and decoding Kinyarwanda language text
Home-page: https://github.com/Nschadrack/Kin-Tokenizer
Author: Yali Labs
Author-email: yalilabs24@gmail.com
License: MIT
Project-URL: Source Code, https://github.com/Nschadrack/Kin-Tokenizer
Keywords: tokenizer,kinyarwanda,bpe,byte-pair-encoding,nlp,ALTA Model,alta
Classifier: Development Status :: 4 - Beta
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Operating System :: OS Independent
Classifier: License :: OSI Approved :: MIT License
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: regex>=2024.7.24
Requires-Dist: requests>=2.32.3
Requires-Dist: numpy>=1.24.0
Requires-Dist: alta-acceleration>=0.3.1
Provides-Extra: training
Requires-Dist: wandb; extra == "training"
Provides-Extra: dev
Requires-Dist: maturin<2.0,>=1.0; extra == "dev"
Requires-Dist: wandb; extra == "dev"
Requires-Dist: pytest; extra == "dev"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: project-url
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# ALTA Tokenizer

[![PyPI version](https://img.shields.io/pypi/v/altatk)](https://pypi.org/project/altatk/)
[![Python](https://img.shields.io/pypi/pyversions/altatk)](https://pypi.org/project/altatk/)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)

A Byte Pair Encoding (BPE) tokenizer built for **Kinyarwanda**, with a built-in Rust-accelerated backend for large-scale training and encoding. Developed by [Yali Labs](mailto:info@yalilabs.com).

`alta-tokenizer` ships a pre-trained vocabulary (50,251 tokens) and can encode/decode out of the box. It also works with other languages (English, French, etc.), albeit with a lower compression rate since the vocabulary was learned from Kinyarwanda text.

## Table of Contents

- [Key Features](#key-features)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [API Reference](#api-reference)
  - [KinTokenizer](#kintokenizer)
  - [Utility Functions](#utility-functions)
- [Training Your Own Tokenizer](#training-your-own-tokenizer)
  - [In-Memory Training (Python)](#in-memory-training-python)
  - [Streamed Rust Training (Large Datasets)](#streamed-rust-training-large-datasets)
  - [Retraining / Extending an Existing Tokenizer](#retraining--extending-an-existing-tokenizer)
- [Dataset Creation for LLM Training](#dataset-creation-for-llm-training)
- [CLI Reference](#cli-reference)
- [Evaluation & Comparison Tools](#evaluation--comparison-tools)
- [Vocabulary Design](#vocabulary-design)
- [Rust Extension](#rust-extension)
- [Compression Rate](#compression-rate)
- [Project Structure](#project-structure)
- [License](#license)

---

## Key Features

- **Pre-trained Kinyarwanda vocabulary** — 50,251 BPE tokens (256 byte tokens + ~49,989 learned merges + 6 special tokens), ready to use.
- **Encode & Decode** — Lossless roundtrip: `decode(encode(text)) == text`.
- **Two space strategies** — *Metaspace* (`▁` prefix, SentencePiece-style, v3.0 default) and *GPT-2* (space-as-prefix, legacy v2.0) with automatic detection from checkpoints.
- **Train custom tokenizers** — In-memory Python path for small corpora; Rust-streamed path for 100 MB+ datasets with memory-mapped I/O.
- **Rust-accelerated backend** (`kin_merge`) — Parallel BPE merge loops, file streaming, encoding, and preprocessing via PyO3 + Rayon. Installed by default; falls back to pure Python if unavailable on your platform.
- **LLM dataset pipeline** — Generate overlapping token sequences (`.npy` memmap) for language-model training.
- **HuggingFace export** — `export_huggingface()` writes `vocab.json`, `merges.txt`, and tokenizer config compatible with HuggingFace Transformers.
- **Evaluation suite** — Built-in scripts for compression ratio, roundtrip fidelity, speed benchmarks, and checkpoint-vs-production comparison.

---

## Installation

### From PyPI

```bash
pip install altatk
```

This gives you the full tokenizer with encode/decode, the pre-trained checkpoint, **and** the Rust-accelerated backend (`alta-acceleration`). No Rust toolchain needed — pre-built binary wheels are provided for all major platforms.

### From Source (development)

```bash
git clone https://github.com/Nschadrack/Kin-Tokenizer.git
cd Kin-Tokenizer
pip install -e ".[dev]"

# Build the Rust extension locally
cd merger
maturin develop --release
```

> Building from source requires a [Rust toolchain](https://rustup.rs) (1.80+). See [Rust Extension](#rust-extension) for details.

---

## Quick Start

```python
from kin_tokenizer import KinTokenizer

tokenizer = KinTokenizer()  # auto-loads pre-trained checkpoint

# Encode
text = "Nagiye gusura abanyeshuri."
tokens = tokenizer.encode(text)
print(tokens)  # e.g. [1835, 7412, 3029, ...]

# Decode
decoded = tokenizer.decode(tokens)
print(decoded)  # "Nagiye gusura abanyeshuri."

# Compression rate
print(f"Compression rate: {len(text) / len(tokens):.2f}X")

# Vocabulary size
print(f"Vocab size: {tokenizer.vocab_size}")
```

---

## API Reference

### KinTokenizer

```python
from kin_tokenizer import KinTokenizer

tokenizer = KinTokenizer(autoload=True)
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `autoload` | `bool` | `True` | Automatically load the pre-trained checkpoint |

#### Methods

| Method | Description |
|--------|-------------|
| `encode(text, lowercase=None, nbr_processes=None)` | BPE-encode text → list of token IDs |
| `decode(token_ids, return_eos=False)` | Decode token IDs → UTF-8 string |
| `train(text, vocab_size, verbose=True, ...)` | Full BPE training loop (in-memory) |
| `save(path, save_legacy_pickle=False)` | Save tokenizer to JSON checkpoint |
| `load(tokenizer_path, allow_pickle=False)` | Load checkpoint (JSON or legacy pickle) |
| `export_huggingface(output_dir)` | Export as HuggingFace `PreTrainedTokenizerFast` |
| `token_to_id(token)` | Look up token string → ID |
| `id_to_token(token_id)` | Look up ID → token string |

#### Properties

| Property | Type | Description |
|----------|------|-------------|
| `vocab` | `dict[int, str]` | Full vocabulary (ID → decoded token) |
| `vocab_size` | `int` | Total number of tokens |
| `merged_tokens` | `dict` | BPE merge rules `(left_id, right_id) → new_id` |
| `space_strategy` | `str` | `"metaspace"` or `"gpt2"` |

### Utility Functions

```python
from kin_tokenizer.utils import (
    train_kin_tokenizer,
    train_kin_tokenizer_streamed_from_file,
    create_dataset,
    preprocess_text,
)
```

| Function | Description |
|----------|-------------|
| `train_kin_tokenizer(text, vocab_size, ...)` | Train BPE tokenizer in-memory |
| `train_kin_tokenizer_streamed_from_file(dataset_path, ...)` | Train via Rust-streamed file I/O (memory-efficient) |
| `create_dataset(text_file_path, ...)` | Tokenize a text file and write non-overlapping train/eval sequence files for LM training |
| `preprocess_text(text, is_lowercase_text=False)` | Regex-based text cleaning (NFC normalization, URL removal, whitespace collapse) |

---

## Training Your Own Tokenizer

The vocabulary is initialized with 256 byte tokens (IDs 1–255) plus `<|PAD|>` at ID 0.  BPE merges are learned on top of that.  The final `vocab_size` you request is an upper bound — the actual size depends on your corpus.

### In-Memory Training (Python)

Best for corpora that fit comfortably in RAM (< ~1 GB).

```python
from kin_tokenizer.utils import train_kin_tokenizer

tokenizer = train_kin_tokenizer(
    text,                     # full corpus as a string
    vocab_size=512,           # target vocabulary size
    save=True,                # save checkpoint after training
    tokenizer_path="./my_checkpoint",
    retrain=False,            # True to continue from existing checkpoint
    nbr_processes=8,          # parallel workers (None = auto)
    lowercase=False,          # case-fold during training
)

tokens = tokenizer.encode("Muraho neza!")
```

### Streamed Rust Training (Large Datasets)

For large corpora (100 MB+), the Rust-streamed path avoids loading the entire file into Python memory. Requires the [Rust extension](#rust-extension) (installed by default).

```bash
python training.py \
    --action train \
    --dataset_path /path/to/corpus.txt \
    --vocab_size 50251 \
    --streamed-rust \
    --stream-lines 200000 \
    --num_processes 8 \
    -d data/checkpoint
```

| Flag | Description |
|------|-------------|
| `--streamed-rust` | Enable Rust file streaming + pretokenization + batched merge loop |
| `--stream-lines N` | Lines per streaming batch (default: 200,000) |
| `--num_processes N` | Rayon thread count for the Rust backend |
| `--save-token-chunks` | Persist `current_token_chunks.bin` for resume support (increases disk usage) |

### Retraining / Extending an Existing Tokenizer

```bash
python training.py \
    --action retrain \
    -d data/checkpoint \
    --dataset_path /path/to/corpus.txt \
    --vocab_size 100000 \
    --num_processes 8
```

Retraining loads the existing checkpoint and continues the BPE merge loop from where it left off.

---

## Dataset Creation for LLM Training

Generate overlapping token sequences suitable for next-token-prediction training.
This split is only for model-training sequences. It does not change tokenizer
training, and you do not need to retrain the tokenizer just to use train/eval
sequence files.

```python
from kin_tokenizer.utils import create_dataset

create_dataset(
    text_file_path="data/dataset.txt",
    nbr_processes=8,
    sequence_length=512,     # tokens per sequence
    step_size=256,           # sliding-window stride (50% overlap)
    eval_ratio=0.15,         # hold out 15% of paragraphs for evaluation
    eval_seed=42,            # deterministic split
    destination_dir="data/sequences",
)
```

Or via CLI:

```bash
python training.py \
    --action create-dataset \
    --dataset_path data/alta_dataset.txt \
    --sequence_length 512 \
    --step_size 256 \
    --eval-ratio 0.15 \
    --eval-seed 42 \
    --output_dir data/sequences
```

**Outputs:**

- `train_sequences.npy` — the main training split.
- `eval_sequences.npy` — held-out evaluation split created from paragraphs not used in train.

Each file is a memory-mapped numpy array of shape `(num_sequences, sequence_length + 1)`, where the last column is the prediction target.

Set `eval_ratio=0` or `--eval-ratio 0` to disable the eval split and produce only `train_sequences.npy`.

Built-in sequence profiles (`--profile`):

| Profile | Sequence Length | Step Size |
|---------|----------------|-----------|
| `small` | 256 | 128 |
| `medium` (default) | 512 | 256 |
| `large` | 1024 | 512 |

---

## CLI Reference

```
python training.py --action ACTION [options]
```

| Argument | Description |
|----------|-------------|
| `-a, --action` | `train`, `retrain`, or `create-dataset` |
| `-d, --save_checkpoint_path` | Checkpoint directory (default: `data/checkpoint`) |
| `--dataset_path` | Path to dataset file (or URL with `--dataset_url`) |
| `--vocab_size` | Target vocabulary size (default: 50,251, minimum: 257) |
| `--num_processes` | Worker/thread count |
| `--streamed-rust` | Use Rust-streamed training path |
| `--stream-lines` | Lines per Rust streaming batch |
| `--save-token-chunks` | Persist binary token chunks for resume |
| `--lowercase / --preserve-case` | Force case-folding or preserve case |
| `--sequence_length` | Tokens per sequence for `create-dataset` |
| `--step_size` | Sliding-window stride for `create-dataset` |
| `--output_dir` | Output directory for sequences |
| `--profile` | Sequence preset: `small`, `medium`, `large` |
| `--allow-legacy-pickle` | Load legacy `.pkl` checkpoints (unsafe) |
| `-p, --project_name` | Weights & Biases project name |
| `-w, --wandb_run` | Weights & Biases run name |

---

## Evaluation & Comparison Tools

### Evaluate a Tokenizer

```bash
python evaluate_tokenizer.py [--checkpoint data/checkpoint] [--dataset data/alta_dataset.txt] [--with-production]
```

Measures: vocabulary stats, roundtrip fidelity, compression ratio, encoding speed, morpheme coverage, and space-handling edge cases.

### Compare Two Tokenizers

```bash
python compare_tokenizers.py [--checkpoint data/checkpoint] [--input "Muraho neza"] [--show-tokens]
```

Side-by-side comparison of production (PyPI) vs. checkpoint: token counts, compression ratio, timing, and vocabulary overlap.

### Interactive REPL

```bash
python read_tokenizer.py -c data/checkpoint
```

Commands: `encode <text>`, `decode <ids>`, `vocab <id>`, `find <substring>`, `info`, `exit`.

---

## Vocabulary Design

| Range | Content |
|-------|---------|
| ID 0 | `<\|PAD\|>` — padding |
| IDs 1–255 | Raw UTF-8 byte tokens (fallback for unseen bytes) |
| IDs 256+ | Learned BPE merge tokens (lower ID = higher merge priority) |
| Last 6 IDs | Special tokens: `<\|EOS\|>`, `<\|BOS\|>`, `<\|SEP\|>`, `<\|MASK\|>`, `<\|UNK\|>`, `<\|CLS\|>` |

---

## Compression Rate

The compression rate measures encoding efficiency — characters per token:

$$\text{Compression Rate} = \frac{\text{number of characters}}{\text{number of tokens}}$$

**Example:** `"Nagiye gusura abanyeshuri."` (26 characters) → 11 tokens → **2.36X** compression.

Higher is better. The pre-trained Kinyarwanda tokenizer achieves significantly higher compression on Kinyarwanda text than general-purpose tokenizers (GPT-2, etc.) because the vocabulary is tailored to Bantu morphology.

---

## Rust Extension

The `kin_merge` Rust extension provides 3–5x speedups for training and encoding via [PyO3](https://pyo3.rs) + [Rayon](https://github.com/rayon-rs/rayon) parallelism. It is installed by default as the `alta-acceleration` package. The library falls back to pure Python if the extension is unavailable on your platform.

### Building

```bash
# Requires: Rust toolchain (rustup.rs), Python ≥ 3.9
cd merger
pip install maturin
maturin develop --release
```

### What It Accelerates

| Operation | Rust Function | Speedup |
|-----------|--------------|---------|
| BPE merge loop | `rust_bpe_train_loop`, `RustBpeTrainer.step` | 3–5x |
| File streaming + pretokenization | `rust_stream_pretokenize_file` | Memory-efficient |
| Text encoding | `rust_encode_text_raw_batch` | 4–5x (zero-copy bytes) |
| Text preprocessing | `rust_preprocess_text` | 3–5x |
| Pair counting | `rust_count_pairs` | Parallel via Rayon |

Key optimizations: FxHashMap for integer pair keys, in-place merges, incremental pair counting, GIL release for true parallelism, and u32-LE byte output to avoid Python integer object overhead.

---

## Project Structure

```
Kin-Tokenizer/
├── kin_tokenizer/              # Main Python package
│   ├── __init__.py             # Exports KinTokenizer
│   ├── tokenizer.py            # KinTokenizer class (encode, decode, train, save, load)
│   ├── utils.py                # Training, dataset creation, preprocessing, Rust integration
│   ├── params.py               # Constants, regex patterns, special tokens, config
│   ├── version.py              # Package version
│   └── data/
│       └── non_kinyarwanda_words.json  # ~244K foreign words for training-time filtering
├── merger/                     # Rust extension (kin_merge)
│   ├── src/lib.rs              # PyO3 bindings: BPE training, encoding, preprocessing
│   ├── Cargo.toml              # Rust dependencies (pyo3, rayon, fxhash, regex)
│   └── pyproject.toml          # Maturin build config
├── training.py                 # CLI for training, retraining, and dataset creation
├── evaluate_tokenizer.py       # Tokenizer quality / speed evaluation
├── compare_tokenizers.py       # Side-by-side tokenizer comparison
├── read_tokenizer.py           # Interactive REPL for inspecting tokenizers
├── helpers.py                  # Email notifications & Weights & Biases setup
├── data/
│   ├── checkpoint/             # Pre-trained tokenizer checkpoint (kin_tokenizer.json)
│   └── sequences/              # Generated LM sequences (train_sequences.npy, eval_sequences.npy)
├── setup.py                    # Package setup (PyPI: altatk)
├── pyproject.toml              # Build system config
└── requirements.txt            # Python dependencies
```

---

## License

MIT License. See [LICENSE](LICENSE) for details.


