Metadata-Version: 2.4
Name: ion-tokenizer
Version: 1.0.3
Summary: Phrase-first semantic tokenizer compiler for neural language models
Requires-Python: <3.14,>=3.9
Description-Content-Type: text/markdown
License-File: LICENSE.md
Requires-Dist: click>=8.0
Requires-Dist: datasets>=2.14
Requires-Dist: spacy>=3.5
Requires-Dist: tqdm>=4.65
Requires-Dist: orjson>=3.9
Requires-Dist: gensim>=4.3
Dynamic: license-file

<img width="186.25" height="198.25" alt="i-ion-light" src="https://github.com/user-attachments/assets/a6b0bcc5-99e7-4d06-8a79-9a0cee664086" />

# Ion

A phrase-first semantic tokenizer compiler for neural language models. Shorter sequences = fewer tokens = lower inference costs.

Ion discovers multi-word phrases ("going to", "in front of", "machine learning") and collapses them into single tokens. The result: **14-44% fewer tokens** than BPE on the same text, with the same or smaller vocabulary.

## Install

```bash
pip install ion-tokenizers
```

Python 3.9 - 3.13 supported.

## Quickstart

```bash
# Build a tokenizer from any text corpus
ion tokenize corpus.txt -o tokenizer.json

# Build from a HuggingFace dataset
ion tokenize wikitext -o tokenizer.json

# Check how well it compresses
ion stats -t tokenizer.json -c corpus.txt
```

## Benchmarks

All benchmarks train Ion and BPE on the same corpus and compare token counts on held-out data.

### Natural Language

| Domain | Vocab Size | Ion Advantage | Ion chars/tok | BPE chars/tok |
|--------|-----------|---------------|---------------|---------------|
| WikiText-103 | 10K | **22.1%** | 5.53 | 4.30 |
| AG News | 10K | **20.3%** | 5.48 | 4.37 |
| Legal Text | 10K | **38.1%** | 9.40 | 5.82 |
| Scientific Abstracts | 10K | **39.0%** | 9.54 | 5.82 |
| Conversational | 10K | **33.9%** | 6.11 | 4.04 |
| Medical Abstracts | 10K | **35.7%** | 9.27 | 5.95 |
| Technical Docs | 10K | **40.1%** | 8.54 | 5.12 |
| IMDB Reviews | 10K | **19.2%** | 5.22 | 4.22 |
| Literary Text | 10K | **37.6%** | 7.42 | 4.62 |

### Source Code (GitHub Repos)

| Repository | Language | Ion Advantage |
|-----------|----------|---------------|
| psf/requests | Python | **22.6%** |
| iluwatar/java-design-patterns | Java | **22.5%** |
| jekyll/jekyll | Ruby | **19.0%** |
| fastapi/fastapi | Python | **18.3%** |
| lodash/lodash | JavaScript | **14.5%** |
| expressjs/express | JavaScript | **12.0%** |
| pallets/flask | Python | **8.1%** |
| BurntSushi/ripgrep | Rust | **6.4%** |

Ion wins on all 14 repositories tested. Average advantage across all benchmarks: **~15-30%** depending on domain.

Run your own:

```bash
ion benchmark corpus.txt --vocab-size 10000
```

## How It Works

Ion uses a **phrase-first** architecture with a strict priority hierarchy:

```
Phrases > Words > BPE subwords > Characters
```

1. **Phrase discovery** — PMI-based multi-layer detection finds n-grams (2-5+ words) that co-occur more than chance
2. **Compression ranking** — Phrases are scored by `tokens_saved * length_bonus * frequency` and greedily selected
3. **Greedy longest-match encoding** — At encode time, the longest matching phrase wins. Deterministic, same input always produces same output
4. **Fallback hierarchy** — Unknown words fall back to BPE subwords (default), or optionally character-level or dynamic vocab

### Fallback Modes

| Mode | Flag | Description |
|------|------|-------------|
| BPE (default) | `--fallback-mode bpe` | Unknown words split into BPE subwords |
| Character | `--fallback-mode character` | Fall back to individual characters |
| Newword | `--fallback-mode newword` | Dynamically add unknown words to vocab |
| Word | `--fallback-mode word` | Treat unknown words as single `<unk>` tokens |

### Vocabulary Modes

| Mode | Flag | Description |
|------|------|-------------|
| Standard | `--max-vocab N` | Limit vocabulary to N tokens (default: 20,000) |
| Take-Needed | `-tn` | Include ALL words and ALL discovered phrases, no cap |

## CLI Reference

```bash
ion                              # Launch interactive GUI
ion tokenize <source>            # Build tokenizer from corpus
ion stats -t tokenizer.json      # Show tokenizer statistics
ion benchmark <corpus>           # Ion vs BPE comparison
ion sweep <corpus>               # Vocab size sweep
ion compare t1.json t2.json      # Compare two tokenizers
ion iterate <corpus> -t t.json   # Adapt tokenizer to new data
ion retokenize <text> -t t.json  # Re-encode text with tokenizer
ion clean <text>                 # Preprocess text for compression
ion export -t t.json --format hf # Export to other formats
```

### Key Options

```bash
# Tokenizer building
--max-vocab 32000          # Vocabulary budget
--fallback-mode bpe        # bpe | character | newword | word
--phrase-target-pct 60     # % of vocab allocated to phrases
--max-phrase-layers 4      # Depth of n-gram discovery
--preserve-case            # Don't lowercase (useful for code)
--language en              # spaCy language code

# Corpus loading
--hf-split train           # HuggingFace dataset split
--hf-text-field text       # Text field name
--hf-config sample-10BT    # Dataset configuration
--max-samples 100000       # Max samples to load
```

## Python API

```python
from ion.tokenizer import IonTokenizer
from ion.builder import build_tokenizer

# Build from corpus
stats = build_tokenizer("corpus.txt", output="tokenizer.json", max_vocab_size=20000)

# Load and use
tokenizer = IonTokenizer.from_file("tokenizer.json")
ids = tokenizer.encode("the machine learning model")
text = tokenizer.decode(ids)
```

## HuggingFace Integration

```python
from ion.integrations import IonTokenizerHF

tokenizer = IonTokenizerHF.from_pretrained("tokenizer.json")
encoded = tokenizer("Hello world", return_tensors="pt")

# Works with HuggingFace Trainer
from transformers import Trainer
trainer = Trainer(model=model, tokenizer=tokenizer, ...)
```

```python
# PyTorch Dataset
from ion.integrations import IonDataset, ion_collate_fn
from torch.utils.data import DataLoader

dataset = IonDataset(texts, tokenizer, max_length=512)
loader = DataLoader(dataset, collate_fn=ion_collate_fn)
```

## Interactive GUI

Run `ion` with no arguments to launch the interactive terminal GUI with visual menu navigation, vocabulary configuration, and animated ASCII ion atom.

```bash
ion
```

## Licensing

Ion is **free for open-source models under 10B parameters**. Paid tiers:

| Use Case | Fee |
|----------|-----|
| Open-source, ≤10B params | **Free** |
| Closed-source, ≤10B params | **$100** (one-time) |
| Open-source, >10B params | **$1,000** (one-time) |
| Closed-source, >10B params | **$3,000/B over 10B** (one-time) |

Source-available: you may examine and redistribute with attribution, but no modifications or forks. See [LICENSE.md](LICENSE.md) for full terms.

Run `ion register` to set up your license.
