Metadata-Version: 2.4
Name: ion-tokenizer
Version: 1.0.1
Summary: Phrase-first semantic tokenizer compiler for neural language models
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: click>=8.0
Requires-Dist: datasets>=2.14
Requires-Dist: spacy>=3.5
Requires-Dist: gensim>=4.3
Requires-Dist: tqdm>=4.65
Requires-Dist: orjson>=3.9

<img width="186.25" height="198.25" alt="i-ion-light" src="https://github.com/user-attachments/assets/a6b0bcc5-99e7-4d06-8a79-9a0cee664086" /> 

# Ion
A phrase-first semantic tokenizer compiler for neural language models. Outpreforms BPE in all metrics tested, and has benchmarked at 1.099x on our 2m pretrained tokenizer, meaning this architectur is capable of producing fewer tokens than there are word in the tokenized texts.

## What ion Is

ion is a CLI that:

- Ingests large English text corpora
- Discovers and formalizes high-frequency semantic units
- Collapses those units into atomic tokens
- Produces a tokenizer that minimizes sequence length with configurable vocabulary size

ion exists to reduce attention cost, training cost, and sequence length for small to mid-sized language models by replacing surface-level English structure with a compact, explicit symbolic substrate.

The output tokenizer is intended to be used as the only language representation seen by the model during training and inference.

## What ion Is Not

ion is not:

- A BPE tokenizer
- A subword tokenizer
- A character tokenizer
- A human-facing language tool

ion does not preserve English spelling, aesthetics, or expressiveness beyond what is required for semantic coverage.

## Primary Objective

**Maximize sequence-length reduction.**

This is achieved by collapsing as many high-frequency multi-word phrases as possible into single tokens. Vocabulary size is configurable (default: 20,000), or use **take-needed mode** (`-tn`) for complete coverage.

## Core Concept

Natural language contains many multi-word expressions that function as single semantic units:

- "a lot"
- "have to"
- "going to"
- "let's go"
- "kind of"
- "as soon as"
- "in front of"
- "out of"

ion formalizes these expressions as single atomic tokens.

**The guiding rule: one semantic unit = one token**

This rule applies to words, phrases, operators, and modifiers.

## Installation

```bash
pip install -e .
```

### Dependencies

- Python 3.10+
- gensim (phrase detection)
- spacy (tokenization)
- datasets (Hugging Face corpus loading)
- click (CLI)
- orjson (fast JSON serialization)
- tqdm (progress bars)

## Commands

### ion tokenize

Build a tokenizer from a corpus.

```bash
# From Hugging Face dataset
ion tokenize wikitext -o tokenizer.json

# From local file
ion tokenize corpus.txt -o tokenizer.json

# From directory of text files
ion tokenize ./data/ -o tokenizer.json

# With custom vocabulary size (no hard cap!)
ion tokenize wikitext --max-vocab 50000

# TAKE-NEEDED mode: include ALL words and ALL phrases
ion tokenize corpus.txt -tn -o full_tokenizer.json

# With full configuration
ion tokenize wikitext \
    --max-vocab 32000 \
    --min-phrase-freq 10 \
    --phrase-threshold 0.001 \
    --max-phrase-layers 4 \
    --max-sentences 1000000
```

**Vocabulary Modes:**

| Mode | Flag | Description |
|------|------|-------------|
| Standard | `--max-vocab N` | Limit vocabulary to N tokens (default: 20000) |
| Take-Needed | `-tn` | Include ALL words and ALL discovered phrases |

Take-needed mode ensures complete coverage with zero character fallback for known words.

### ion clean

Preprocess text for maximum sequence length reduction.

```bash
# Basic cleaning
ion clean input.txt -o output.txt

# Standard aggressive cleaning
ion clean input.txt --aggressive

# Maximum compression (all options enabled)
ion clean input.txt --maximum

# EXTREME compression (includes stopword removal - may lose meaning)
ion clean input.txt --extreme

# Directory processing
ion clean ./data/ -o ./cleaned/
```

**Cleaning Presets:**

| Preset | Description |
|--------|-------------|
| `--aggressive` | URLs, HTML, emails, emphasis, contractions, punctuation |
| `--maximum` | ALL semantic-preserving options for maximum reduction |
| `--extreme` | MAXIMUM + stopword removal (may lose semantic meaning) |

**Cleaning Options:**

| Flag | Description |
|------|-------------|
| `--remove-emphasis` | Remove intensifiers (very, really, extremely) |
| `--remove-fillers` | Remove filler words (um, like, you know) |
| `--remove-discourse` | Remove discourse markers (however, therefore) |
| `--semantic-normalize` | Normalize phrases (gonna->going to, a lot of->many) |
| `--normalize-numbers` | Replace numbers with generic tokens |
| `--remove-parens` | Remove parenthetical expressions |
| `--simplify-punct` | Simplify punctuation to canonical forms |
| `--deduplicate` | Remove adjacent duplicate words |
| `--remove-redundant` | Remove redundant phrases (very unique -> unique) |
| `--remove-starters` | Remove sentence starters (well, so, anyway) |
| `--simplify-redundant` | Simplify redundant phrases (end result -> result) |
| `--strip-quotes` | Remove quotation marks |
| `--flatten-case` | Flatten ALL CAPS and MiXeD CaSe |
| `--remove-stopwords` | Aggressively remove stopwords (the, a, is) |
| `--expand-contractions` | Expand contractions (can't->cannot) |
| `--collapse-punctuation` | Collapse repeated punctuation |
| `--strip-urls` | Remove URLs |
| `--strip-emails` | Remove email addresses |
| `--strip-html` | Remove HTML tags |
| `--strip-special` | Remove special characters |
| `--strip-numbers` | Remove standalone numbers |

### ion stats

Report tokenizer statistics and compression metrics.

```bash
# Basic stats
ion stats -t tokenizer.json

# With corpus analysis
ion stats -t tokenizer.json -c wikitext
```

### ion iterate

Adapt an existing tokenizer to new corpus data.

```bash
ion iterate new_corpus.txt -t tokenizer.json -o updated_tokenizer.json
```

### ion retokenize

Re-encode text using a tokenizer.

```bash
# Output token IDs
ion retokenize document.txt -t tokenizer.json -o document.tokens

# Output token text
ion retokenize document.txt -t tokenizer.json --format tokens

# Output both IDs and tokens
ion retokenize document.txt -t tokenizer.json --format both
```

### ion compare

Compare two tokenizers with optional corpus analysis.

```bash
# Basic comparison
ion compare tokenizer1.json tokenizer2.json

# With corpus compression analysis
ion compare tokenizer1.json tokenizer2.json -c corpus.txt

# With HuggingFace dataset (full URL or identifier)
ion compare tokenizer1.json tokenizer2.json -c https://huggingface.co/datasets/wikitext --max-samples 10000
```

### ion help

Show detailed help and usage information.

```bash
ion help
```

### ion uninstall

Remove ion from the system.

```bash
ion uninstall
```

### ion benchmark

Benchmark ion against BPE tokenizer on a corpus.

```bash
# Basic benchmark on local corpus
ion benchmark corpus.txt --vocab-size 10000

# Benchmark on HuggingFace dataset (full URL)
ion benchmark https://huggingface.co/datasets/wikitext --vocab-size 10000 --max-samples 50000

# With specific configuration
ion benchmark https://huggingface.co/datasets/HuggingFaceFW/fineweb --hf-config sample-10BT --vocab-size 15000
```

This command trains both an ion tokenizer and a BPE tokenizer on the same corpus,
then compares their compression performance. Requires the `tokenizers` package for BPE.

## Token Boundaries

- Whitespace is not a token
- Every token implicitly includes its own boundary
- Tokens may include leading-space semantics internally
- Rendering back to English is handled outside ion

## Open-Vocabulary Requirement

The tokenizer never fails. Language outside the known vocabulary is representable via:

- A fixed fallback alphabet
- No runtime vocabulary expansion
- No dynamic token creation

## Phrase Discovery

Phrase discovery is the core mechanism of ion. The system:

- Aggressively discovers phrases using gensim Phrases and PMI scoring
- Aggressively collapses them into single tokens
- In standard mode: selects best phrases within vocab limit
- In take-needed mode (`-tn`): includes ALL discovered phrases

### Selection Algorithm

1. Extract n-grams of length 2-5+ using layered phrase detection
2. Score by total tokens saved: `frequency * (length - 1)`
3. Apply exponential length bonus: longer phrases heavily favored
4. Apply frequency density weighting
5. Greedily select phrases until vocab cap or diminishing returns

## Tokenization Rules

- **Deterministic**: Same input always produces same output
- **Greedy longest-match**: Phrase tokens override word tokens
- **Priority order**: Phrase > Word > Fallback character
- **No BPE or unigram LM methods**

## Output Format

ion outputs a single artifact: `tokenizer.json`

```json
{
    "version": "1.0",
    "vocab_size": 19847,
    "vocab": {
        "<pad>": 0,
        "<unk>": 1,
        "<bos>": 2,
        "<eos>": 3,
        "going to": 4,
        "have to": 5,
        ...
    },
    "token_classes": {
        "special": [...],
        "phrase": [...],
        "word": [...],
        "fallback": [...]
    }
}
```

## Priority Order

1. **Sequence-length reduction** (primary objective)
2. **Determinism**
3. **Vocabulary size control** (configurable, default: 20,000)
4. **Coverage** (use `-tn` for complete coverage)

Human readability is not a priority.

## Data Input

ion accepts English corpora from:

- Hugging Face datasets (URLs or dataset identifiers)
- Local text files
- Local directories of text files

### Hugging Face Options

```bash
--hf-split train          # Dataset split
--hf-text-field text      # Text field name
--hf-config sample-10BT   # Dataset configuration
--max-samples 100000      # Maximum samples to load
```

## Example Workflow

```bash
# 1. Clean your corpus for maximum compression
ion clean raw_corpus.txt --maximum -o cleaned_corpus.txt

# 2. Build tokenizer (standard mode with custom size)
ion tokenize cleaned_corpus.txt -o tokenizer.json --max-vocab 32000

# 2b. OR build with take-needed mode for complete coverage
ion tokenize cleaned_corpus.txt -tn -o tokenizer_full.json

# 3. Check compression stats
ion stats -t tokenizer.json -c cleaned_corpus.txt

# 4. Benchmark against BPE
ion benchmark cleaned_corpus.txt --vocab-size 10000

# 5. Iterate with additional data
ion iterate more_data.txt -t tokenizer.json -o tokenizer.json
```

## Interactive GUI

Launch the interactive GUI by running `ion` without arguments:

```bash
ion
```

The GUI provides:
- Visual menu navigation (arrow keys or j/k)
- Vocabulary size configuration with visual slider
- Take-needed mode selection
- Command help with `?` key
- Animated ASCII ion atom

## License

MIT
