Metadata-Version: 2.4
Name: untoken
Version: 0.1.0
Summary: Token compression for LLM prompts
Home-page: https://github.com/pacifio/untoken
Author: Adib Mohsin
License: MIT
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.1.0
Requires-Dist: transformers>=4.36.0
Requires-Dist: tokenizers>=0.15.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: tqdm>=4.66.0
Provides-Extra: train
Requires-Dist: datasets>=2.16.0; extra == "train"
Requires-Dist: sentence-transformers>=2.2.0; extra == "train"
Requires-Dist: accelerate>=0.25.0; extra == "train"
Requires-Dist: wandb>=0.16.0; extra == "train"
Requires-Dist: bert-score>=0.3.13; extra == "train"
Requires-Dist: rouge-score>=0.1.2; extra == "train"
Requires-Dist: scipy>=1.11.0; extra == "train"
Requires-Dist: scikit-learn>=1.3.0; extra == "train"
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: black>=23.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# UNTOKEN

Token compression for LLM prompts. Reduces prompt length by ~70% while preserving semantic content.

UNTOKEN uses a learned token selector: given a sequence of N tokens, it returns a subsequence of ~0.3N tokens ranked by contextual importance. The model is a fine-tuned DistilBERT encoder with a lightweight importance head trained via an adversarial autoencoder objective.

## Install

```bash
pip install untoken
```

Requires Python 3.10+ and PyTorch 2.1+. Works on CPU and GPU.

## Quick Start

```python
from untoken import Untoken

# Load from HuggingFace Hub
ut = Untoken("pacifio/untoken-v1")

text = """
The quarterly earnings report showed a significant increase in revenue,
driven primarily by strong performance in the cloud computing division.
Operating margins improved by 3.2 percentage points year-over-year,
reflecting continued efficiency gains and disciplined cost management
across all business segments. The board approved a share buyback program
worth $2 billion, signaling confidence in the company's long-term outlook.
"""

compressed = ut.compress(text)
print(compressed)
# quarterly earnings report showed significant increase revenue driven
# strong performance cloud computing division operating margins improved
# 3.2 percentage points year-over-year efficiency gains disciplined cost
# management business segments board approved share buyback $2 billion
# confidence company long-term outlook
```

## Return Stats

```python
compressed, stats = ut.compress(text, ratio=0.3, return_stats=True)

print(compressed)
print(stats)
# {
#   "original_tokens": 128,
#   "compressed_tokens": 39,
#   "ratio": 0.305,
#   "savings_pct": 69.5
# }
```

## Adjustable Compression Ratio

The `ratio` parameter controls the fraction of tokens retained. Lower values compress more aggressively.

```python
# Keep 50% of tokens (lighter compression)
compressed = ut.compress(text, ratio=0.5)

# Keep 20% of tokens (aggressive compression)
compressed = ut.compress(text, ratio=0.2)
```

No retraining required — the ratio is applied at inference time via top-k selection.

## CLI

```bash
# Compress a file
untoken --model pacifio/untoken-v1 --input prompt.txt --ratio 0.3

# Output includes compression stats
# [512 → 154 tokens, 69.9% savings]
```

## Long Documents

Documents exceeding 480 tokens are automatically chunked at sentence boundaries and compressed independently. No truncation occurs.

```python
with open("long_document.txt") as f:
    text = f.read()

# Works on arbitrarily long inputs
compressed = ut.compress(text, ratio=0.3)
```

## Batch Usage

```python
texts = [doc1, doc2, doc3, ...]

compressed_texts = [ut.compress(t, ratio=0.3) for t in texts]
```

## Evaluation (CNN/DailyMail, n=200, ratio=0.3)

| Method | Cosine Sim | ROUGE-L | Compression Ratio |
|--------|-----------|---------|-------------------|
| **UNTOKEN** | **0.878** | **0.459** | **0.304** |
| Random drop | 0.723 | 0.429 | 0.303 |
| Stopword removal | 0.933 | 0.824 | 0.761 |

UNTOKEN achieves +15.5pp cosine similarity over random token dropping at an equivalent compression ratio. Stopword removal retains 76% of tokens and is not a comparable operating point.

## Model

The shipped artifact is a single ~300MB model: a DistilBERT encoder (66M parameters) with a 2-layer MLP importance head. The reconstructor and discriminator used during training are discarded at inference.

Model on HuggingFace: [pacifio/untoken-v1](https://huggingface.co/pacifio/untoken-v1)

## License

Apache 2.0
