Metadata-Version: 2.4
Name: pretok
Version: 0.2.0
Summary: Universal pre-token language adaptation layer for text-based LLMs
Project-URL: Homepage, https://github.com/yen0304/pretok
Project-URL: Documentation, https://github.com/yen0304/pretok#readme
Project-URL: Repository, https://github.com/yen0304/pretok
Project-URL: Issues, https://github.com/yen0304/pretok/issues
Author-email: yen0304 <asce55123@gmail.com>
License: MIT
License-File: LICENSE
Keywords: language-detection,llm,nlp,pre-tokenization,translation
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Typing :: Typed
Requires-Python: >=3.11
Requires-Dist: pydantic>=2.0
Requires-Dist: pyyaml>=6.0
Provides-Extra: all
Requires-Dist: deepl>=1.0; extra == 'all'
Requires-Dist: fasttext-wheel>=0.9.2; extra == 'all'
Requires-Dist: google-cloud-translate>=3.0; extra == 'all'
Requires-Dist: langdetect>=1.0.9; extra == 'all'
Requires-Dist: openai>=1.0; extra == 'all'
Requires-Dist: redis>=4.0; extra == 'all'
Requires-Dist: sentencepiece>=0.1.99; extra == 'all'
Requires-Dist: torch>=2.0; extra == 'all'
Requires-Dist: transformers>=4.30; extra == 'all'
Provides-Extra: api
Requires-Dist: deepl>=1.0; extra == 'api'
Requires-Dist: google-cloud-translate>=3.0; extra == 'api'
Requires-Dist: openai>=1.0; extra == 'api'
Provides-Extra: deepl
Requires-Dist: deepl>=1.0; extra == 'deepl'
Provides-Extra: detection
Requires-Dist: fasttext-wheel>=0.9.2; extra == 'detection'
Requires-Dist: langdetect>=1.0.9; extra == 'detection'
Provides-Extra: dev
Requires-Dist: hypothesis>=6.0; extra == 'dev'
Requires-Dist: mypy>=1.8; extra == 'dev'
Requires-Dist: pre-commit>=3.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs-material>=9.0; extra == 'docs'
Requires-Dist: mkdocs>=1.5; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.24; extra == 'docs'
Provides-Extra: fasttext
Requires-Dist: fasttext-wheel>=0.9.2; extra == 'fasttext'
Provides-Extra: google
Requires-Dist: google-cloud-translate>=3.0; extra == 'google'
Provides-Extra: langdetect
Requires-Dist: langdetect>=1.0.9; extra == 'langdetect'
Provides-Extra: local
Requires-Dist: sentencepiece>=0.1.99; extra == 'local'
Requires-Dist: torch>=2.0; extra == 'local'
Requires-Dist: transformers>=4.30; extra == 'local'
Provides-Extra: m2m100
Requires-Dist: sentencepiece>=0.1.99; extra == 'm2m100'
Requires-Dist: torch>=2.0; extra == 'm2m100'
Requires-Dist: transformers>=4.30; extra == 'm2m100'
Provides-Extra: nllb
Requires-Dist: sentencepiece>=0.1.99; extra == 'nllb'
Requires-Dist: torch>=2.0; extra == 'nllb'
Requires-Dist: transformers>=4.30; extra == 'nllb'
Provides-Extra: openai
Requires-Dist: openai>=1.0; extra == 'openai'
Provides-Extra: redis
Requires-Dist: redis>=4.0; extra == 'redis'
Description-Content-Type: text/markdown

<p align="center">
  <img src="https://raw.githubusercontent.com/yen0304/pretok/main/logo.png" alt="pretok logo" width="640">
</p>

<h1 align="center">pretok</h1>

<p align="center">
  <a href="https://github.com/yen0304/pretok/actions/workflows/ci.yml"><img src="https://github.com/yen0304/pretok/actions/workflows/ci.yml/badge.svg?branch=main" alt="CI"></a>
  <a href="https://codecov.io/gh/yen0304/pretok"><img src="https://codecov.io/gh/yen0304/pretok/branch/main/graph/badge.svg" alt="codecov"></a>
  <a href="https://pypi.org/project/pretok/"><img src="https://img.shields.io/pypi/v/pretok.svg" alt="PyPI version"></a>
  <a href="https://www.python.org/downloads/"><img src="https://img.shields.io/badge/python-3.11+-blue.svg" alt="Python 3.11+"></a>
  <a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a>
  <a href="https://github.com/astral-sh/ruff"><img src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json" alt="Ruff"></a>
</p>

> Universal pre-token language adaptation layer for text-based LLMs.

**pretok** enables any Large Language Model to receive input in any human language by automatically translating input text into a language the model supports—all before tokenization, without modifying the model or tokenizer.

## ✨ Features

- **Model-Agnostic**: Works with any text-based LLM (local, remote, open-source, proprietary)
- **Pre-Token Boundary**: All transformations occur on raw text before tokenization
- **Prompt Structure Preservation**: Role markers, delimiters, code blocks, and control tokens are preserved
- **Flexible Translation**: Use any LLM via OpenAI-compatible APIs (OpenRouter, Ollama, vLLM, etc.)
- **Pluggable Backends**: Support for multiple detection and translation engines
- **Explicit Capability Contracts**: Models declare their supported languages

## 🚀 Installation

```bash
pip install pretok
```

Or with uv:

```bash
uv add pretok
```

### Optional Dependencies

```bash
# Language detection
pip install pretok[fasttext]      # FastText (high accuracy)
pip install pretok[langdetect]    # langdetect (pure Python)

# Translation backends
pip install pretok[nllb]          # Meta's NLLB model (local)
pip install pretok[openai]        # OpenAI API

# All features
pip install pretok[all]
```

## 📖 Quick Start

```python
from pretok import Pretok, create_pretok

# Create with default settings
pretok = Pretok(target_language="en")

# Process text
result = pretok.process("Bonjour, comment ca va?")

print(result.processed_text)  # "Hello, how are you?"
print(result.was_modified)    # True
```

### With Model-Specific Optimization

```python
# Auto-detect optimal language from model capabilities
pretok = create_pretok(model_id="gpt-4")     # Uses English
pretok = create_pretok(model_id="qwen-7b")   # Uses Chinese
```

### With Custom Translation Backend

```python
from pretok import Pretok
from pretok.config import LLMTranslatorConfig
from pretok.translation.llm import LLMTranslator

# Use any OpenAI-compatible API
config = LLMTranslatorConfig(
    base_url="https://api.openai.com/v1",  # Or OpenRouter, Ollama, vLLM
    model="gpt-4o-mini",
)
translator = LLMTranslator(config)
pretok = Pretok(target_language="en", translator=translator)
```

### Preserving Prompt Structure

```python
prompt = """<|im_start|>system
You are a helpful assistant.
<|im_end|>
<|im_start|>user
What is the capital of Japan?
<|im_end|>"""

result = pretok.process(prompt)
# Role markers preserved, only content translated
```

### Configuration

Create a `pretok.yaml`:

```yaml
version: "1.0"

pipeline:
  default_detector: langdetect
  cache_enabled: true

translation:
  llm:
    base_url: "https://api.openai.com/v1"
    model: "gpt-4o-mini"

cache:
  memory:
    max_size: 1000
    ttl: 3600
```

```python
from pretok import Pretok
from pretok.config import load_config

config = load_config("pretok.yaml")
pretok = Pretok(config=config)
```

## 🏗️ Architecture

```
Input Text (any language)
        ↓
Segment Parsing (roles, code, text)
        ↓
Language Detection
        ↓
Translation Decision
        ↓
Translation (if needed)
        ↓
Prompt Reconstruction
        ↓
Tokenizer (unchanged)
        ↓
LLM Inference
```

## 📚 Documentation

- [Installation Guide](docs/getting-started/installation.md)
- [Quickstart Tutorial](docs/getting-started/quickstart.md)
- [Configuration Reference](docs/getting-started/configuration.md)
- [API Documentation](docs/api/pipeline.md)

## 🛠️ Development

```bash
# Clone the repository
git clone https://github.com/yen0304/pretok.git
cd pretok

# Install dependencies
uv sync --dev

# Run tests
uv run pytest

# Run linting
uv run ruff check src/ tests/

# Run type checking
uv run mypy src/
```

## 📄 License

MIT License - see [LICENSE](LICENSE) for details.

## 🤝 Contributing

Contributions are welcome! Please see [CONTRIBUTING.md](docs/development/contributing.md) for guidelines.
