Metadata-Version: 2.3
Name: kitoken
Version: 0.10.0
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Environment :: Console
Classifier: Topic :: File Formats
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: End Users/Desktop
Classifier: License :: OSI Approved :: BSD License
Classifier: Typing :: Typed
Summary: Fast and versatile tokenizer for language models, supporting BPE, Unigram and WordPiece tokenization
Author-email: Christian Sdunek <me@systemcluster.me>
License: BSD-2-Clause
Requires-Python: >=3.10
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: repository, https://github.com/Systemcluster/kitoken

# kitoken

**Tokenizer for language models.**

```py
from kitoken import Kitoken

const encoder = Kitoken.from_file("models/llama2.kit")

const tokens = encoder.encode("hello world!", True)
const string = encoder.decode(tokens).decode("utf-8")

assert string == "hello world!"
```

## Features

- **Fast encoding and decoding**\
  Faster than most other tokenizers in both common and uncommon scenarios.
- **Support for a wide variety of tokenizer formats and tokenization strategies**\
  Including support for Tokenizers, SentencePiece, Tiktoken and more.
- **Compatible with many systems and platforms**\
  Runs on Windows, Linux, macOS and embedded, and comes with bindings for Web, Node and Python.
- **Compact data format**\
  Definitions are stored in an efficient binary format and without merge list.
- **Support for normalization and pre-tokenization**\
  Including unicode normalization, whitespace normalization, and many others.

## Overview

Kitoken is a fast and versatile tokenizer for language models. Multiple tokenization algorithms are supported:

- **BytePair**: A variation of the BPE algorithm, merging byte or character pairs.
- **Unigram**: The Unigram subword algorithm.
- **WordPiece**: The WordPiece subword algorithm.

Kitoken is compatible with many existing tokenizers,
including [SentencePiece](https://github.com/google/sentencepiece), [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers), [OpenAI Tiktoken](https://github.com/openai/tiktoken) and [Mistral Tekken](https://docs.mistral.ai/guides/tokenization).

See the main [README](//github.com/Systemcluster/kitoken) for more information.

