Metadata-Version: 2.3
Name: burmese-tokenizer
Version: 0.1.2
Summary: A simple tokenizer for Burmese text
Keywords: burmese,tokenizer,nlp,myanmar,text-processing
Author: janakhpon
Author-email: janakhpon <jnovaxer@gmail.com>
License: MIT
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Dist: sentencepiece>=0.1.99
Requires-Dist: click>=8.0.0
Requires-Dist: rich>=13.0.0
Requires-Dist: twine>=6.1.0
Requires-Dist: pytest>=7.0.0 ; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0 ; extra == 'dev'
Requires-Dist: black>=23.0.0 ; extra == 'dev'
Requires-Dist: isort>=5.12.0 ; extra == 'dev'
Requires-Dist: mypy>=1.0.0 ; extra == 'dev'
Requires-Dist: ruff>=0.1.0 ; extra == 'dev'
Requires-Dist: sphinx>=7.0.0 ; extra == 'docs'
Requires-Dist: sphinx-rtd-theme>=1.3.0 ; extra == 'docs'
Requires-Python: >=3.11
Project-URL: Changelog, https://github.com/Code-Yay-Mal/burmese_tokenizer/blob/main/CHANGELOG.md
Project-URL: Documentation, https://github.com/Code-Yay-Mal/burmese_tokenizer#readme
Project-URL: Homepage, https://github.com/Code-Yay-Mal/burmese_tokenizer
Project-URL: Issues, https://github.com/Code-Yay-Mal/burmese_tokenizer/issues
Project-URL: Repository, https://github.com/Code-Yay-Mal/burmese_tokenizer
Provides-Extra: dev
Provides-Extra: docs
Description-Content-Type: text/markdown

# Burmese Tokenizer

Tokenize Burmese text like a pro. No fancy stuff, just gets the job done.

## Quick Start

```bash
# Using pip
pip install burmese-tokenizer

# Using uv (faster)
uv add burmese-tokenizer
```

```python
from burmese_tokenizer import BurmeseTokenizer

tokenizer = BurmeseTokenizer()
text = "မင်္ဂလာပါ။ နေကောင်းပါသလား။"

# Tokenize
result = tokenizer.encode(text)
print(result["pieces"])  # ['▁မင်္ဂလာ', '▁ပါ', '။', '▁နေ', '▁ကောင်း', '▁ပါ', '▁သလား', '။']

# Decode
decoded = tokenizer.decode(result["pieces"])
print(decoded)  # မင်္ဂလာပါ။ နေကောင်းပါသလား။
```

## CLI

```bash
# Tokenize
burmese-tokenizer "မင်္ဂလာပါ။"

# Verbose mode (shows all the details)
burmese-tokenizer -v "မင်္ဂလာပါ။"

# Decode tokens back to text
burmese-tokenizer -d -t "▁မင်္ဂလာ,▁ပါ,။"
```

## API

- `encode(text)` - Chop text into tokens
- `decode(pieces)` - Glue tokens back together
- `decode_ids(ids)` - Convert IDs back to text
- `get_vocab_size()` - How many tokens we know
- `get_vocab()` - The whole vocabulary

## Dev Setup

```bash
git clone git@github.com:Code-Yay-Mal/burmese_tokenizer.git
cd burmese_tokenizer
uv sync --dev
uv run pytest

uv build
uv build --no-sources 
# make sure to have pypirc
uv run twine upload dist/*  or uv publish

# bump version
uv version --bump patch
uv version --short

# or publish with gh-action
git tag v0.1.2 
git push origin v0.1.2 

# if something goes wrong delete and restart all over again
git tag -d v0.1.2 && git push origin :refs/tags/v0.1.2 

```

## License

MIT - do whatever you want with it.
