Metadata-Version: 2.4
Name: tamil-utils
Version: 0.2.1
Summary: Tiny Tamil text utilities: normalize, tokenize, stopword removal, graphemes
Author: Arulnidhi Karunanidhi
License: MIT
Project-URL: Homepage, https://github.com/arulnidhii/tamil-utils
Project-URL: Issues, https://github.com/arulnidhii/tamil-utils/issues
Keywords: tamil,indic,nlp,unicode
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: regex>=2023.0
Dynamic: license-file

# tamil-utils

Tiny **Tamil‑first** text utilities for Unicode‑safe text handling and preprocessing.

[![PyPI](https://img.shields.io/pypi/v/tamil-utils.svg)](https://pypi.org/project/tamil-utils/)
[![CI](https://github.com/arulnidhii/tamil-utils/actions/workflows/ci.yml/badge.svg)](https://github.com/arulnidhii/tamil-utils/actions/workflows/ci.yml)
[![Docs](https://img.shields.io/badge/docs-mkdocs--material-blue)](https://arulnidhii.github.io/tamil-utils/)
[![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)

---

## Features

* **Core:** `normalize`, `tokens`, `remove_stopwords`, `graphemes`
* **Sentences & numerals:** `sents`, `to_arabic_numerals`, `to_tamil_numerals`
* **Script & transliteration:** `script_of`, `token_scripts`, `transliterate_iso15919`
* **v0.2:** `ngrams`, `bigrams`, `trigrams`, `word_counts`, `syllables`, `sort_tamil`

---

## Install

```bash
pip install tamil-utils
```

---

## Quick start (Python)

```python
from tamil_utils import (
    normalize, tokens, remove_stopwords, graphemes,
    sents, to_arabic_numerals, to_tamil_numerals,
    script_of, token_scripts, transliterate_iso15919,
    word_counts, sort_tamil
)

s = "இது ஒரு சோதனை. இது இரண்டாம்? சரி! ௨௦௨௫"
print(tokens(s))                                # ['இது','ஒரு','சோதனை','இது','இரண்டாம்','சரி','௨௦௨௫']
print(remove_stopwords(tokens(s), preset="ta")) # stopwords removed
print(graphemes("👩‍🚀"))                        # emoji‑safe graphemes
print(sents(s))                                  # sentence split
print(to_arabic_numerals("௨௦௨௫"))                 # "2025"
print(transliterate_iso15919("தமிழ்"))             # "tamiḻ"
print(token_scripts(tokens("கோட்123 hello")))     # [('கோட்123','Mixed'), ('hello','Latin')]

# v0.2
print(word_counts("தமிழ் NLP தமிழ் பயன்பாடு தமிழ் NLP", n=2, top=2))  # bigram freq
print(sort_tamil(["இலங்கை","ஆதி","அடி"]))                         # ['அடி','ஆதி','இலங்கை']
```

---

## CLI

```bash
# tokens / stopwords / graphemes / sents
python -m tamil_utils.cli tokens "இது ஒரு சோதனை"
python -m tamil_utils.cli tokens --rmstop "இது ஒரு சோதனை"
python -m tamil_utils.cli graphemes "👩‍🚀"
python -m tamil_utils.cli sents "இது ஒரு வாக்கியம். இது இரண்டாம்? சரி!"

# numerals
python -m tamil_utils.cli to-arabic "௨௦௨௫"   # -> 2025
python -m tamil_utils.cli to-tamil  "123"     # -> ௧௨௩

# transliteration & script tags
python -m tamil_utils.cli to-iso "தமிழ்"
python -m tamil_utils.cli script "கோட்123 hello"

# v0.2: n‑grams, frequency, syllables, sort
python -m tamil_utils.cli ngrams -n 3 "தமிழ் NLP பயன்பாடு"
python -m tamil_utils.cli freq -n 2 --top 5 "தமிழ் NLP தமிழ் பயன்பாடு தமிழ் NLP"
python -m tamil_utils.cli syllables "தமிழ் test 123"
# sort: pass words as args or via stdin
python -m tamil_utils.cli sort இலங்கை ஆதி அடி
type words.txt | python -m tamil_utils.cli sort
```

---

## Links

* **PyPI:** [https://pypi.org/project/tamil-utils/](https://pypi.org/project/tamil-utils/)
* **Docs:** [https://arulnidhii.github.io/tamil-utils/](https://arulnidhii.github.io/tamil-utils/)
* **Issues:** [https://github.com/arulnidhii/tamil-utils/issues](https://github.com/arulnidhii/tamil-utils/issues)

---

## Contributing

PRs welcome! Please run tests locally and ensure linting passes.

---

## License

MIT © Arulnidhi Karunanidhi
