Metadata-Version: 2.4
Name: tamil-utils
Version: 0.1.0
Summary: Tiny Tamil text utilities: normalize, tokenize, stopword removal, graphemes
Author: Arulnidhi Karunanidhi
License: MIT
Project-URL: Homepage, https://github.com/arulnidhii/tamil-utils
Project-URL: Issues, https://github.com/arulnidhii/tamil-utils/issues
Keywords: tamil,indic,nlp,unicode
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: regex>=2023.0
Dynamic: license-file

# tamil-utils

Tiny **Tamil** text utilities: `normalize`, `tokens`, `remove_stopwords`, `graphemes`.

```python
from tamil_utils import normalize, tokens, remove_stopwords, graphemes

s = "இது ஒரு சோதனை 👋🏽"
print(tokens(s))                              # ['இது', 'ஒரு', 'சோதனை']
print(remove_stopwords(tokens(s), preset="ta"))  # ['சோதனை']
print(graphemes("👩🏽‍💻"))                       # ['👩🏽‍💻']
                      # ['👩🏽‍💻']

```
## Installation

```python
pip install tamil-utils
```
# Windows CLI (module form)
```python
python -m tamil_utils.cli tokens "இது ஒரு சோதனை"
```
# Sentence split
```python
python -m tamil_utils.cli sents "இது ஒரு வாக்கியம். இது இரண்டாம்? சரி!"
```

# Numerals
```python
python -m tamil_utils.cli to-arabic "௨௦௨௫"   # -> 2025
python -m tamil_utils.cli to-tamil "123"     # -> ௧௨௩
```

# Tokens with stopwords removed (preset)
python -m tamil_utils.cli tokens --rmstop "இது ஒரு சோதனை"

## Status
[![PyPI](https://img.shields.io/pypi/v/tamil-utils)](https://pypi.org/project/tamil-utils/)
[![CI](https://github.com/arulnidhii/tamil-utils/actions/workflows/ci.yml/badge.svg)](https://github.com/arulnidhii/tamil-utils/actions)
