Metadata-Version: 2.1
Name: khmernormalizer
Version: 0.0.3
Summary: A missing toolkit for Khmer Natural Language Processing.
Home-page: https://github.com/seanghay/khmernormalizer
Author: Seanghay Yath
Author-email: seanghay.dev@gmail.com
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: regex
Requires-Dist: emoji (==2.6.0)
Requires-Dist: ftfy (==6.1.1)

## Khmer Normalizer 

A missing toolkit for **Khmer Natural Language Processing**.

- Character Reordering
- Duplicate Whitespaces
- Remove zero width space
- Remove emojis
- Fix Common misspellings
- Fix Unicode issues
- Fix Khmer trailing vowels
- URL Replacements
- Unicode Normalization (NFKC)
- Quotes symbols normalization
- Remove repeated punctuations

### Installation

```shell
pip install khmernormalizer
```

### Usage

```python
from khmernormalizer import normalize

input_str = """
តាម៖៖​សេចក្តី​រាយ​ការណ៍​​ឲ្យ​ដឹង​ថា!!!!!
https://google.com/a?x=1
កាល 😂 ពីវេលាម៉ោង    ៗ      ប្រមាណ១១យប់ថ្ងៃទី៤ 😂😂😂😂😂 ??
កាាាាត់
មិិិិិន 
មួយរយះះះះះះះ
រយះពេល
""".strip()

normalize(input_str, 
          emoji_replacement="", 
          remove_zwsp=True, 
          url_replacement="")
```

Result:
```
តាម៖សេចក្តីរាយការណ៍ឱ្យដឹងថា!

កាល ពីវេលាម៉ោងៗ ប្រមាណ១១យប់ថ្ងៃទី៤?
កាត់
មិន 
មួយរយៈះ
រយៈពេល
```
