Metadata-Version: 2.1
Name: khmercut
Version: 0.0.2
Summary: A (fast) Khmer word segmentation toolkit.
Home-page: https://github.com/seanghay/khmercut
Author: Seanghay Yath
Author-email: seanghay.dev@gmail.com
License: Apache License 2.0
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Requires-Python: >3.5
Description-Content-Type: text/markdown
Requires-Dist: python-crfsuite (==0.9.9)
Requires-Dist: khmernormalizer (==0.0.4)
Requires-Dist: tqdm (==4.65.0)
Provides-Extra: test
Requires-Dist: pytest ; extra == 'test'
Requires-Dist: coverage ; extra == 'test'

### khmercut

A (fast) Khmer word segmentation toolkit. 

- A single python file
- Using `pycrfsuite` only
- Include Khmer normalize
- CLI Supoprt
- Multiprocess support

```shell
pip install khmercut
```

### Python

```python
from khmercut import tokenize

tokenize("ឃាត់ខ្លួនជនសង្ស័យ០៤នាក់ ករណីលួចខ្សែភ្លើង នៅស្រុកព្រៃនប់")
# => ['ឃាត់ខ្លួន', 'ជនសង្ស័យ', '០៤', 'នាក់', ' ', 'ករណី', 'លួច', 'ខ្សែភ្លើង', ' ', 'នៅ', 'ស្រុក', 'ព្រៃនប់']
```

### CLI

e.g.

```shell
khmercut large_km.txt --jobs 20 --normalize -d out/ -s "|"
```

Available options

```
usage: khmercut [-h] [-d DIRECTORY] [-s SEPARATOR] [-j JOBS] [-q] [-n] files [files ...]

A fast Khmer word segmentation toolkit.

positional arguments:
  files                 Path to text files

optional arguments:
  -h, --help            show this help message and exit
  -d DIRECTORY, --directory DIRECTORY
                        Output folder
  -s SEPARATOR, --separator SEPARATOR
                        Specify token separator
  -j JOBS, --jobs JOBS  Number of processors
  -q, --quiet           Disable progress output
  -n, --normalize       Normalize input text before processing
```

### Reference

- [Khmer language processing toolkit](https://github.com/VietHoang1512/khmer-nltk)
