Metadata-Version: 2.1
Name: ZiTokenizer
Version: 0.0.0
Summary: ZiTokenizer: tokenize word as Zi
Home-page: https://github.com/laohur/ZiCutter
Author: laohur
Author-email: laohur@gmail.com
License: [Anti-996 License](https: // github.com/996icu/996.ICU/blob/master/LICENSE)
Keywords: UnicodeTokenizer,ZiCutter,ZiTokenizer,Tokenizer,Unicode,laohur
Platform: UNKNOWN
Requires-Python: >=3.0
Description-Content-Type: text/markdown
Requires-Dist: logzero
Requires-Dist: UnicodeTokenizer
Requires-Dist: ZiCutter

# ZiTokenizer

ZiTokenizer: tokenize word as Zi

read word as Zi

word = prefix + root + suffix


## use
* pip install ZiTokenizer
* toeknize language frequency and count word frequency (https://github.com/laohur/UnicodeTokenizer/blob/master/test/count_lang/count_word.py)
```python
from ZiTokenizer.ZiTokenizer import ZiTokenizer

# build 
tokenizer = ZiTokenizer(dir) # dir includ "word_frequency.tsv"
tokenizer.build(min_ratio=2e-6, min_freq=1)

# use
tokenizer = ZiTokenizer(dir)
line = "'〇㎡[คุณจะจัดพิธีแต่งงานเมื่อไรคะัีิ์ื็ํึ]Ⅷpays-g[ran]d-blanc-élevé » (白高大夏國)😀熇'\x0000𧭏"
tokens = tokenizer.tokenize(line)
print(tokens)
```


