Metadata-Version: 2.1
Name: kodoc-tokenizer
Version: 0.2.0rc1
Summary: Tokenizer for kodoc
Home-page: https://github.com/kodoc/kodoc-tokenizer
Author: Jangwon Park
Author-email: adieujw@gmail.com
License: Apache License 2.0
Platform: UNKNOWN
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: numpy
Requires-Dist: packaging
Requires-Dist: dataclasses ; python_version < "3.7"
Requires-Dist: importlib-metadata ; python_version < "3.8"

# kodoc-tokenizer

- Tokenizer for kodoc
- Based on `transformers==4.7.0`

## Installation

```bash
pip3 install kodoc-tokenizer
```

## How to Use

### Version

```python
import kodoc_tokenizer

kodoc_tokenizer.__version__  # 0.2.0rc1
```

### clean_text

```python
from kodoc_tokenizer import clean_text

text = "Today a::: : \t\t \x00I \x00a  朝 三暮四 [MASK] m \na fool \n\nbecause I am a fool. \n [SEP][CLS]  "
assert clean_text(text) == "Today a::: : I a 朝 三暮四 [MASK] m a fool because I am a fool. [SEP][CLS]"
```

### Basic Function

```python
from kodoc_tokenizer import get_kodoc_tokenizer

tokenizer = get_kodoc_tokenizer()
tokens = tokenizer.tokenize("다이어트마침표_1부 2013.7.25 02:24 PM 페이지1 제1부 다이어트 핵심 바이블 A`2`Z 다이어트에 실패하는 원인 중 하나는 잘못된 상식도 크게 한몫을 한다.")
```


