Metadata-Version: 2.1
Name: pywordseg
Version: 0.1.2
Summary: Open source state-of-the-art Chinese word segmentation toolkit
Home-page: https://github.com/voidism/pywordseg
Author: Jexus Chuang
License: MIT
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.6
Description-Content-Type: text/markdown
Requires-Dist: torch
Requires-Dist: h5py
Requires-Dist: numpy
Requires-Dist: overrides

# Pywordseg
基於 BiLSTM 及 ELMo 的 State-of-the-art 開源中文斷詞系統。  
An open source state-of-the-art Chinese word segmentation system with BiLSTM and ELMo.  

- arXiv paper link: https://arxiv.org/abs/1901.05816
- PyPI page: https://pypi.org/project/pywordseg/

## Performance
![](https://i.imgur.com/4WflkYS.png)
- 此專案提供圖中的 "character level ELMo" model 以及 "baseline" model，其中 "character level ELMo" model 是當前準確率最高。這兩個 model 都贏過目前常用的斷詞系統 [Jieba](https://github.com/fxsjy/jieba) (HMM-based) 及 [CKIP](http://ckipsvr.iis.sinica.edu.tw/) (rule-based) 許多。  
- This repo provides the "character level ELMo" model and "baseline" model in the figure. Our "character level ELMo" model outperforms the previous state-of-the-art Chinese word segmentation (Ma et al. 2018), and also largely outerform "[Jieba](https://github.com/fxsjy/jieba)" and "[CKIP](http://ckipsvr.iis.sinica.edu.tw/)", which are most popular toolkits in processing simplified/traditional Chinese text.

![](https://i.imgur.com/0vCz0ui.png)
- 當處理訓練時未見過的詞時，"character level ELMo" model 仍然保有不錯的正確率，相較於"baseline" model。  
- When considering OOV accuracy, our "character level ELMo" model outperforms our "baseline" model about 5%.

## Usage
### Requirements
- python >= 3.6 (do not use 3.5)
- pytorch 0.4
- overrides

### Install with Pip
  - `$ pip install pywordseg`
  - the module will automatically download the models while your first import within 1 minute.
  - if you use MacOS and encounter the [urllib.error.URLError](https://stackoverflow.com/questions/49183801/ssl-certificate-verify-failed-with-urllib) problem when downloading your models,  
  try `$ sudo /Applications/Python\ 3.6/Install\ Certificates.command` to bypass the certificate issue.

### Install manually
  - `$ git clone https://github.com/voidism/pywordseg`
  - download [ELMoForManyLangs.zip](https://www.dropbox.com/s/eiya6ztmjopprsm/ELMoForManyLangs.zip?dl=0) and unzip it to the `pywordseg/pywordseg` (the code of the ELMo model is from [HIT-SCIR](https://github.com/HIT-SCIR/ELMoForManyLangs), training by myself in character-level)
  - `$ pip install .` under the main directory

### Segment!
  ```python
  # import the module
  from pywordseg import *

  # declare the segmentor.
  seg = Wordseg(batch_size=64, device="cuda:0", embedding='elmo', elmo_use_cuda=True, mode="TW")

  # input is a list of raw sentences.
  seg.cut(["今天天氣真好啊!", "潮水退了就知道，誰沒穿褲子。"])

  # will return a list of lists of the segmented sentences.
  # [['今天', '天氣', '真', '好', '啊', '!'], ['潮水', '退', '了', '就', '知道', ',', '誰', '沒', '穿', '褲子', '。']]
  ```
### External Dictionary (Optional)
This feature was inspired by [CKIPTagger](https://github.com/ckiplab/ckiptagger#3-optional-create-dictionary).
  ```python
  # import the module
  from pywordseg import *

  # declare the segmentor.
  seg = Wordseg(batch_size=64, device="cuda:0", embedding='elmo', elmo_use_cuda=True, mode="TW")

  # create dictionary with their relative weights to prioritize.
  word_to_weight = {
    "來辦": 2.0,
    "你本人": 1.0,
    "或者是": 1.0,
    "有興趣": 1.0,
    "有興趣的": "2.0",
  }
  dictionary = construct_dictionary(word_to_weight)
  print(dictionary)
  # [(2, {'來辦': 2.0}), (3, {'你本人': 1.0, '或者是': 1.0, '有興趣': 1.0}), (4, {'有興趣的': 2.0})]

  # 1) segment without dictionary.
  seg.cut(["你本人或者是親屬有興趣的話都可以來辦理"])
  # [['你', '本人', '或者', '是', '親屬', '有', '興趣', '的話', '都', '可以', '來', '辦理']]

  # 2) segment with dictionary to merge words (only merge words that will not break existing words).
  seg.cut(["你本人或者是親屬有興趣的話都可以來辦理"], merge_dict=dictionary)
  # [['你本人', '或者是', '親屬', '有興趣', '的話', '都', '可以', '來', '辦理']]
  # merged: '你', '本人' --> '你本人'
  # merged: '或者', '是' --> '或者是'
  # merged: '有', '興趣' --> '有興趣'
  # not merged: '來', '辦理' -x-> '來辦', '理' because it breaks existing words

  # 3) segment with dictionary that force words to be segmented (ignore existing words).
  seg.cut(["你本人或者是親屬有興趣的話都可以來辦理"], force_dict=dictionary)
  # [['你本人', '或者是', '親屬', '有興趣的', '話', '都', '可以', '來辦', '理']]
  # merged: '你', '本人' --> '你本人'
  # merged: '或者', '是' --> '或者是'
  # change: '有興趣', '的話' --> '有興趣的', '話'
  # change: '來', '辦理' --> '來辦', '理'
  ```
#### Parameters:
  - **batch_size**: batch size for the word segmentation model, default: `64`.
  - **device**: the CPU/GPU device to run you model, default: `'cpu'`.
  - **embedding**: (default: `'w2v'`) 
    - `'elmo'`: the loaded model will be the "character level ELMo" model above, which runs slow.
    - `'w2v'`: the loaded model will be the "baseline model" above, which runs faster than `'elmo'`.
  - **elmo_use_cuda**: if you want your ELMo model be accelerated on GPU, use `True`, otherwise the ELMo model will be run on CPU. This param is no use when `embedding='w2v'`. default: `True`.
  - **mode**: `WordSeg` will load different model according to the mode as listed below: (default: `TW`)
    - `TW`: trained on AS corpus, from CKIP, Academia Sinica, Taiwan.
    - `HK`: trained on CityU corpus, from City University of Hong Kong, Hong Kong SAR.
    - `CN_MSR`: trained on MSR corpus, from Microsoft Research, China.
    - `CN_PKU` or `CN`: trained on PKU corpus, from Peking University, China.

### Include External Dictionary (Optional)
This feature was inspired by [CKIPTagger](https://github.com/ckiplab/ckiptagger#3-optional-create-dictionary).
  ```python
  # import the module
  from pywordseg import *

  # declare the segmentor.
  seg = Wordseg(batch_size=64, device="cuda:0", embedding='elmo', elmo_use_cuda=True, mode="TW")

  # create dictionary with their relative weights to prioritize.
  word_to_weight = {
    "來辦": 2.0,
    "你本人": 1.0,
    "或者是": 1.0,
    "有興趣": 1.0,
    "有興趣的": "2.0",
  }
  dictionary = construct_dictionary(word_to_weight)
  print(dictionary)
  # [(2, {'來辦': 2.0}), (3, {'你本人': 1.0, '或者是': 1.0, '有興趣': 1.0}), (4, {'有興趣的': 2.0})]

  # 1) segment without dictionary.
  seg.cut(["你本人或者是親屬有興趣的話都可以來辦理"])
  # [['你', '本人', '或者', '是', '親屬', '有', '興趣', '的話', '都', '可以', '來', '辦理']]

  # 2) segment with dictionary to merge words (only merge words that will not break existing words).
  seg.cut(["你本人或者是親屬有興趣的話都可以來辦理"], merge_dict=dictionary)
  # [['你本人', '或者是', '親屬', '有興趣', '的話', '都', '可以', '來', '辦理']]
  # merged: '你', '本人' --> '你本人'
  # merged: '或者', '是' --> '或者是'
  # merged: '有', '興趣' --> '有興趣'
  # not merged: '來', '辦理' -x-> '來辦', '理' because it breaks existing words

  # 3) segment with dictionary that force words to be segmented (ignore existing words).
  seg.cut(["你本人或者是親屬有興趣的話都可以來辦理"], force_dict=dictionary)
  # [['你本人', '或者是', '親屬', '有興趣的', '話', '都', '可以', '來辦', '理']]
  # merged: '你', '本人' --> '你本人'
  # merged: '或者', '是' --> '或者是'
  # change: '有興趣', '的話' --> '有興趣的', '話'
  # change: '來', '辦理' --> '來辦', '理'
  ```

## TODO
- 目前只支援繁體中文(即使選擇CN mode，文字也要轉換成繁體才能運作，目前訓練資料都是經過 [OpenCC](https://github.com/BYVoid/OpenCC) 轉換的)，日後會加入簡體中文。

## Citation
If you use the code in your paper, then please cite it as:

    @article{Chuang2019,
      archivePrefix = {arXiv},
      arxivId       = {1901.05816},
      author        = {Chuang, Yung-Sung},
      eprint        = {1901.05816},
      title         = {Robust Chinese Word Segmentation with Contextualized Word Representations},
      url           = {http://arxiv.org/abs/1901.05816},
      year          = {2019}
    }


