Metadata-Version: 2.1
Name: connlp
Version: 0.0.18
Summary: A bunch of python codes to analyze text data in the construction industry. Mainly reconstitute the pre-exist python libraries for Natural Language Processing (NLP)
Home-page: https://github.com/blank54/connlp.git
Author: Seonghyeon Boris Moon
Author-email: boris.moon514@gmail.com
License: UNKNOWN
Platform: UNKNOWN
Description-Content-Type: text/markdown

# connlp
A bunch of python codes to analyze text data in the construction industry.  
Mainly reconstitute the pre-exist python libraries for Natural Language Processing (NLP).

## _Project Information_
- Supported by C!LAB (@Seoul Nat'l Univ.)

## _Contributors_
- Seonghyeon Boris Moon (blank54@snu.ac.kr, https://github.com/blank54/)
- Sehwan Chung (hwani751@snu.ac.kr)
- Jungyeon Kim (janykjy@snu.ac.kr)

# Initialize

## _Setup_

Install _**connlp**_ with _pip_.

```shell
pip install connlp
```

Install _requirements.txt_.

```shell
cd WORKSPACE
wget -O requirements_connlp.txt https://raw.githubusercontent.com/blank54/connlp/master/requirements.txt
pip install -r requirements_connlp.txt
```

## _Test_

If the code below runs with no error, _**connlp**_ is installed successfully.

```python
from connlp.test import hello
hello()

# 'Helloworld'
```

# Preprocess

Preprocessing module supports English and Korean.  
NOTE: No plan for other languages (by 2021.04.02.).

## _Normalizer_

_**Normalizer**_ normalizes the input text by eliminating trash characters and remaining numbers, alphabets, and punctuation marks.

```python
from connlp.preprocess import Normalizer
normalizer = Normalizer()

normalizer.normalize(text='I am a boy!')

# 'i am a boy'
```

## _EnglishTokenizer_

_**EnglishTokenizer**_ tokenizes the input text in English based on word spacing.  
The ngram-based tokenization is in preparation.

```python
from connlp.preprocess import EnglishTokenizer
tokenizer = EnglishTokenizer()

tokenizer.tokenize(text='I am a boy!')

# ['I', 'am', 'a', 'boy!']
```

## _KoreanTokenizer_

_**KoreanTokenizer**_ tokenizes the input text in Korean, and is based on either pre-trained or unsupervised approaches.

You are recommended to use pre-trained method unless you have a large size of corpus. This is the default setting.

If you want to use a pre-trained tokenizer, you have to select which analyzer you want to use. Available analyzers are based on KoNLPy (https://konlpy.org/ko/latest/api/konlpy.tag/), a python package for Korean language processing. The default analyzer is _**Hannanum**_

```python
from connlp.preprocess import KoreanTokenizer
tokenizer = KoreanTokenizer(pre_trained=True, analyzer='Hannanum')
```

If your corpus is big, you may use an unsupervised method, which is based on _**soynlp**_ (https://github.com/lovit/soynlp), an unsupervised text analyzer in Korean.

```python
from connlp.preprocess import KoreanTokenizer
tokenizer = KoreanTokenizer(pre_trained=False)
```

### _train_

If your _**KoreanTokenizer**_ are pre-trained, you can neglect this step.

Otherwhise (i.e. you are using an unsupervised approach), the _**KoreanTokenizer**_ object first needs to be trained on (unlabeled) corpus. 'Word score' is calculated for every subword in the corpus.

```python
from connlp.preprocess import KoreanTokenizer
tokenizer = KoreanTokenizer(pre_trained=False)

docs = ['코퍼스의 첫 번째 문서입니다.', '두 번째 문서입니다.', '마지막 문서']

tokenizer.train(text=docs)
print(tokenizer.word_score)

# {'서': 0.0, '코': 0.0, '째': 0.0, '.': 0.0, '의': 0.0, '마': 0.0, '막': 0.0, '번': 0.0, '문': 0.0, '코퍼': 1.0, '번째': 1.0, '마지': 1.0, '문서': 1.0, '코퍼스': 1.0, '문서입': 0.816496580927726, '마지막': 1.0, '코퍼스의': 1.0, '문서입니': 0.8735804647362989, '문서입니다': 0.9036020036098448, '문서입니다.': 0.9221079114817278}
```

### _tokenize_

If you are using a pre-trained _**KoreanTokenizer**_, the selected KoNLPy analyzer will tokenize the input sentence based on morphological analysis.

```python
from connlp.preprocess import KoreanTokenizer
tokenizer = KoreanTokenizer(pre_trained=True, analyzer='Hannanum')
doc = docs[0] # '코퍼스의 첫 번째 문서입니다.'
tokenizer.tokenize(doc)

# ['코퍼스', '의', '첫', '번째', '문서', '입니다', '.']
```

If you are using an unsupervised _**KoreanTokenizer**_, tokenization is based on the 'word score' calculated from _**KoreanTokenizer.train**_ method.

For each blank-separated token, a subword that has the maximum 'word score' is selectd as an individual 'word' and separated with the remaining part.

```python
from connlp.preprocess import KoreanTokenizer
tokenizer = KoreanTokenizer(pre_trained=False)
doc = docs[0] # '코퍼스의 첫 번째 문서입니다.'
tokenizer.tokenize(doc)

# ['코퍼스의', '첫', '번째', '문서', '입니다.']
```

## _StopwordRemover_

_**StopwordRemover**_ removes stopwords from a given sentence based on the user-customized stopword list.  
Before utilizing _**StopwordRemover**_ the user should normalize and tokenize the docs.

```python
from connlp.preprocess import Normalizer, EnglishTokenizer, StopwordRemover
normalizer = Normalizer()
eng_tokenizer = EnglishTokenizer()
stopword_remover = StopwordRemover()

docs = ['I am a boy!', 'He is a boy..', 'She is a girl?']
tokenized_docs = []

for doc in eng_docs:
    normalized_doc = normalizer.normalize(text=doc)
    tokenized_doc = eng_tokenizer.tokenize(text=normalized_doc)
    tokenized_docs.append(tokenized_doc)

print(docs)
print(tokenized_docs)

# ['I am a boy!', 'He is a boy..', 'She is a girl?']
# [['i', 'am', 'a', 'boy'], ['he', 'is', 'a', 'boy'], ['she', 'is', 'a', 'girl']]
```

The user should prepare a customized stopword list (i.e., _stoplist_).  
The _stoplist_ should include user-customized stopwords divided by '\n' and the file should be in ".txt" format.

```text
a
is
am
```

Initiate the _**StopwordRemover**_ with appropriate filepath of user-customized stopword list.  
If the stoplist is absent at the filepath, the stoplist would be ramain as a blank list.

```python
fpath_stoplist = 'test/thesaurus/stoplist.txt'
stopword_remover.initiate(fpath_stoplist=fpath_stoplist)

print(stopword_remover)

# <connlp.preprocess.StopwordRemover object at 0x7f163e70c050>
```

The user can count the word frequencies and figure out additional stopwords based on the results.

```python
stopword_remover.count_freq_words(docs=tokenized_docs)

# ========================================
# Word counts
#   | [1] a: 3
#   | [2] boy: 2
#   | [3] is: 2
#   | [4] i: 1
#   | [5] am: 1
#   | [6] he: 1
#   | [7] she: 1
#   | [8] girl: 1
```

After finally updating the _stoplist_, use _**remove**_ method to remove the stopwords from text.

```python
stopword_removed_docs = []
    for doc in tokenized_docs:
        stopword_removed_docs.append(stopword_remover.remove(sent=doc))

print(stopword_removed_docs)

# [['i', 'boy'], ['he', 'boy'], ['she', 'girl']]
```

The user can check which stopword was removed with _**check_removed_words**_ methods.

```python
stopword_remover.check_removed_words(docs=tokenized_docs, stopword_removed_docs=stopword_removed_docs)

# ========================================
# Check stopwords removed
#   | [1] BEFORE: a(3) ->
#   | [2] BEFORE: boy -> AFTER: boy(2)
#   | [3] BEFORE: is(2) ->
#   | [4] BEFORE: i -> AFTER: i(1)
#   | [5] BEFORE: am(1) ->
#   | [6] BEFORE: he -> AFTER: he(1)
#   | [7] BEFORE: she -> AFTER: she(1)
#   | [8] BEFORE: girl -> AFTER: girl(1)
```

# Embedding

## _Vectorizer_

_**Vectorizer**_ includes several text embedding methods that have been commonly used for decades.  

### _tfidf_

TF-IDF is the most commonly used technique for word embedding.  
The TF-IDF model counts the term frequency(TF) and inverse document frequency(IDF) from the given documents.  
The results included the followings.  
- TF-IDF Vectorizer (a class of sklearn.feature_extraction.text.TfidfVectorizer')
- TF-IDF Matrix
- TF-IDF Vocabulary

```python
from connlp.preprocess import EnglishTokenizer
from connlp.embedding import Vectorizer
tokenizer = EnglishTokenizer()
vectorizer = Vectorizer()

docs = ['I am a boy', 'He is a boy', 'She is a girl']
tfidf_vectorizer, tfidf_matrix, tfidf_vocab = vectorizer.tfidf(docs=docs)
type(tfidf_vectorizer)

# <class 'sklearn.feature_extraction.text.TfidfVectorizer'>
```

The user can get a document vector by indexing the _**tfidf_matrix**_.

```python
tfidf_matrix[0]

# (0, 2)    0.444514311537431
# (0, 0)    0.34520501686496574
# (0, 1)    0.5844829010200651
# (0, 5)    0.5844829010200651
```

The _**tfidf_vocab**_ returns an index for every token.

```python
print(tfidf_vocab)

# {'i': 5, 'am': 1, 'a': 0, 'boy': 2, 'he': 4, 'is': 6, 'she': 7, 'girl': 3}
```

### _word2vec_

Word2Vec is a distributed representation language model for word embedding.  
The Word2vec model trains tokenized docs and returns word vectors.  
The result is a class of 'gensim.models.word2vec.Word2Vec'.

```python
from connlp.preprocess import EnglishTokenizer
from connlp.embedding import Vectorizer
tokenizer = EnglishTokenizer()
vectorizer = Vectorizer()

docs = ['I am a boy', 'He is a boy', 'She is a girl']
tokenized_docs = [tokenizer.tokenize(text=doc) for doc in docs]
w2v_model = vectorizer.word2vec(docs=tokenized_docs)
type(w2v_model)

# <class 'gensim.models.word2vec.Word2Vec'>
```

The user can get a word vector by _**.wv**_ method.

```python
w2v_model.wv['boy']

# [-2.0130998e-03 -3.5652996e-03  2.7793974e-03 ...]
```

The Word2Vec model provides the _topn_-most similar word vectors.

```python
w2v_model.wv.most_similar('boy', topn=3)

# [('He', 0.05311150848865509), ('a', 0.04154288396239281), ('She', -0.029122961685061455)]
```

### _word2vec (update)_

The user can update the Word2Vec model with new data.

```python
new_docs = ['Tom is a man', 'Sally is not a boy']
tokenized_new_docs = [tokenizer.tokenize(text=doc) for doc in new_docs]
w2v_model_updated = vectorizer.word2vec_update(w2v_model=w2v_model, new_docs=tokenized_new_docs)

w2v_model_updated.wv['man']

# [4.9649975e-03  3.8002312e-04 -1.5773597e-03 ...]
```

### _doc2vec_

Doc2Vec is a distributed representation language model for longer text (e.g., sentence, paragraph, document) embedding.  
The Doc2vec model trains tokenized docs with tags and returns document vectors.  
The result is a class of 'gensim.models.doc2vec.Doc2Vec'.

```python
from connlp.preprocess import EnglishTokenizer
from connlp.embedding import Vectorizer
tokenizer = EnglishTokenizer()
vectorizer = Vectorizer()

docs = ['I am a boy', 'He is a boy', 'She is a girl']
tagged_docs = [(idx, tokenizer.tokenize(text=doc)) for idx, doc in enumerate(docs)]
d2v_model = vectorizer.doc2vec(tagged_docs=tagged_docs)
type(d2v_model)

# <class 'gensim.models.doc2vec.Doc2Vec'>
```

The Doc2Vec model can infer a new document.

```python
test_doc = ['My', 'name', 'is', 'Peter']
d2v_model.infer_vector(doc_words=test_doc)

# [4.8494316e-03 -4.3647490e-03  1.1437446e-03 ...]
```

# Analysis

## _TopicModel_

_**TopicModel**_ is a class for topic modeling based on gensim LDA model.  
It provides a simple way to train lda model and assign topics to docs.  

Before using LDA topic modeling, the user should install the following packages.

```shell
pip install pyldavis==2.1.2
```

_**TopicModel**_ requires two instances.  
- a dict of docs whose keys are the tag
- the number of topics for modeling

```python
from connlp.analysis_lda import TopicModel

num_topics = 2
docs = {'doc1': ['I', 'am', 'a', 'boy'],
        'doc2': ['He', 'is', 'a', 'boy'],
        'doc3': ['Cat', 'on', 'the', 'table'],
        'doc4': ['Mike', 'is', 'a', 'boy'],
        'doc5': ['Dog', 'on', 'the', 'table'],
        }

lda_model = TopicModel(docs=docs, num_topics=num_topics)
```

### _learn_

The user can train the model with _learn_ method.
Unless parameters being provided by the user, the model trains based on default parameters.  

After _learn_, _**TopicModel**_ provides _model_ instance that is a class of <'gensim.models.ldamodel.LdaModel'>


```python
parameters = {
    'iterations': 100,
    'alpha': 0.7,
    'eta': 0.05,
}
lda_model.learn(parameters=parameters)
type(lda_model.model)

# <class 'gensim.models.ldamodel.LdaModel'>
```

### _coherence_

_**TopicModel**_ provides coherence value for model evaluation.  
The coherence value is automatically calculated right after model training.

```python
print(lda_model.coherence)

# 0.3607990279229385
```

### _assign_

The user can easily assign the most proper topic to each doc using _assign_ method.  
After _assign_, the _**TopicModel**_ provides _tag2topic_ and _topic2tag_ instances for convenience.

```python
lda_model.assign()

print(lda_model.tag2topic)
print(lda_model.topic2tag)

# defaultdict(<class 'int'>, {'doc1': 1, 'doc2': 1, 'doc3': 0, 'doc4': 1, 'doc5': 0})
# defaultdict(<class 'list'>, {1: ['doc1', 'doc2', 'doc4'], 0: ['doc3', 'doc5']})
```

## _NamedEntityRecognition_

Before using NER modules, the user should install proper versions of TensorFlow and Keras.  

```shell
pip install config==0.4.2 gensim==3.8.1 gpustat==0.6.0 GPUtil==1.4.0 h5py==2.10.0 JPype1==0.7.1 Keras==2.2.4 konlpy==0.5.2 nltk==3.4.5 numpy==1.18.1 pandas==1.0.1 scikit-learn==0.22.1 scipy==1.4.1 silence-tensorflow==1.1.1 soynlp==0.0.493 tensorflow==1.14.0 tensorflow-gpu==1.14.0
```

The modules might require the module of _keras-contrib_.  
The user can install the module by following the below.  

```shell
git clone https://www.github.com/keras-team/keras-contrib.git 
cd keras-contrib 
python setup.py install
```

### _Labels_

_**NER_Model**_ is a class to conduct named entity recognition using Bi-directional Long-Short Term Memory (Bi-LSTM) and Conditional Random Field (CRF).  

At the beginning, appropriate labels are required.  
The labels should be numbered with start of 0.

```python
from connlp.analysis_ner import NER_Labels

label_dict = {'NON': 0,     #None
              'PER': 1,     #PERSON
              'FOD': 2,}    #FOOD

ner_labels = NER_Labels(label_dict=label_dict)
```

### _Corpus_

Next, the user should prepare data including sentences and labels, of which each data being matched by the same tag.  
The tokenized sentences and labels are then combined via _**NER_LabeledSentence**_.  
With the data, labels, and a proper size of _max_sent_len_ (i.e., the maximum length of sentence for analysis), _**NER_Corpus**_ would be developed.  
Once the corpus was developed, every data of sentences and labels would be padded with the length of _max_sent_len_.  

```python
from connlp.preprocess import EnglishTokenizer
from connlp.analysis_ner import NER_LabeledSentence, NER_Corpus
tokenizer = EnglishTokenizer()

data_sents = {'sent1': 'Sam likes pizza',
              'sent2': 'Erik eats pizza',
              'sent3': 'Erik and Sam are drinking soda',
              'sent4': 'Flora cooks chicken',
              'sent5': 'Sam ordered a chicken',
              'sent6': 'Flora likes chicken sandwitch',
              'sent7': 'Erik likes to drink soda'}
data_labels = {'sent1': [1, 0, 2],
               'sent2': [1, 0, 2],
               'sent3': [1, 0, 1, 0, 0, 2],
               'sent4': [1, 0, 2],
               'sent5': [1, 0, 0, 2],
               'sent6': [1, 0, 2, 2],
               'sent7': [1, 0, 0, 0, 2]}

docs = []
for tag, sent in data_sents.items():
    words = [str(w) for w in tokenizer.tokenize(text=sent)]
    labels = data_labels[tag]
    docs.append(NER_LabeledSentence(tag=tag, words=words, labels=labels))

max_sent_len = 10
ner_corpus = NER_Corpus(docs=docs, ner_labels=ner_labels, max_sent_len=max_sent_len)
type(ner_corpus)

# <class 'connlp.analysis_ner.NER_Corpus'>
```

### _Word Embedding_

Every word in the _**NER_Corpus**_ should be embedded into numeric vector space.  
The user can conduct embedding with Word2Vec which is provided in _**Vectorizer**_ of _**connlp**_.  
Note that the embedding process of _**NER_Corpus**_ only requires the dictionary of word vectors and the feature size.  

```python
from connlp.preprocess import EnglishTokenizer
from connlp.embedding import Vectorizer
tokenizer = EnglishTokenizer()
vectorizer = Vectorizer()

tokenized_sents = [tokenizer.tokenize(sent) for sent in data_sents.values()]
w2v_model = vectorizer.word2vec(docs=tokenized_sents)

word2vector = vectorizer.get_word_vectors(w2v_model)
feature_size = w2v_model.vector_size
ner_corpus.word_embedding(word2vector=word2vector, feature_size=feature_size)
print(ner_corpus.X_embedded)

# [[[-2.40120804e-03  1.74632657e-03  ...]
#   [-3.57543468e-03  2.86567654e-03  ...]
#   ...
#   [ 0.00000000e+00  0.00000000e+00  ...]] ...]
```

### _Model Initialization_

The parameters for Bi-LSTM and model training should be provided, however, they can be composed of a single dictionary.  
The user should initialize the _**NER_Model**_ with _**NER_Corpus**_ and the parameters.

```python
from connlp.analysis_ner import NER_Model

parameters = {
    # Parameters for Bi-LSTM.
    'lstm_units': 512,
    'lstm_return_sequences': True,
    'lstm_recurrent_dropout': 0.2,
    'dense_units': 100,
    'dense_activation': 'relu',

    # Parameters for model training.
    'test_size': 0.3,
    'batch_size': 1,
    'epochs': 100,
    'validation_split': 0.1,
}

ner_model = NER_Model()
ner_model.initialize(ner_corpus=ner_corpus, parameters=parameters)
type(ner_model)

# <class 'connlp.analysis_ner.NER_Model'>
```

### _Model Training_

The user can train the _**NER_Model**_ with customized parameters.  
The model automatically gets the dataset from the _**NER_Corpus**_.  

```python
ner_model.train(parameters=parameters)

# Train on 3 samples, validate on 1 samples
# Epoch 1/100
# 3/3 [==============================] - 3s 1s/step - loss: 1.4545 - crf_viterbi_accuracy: 0.3000 - val_loss: 1.0767 - val_crf_viterbi_accuracy: 0.8000
# Epoch 2/100
# 3/3 [==============================] - 0s 74ms/step - loss: 0.8602 - crf_viterbi_accuracy: 0.7000 - val_loss: 0.5287 - val_crf_viterbi_accuracy: 0.8000
# ...
```

### _Model Evaluation_

The model performance can be shown in the aspects of confusion matrix and F1 score.

```python
ner_model.evaluate()

# |--------------------------------------------------
# |Confusion Matrix:
# [[ 3  0  3  6]
#  [ 1  3  0  4]
#  [ 0  0  2  2]
#  [ 4  3  5 12]]
# |--------------------------------------------------
# |F1 Score: 0.757
# |--------------------------------------------------
# |    [NON]: 0.600
# |    [PER]: 0.857
# |    [FOD]: 0.571
```

### _Save_

The user can save the _**NER_Model**_.  
The model would save the model itself ("\<FileName\>.pk") and the dataset ("\<FileName\>-dataset.pk") that was used in model development.  
Note that the directory should exist before saving the model.  

```python
from connlp.util import makedir

fpath_model = 'test/ner/model.pk'
makedir(fpath=fpath_model)
ner_model.save(fpath_model=fpath_model)
```

### _Load_

If the user wants to load the already trained model, just call the model and load.  

```python
fpath_model = 'test/ner/model.pk'
ner_model = NER_Model()
ner_model.load(fpath_model=fpath_model, ner_corpus=ner_corpus, parameters=parameters)
```

### _Prediction_

_**NER_Model**_ can conduct a new NER task on the given sentence.  
The result is a class of _**NER_Result**_.  

```python
from connlp.preprocess import EnglishTokenizer
vectorizer = Vectorizer()

new_sent = 'Tom eats apple'
tokenized_sent = tokenizer.tokenize(new_sent)
ner_result = ner_model.predict(sent=tokenized_sent)
print(ner_result)

# Tom/PER eats/NON apple/FOD
```

## _Web Crawling_

The _**connlp**_ currently provides web crawling for Naver news articles.

### _Query_

The user should prepare the proper queries first.  
A single text file(.txt) should include every information of the query as below.  
- Date Start
- Date End
- Keywords

The web crawler utilizes the keywords separated with '\n\n' in the same time.  
Meanwhile, the web crawler utilizes the keywords separated with '\n' as a different queries.

For example, if the queries are determined as below, the web crawler would search the articles with six queries: "smart+construction+safety at 20210718", "smart+construction+management at 20210718", "smart+construction+safety at 20210719", ...

```plain
20210718
20210720

smart

construction

safety
management
```

The _**NewsQueryParser**_ parses the queries into appropriate formats.

```python
from connlp.web_crawling import NewsQueryParser
query_parser = NewsQueryParser()

fpath_query = 'FILEPATH_OF_YOUR_QUERY'
query_list, date_list = query_parser.parse(fpath_query=fpath_query)
```

### _URLs_

For the second step, the web crawler parses the web page that shows the list of news articles.  
_**NaverNewsListScraper**_ provides the function of parsing the list page.  
The user is recommended to save the url lists and load them later.

```python
from connlp.web_crawling import NaverNewsListScraper
list_scraper = NaverNewsListScraper()

for date in sorted(date_list, reverse=False):
    for query in query_list:
        url_list = list_scraper.get_url_list(query=query, date=date)
```

### _Articles_

The last step is to parse the article page and get information from the article.  
_**NaverNewsArticleParser**_ returns a class of _**Article**_ for a given article.  
Remember to extend the query list of the article.

```python
from connlp.web_crawling import NaverNewsArticleParser
article_parser = NaverNewsArticleParser()

query_list, _ = query_parser.urlname2query(fname_url_list=fname_url_list)
for url in url_list:
    article = article_parser.parse(url=url)
    article.extend_query(query_list)
```

### _Status_

_**NewsStatus**_ provides the status of the crawled corpus for given directories.

```python
from connlp.web_crawling import NewsStatus
news_status = NewsStatus()

fdir_queries = 'DIRECTORY_FOR_QUERIES'
fdir_url_list = 'DIRECTORY_FOR_URLS'
fdir_article = 'DIRECTORY_FOR_ARTICLES'

news_status.queries(fdir_queries=fdir_queries)
news_status.urls(fdir_urls=fdir_url_list)
news_status.articles(fdir_articles=fdir_article)
```

# Visualization

## _Visualizer_

_**Visualizer**_ includes several simple tools for text visualization.

Install the following packages.

```
pip install networkx wordcloud
```

### _network_

_**network**_ method provides a word network for tokenized docs.

```python
from connlp.preprocess import EnglishTokenizer
from connlp.visualize import Visualizer
tokenizer = EnglishTokenizer()
visualizer = Visualizer()

docs = ['I am a boy', 'She is a girl']
tokenized_docs = [tokenizer.tokenize(text=doc) for doc in docs]
word_network = visualizer.network(docs=tokenized_docs, show=True)
```

The word network is a _matplotlib.pyplot_ object.  
The user can save the figure by _.savefig()_ method.

```python
word_network.savefig(FILEPATH)
```

### _wordcloud_

_**wordcloud**_ method provides a word cloud for tokenized docs.

```python
from connlp.preprocess import EnglishTokenizer
from connlp.visualize import Visualizer
tokenizer = EnglishTokenizer()
visualizer = Visualizer()

docs = ['I am a boy', 'She is a girl']
tokenized_docs = [tokenizer.tokenize(text=doc) for doc in docs]
wordcloud = visualizer.wordcloud(docs=tokenized_docs, show=True)
```

The wordcloud is a _matplotlib.pyplot_ object.  
The user can save the figure by _.savefig()_ method.

```python
wordcloud.savefig(FILEPATH)
```

# Extracting Text

## _TextConverter_

_**TextConverter**_ includes several methods that extract raw text from various types of files (e.g. PDF, HWP) and/or converts the files into plain text files (e.g. TXT).

### _hwp2txt_

_**hwp2txt**_ method converts a HWP file into a plain text file.
Dependencies: pyhwp package

Install pyhwp (you need to install the pre-release version)

```
pip install --pre pyhwp
```

Example

```python
from connlp.text_extract import TextConverter
converter = TextConverter()

hwp_fpath = '/data/raw/hwp_file.hwp'
output_fpath = '/data/processed/extracted_text.txt'

converter.hwp2txt(hwp_fpath, output_fpath) # returns 0 if no error occurs
```

# GPU Utils

## _GPUMonitor_

_**GPUMonitor**_ generates a class to monitor and display the GPU status based on nvidia-smi.  
Refer to "https://github.com/anderskm/gputil" and "https://data-newbie.tistory.com/561" for usages.

Install _GPUtils_ module with _pip_.

```
pip install GPUtil
```

Write your code between the initiation of the _**GPUMonitor**_ and _**monitor.stop()**_.

```python
from connlp.util import GPUMonitor

monitor = GPUMonitor(delay=3)
# >>>Write your code here<<<
monitor.stop()

# | ID | GPU | MEM |
# ------------------
# |  0 |  0% |  0% |
# |  1 |  1% |  0% |
# |  2 |  0% | 94% |
```


