Metadata-Version: 2.1
Name: trankit
Version: 0.2.7
Summary: Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing
Home-page: https://github.com/nlp-uoregon/trankit
Author: NLP Group at the University of Oregon
Author-email: thien@cs.uoregon.edu
License: Apache License 2.0
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Information Technology
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Software Development
Classifier: Topic :: Software Development :: Libraries
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Description-Content-Type: text/markdown
Requires-Dist: numpy
Requires-Dist: protobuf
Requires-Dist: requests
Requires-Dist: torch (>=1.6.0)
Requires-Dist: tqdm
Requires-Dist: adapter-transformers (>=1.0.1)

<h2 align="center">Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing</h2>

<div align="center">
    <a href="https://github.com/minhvannguyen/trankit/blob/main/LICENSE">
        <img alt="GitHub" src="https://img.shields.io/github/license/minhvannguyen/trankit.svg?color=blue">
    </a>
    <a href='https://trankit.readthedocs.io/en/latest/?badge=latest'>
    <img src='https://readthedocs.org/projects/trankit/badge/?version=latest' alt='Documentation Status' />
    </a>
    <a href="http://nlp.uoregon.edu/trankit">
        <img alt="Demo Website" src="https://img.shields.io/website/http/trankit.readthedocs.io/en/latest/index.html.svg?down_color=red&down_message=offline&up_message=online">
    </a>
    <a href="https://pypi.org/project/trankit/">
        <img alt="PyPI Version" src="https://img.shields.io/pypi/v/trankit?color=blue">
    </a>
    <a href="https://pypi.org/project/trankit/">
        <img alt="Python Versions" src="https://img.shields.io/pypi/pyversions/trankit?colorB=blue">
    </a>
</div>

Trankit is a **light-weight Transformer-based Python** Toolkit for multilingual Natural Language Processing (NLP). It provides a trainable pipeline for fundamental NLP tasks over [100 languages](https://trankit.readthedocs.io/en/latest/pkgnames.html#trainable-languages), and 90 pretrained pipelines for [56 languages](https://trankit.readthedocs.io/en/latest/pkgnames.html#pretrained-languages-their-code-names). Trankit can process inputs which are untokenized (raw) or pretokenized strings, at
both sentence and document level. Currently, Trankit supports the following tasks:
- Sentence segmentation.
- Tokenization.
- Multi-word token expansion.
- Part-of-speech tagging.
- Morphological feature tagging.
- Dependency parsing.
- Named entity recognition.

Built on the state-of-the-art multilingual pretrained transformer [XLM-Roberta](https://arxiv.org/abs/1911.02116), Trankit significantly *outperforms* prior multilingual NLP pipelines (e.g., UDPipe, Stanza) in many tasks over 90 [Universal Dependencies v2.5 treebanks](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3105) while still being efficient in memory usage and
speed, making it *usable for general users*. Below is the performance comparison between Trankit and other NLP toolkits on Arabic, Chinese, and English.

<table class="tg">
<thead>
  <tr>
    <th class="tg-0pky">Treebank</th>
    <th class="tg-0pky">System</th>
    <th class="tg-0pky">Tokens</th>
    <th class="tg-0pky">Sents.</th>
    <th class="tg-0pky">Words</th>
    <th class="tg-0pky">UPOS</th>
    <th class="tg-0pky">XPOS</th>
    <th class="tg-0pky">UFeats</th>
    <th class="tg-0pky">Lemmas</th>
    <th class="tg-0pky">UAS</th>
    <th class="tg-0pky">LAS</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-0pky" rowspan="2"><br>Overall (90 treebanks)</td>
    <td class="tg-0pky">Trankit</td>
    <td class="tg-c3ow">99.23</td>
    <td class="tg-7btt"><b>91.82</b></td>
    <td class="tg-7btt"><b>99.02</b></td>
    <td class="tg-7btt"><b>95.65</b></td>
    <td class="tg-7btt"><b>94.05</b></td>
    <td class="tg-7btt"><b>93.21</b></td>
    <td class="tg-7btt"><b>94.27</b></td>
    <td class="tg-7btt"><b>87.06</b></td>
    <td class="tg-7btt"><b>83.69</b></td>
  </tr>
  <tr>
    <td class="tg-0pky">Stanza</td>
    <td class="tg-7btt"><b>99.26</b></td>
    <td class="tg-c3ow">88.58</td>
    <td class="tg-c3ow">98.90</td>
    <td class="tg-c3ow">94.21</td>
    <td class="tg-c3ow">92.50</td>
    <td class="tg-c3ow">91.75</td>
    <td class="tg-c3ow">94.15</td>
    <td class="tg-c3ow">83.06</td>
    <td class="tg-c3ow">78.68</td>
  </tr>
  <tr>
    <td class="tg-0pky" rowspan="3"><br><br>Arabic-PADT<br></td>
    <td class="tg-0pky">Trankit</td>
    <td class="tg-c3ow">99.93</td>
    <td class="tg-7btt"><b>96.59</b></td>
    <td class="tg-7btt"><b>99.22</b></td>
    <td class="tg-7btt"><b>96.31</b></td>
    <td class="tg-7btt"><b>94.08</b></td>
    <td class="tg-7btt"><b>94.28</b></td>
    <td class="tg-7btt"><b>94.65</b></td>
    <td class="tg-7btt"><b>88.39</b></td>
    <td class="tg-7btt"><b>84.68</b></td>
  </tr>
  <tr>
    <td class="tg-0pky">Stanza</td>
    <td class="tg-7btt"><b>99.98</b></td>
    <td class="tg-c3ow">80.43</td>
    <td class="tg-c3ow">97.88</td>
    <td class="tg-c3ow">94.89</td>
    <td class="tg-c3ow">91.75</td>
    <td class="tg-c3ow">91.86</td>
    <td class="tg-c3ow">93.27</td>
    <td class="tg-c3ow">83.27</td>
    <td class="tg-c3ow">79.33</td>
  </tr>
  <tr>
    <td class="tg-0pky">UDPipe</td>
    <td class="tg-c3ow">99.98</td>
    <td class="tg-c3ow">82.09</td>
    <td class="tg-c3ow">94.58</td>
    <td class="tg-c3ow">90.36</td>
    <td class="tg-c3ow">84.00</td>
    <td class="tg-c3ow">84.16</td>
    <td class="tg-c3ow">88.46</td>
    <td class="tg-c3ow">72.67</td>
    <td class="tg-c3ow">68.14</td>
  </tr>
  <tr>
    <td class="tg-0pky" rowspan="3"><br><br>Chinese-GSD</td>
    <td class="tg-0pky">Trankit</td>
    <td class="tg-7btt"><b>97.01</b></td>
    <td class="tg-7btt"><b>99.70</b></td>
    <td class="tg-7btt"><b>97.01</b></td>
    <td class="tg-7btt"><b>94.21</b></td>
    <td class="tg-7btt"><b>94.02</b></td>
    <td class="tg-7btt"><b>96.59</b></td>
    <td class="tg-7btt"><b>97.01</b></td>
    <td class="tg-7btt"><b>85.19</b></td>
    <td class="tg-7btt"><b>82.54</b></td>
  </tr>
  <tr>
    <td class="tg-0pky">Stanza</td>
    <td class="tg-c3ow">92.83</td>
    <td class="tg-c3ow">98.80</td>
    <td class="tg-c3ow">92.83</td>
    <td class="tg-c3ow">89.12</td>
    <td class="tg-c3ow">88.93</td>
    <td class="tg-c3ow">92.11</td>
    <td class="tg-c3ow">92.83</td>
    <td class="tg-c3ow">72.88</td>
    <td class="tg-c3ow">69.82</td>
  </tr>
  <tr>
    <td class="tg-0pky">UDPipe</td>
    <td class="tg-c3ow">90.27</td>
    <td class="tg-c3ow">99.10</td>
    <td class="tg-c3ow">90.27</td>
    <td class="tg-c3ow">84.13</td>
    <td class="tg-c3ow">84.04</td>
    <td class="tg-c3ow">89.05</td>
    <td class="tg-c3ow">90.26</td>
    <td class="tg-c3ow">61.60</td>
    <td class="tg-c3ow">57.81</td>
  </tr>
  <tr>
    <td class="tg-0pky" rowspan="4"><br><br><br>English-EWT</td>
    <td class="tg-0pky">Trankit</td>
    <td class="tg-c3ow">98.48</td>
    <td class="tg-7btt"><b>88.35</b></td>
    <td class="tg-c3ow">98.48</td>
    <td class="tg-7btt"><b>95.95</b></td>
    <td class="tg-7btt"><b>95.71</b></td>
    <td class="tg-7btt"><b>96.26</b></td>
    <td class="tg-7btt">96.84</td>
    <td class="tg-7btt"><b>90.14</b></td>
    <td class="tg-7btt"><b>87.96</b></td>
  </tr>
  <tr>
    <td class="tg-0pky">Stanza</td>
    <td class="tg-7btt"><b>99.01</b></td>
    <td class="tg-c3ow">81.13</td>
    <td class="tg-7btt"><b>99.01</b></td>
    <td class="tg-c3ow">95.40</td>
    <td class="tg-c3ow">95.12</td>
    <td class="tg-c3ow">96.11</td>
    <td class="tg-c3ow"><b>97.21</b></td>
    <td class="tg-c3ow">86.22</td>
    <td class="tg-c3ow">83.59</td>
  </tr>
  <tr>
    <td class="tg-0pky">UDPipe</td>
    <td class="tg-c3ow">98.90</td>
    <td class="tg-c3ow">77.40</td>
    <td class="tg-c3ow">98.90</td>
    <td class="tg-c3ow">93.26</td>
    <td class="tg-c3ow">92.75</td>
    <td class="tg-c3ow">94.23</td>
    <td class="tg-c3ow">95.45</td>
    <td class="tg-c3ow">80.22</td>
    <td class="tg-c3ow">77.03</td>
  </tr>
  <tr>
    <td class="tg-0pky">spaCy</td>
    <td class="tg-c3ow">97.30</td>
    <td class="tg-c3ow">61.19</td>
    <td class="tg-c3ow">97.30</td>
    <td class="tg-c3ow">86.72</td>
    <td class="tg-c3ow">90.83</td>
    <td class="tg-c3ow">-</td>
    <td class="tg-c3ow">87.05</td>
    <td class="tg-c3ow">-</td>
    <td class="tg-c3ow">-</td>
  </tr>
</tbody>
</table>


Performance comparison between Trankit and these toolkits on other languages can be found [here](https://trankit.readthedocs.io/en/latest/performance.html#universal-dependencies-v2-5) on our documentation page.

We also created a Demo Website for Trankit, which is hosted at: http://nlp.uoregon.edu/trankit

Technical details about Trankit are presented in [our following paper](). Please cite the paper if you use Trankit in your software or research.

```bibtex
@article{unset,
  title={Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing},
  author={unset},
  journal={arXiv preprint arXiv:},
  year={2021}
}
```


### Installation
Trankit can be easily installed via one of the following methods:
#### Using pip
```
pip install trankit
```
The command would install Trankit and all dependent packages automatically.

#### From source
```
git clone https://github.com/minhvannguyen/trankit.git
cd trankit
pip install -e .
```
This would first clone our github repo and automatically install Trankit.

### Quick Examples

#### Initialize a pretrained pipeline
The following code shows how to initialize a pretrained pipeline for English; it is instructed to run on GPU, automatically downloaded pretrained models and and store them to the specified cache directory. Trankit will not download pretrained models if they already exist.
```python
from trankit import Pipeline

# initialize a multilingual pipeline
p = Pipeline(lang='english', gpu=True, cache_dir='./cache')
```

#### Basic functions
Trankit can process inputs which are untokenized (raw) or pretokenized strings, at both sentence and document level. A pretokenized input can be a list of strings (i.e., a tokenized sentence) or a list of lists of strings (i.e., a tokenized document with multiple tokenized sentences) are automatically recognized by Trankit. If the input is a sentence, the tag `is_sent` must be set to True. 
```python
from trankit import Pipeline

p = Pipeline(lang='english', gpu=True, cache_dir='./cache')

######## document-level processing ########
untokenized_doc = '''Hello! This is Trankit.'''
pretokenized_doc = [['Hello', '!'], ['This', 'is', 'Trankit', '.']]

# perform all tasks on the input
processed_doc1 = p(untokenized_doc)
processed_doc2 = p(pretokenized_doc)

######## sentence-level processing ####### 
untokenized_sent = '''This is Trankit.'''
pretokenized_sent = ['This', 'is', 'Trankit', '.']

# perform all tasks on the input
processed_sent1 = p(untokenized_sent, is_sent=True)
processed_sent2 = p(pretokenized_sent, is_sent=True)

# perform separate tasks on the input
sents = p.ssplit(untokenized_doc) # sentence segmentation
tokenized_doc = p.tokenize(untokenized_doc) # sentence segmentation and tokenization
tokenized_sent = p.tokenize(untokenized_sent, is_sent=True) # tokenization only
posdeps = p.posdep(untokenized_doc) # upos, xpos, ufeats, dep parsing
ners = p.ner(untokenized_doc) # ner tagging
lemmas = p.lemmatize(untokenized_doc) # lemmatization
```
WARNING: Although pretokenized inputs can always be processed, using pretokenized inputs for languages that require multi-word token expansion such as Arabic or French might not be the correct way. Check out this list to see if a particular language requires multi-word token expansion or not.  
For more detailed examples, please checkout our [documentation page](https://trankit.readthedocs.io/en/latest/overview.html).

#### Multilingual usage
In case we want to process inputs of different languages, we need to initialize a multilingual pipeline. The following code shows an example for initializing a multilingual pipeline for Arabic, Chinese, Dutch, and English.
```python
from trankit import Pipeline

# initialize a multilingual pipeline
p = Pipeline(lang='english', gpu=True, cache_dir='./cache')

langs = ['arabic', 'chinese', 'dutch']
for lang in langs:
    p.add(lang)

# tokenize English input
p.set_active('english')
en = p.tokenize('Rich was here before the scheduled time.')

# get ner tags for Arabic input
p.set_active('arabic')
ar = p.ner('وكان كنعان قبل ذلك رئيس جهاز الامن والاستطلاع للقوات السورية العاملة في لبنان.')
```
In this example, `.set_active()` is used to switch between languages.

### Training your own pipelines
Training customized pipelines is easy with Trankit via the class `TPipeline`. Below is a code for training a token and sentence splitter with Trankit.
```python
from trankit import TPipeline

tp = TPipeline(training_config={
    'task': 'tokenize',
    'save_dir': './saved_model',
    'train_txt_fpath': './train.txt',
    'train_conllu_fpath': './train.conllu',
    'dev_txt_fpath': './dev.txt',
    'dev_conllu_fpath': './dev.conllu'
    }
)

trainer.train()
```
Detailed guidelines for training customized pipelines can be found [here](https://trankit.readthedocs.io/en/latest/training.html) 

### Acknowledgements
We use the [AdapterHub](https://github.com/Adapter-Hub/adapter-transformers) to implement our plug-and-play mechanism with Adapters. To speed up the development process, the implementations for the MWT expander and the lemmatizer are adapted from [Stanza](https://github.com/stanfordnlp/stanza).


