Metadata-Version: 2.1
Name: tok
Version: 0.0.4
Summary: Fast and customizable tokenizer
Home-page: https://github.com/kootenpv/tok
Author: Pascal van Kooten
Author-email: kootenpv@gmail.com
License: MIT
Platform: any
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Customer Service
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: Microsoft
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: Unix
Classifier: Operating System :: POSIX
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Topic :: Software Development
Classifier: Topic :: Software Development :: Build Tools
Classifier: Topic :: Software Development :: Debuggers
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: System :: Software Distribution
Classifier: Topic :: System :: Systems Administration
Classifier: Topic :: Utilities
Description-Content-Type: text/markdown
Requires-Dist: textsearch
Requires-Dist: tldextract
Requires-Dist: contractions

## tok

[![PyPI](https://img.shields.io/pypi/v/tok.svg?style=flat-square)](https://pypi.python.org/pypi/tok/)
[![PyPI](https://img.shields.io/pypi/pyversions/tok.svg?style=flat-square)](https://pypi.python.org/pypi/tok/)

Fastest and most complete/customizable tokenizer in Python.

It is roughly 25x faster than spacy's and nltk's regex based tokenizers.

Using the aho-corasick algorithm makes it a novelty and allows it to be both explainable and fast in how it will split.

The heavy lifting is done by [textsearch](https://github.com/kootenpv/textsearch) and [pyahocorasick](https://github.com/WojciechMula/pyahocorasick), allowing this to be written in only ~200 lines of code.

### Installation

    pip install tok

### Usage

By default it handles contractions, http, (float) numbers and currencies.

```python
from tok import word_tokenize
word_tokenize("I wouldn't do that.... would you?")
['I', 'would', 'not', 'do', 'that', '...', 'would', 'you', '?']
```

Or configure it yourself:

```python
from tok import Tokenizer
tokenizer = Tokenizer(protected_words=["some.thing"]) # still using the defaults
tokenizer.word_tokenize("I want to protect some.thing")
['I', 'want', 'to', 'protect', 'some.thing']
```

Split by sentences:

```python
from tok import sent_tokenize
sent_tokenize("I wouldn't do that.... would you?")
[['I', 'would', 'not', 'do', 'that', '...'], ['would', 'you', '?']]
```

for more options check the documentation of the `Tokenizer`.

### Further customization

Given:

```python
from tok import Tokenizer
t = Tokenizer(protected_words=["some.thing"]) # still using the defaults
```

You can add your own ideas to the tokenizer by using:

- `t.keep(x, reason)`: Whenever it finds x, it will not add whitespace. Prevents direct tokenization.
- `t.split(x, reason)`: Whenever it finds x, it will surround it by whitespace, thus creating a token.
- `t.drop(x, reason)`: Whenever it finds x, it will remove it but add a split.
- `t.strip(x, reason)`: Whenever it finds x, it will remove it without splitting.

```python
tokenizer.drop("bla", "bla is not needed")
t.word_tokenize("Please remove bla, thank you")
['Please', 'remove', ',', 'thank', 'you']
```

### Explainable

Explain what happened:

```python
t.explain("bla")
[{'from': 'bla', 'to': ' ', 'explanation': 'bla is not needed'}]
```

See everything in there (will help you understand how it works):

```python
t.explain_dict
```

### Contributing

It would be greatly appreciated if you want to contribute to this library.

It would also be great to add [contractions](https://github.com/kootenpv/contractions) for other languages.


