Metadata-Version: 2.1
Name: tftokenizers
Version: 0.1.1
Summary: Use Huggingface Transformer and Tokenizers as Tensorflow Reusable SavedModels.
Home-page: https://github.com/Hugging-Face-Supporter/tftokenizers
License: Apache-2.0
Keywords: huggingface,transformers,tokenizers,tensorflow,text
Author: MarkusSagen
Author-email: markus.john.sagen@gmail.com
Requires-Python: >=3.8,<4.0
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Requires-Dist: Sphinx (==4.1.2)
Requires-Dist: datasets (>=1.17.0,<2.0.0)
Requires-Dist: myst-parser (==0.15.2)
Requires-Dist: pydantic (>=1.9.0,<2.0.0)
Requires-Dist: python-decouple (>=3.5,<4.0)
Requires-Dist: readthedocs-sphinx-search (==0.1.1)
Requires-Dist: recommonmark (>=0.7.1,<0.8.0)
Requires-Dist: requests (==2.26.0)
Requires-Dist: rich[jupyter] (>=10.14.0,<11.0.0)
Requires-Dist: sentencepiece (>=0.1.96,<0.2.0)
Requires-Dist: sphinx-copybutton (==0.4.0)
Requires-Dist: sphinx-markdown-tables (==0.0.15)
Requires-Dist: sphinx-rtd-theme (==1.0.0)
Requires-Dist: sphinxemoji (>=0.2.0,<0.3.0)
Requires-Dist: sphinxext-opengraph (==0.4.2)
Requires-Dist: tensorflow (==2.5.2)
Requires-Dist: tensorflow-datasets (>=4.4.0,<5.0.0)
Requires-Dist: tensorflow-hub (>=0.12.0,<0.13.0)
Requires-Dist: tensorflow-text (==2.5.0)
Requires-Dist: tf-sentencepiece (>=0.1.92,<0.2.0)
Requires-Dist: tomlkit (==0.7.2)
Requires-Dist: torch (>=1.10.1,<2.0.0)
Requires-Dist: transformers (>=4.15.0,<5.0.0)
Requires-Dist: unzip (>=1.0.0,<2.0.0)
Requires-Dist: wget (>=3.2,<4.0)
Project-URL: Repository, https://github.com/Hugging-Face-Supporter/tftokenizers
Description-Content-Type: text/markdown

# TFtftransformers
Converting Hugginface tokenizers to Tensorflow tokenizers. The main reason is to be able to bundle the tokenizer and model into one Reusable SavedModel.

---

**Source Code**: <a href="https://github.com/Huggingface-Supporters/tftftransformers" target="_blank">https://github.com/Hugging-Face-Supporter/tftokenizers</a>

---


## Example
This is an example of how one can use Huggingface model and tokenizers bundled together as a [Reusable SavedModel](https://www.tensorflow.org/hub/reusable_saved_models) and yields the same result as using the model and tokenizer from Huggingface 🤗


```python
import tensorflow as tf
from tftokenizer import TFModel
from tftokenizers import TFAutoTokenizer
from transformers import TFAutoModel

# Load base models from Huggingface
model_name = "bert-base-cased"
model = TFAutoModel.from_pretrained(model_name)

# Load converted TF tokenizer
tokenizer = TFAutoTokenizer(model_name)

# Create a TF Reusable SavedModel
custom_model = TFModel(model=model, tokenizer=tokenizer)

# Tokenizer and model can handle `tf.Tensors` or regular strings
tf_string = tf.constant(["Hello from Tensorflow"])
s1 = "SponGE bob SQuarePants is an avenger"
s2 = "Huggingface to Tensorflow tokenizers"
s3 = "Hello, world!"

output = custom_model(tf_string)
output = custom_model([s1, s2, s3])

# You can also pass arguments, similar to Huggingface tokenizers
output = custom_model(
    [s1, s2, s3],
    max_length=512,
    padding="max_length",
)
print(output)

# Save tokenizer
saved_name = "reusable_bert_tf"
tf.saved_model.save(custom_model, saved_name)

# # Load tokenizer
reloaded_model = tf.saved_model.load(saved_name)
output = reloaded_model([s1, s2, s3])
print(output)
```

### `Setup`
```bash
git clone https://github.com/Hugging-Face-Supporter/tftokenizers.git
cd tftokenizers
poetry install
poetry shell
```

### `Run`
To convert a Huggingface tokenizer to Tensorflow, first choose one from the models or tokenizers from the Huggingface hub to download.

**NOTE**
> Currently only BERT models work with the converter.

#### `Download`
First download tokenizers from the hub by name. Either run the bash script do download multiple tokenizers or download a single tokenizer with the python script.

The idea is to eventually only to automatically download and convert

```bash
python tftokenizers/download.py -n bert-base-uncased
bash scripts/download_tokenizers.sh
```

#### `Convert`
Convert downloaded tokenizer from Huggingface format to Tensorflow
```bash
python tftokenizers/convert.py
```

### `Before Commit`
```bash
make build
```



## WIP
- [x] Convert a BERT tokenizer from Huggingface to Tensorflow
- [x] Make a TF Reusabel SavedModel with Tokenizer and Model in the same class. Emulate how the TF Hub example for BERT works.
- [x] Find methods for identifying the base tokenizer model and map those settings and special tokens to new tokenizers
- [x] Extend the tokenizers to more tokenizer types and identify them from a huggingface model name
- [ ] Document how others can use the library and document the different stages in the process
- [ ] Convert other tokenizers. Identify limitations
- [ ] Improve the conversion pipeline (s.a. Download and export files if not passed in or available locally)
- [ ] Support encoding of two sentences at a time [Ref](https://www.tensorflow.org/text/guide/bert_preprocessing_guide)
- [ ] Allow the tokenizers to be used for Masking (MLM) [Ref](https://www.tensorflow.org/text/guide/bert_preprocessing_guide)

