Metadata-Version: 2.1
Name: tokenizer-adapter
Version: 0.1.1
Summary: A simple to adapt a pretrained language model to a new vocabulary
Home-page: https://github.com/ccdv-ai/tokenizer-adapter
Author: Charles Condevaux
Author-email: charles.condevaux@gmail.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch >=1.8
Requires-Dist: tokenizers >=0.15.0
Requires-Dist: tqdm

# Tokenizer Adapter

A simple tool to adapt a pretrained Huggingface model to a new vocabulary (domain specific) with (almost) no training. \
Should work for almost all language models from the Huggingface Hub (need more test).

## Install

```
pip install tokenizer-adapter --upgrade
```

## Usage
```python
from tokenizer_adapter import TokenizerAdapter
from transformers import AutoTokenizer, AutoModelForMaskedLM

BASE_MODEL_PATH = "camembert-base"

# A simple corpus
corpus = ["A first sentence", "A second sentence", "blablabla"]

# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained(BASE_MODEL_PATH)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_PATH)

# Train new vocabulary from the old tokenizer
new_tokenizer = tokenizer.train_new_from_iterator(corpus, vocab_size=300)

# Default params should work in most cases
adapter = TokenizerAdapter()

# Patch the model with the new tokenizer
model = adapter.adapt_from_pretrained(new_tokenizer, model, tokenizer)

# Save the model and the new tokenizer
model.save_pretrained("my_new_model/")
new_tokenizer.save_pretrained("my_new_model/")
```
