Metadata-Version: 2.1
Name: transformers-domain-adaptation
Version: 0.3.1
Summary: Adapt Transformer-based language models to new text domains
Home-page: https://github.com/georgianpartners/Transformers-Domain-Adaptation
Author: Christopher Tee
Author-email: chris@georgian.io
License: MIT
Keywords: transformers,tokenizers,huggingface,pytorch,domain-adaptation,transfer-learning,natural-language-processing
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: License :: OSI Approved :: MIT License
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: General
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: transformers (<5,>=4)
Requires-Dist: tokenizers (<0.10,>=0.9)
Requires-Dist: datasets (<1.3,>=1.2)
Requires-Dist: pandas
Requires-Dist: torch (<1.8,>=1.7)
Requires-Dist: scipy (==1.5.4)
Requires-Dist: scikit-learn
Requires-Dist: tqdm

<div align="center">

<h1 style="text-align:center">Transformers Domain Adaptation</h1>
<p align="center">
    <a href="https://transformers-domain-adaptation.readthedocs.io/en/latest/content/introduction.html">Documentation</a> •
    <a href="https://colab.research.google.com/github/georgianpartners/Transformers-Domain-Adaptation/blob/master/notebooks/GuideToTransformersDomainAdaptation.ipynb">Colab Guide</a>
</p>

[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/transformers-domain-adaptation)](https://pypi.org/project/transformers-domain-adaptation/)
[![PyPI version](https://badge.fury.io/py/transformers-domain-adaptation.svg)](https://badge.fury.io/py/transformers-domain-adaptation)
![Python package](https://github.com/georgianpartners/Transformers-Domain-Adaptation/workflows/Python%20package/badge.svg)
[![Documentation Status](https://readthedocs.org/projects/transformers-domain-adaptation/badge/?version=latest)](https://transformers-domain-adaptation.readthedocs.io/en/latest/?badge=latest)

</div>

This toolkit improves the performance of HuggingFace transformer models on downstream NLP tasks,
by domain-adapting models to the target domain of said NLP tasks (e.g. BERT -> LawBERT).

![](docs/source/domain_adaptation_diagram.png)

The overall Domain Adaptation framework can be broken down into three phases:
1. Data Selection
    > Select a relevant subset of documents from the in-domain corpus that is likely to be beneficial for domain pre-training (see below)
2. Vocabulary Augmentation
    > Extending the vocabulary of the transformer model with domain specific-terminology
3. Domain Pre-Training
    > Continued pre-training of transformer model on the in-domain corpus to learn linguistic nuances of the target domain

After a model is domain-adapted, it can be fine-tuned on the downstream NLP task of choice, like any pre-trained transformer model.

### Components
This toolkit provides two classes, `DataSelector` and `VocabAugmentor`, to simplify the Data Selection and Vocabulary Augmentation steps respectively.

## Installation
This package was developed on Python 3.6+ and can be downloaded using `pip`:
```
pip install transformers-domain-adaptation
```

## Features
- Compatible with the HuggingFace ecosystem:
    - `transformers 4.x`
    - `tokenizers`
    - `datasets`

## Usage
Please refer to our Colab guide!

<a href="https://colab.research.google.com/github/georgianpartners/Transformers-Domain-Adaptation/blob/master/notebooks/GuideToTransformersDomainAdaptation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## Results
TODO


