Metadata-Version: 2.1
Name: transformer-vae
Version: 0.0.2
Summary: Interpolate between discrete sequences.
Home-page: https://github.com/Fraser-Greenlee/transformer-vae
Author: Fraser Greenlee
Author-email: fraser.greenlee@mac.com
License: MIT
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: datasets (==1.1.3)
Requires-Dist: transformers (==4.1.1)
Requires-Dist: wandb (>=0.10.12)
Requires-Dist: torch (==1.7.0)
Requires-Dist: sklearn
Requires-Dist: bert-score
Provides-Extra: test
Requires-Dist: pytest ; extra == 'test'
Requires-Dist: flake8 ; extra == 'test'
Requires-Dist: flake8-mypy ; extra == 'test'
Requires-Dist: black ; extra == 'test'
Requires-Dist: twine ; extra == 'test'

# Transformer-VAE (WIP)

![Diagram of the a State Autoencoder](https://github.com/Fraser-Greenlee/transformer-vae/blob/v1/t5-vae.png)

Transformer-VAE's learn smooth latent spaces of discrete sequences without any explicit rules in their decoders.

This can be used for program synthesis, drug discovery, music generation and much more!

To see how it works checkout [this blog post](https://fraser-greenlee.github.io/2020/08/13/Transformers-as-Variational-Autoencoders.html).

This repo is in active development but I should be coming out with full a release soon.

## Install

Install using pip:
```
pip install transformer_vae
```

## Usage

You can exececute the module to easily train it on your own data.
```bash
python -m transformer_vae \
    --project_name="T5-VAE" \
    --output_dir=poet \
    --do_train \
    --huggingface_dataset=poems \
```
Or you can import Transformer-VAE to use as a package much like a Huggingface model.
```python
from transformer_vae import T5_VAE_Model

model = T5_VAE_Model.from_pretrained('t5-vae-poet')
```
## Training
Setup [Weights & Biasis](https://app.wandb.ai/) for logging, see [client](https://github.com/wandb/client).

Get a dataset to model, must be represented with text. This is what we will be interpolating over.

This can be a text file with each line representing a sample.
```bash
python -m transformer_vae \
    --project_name="T5-VAE" \
    --output_dir=poet \
    --do_train \
    --train_file=poems.txt \
```
Alternatively seperate each sample with a line containing only `<|endoftext|>` seperating samples:
```bash
python -m transformer_vae \
    --project_name="T5-VAE" \
    --output_dir=poet \
    --do_train \
    --train_file=poems.txt \
    --multiline_samples
```
Alternatively provide a Huggingface dataset.
```bash
python -m transformer_vae \
    --project_name="T5-VAE" \
    --output_dir=poet \
    --do_train \
    --dataset=poems \
    --content_key text
```

Experiment with different parameters.

Once finished upload to huggingface model hub.

```bash
# TODO
```

Explore the produced latent space using `Colab_T5_VAE.ipynb` or vising this [Colab page](TODO).

### Contributing

Install with tests:
```
pip install -e .[test]
```

Possible contributions to make:
1. Could the docs be more clear? Would it be worth having a docs site/blog?
2. Use a Funnel transformer encoder, is it more efficient?
3. Allow defining alternative tokens set.
4. Store the latent codes from the previous step to use in MMD loss so smaller batch sizes are possible.

Feel free to ask what would be useful!


