Metadata-Version: 2.1
Name: awesome-align
Version: 0.1.7
Summary: An awesome word alignment tool
Home-page: https://github.com/neulab/awesome-align
Author: NeuLab
Author-email: zdou0830@gmail.com
License: BSD 3-Clause
Platform: UNKNOWN
Description-Content-Type: text/markdown
Requires-Dist: tokenizers (>=0.5.2)
Requires-Dist: torch (>=1.2.0)
Requires-Dist: tqdm
Requires-Dist: numpy
Requires-Dist: boto3
Requires-Dist: filelock
Requires-Dist: requests

## AWESOME: Aligning Word Embedding Spaces of Multilingual Encoders

`awesome-align` is a tool that can extract word alignments from multilingual BERT (mBERT) [[Demo]](https://colab.research.google.com/drive/1205ubqebM0OsZa1nRgbGJBtitgHqIVv6?usp=sharing) and allows you to fine-tune mBERT on parallel corpora for better alignment quality (see [our paper](https://arxiv.org/abs/2101.08231) for more details).

### Dependencies

First, you need to install the dependencies:

```bash
pip install -r requirements.txt
python setup.py install
```

### Input format

Inputs should be *tokenized* and each line is a source language sentence and its target language translation, separated by (` ||| `). You can see some examples in the `examples` folder.

### Extracting alignments

Here is an example of extracting word alignments from multilingual BERT:

```bash
DATA_FILE=/path/to/data/file
MODEL_NAME_OR_PATH=bert-base-multilingual-cased
OUTPUT_FILE=/path/to/output/file

CUDA_VISIBLE_DEVICES=0 awesome-align \
    --output_file=$OUTPUT_FILE \
    --model_name_or_path=$MODEL_NAME_OR_PATH \
    --data_file=$DATA_FILE \
    --extraction 'softmax' \
    --batch_size 32
```

This produces outputs in the `i-j` Pharaoh format. A pair `i-j` indicates that the <i>i</i>th word (zero-indexed) of the source sentence is aligned to the <i>j</i>th word of the target sentence. 

You can set `--output_prob_file` if you want to obtain the alignment probability and set `--output_word_file` if you want to obtain the aligned word pairs (in the `src_word<sep>tgt_word` format). You can also set `--cache_dir` to specify where you want to cache multilingual BERT.

You can also set `MODEL_NAME_OR_PATH` to the path of your fine-tuned model as shown below.

### Fine-tuning on parallel data

If there is parallel data available, you can fine-tune embedding models on that data.

Here is an example of fine-tuning mBERT that balances well between efficiency and effectiveness:

```bash
TRAIN_FILE=/path/to/train/file
EVAL_FILE=/path/to/eval/file
OUTPUT_DIR=/path/to/output/directory

CUDA_VISIBLE_DEVICES=0 awesome-train \
    --output_dir=$OUTPUT_DIR \
    --model_name_or_path=bert-base-multilingual-cased \
    --extraction 'softmax' \
    --do_train \
    --train_tlm \
    --train_so \
    --train_data_file=$TRAIN_FILE \
    --per_gpu_train_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --num_train_epochs 1 \
    --learning_rate 2e-5 \
    --save_steps 4000 \
    --max_steps 20000 \
    --do_eval \
    --eval_data_file=$EVAL_FILE
```

You can also fine-tune the model a bit longer with more training objectives for better quality:

```bash
TRAIN_FILE=/path/to/train/file
EVAL_FILE=/path/to/eval/file
OUTPUT_DIR=/path/to/output/directory

CUDA_VISIBLE_DEVICES=0 awesome-train \
    --output_dir=$OUTPUT_DIR \
    --model_name_or_path=bert-base-multilingual-cased \
    --extraction 'softmax' \
    --do_train \
    --train_mlm \
    --train_tlm \
    --train_tlm_full \
    --train_so \
    --train_psi \
    --train_data_file=$TRAIN_FILE \
    --per_gpu_train_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --num_train_epochs 1 \
    --learning_rate 2e-5 \
    --save_steps 10000 \
    --max_steps 40000 \
    --do_eval \
    --eval_data_file=$EVAL_FILE
```

If you want high alignment recalls, you can turn on the `--train_co` option, but note that the alignment precisions may drop. You can set `--cache_dir` to specify where you want to cache multilingual BERT.

### Supervised settings

In supervised settings where gold word alignments are available for your training data, you can incorporate the supervised signals into our self-training objective (`--train_so`) and here is an example command:

```bash
TRAIN_FILE=/path/to/train/file
TRAIN_GOLD_FILE=/path/to/train/gold/file
OUTPUT_DIR=/path/to/output/directory

CUDA_VISIBLE_DEVICES=0 awesome-train \
    --output_dir=$OUTPUT_DIR \
    --model_name_or_path=bert-base-multilingual-cased \
    --extraction 'softmax' \
    --do_train \
    --train_so \
    --train_data_file=$TRAIN_FILE \
    --train_gold_file=$TRAIN_GOLD_FILE \
    --per_gpu_train_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --num_train_epochs 5 \
    --learning_rate 1e-4 \
    --save_steps 200
```

See `examples/*.gold` for the example format of the gold alignments. You need to turn on the `--gold_one_index` option if the gold alignments are 1-indexed and you can turn on the `--ignore_possible_alignments` option if you want to ignore possible alignments.

### Model performance

The following table shows the alignment error rates (AERs) of our models and popular statistical word aligners on five language pairs. The De-En, Fr-En, Ro-En datasets can be obtained following [this repo](https://github.com/lilt/alignment-scripts), the Ja-En data is from [this link](http://www.phontron.com/kftt/) and the Zh-En data is available at [this link](http://nlp.csai.tsinghua.edu.cn/~ly/systems/TsinghuaAligner/TsinghuaAligner.html). The best scores are in **bold**.

|            | De-En | Fr-En | Ro-En | Ja-En | Zh-En |
| -| ------- | ------- | ------- | ------- | ------- | 
| [fast\_align](https://github.com/clab/fast_align) | 27.0 | 10.5 | 32.1 | 51.1 | 38.1 |
| [eflomal](https://github.com/robertostling/eflomal) | 22.6 | 8.2 | 25.1 | 47.5 | 28.7 |
| [Mgiza](https://github.com/moses-smt/mgiza)    | 20.6 | 5.9 | 26.4 | 48.0 | 35.1 |
| Ours (w/o fine-tuning, softmax) | 17.4 | 5.6 | 27.9 | 45.6 | 18.1 |
| Ours (multilingually fine-tuned <br/>  w/o `--train_co`, softmax) [[Download]](https://drive.google.com/file/d/1IcQx6t5qtv4bdcGjjVCwXnRkpr67eisJ/view?usp=sharing) | 15.2 | **4.1** | 22.6 | **37.4** | **13.4** |
| Ours (multilingually fine-tuned <br/>  w/ `--train_co`, softmax) [[Download]](https://drive.google.com/file/d/1IluQED1jb0rjITJtyj4lNMPmaRFyMslg/view?usp=sharing) |  **15.1** | 4.5 | **20.7** | 38.4 | 14.5 |


### Citation

If you use our tool, we'd appreciate if you cite the following paper:

```
@inproceedings{dou2021word,
  title={Word Alignment by Fine-tuning Embeddings on Parallel Corpora},
  author={Dou, Zi-Yi and Neubig, Graham},
  booktitle={Conference of the European Chapter of the Association for Computational Linguistics (EACL)},
  year={2021}
}
```


### Acknowledgements

Some of the code is borrowed from [HuggingFace Transformers](https://github.com/huggingface/transformers) licensed under [Apache 2.0](https://github.com/huggingface/transformers/blob/master/LICENSE) and the entmax implementation is from [this repo](https://github.com/deep-spin/entmax).


