|
# Pay Less Attention with Lightweight and Dynamic Convolutions (Wu et al., 2019) |
|
|
|
This page contains pointers to pre-trained models as well as instructions on how to train new models for [our paper](https://arxiv.org/abs/1901.10430). |
|
|
|
## Citation: |
|
```bibtex |
|
@inproceedings{wu2018pay, |
|
title = {Pay Less Attention with Lightweight and Dynamic Convolutions}, |
|
author = {Felix Wu and Angela Fan and Alexei Baevski and Yann Dauphin and Michael Auli}, |
|
booktitle = {International Conference on Learning Representations}, |
|
year = {2019}, |
|
url = {https://arxiv.org/abs/1901.10430}, |
|
} |
|
``` |
|
|
|
## Translation |
|
|
|
### Pre-trained models |
|
For some datasets we release models without GLUs which are faster at inference. |
|
|
|
Model | Description | Dataset | Download |
|
---|---|---|--- |
|
`lightconv.no_glu.iwslt14.de-en` | LightConv (without GLUs) | [IWSLT14 German-English](https://wit3.fbk.eu/archive/2014-01/texts/de/en/de-en.tgz) | model: <br> [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/dynamicconv/iwslt14.de-en.lightconv.tar.gz) <br> IWSLT14 test: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/iwslt14.de-en.test.tar.bz2) |
|
`dynamicconv.no_glu.iwslt14.de-en` | DynamicConv (without GLUs) | [IWSLT14 German-English](https://wit3.fbk.eu/archive/2014-01/texts/de/en/de-en.tgz) | model: <br> [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/dynamicconv/iwslt14.de-en.dynamicconv.tar.gz) <br> IWSLT14 test: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/iwslt14.de-en.test.tar.bz2) |
|
`lightconv.no_glu.wmt16.en-de` | LightConv (without GLUs) | [WMT16 English-German](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8) | model: <br> [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/dynamicconv/wmt16.en-de.joined-dict.lightconv.tar.gz) <br> newstest2014 (shared vocab): <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt16.en-de.joined-dict.newstest2014.tar.bz2) |
|
`dynamicconv.no_glu.wmt16.en-de` | DynamicConv (without GLUs) | [WMT16 English-German](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8) | model: <br> [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/dynamicconv/wmt16.en-de.joined-dict.dynamicconv.tar.gz) <br> newstest2014 (shared vocab): <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt16.en-de.joined-dict.newstest2014.tar.bz2) |
|
`lightconv.glu.wmt16.en-de` | LightConv | [WMT16 English-German](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8) | model: <br> [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/dynamicconv/wmt16.en-de.joined-dict.lightconv-glu.tar.gz) <br> newstest2014 (shared vocab): <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt16.en-de.joined-dict.newstest2014.tar.bz2) |
|
`dynamicconv.glu.wmt16.en-de` | DynamicConv | [WMT16 English-German](https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8) | model: <br> [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/dynamicconv/wmt16.en-de.joined-dict.dynamicconv-glu.tar.gz) <br> newstest2014 (shared vocab): <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt16.en-de.joined-dict.newstest2014.tar.bz2) |
|
`lightconv.glu.wmt14.en-fr` | LightConv | [WMT14 English-French](http://statmt.org/wmt14/translation-task.html#Download) | model: <br> [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/dynamicconv/wmt14.en-fr.joined-dict.lightconv-glu.tar.gz) <br> newstest2014: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt14.en-fr.joined-dict.newstest2014.tar.bz2) |
|
`dynamicconv.glu.wmt14.en-fr` | DynamicConv | [WMT14 English-French](http://statmt.org/wmt14/translation-task.html#Download) | model: <br> [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/dynamicconv/wmt14.en-fr.joined-dict.dynamicconv-glu.tar.gz) <br> newstest2014: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt14.en-fr.joined-dict.newstest2014.tar.bz2) |
|
`lightconv.glu.wmt17.zh-en` | LightConv | [WMT17 Chinese-English](http://statmt.org/wmt17/translation-task.html#Download) | model: <br> [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/dynamicconv/wmt17.zh-en.lightconv-glu.tar.gz) <br> newstest2017: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt17.zh-en.newstest2017.tar.bz2) |
|
`dynamicconv.glu.wmt17.zh-en` | DynamicConv | [WMT17 Chinese-English](http://statmt.org/wmt17/translation-task.html#Download) | model: <br> [download (.tar.gz)](https://dl.fbaipublicfiles.com/fairseq/models/dynamicconv/wmt17.zh-en.dynamicconv-glu.tar.gz) <br> newstest2017: <br> [download (.tar.bz2)](https://dl.fbaipublicfiles.com/fairseq/data/wmt17.zh-en.newstest2017.tar.bz2) |
|
|
|
### Memory-Efficient CUDA Kernels |
|
|
|
Since the PyTorch implementations of Light/Dynamic conv are quite memory intensive, we have developed CUDA kernels that implement the light and dynamic convolution operator in a memory-efficient and performant manner. For large sequence lengths, these kernels save about 50% memory compared to the PyTorch equivalent. |
|
|
|
To install the kernels, use the commands below. Once installed, they will automatically be used in place of the PyTorch implementations whenever a light or dynamic convolution is used. |
|
|
|
```sh |
|
# to install lightconv |
|
cd fairseq/modules/lightconv_layer |
|
python cuda_function_gen.py |
|
python setup.py install |
|
|
|
# to install dynamicconv |
|
cd fairseq/modules/dynamicconv_layer |
|
python cuda_function_gen.py |
|
python setup.py install |
|
``` |
|
|
|
### Example usage (torch.hub) |
|
|
|
We require a few additional Python dependencies for preprocessing: |
|
```bash |
|
pip install sacremoses subword_nmt |
|
``` |
|
|
|
Interactive translation via PyTorch Hub: |
|
```python |
|
import torch |
|
|
|
# List available models |
|
torch.hub.list('pytorch/fairseq') # [..., 'lightconv.glu.wmt17.zh-en', ... ] |
|
|
|
# Load a transformer trained on WMT'16 En-De |
|
zh2en = torch.hub.load('pytorch/fairseq', 'lightconv.glu.wmt17.zh-en', tokenizer='moses', bpe='subword_nmt') |
|
|
|
# The underlying model is available under the *models* attribute |
|
assert isinstance(zh2en.models[0], fairseq.models.lightconv.LightConvModel) |
|
|
|
# Translate a sentence |
|
zh2en.translate('你好 世界') |
|
# 'Hello World' |
|
``` |
|
|
|
Loading custom models: |
|
```python |
|
from fairseq.models.lightconv import LightConvModel |
|
en2fr = LightConvModel.from_pretrained( |
|
'/path/to/checkpoints', |
|
checkpoint_file='checkpoint_best.pt', |
|
data_name_or_path='data-bin/wmt14_en_fr', |
|
bpe='subword_nmt', |
|
bpe_codes='data-bin/wmt14_en_fr/en.code' |
|
) |
|
en2fr.translate('Hello world!') |
|
# 'Bonjour le monde' |
|
``` |
|
|
|
### Preprocessing the training datasets |
|
|
|
Please follow the instructions in [`examples/translation/README.md`](../translation/README.md) to preprocess the data. |
|
|
|
### Training and evaluation options: |
|
To use the model without GLU, please set `--encoder-glu 0 --decoder-glu 0`. |
|
For LightConv, please use `--encoder-conv-type lightweight --decoder-conv-type lightweight`, otherwise the default is DynamicConv. |
|
For best BLEU results, lenpen may need to be manually tuned. |
|
|
|
To use the CUDA kernels, first install the PyTorch modules using the commands |
|
above. Once the CUDA modules are installed, they will automatically be used |
|
instead of the PyTorch modules. |
|
|
|
### IWSLT14 De-En |
|
Training and evaluating DynamicConv (without GLU) on a GPU: |
|
```sh |
|
# Training |
|
SAVE="save/dynamic_conv_iwslt" |
|
mkdir -p $SAVE |
|
CUDA_VISIBLE_DEVICES=0 $(which fairseq-train) data-bin/iwslt14.tokenized.de-en \ |
|
--clip-norm 0 --optimizer adam --lr 0.0005 \ |
|
--source-lang de --target-lang en --max-tokens 4000 --no-progress-bar \ |
|
--log-interval 100 --stop-min-lr '1e-09' --weight-decay 0.0001 \ |
|
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ |
|
--lr-scheduler inverse_sqrt \ |
|
--ddp-backend=legacy_ddp \ |
|
--max-update 50000 --warmup-updates 4000 --warmup-init-lr '1e-07' \ |
|
--adam-betas '(0.9, 0.98)' --keep-last-epochs 10 \ |
|
-a lightconv_iwslt_de_en --save-dir $SAVE \ |
|
--dropout 0.3 --attention-dropout 0.1 --weight-dropout 0.1 \ |
|
--encoder-glu 0 --decoder-glu 0 |
|
python scripts/average_checkpoints.py --inputs $SAVE \ |
|
--num-epoch-checkpoints 10 --output "${SAVE}/checkpoint_last10_avg.pt" |
|
|
|
# Evaluation |
|
CUDA_VISIBLE_DEVICES=0 fairseq-generate data-bin/iwslt14.tokenized.de-en --path "${SAVE}/checkpoint_last10_avg.pt" --batch-size 128 --beam 4 --remove-bpe --lenpen 1 --gen-subset test --quiet |
|
``` |
|
|
|
### WMT16 En-De |
|
Training and evaluating DynamicConv (with GLU) on WMT16 En-De using cosine scheduler on one machine with 8 V100 GPUs: |
|
```sh |
|
# Training |
|
SAVE="save/dynamic_conv_wmt16en2de" |
|
mkdir -p $SAVE |
|
python -m torch.distributed.launch --nproc_per_node 8 $(which fairseq-train) \ |
|
data-bin/wmt16_en_de_bpe32k --fp16 --log-interval 100 --no-progress-bar \ |
|
--max-update 30000 --share-all-embeddings --optimizer adam \ |
|
--adam-betas '(0.9, 0.98)' --clip-norm 0.0 --weight-decay 0.0 \ |
|
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ |
|
--stop-min-lr 1e-09 --update-freq 16 --attention-dropout 0.1 --keep-last-epochs 10 \ |
|
--ddp-backend=legacy_ddp --max-tokens 3584 \ |
|
--lr-scheduler cosine --warmup-init-lr 1e-7 --warmup-updates 10000 \ |
|
--lr-shrink 1 --lr 0.001 --min-lr 1e-7 --warmup-init-lr 1e-07 \ |
|
--t-mult 1 --lr-period-updates 20000 \ |
|
--arch lightconv_wmt_en_de_big --save-dir $SAVE \ |
|
--dropout 0.3 --attention-dropout 0.1 --weight-dropout 0.1 \ |
|
--encoder-glu 1 --decoder-glu 1 |
|
|
|
# Evaluation |
|
CUDA_VISIBLE_DEVICES=0 fairseq-generate data-bin/wmt16.en-de.joined-dict.newstest2014 --path "${SAVE}/checkpoint_best.pt" --batch-size 128 --beam 5 --remove-bpe --lenpen 0.5 --gen-subset test > wmt16_gen.txt |
|
bash scripts/compound_split_bleu.sh wmt16_gen.txt |
|
``` |
|
|
|
### WMT14 En-Fr |
|
Training DynamicConv (with GLU) on WMT14 En-Fr using cosine scheduler on one machine with 8 V100 GPUs: |
|
```sh |
|
# Training |
|
SAVE="save/dynamic_conv_wmt14en2fr" |
|
mkdir -p $SAVE |
|
python -m torch.distributed.launch --nproc_per_node 8 $(which fairseq-train) \ |
|
data-bin/wmt14_en_fr --fp16 --log-interval 100 --no-progress-bar \ |
|
--max-update 30000 --share-all-embeddings --optimizer adam \ |
|
--adam-betas '(0.9, 0.98)' --clip-norm 0.0 --weight-decay 0.0 \ |
|
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ |
|
--stop-min-lr 1e-09 --update-freq 16 --attention-dropout 0.1 --keep-last-epochs 10 \ |
|
--ddp-backend=legacy_ddp --max-tokens 3584 \ |
|
--lr-scheduler cosine --warmup-init-lr 1e-7 --warmup-updates 10000 \ |
|
--lr-shrink 1 --lr 0.001 --min-lr 1e-7 --warmup-init-lr 1e-07 \ |
|
--t-mult 1 --lr-period-updates 70000 \ |
|
--arch lightconv_wmt_en_fr_big --save-dir $SAVE \ |
|
--dropout 0.1 --attention-dropout 0.1 --weight-dropout 0.1 \ |
|
--encoder-glu 1 --decoder-glu 1 |
|
|
|
# Evaluation |
|
CUDA_VISIBLE_DEVICES=0 fairseq-generate data-bin/wmt14.en-fr.joined-dict.newstest2014 --path "${SAVE}/checkpoint_best.pt" --batch-size 128 --beam 5 --remove-bpe --lenpen 0.9 --gen-subset test |
|
``` |
|
|