|
Evaluating Pre-trained Models |
|
============================= |
|
|
|
First, download a pre-trained model along with its vocabularies: |
|
|
|
.. code-block:: console |
|
|
|
> curl https://dl.fbaipublicfiles.com/fairseq/models/wmt14.v2.en-fr.fconv-py.tar.bz2 | tar xvjf - |
|
|
|
This model uses a `Byte Pair Encoding (BPE) |
|
vocabulary <https://arxiv.org/abs/1508.07909>`__, so we'll have to apply |
|
the encoding to the source text before it can be translated. This can be |
|
done with the |
|
`apply\_bpe.py <https://github.com/rsennrich/subword-nmt/blob/master/subword_nmt/apply_bpe.py>`__ |
|
script using the ``wmt14.en-fr.fconv-cuda/bpecodes`` file. ``@@`` is |
|
used as a continuation marker and the original text can be easily |
|
recovered with e.g. ``sed s/@@ //g`` or by passing the `` |
|
flag to :ref:`fairseq-generate`. Prior to BPE, input text needs to be tokenized |
|
using ``tokenizer.perl`` from |
|
`mosesdecoder <https://github.com/moses-smt/mosesdecoder>`__. |
|
|
|
Let's use :ref:`fairseq-interactive` to generate translations interactively. |
|
Here, we use a beam size of 5 and preprocess the input with the Moses |
|
tokenizer and the given Byte-Pair Encoding vocabulary. It will automatically |
|
remove the BPE continuation markers and detokenize the output. |
|
|
|
.. code-block:: console |
|
|
|
> MODEL_DIR=wmt14.en-fr.fconv-py |
|
> fairseq-interactive \ |
|
|
|
|
|
|
|
|
|
| loading model(s) from wmt14.en-fr.fconv-py/model.pt |
|
| [en] dictionary: 44206 types |
|
| [fr] dictionary: 44463 types |
|
| Type the input sentence and press return: |
|
Why is it rare to discover new marine mammal species? |
|
S-0 Why is it rare to discover new marine mam@@ mal species ? |
|
H-0 -0.0643349438905716 Pourquoi est-il rare de découvrir de nouvelles espèces de mammifères marins? |
|
P-0 -0.0763 -0.1849 -0.0956 -0.0946 -0.0735 -0.1150 -0.1301 -0.0042 -0.0321 -0.0171 -0.0052 -0.0062 -0.0015 |
|
|
|
This generation script produces three types of outputs: a line prefixed |
|
with *O* is a copy of the original source sentence; *H* is the |
|
hypothesis along with an average log-likelihood; and *P* is the |
|
positional score per token position, including the |
|
end-of-sentence marker which is omitted from the text. |
|
|
|
Other types of output lines you might see are *D*, the detokenized hypothesis, |
|
*T*, the reference target, *A*, alignment info, *E* the history of generation steps. |
|
|
|
See the `README <https://github.com/pytorch/fairseq#pre-trained-models>`__ for a |
|
full list of pre-trained models available. |
|
|
|
Training a New Model |
|
==================== |
|
|
|
The following tutorial is for machine translation. For an example of how |
|
to use Fairseq for other tasks, such as :ref:`language modeling`, please see the |
|
``examples/`` directory. |
|
|
|
Data Pre-processing |
|
|
|
|
|
Fairseq contains example pre-processing scripts for several translation |
|
datasets: IWSLT 2014 (German-English), WMT 2014 (English-French) and WMT |
|
2014 (English-German). To pre-process and binarize the IWSLT dataset: |
|
|
|
.. code-block:: console |
|
|
|
> cd examples/translation/ |
|
> bash prepare-iwslt14.sh |
|
> cd ../.. |
|
> TEXT=examples/translation/iwslt14.tokenized.de-en |
|
> fairseq-preprocess |
|
|
|
|
|
|
|
This will write binarized data that can be used for model training to |
|
``data-bin/iwslt14.tokenized.de-en``. |
|
|
|
Training |
|
|
|
|
|
Use :ref:`fairseq-train` to train a new model. Here a few example settings that work |
|
well for the IWSLT 2014 dataset: |
|
|
|
.. code-block:: console |
|
|
|
> mkdir -p checkpoints/fconv |
|
> CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/iwslt14.tokenized.de-en \ |
|
|
|
|
|
|
|
By default, :ref:`fairseq-train` will use all available GPUs on your machine. Use the |
|
``CUDA_VISIBLE_DEVICES`` environment variable to select specific GPUs and/or to |
|
change the number of GPU devices that will be used. |
|
|
|
Also note that the batch size is specified in terms of the maximum |
|
number of tokens per batch (`` |
|
smaller value depending on the available GPU memory on your system. |
|
|
|
Generation |
|
|
|
|
|
Once your model is trained, you can generate translations using |
|
:ref:`fairseq-generate` **(for binarized data)** or |
|
:ref:`fairseq-interactive` **(for raw text)**: |
|
|
|
.. code-block:: console |
|
|
|
> fairseq-generate data-bin/iwslt14.tokenized.de-en \ |
|
|
|
|
|
| [de] dictionary: 35475 types |
|
| [en] dictionary: 24739 types |
|
| data-bin/iwslt14.tokenized.de-en test 6750 examples |
|
| model fconv |
|
| loaded checkpoint trainings/fconv/checkpoint_best.pt |
|
S-721 danke . |
|
T-721 thank you . |
|
... |
|
|
|
To generate translations with only a CPU, use the `` |
|
continuation markers can be removed with the `` |
|
|
|
Advanced Training Options |
|
========================= |
|
|
|
Large mini-batch training with delayed updates |
|
|
|
|
|
The `` |
|
multiple mini-batches and delay updating, creating a larger effective |
|
batch size. Delayed updates can also improve training speed by reducing |
|
inter-GPU communication costs and by saving idle time caused by variance |
|
in workload across GPUs. See `Ott et al. |
|
(2018) <https://arxiv.org/abs/1806.00187>`__ for more details. |
|
|
|
To train on a single GPU with an effective batch size that is equivalent |
|
to training on 8 GPUs: |
|
|
|
.. code-block:: console |
|
|
|
> CUDA_VISIBLE_DEVICES=0 fairseq-train |
|
|
|
Training with half precision floating point (FP16) |
|
|
|
|
|
.. note:: |
|
|
|
FP16 training requires a Volta GPU and CUDA 9.1 or greater |
|
|
|
Recent GPUs enable efficient half precision floating point computation, |
|
e.g., using `Nvidia Tensor Cores |
|
<https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html>`__. |
|
Fairseq supports FP16 training with the `` |
|
|
|
.. code-block:: console |
|
|
|
> fairseq-train |
|
|
|
Distributed training |
|
|
|
|
|
Distributed training in fairseq is implemented on top of ``torch.distributed``. |
|
The easiest way to launch jobs is with the `torch.distributed.launch |
|
<https://pytorch.org/docs/stable/distributed.html#launch-utility>`__ tool. |
|
|
|
For example, to train a large English-German Transformer model on 2 nodes each |
|
with 8 GPUs (in total 16 GPUs), run the following command on each node, |
|
replacing ``node_rank=0`` with ``node_rank=1`` on the second node and making |
|
sure to update `` |
|
|
|
.. code-block:: console |
|
|
|
> python -m torch.distributed.launch |
|
|
|
|
|
$(which fairseq-train) data-bin/wmt16_en_de_bpe32k \ |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
On SLURM clusters, fairseq will automatically detect the number of nodes and |
|
GPUs, but a port number must be provided: |
|
|
|
.. code-block:: console |
|
|
|
> salloc |
|
> srun fairseq-train |
|
|
|
Sharding very large datasets |
|
|
|
|
|
It can be challenging to train over very large datasets, particularly if your |
|
machine does not have much system RAM. Most tasks in fairseq support training |
|
over "sharded" datasets, in which the original dataset has been preprocessed |
|
into non-overlapping chunks (or "shards"). |
|
|
|
For example, instead of preprocessing all your data into a single "data-bin" |
|
directory, you can split the data and create "data-bin1", "data-bin2", etc. |
|
Then you can adapt your training command like so: |
|
|
|
.. code-block:: console |
|
|
|
> fairseq-train data-bin1:data-bin2:data-bin3 (...) |
|
|
|
Training will now iterate over each shard, one by one, with each shard |
|
corresponding to an "epoch", thus reducing system memory usage. |
|
|