Spaces:

mshukor
/

UnIVAL

Running

UnIVAL / fairseq /docs /getting_started.rst

mshukor

init

26fd00c about 2 years ago

8.61 kB

	Evaluating Pre-trained Models
	=============================

	First, download a pre-trained model along with its vocabularies:

	.. code-block:: console

	> curl https://dl.fbaipublicfiles.com/fairseq/models/wmt14.v2.en-fr.fconv-py.tar.bz2 \| tar xvjf -

	This model uses a `Byte Pair Encoding (BPE)
	vocabulary <https://arxiv.org/abs/1508.07909>`__, so we'll have to apply
	the encoding to the source text before it can be translated. This can be
	done with the
	`apply\_bpe.py <https://github.com/rsennrich/subword-nmt/blob/master/subword_nmt/apply_bpe.py>`__
	script using the ``wmt14.en-fr.fconv-cuda/bpecodes`` file. ``@@`` is
	used as a continuation marker and the original text can be easily
	recovered with e.g. ``sed s/@@ //g`` or by passing the ``--remove-bpe``
	flag to :ref:`fairseq-generate`. Prior to BPE, input text needs to be tokenized
	using ``tokenizer.perl`` from
	`mosesdecoder <https://github.com/moses-smt/mosesdecoder>`__.

	Let's use :ref:`fairseq-interactive` to generate translations interactively.
	Here, we use a beam size of 5 and preprocess the input with the Moses
	tokenizer and the given Byte-Pair Encoding vocabulary. It will automatically
	remove the BPE continuation markers and detokenize the output.

	.. code-block:: console

	> MODEL_DIR=wmt14.en-fr.fconv-py
	> fairseq-interactive \
	--path $MODEL_DIR/model.pt $MODEL_DIR \
	--beam 5 --source-lang en --target-lang fr \
	--tokenizer moses \
	--bpe subword_nmt --bpe-codes $MODEL_DIR/bpecodes
	\| loading model(s) from wmt14.en-fr.fconv-py/model.pt
	\| [en] dictionary: 44206 types
	\| [fr] dictionary: 44463 types
	\| Type the input sentence and press return:
	Why is it rare to discover new marine mammal species?
	S-0 Why is it rare to discover new marine mam@@ mal species ?
	H-0 -0.0643349438905716 Pourquoi est-il rare de découvrir de nouvelles espèces de mammifères marins?
	P-0 -0.0763 -0.1849 -0.0956 -0.0946 -0.0735 -0.1150 -0.1301 -0.0042 -0.0321 -0.0171 -0.0052 -0.0062 -0.0015

	This generation script produces three types of outputs: a line prefixed
	with O is a copy of the original source sentence; H is the
	hypothesis along with an average log-likelihood; and P is the
	positional score per token position, including the
	end-of-sentence marker which is omitted from the text.

	Other types of output lines you might see are D, the detokenized hypothesis,
	T, the reference target, A, alignment info, E the history of generation steps.

	See the `README <https://github.com/pytorch/fairseq#pre-trained-models>`__ for a
	full list of pre-trained models available.

	Training a New Model
	====================

	The following tutorial is for machine translation. For an example of how
	to use Fairseq for other tasks, such as :ref:`language modeling`, please see the
	``examples/`` directory.

	Data Pre-processing
	-------------------

	Fairseq contains example pre-processing scripts for several translation
	datasets: IWSLT 2014 (German-English), WMT 2014 (English-French) and WMT
	2014 (English-German). To pre-process and binarize the IWSLT dataset:

	.. code-block:: console

	> cd examples/translation/
	> bash prepare-iwslt14.sh
	> cd ../..
	> TEXT=examples/translation/iwslt14.tokenized.de-en
	> fairseq-preprocess --source-lang de --target-lang en \
	--trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
	--destdir data-bin/iwslt14.tokenized.de-en

	This will write binarized data that can be used for model training to
	``data-bin/iwslt14.tokenized.de-en``.

	Training
	--------

	Use :ref:`fairseq-train` to train a new model. Here a few example settings that work
	well for the IWSLT 2014 dataset:

	.. code-block:: console

	> mkdir -p checkpoints/fconv
	> CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/iwslt14.tokenized.de-en \
	--optimizer nag --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
	--arch fconv_iwslt_de_en --save-dir checkpoints/fconv

	By default, :ref:`fairseq-train` will use all available GPUs on your machine. Use the
	``CUDA_VISIBLE_DEVICES`` environment variable to select specific GPUs and/or to
	change the number of GPU devices that will be used.

	Also note that the batch size is specified in terms of the maximum
	number of tokens per batch (``--max-tokens``). You may need to use a
	smaller value depending on the available GPU memory on your system.

	Generation
	----------

	Once your model is trained, you can generate translations using
	:ref:`fairseq-generate` (for binarized data) or
	:ref:`fairseq-interactive` (for raw text):

	.. code-block:: console

	> fairseq-generate data-bin/iwslt14.tokenized.de-en \
	--path checkpoints/fconv/checkpoint_best.pt \
	--batch-size 128 --beam 5
	\| [de] dictionary: 35475 types
	\| [en] dictionary: 24739 types
	\| data-bin/iwslt14.tokenized.de-en test 6750 examples
	\| model fconv
	\| loaded checkpoint trainings/fconv/checkpoint_best.pt
	S-721 danke .
	T-721 thank you .
	...

	To generate translations with only a CPU, use the ``--cpu`` flag. BPE
	continuation markers can be removed with the ``--remove-bpe`` flag.

	Advanced Training Options
	=========================

	Large mini-batch training with delayed updates
	----------------------------------------------

	The ``--update-freq`` option can be used to accumulate gradients from
	multiple mini-batches and delay updating, creating a larger effective
	batch size. Delayed updates can also improve training speed by reducing
	inter-GPU communication costs and by saving idle time caused by variance
	in workload across GPUs. See `Ott et al.
	(2018) <https://arxiv.org/abs/1806.00187>`__ for more details.

	To train on a single GPU with an effective batch size that is equivalent
	to training on 8 GPUs:

	.. code-block:: console

	> CUDA_VISIBLE_DEVICES=0 fairseq-train --update-freq 8 (...)

	Training with half precision floating point (FP16)
	--------------------------------------------------

	.. note::

	FP16 training requires a Volta GPU and CUDA 9.1 or greater

	Recent GPUs enable efficient half precision floating point computation,
	e.g., using `Nvidia Tensor Cores
	<https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html>`__.
	Fairseq supports FP16 training with the ``--fp16`` flag:

	.. code-block:: console

	> fairseq-train --fp16 (...)

	Distributed training
	--------------------

	Distributed training in fairseq is implemented on top of ``torch.distributed``.
	The easiest way to launch jobs is with the `torch.distributed.launch
	<https://pytorch.org/docs/stable/distributed.html#launch-utility>`__ tool.

	For example, to train a large English-German Transformer model on 2 nodes each
	with 8 GPUs (in total 16 GPUs), run the following command on each node,
	replacing ``node_rank=0`` with ``node_rank=1`` on the second node and making
	sure to update ``--master_addr`` to the IP address of the first node:

	.. code-block:: console

	> python -m torch.distributed.launch --nproc_per_node=8 \
	--nnodes=2 --node_rank=0 --master_addr="192.168.1.1" \
	--master_port=12345 \
	$(which fairseq-train) data-bin/wmt16_en_de_bpe32k \
	--arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
	--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
	--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
	--lr 0.0005 \
	--dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
	--max-tokens 3584 \
	--max-epoch 70 \
	--fp16

	On SLURM clusters, fairseq will automatically detect the number of nodes and
	GPUs, but a port number must be provided:

	.. code-block:: console

	> salloc --gpus=16 --nodes 2 (...)
	> srun fairseq-train --distributed-port 12345 (...).

	Sharding very large datasets
	----------------------------

	It can be challenging to train over very large datasets, particularly if your
	machine does not have much system RAM. Most tasks in fairseq support training
	over "sharded" datasets, in which the original dataset has been preprocessed
	into non-overlapping chunks (or "shards").

	For example, instead of preprocessing all your data into a single "data-bin"
	directory, you can split the data and create "data-bin1", "data-bin2", etc.
	Then you can adapt your training command like so:

	.. code-block:: console

	> fairseq-train data-bin1:data-bin2:data-bin3 (...)

	Training will now iterate over each shard, one by one, with each shard
	corresponding to an "epoch", thus reducing system memory usage.