Transformer with Pointer-Generator Network
This page describes the transformer_pointer_generator
model that incorporates
a pointing mechanism in the Transformer model that facilitates copying of input
words to the output. This architecture is described in Enarvi et al. (2020).
Background
The pointer-generator network was introduced in See et al. (2017) for RNN encoder-decoder attention models. A similar mechanism can be incorporated in a Transformer model by reusing one of the many attention distributions for pointing. The attention distribution over the input words is interpolated with the normal output distribution over the vocabulary words. This allows the model to generate words that appear in the input, even if they don't appear in the vocabulary, helping especially with small vocabularies.
Implementation
The mechanism for copying out-of-vocabulary words from the input has been implemented differently to See et al. In their implementation they convey the word identities through the model in order to be able to produce words that appear in the input sequence but not in the vocabulary. A different approach was taken in the Fairseq implementation to keep it self-contained in the model file, avoiding any changes to the rest of the code base. Copying out-of-vocabulary words is possible by pre-processing the input and post-processing the output. This is described in detail in the next section.
Usage
The training and evaluation procedure is outlined below. You can also find a more detailed example for the XSum dataset on this page.
1. Create a vocabulary and extend it with source position markers
The pointing mechanism is especially helpful with small vocabularies, if we are
able to recover the identities of any out-of-vocabulary words that are copied
from the input. For this purpose, the model allows extending the vocabulary with
special tokens that can be used in place of <unk>
tokens to identify different
input positions. For example, the user may add <unk-0>
, <unk-1>
, <unk-2>
,
etc. to the end of the vocabulary, after the normal words. Below is an example
of how to create a vocabulary of 10000 most common words and add 1000 input
position markers.
vocab_size=10000
position_markers=1000
export LC_ALL=C
cat train.src train.tgt |
tr -s '[:space:]' '\n' |
sort |
uniq -c |
sort -k1,1bnr -k2 |
head -n "$((vocab_size - 4))" |
awk '{ print $2 " " $1 }' >dict.pg.txt
python3 -c "[print('<unk-{}> 0'.format(n)) for n in range($position_markers)]" >>dict.pg.txt
2. Preprocess the text data
The idea is that any <unk>
tokens in the text are replaced with <unk-0>
if
it appears in the first input position, <unk-1>
if it appears in the second
input position, and so on. This can be achieved using the preprocess.py
script
that is provided in this directory.
3. Train a model
The number of these special tokens is given to the model with the
--source-position-markers
argument—the model simply maps all of these to the
same word embedding as <unk>
.
The attention distribution that is used for pointing is selected using the
--alignment-heads
and --alignment-layer
command-line arguments in the same
way as with the transformer_align
model.
4. Generate text and postprocess it
When using the model to generate text, you want to preprocess the input text in
the same way that training data was processed, replacing out-of-vocabulary words
with <unk-N>
tokens. If any of these tokens are copied to the output, the
actual words can be retrieved from the unprocessed input text. Any <unk-N>
token should be replaced with the word at position N in the original input
sequence. This can be achieved using the postprocess.py
script.