|
# Transformer with Pointer-Generator Network |
|
|
|
This page describes the `transformer_pointer_generator` model that incorporates |
|
a pointing mechanism in the Transformer model that facilitates copying of input |
|
words to the output. This architecture is described in [Enarvi et al. (2020)](https://www.aclweb.org/anthology/2020.nlpmc-1.4/). |
|
|
|
## Background |
|
|
|
The pointer-generator network was introduced in [See et al. (2017)](https://arxiv.org/abs/1704.04368) |
|
for RNN encoder-decoder attention models. A similar mechanism can be |
|
incorporated in a Transformer model by reusing one of the many attention |
|
distributions for pointing. The attention distribution over the input words is |
|
interpolated with the normal output distribution over the vocabulary words. This |
|
allows the model to generate words that appear in the input, even if they don't |
|
appear in the vocabulary, helping especially with small vocabularies. |
|
|
|
## Implementation |
|
|
|
The mechanism for copying out-of-vocabulary words from the input has been |
|
implemented differently to See et al. In their [implementation](https://github.com/abisee/pointer-generator) |
|
they convey the word identities through the model in order to be able to produce |
|
words that appear in the input sequence but not in the vocabulary. A different |
|
approach was taken in the Fairseq implementation to keep it self-contained in |
|
the model file, avoiding any changes to the rest of the code base. Copying |
|
out-of-vocabulary words is possible by pre-processing the input and |
|
post-processing the output. This is described in detail in the next section. |
|
|
|
## Usage |
|
|
|
The training and evaluation procedure is outlined below. You can also find a |
|
more detailed example for the XSum dataset on [this page](README.xsum.md). |
|
|
|
##### 1. Create a vocabulary and extend it with source position markers |
|
|
|
The pointing mechanism is especially helpful with small vocabularies, if we are |
|
able to recover the identities of any out-of-vocabulary words that are copied |
|
from the input. For this purpose, the model allows extending the vocabulary with |
|
special tokens that can be used in place of `<unk>` tokens to identify different |
|
input positions. For example, the user may add `<unk-0>`, `<unk-1>`, `<unk-2>`, |
|
etc. to the end of the vocabulary, after the normal words. Below is an example |
|
of how to create a vocabulary of 10000 most common words and add 1000 input |
|
position markers. |
|
|
|
```bash |
|
vocab_size=10000 |
|
position_markers=1000 |
|
export LC_ALL=C |
|
cat train.src train.tgt | |
|
tr -s '[:space:]' '\n' | |
|
sort | |
|
uniq -c | |
|
sort -k1,1bnr -k2 | |
|
head -n "$((vocab_size - 4))" | |
|
awk '{ print $2 " " $1 }' >dict.pg.txt |
|
python3 -c "[print('<unk-{}> 0'.format(n)) for n in range($position_markers)]" >>dict.pg.txt |
|
``` |
|
|
|
##### 2. Preprocess the text data |
|
|
|
The idea is that any `<unk>` tokens in the text are replaced with `<unk-0>` if |
|
it appears in the first input position, `<unk-1>` if it appears in the second |
|
input position, and so on. This can be achieved using the `preprocess.py` script |
|
that is provided in this directory. |
|
|
|
##### 3. Train a model |
|
|
|
The number of these special tokens is given to the model with the |
|
`--source-position-markers` argument—the model simply maps all of these to the |
|
same word embedding as `<unk>`. |
|
|
|
The attention distribution that is used for pointing is selected using the |
|
`--alignment-heads` and `--alignment-layer` command-line arguments in the same |
|
way as with the `transformer_align` model. |
|
|
|
##### 4. Generate text and postprocess it |
|
|
|
When using the model to generate text, you want to preprocess the input text in |
|
the same way that training data was processed, replacing out-of-vocabulary words |
|
with `<unk-N>` tokens. If any of these tokens are copied to the output, the |
|
actual words can be retrieved from the unprocessed input text. Any `<unk-N>` |
|
token should be replaced with the word at position N in the original input |
|
sequence. This can be achieved using the `postprocess.py` script. |
|
|