Spaces:
Running
Running
# Transformer with Pointer-Generator Network | |
This page describes the `transformer_pointer_generator` model that incorporates | |
a pointing mechanism in the Transformer model that facilitates copying of input | |
words to the output. This architecture is described in [Enarvi et al. (2020)](https://www.aclweb.org/anthology/2020.nlpmc-1.4/). | |
## Background | |
The pointer-generator network was introduced in [See et al. (2017)](https://arxiv.org/abs/1704.04368) | |
for RNN encoder-decoder attention models. A similar mechanism can be | |
incorporated in a Transformer model by reusing one of the many attention | |
distributions for pointing. The attention distribution over the input words is | |
interpolated with the normal output distribution over the vocabulary words. This | |
allows the model to generate words that appear in the input, even if they don't | |
appear in the vocabulary, helping especially with small vocabularies. | |
## Implementation | |
The mechanism for copying out-of-vocabulary words from the input has been | |
implemented differently to See et al. In their [implementation](https://github.com/abisee/pointer-generator) | |
they convey the word identities through the model in order to be able to produce | |
words that appear in the input sequence but not in the vocabulary. A different | |
approach was taken in the Fairseq implementation to keep it self-contained in | |
the model file, avoiding any changes to the rest of the code base. Copying | |
out-of-vocabulary words is possible by pre-processing the input and | |
post-processing the output. This is described in detail in the next section. | |
## Usage | |
The training and evaluation procedure is outlined below. You can also find a | |
more detailed example for the XSum dataset on [this page](README.xsum.md). | |
##### 1. Create a vocabulary and extend it with source position markers | |
The pointing mechanism is especially helpful with small vocabularies, if we are | |
able to recover the identities of any out-of-vocabulary words that are copied | |
from the input. For this purpose, the model allows extending the vocabulary with | |
special tokens that can be used in place of `<unk>` tokens to identify different | |
input positions. For example, the user may add `<unk-0>`, `<unk-1>`, `<unk-2>`, | |
etc. to the end of the vocabulary, after the normal words. Below is an example | |
of how to create a vocabulary of 10000 most common words and add 1000 input | |
position markers. | |
```bash | |
vocab_size=10000 | |
position_markers=1000 | |
export LC_ALL=C | |
cat train.src train.tgt | | |
tr -s '[:space:]' '\n' | | |
sort | | |
uniq -c | | |
sort -k1,1bnr -k2 | | |
head -n "$((vocab_size - 4))" | | |
awk '{ print $2 " " $1 }' >dict.pg.txt | |
python3 -c "[print('<unk-{}> 0'.format(n)) for n in range($position_markers)]" >>dict.pg.txt | |
``` | |
##### 2. Preprocess the text data | |
The idea is that any `<unk>` tokens in the text are replaced with `<unk-0>` if | |
it appears in the first input position, `<unk-1>` if it appears in the second | |
input position, and so on. This can be achieved using the `preprocess.py` script | |
that is provided in this directory. | |
##### 3. Train a model | |
The number of these special tokens is given to the model with the | |
`--source-position-markers` argument—the model simply maps all of these to the | |
same word embedding as `<unk>`. | |
The attention distribution that is used for pointing is selected using the | |
`--alignment-heads` and `--alignment-layer` command-line arguments in the same | |
way as with the `transformer_align` model. | |
##### 4. Generate text and postprocess it | |
When using the model to generate text, you want to preprocess the input text in | |
the same way that training data was processed, replacing out-of-vocabulary words | |
with `<unk-N>` tokens. If any of these tokens are copied to the output, the | |
actual words can be retrieved from the unprocessed input text. Any `<unk-N>` | |
token should be replaced with the word at position N in the original input | |
sequence. This can be achieved using the `postprocess.py` script. | |