Commit
·
bc0eba5
1
Parent(s):
314bbe2
Update README.md
Browse files
README.md
CHANGED
@@ -9,7 +9,7 @@ datasets:
|
|
9 |
- trivia_qa
|
10 |
---
|
11 |
|
12 |
-
#
|
13 |
|
14 |
longformer-base-4096 is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents. It supports sequences of length up to 4,096.
|
15 |
It was introduced in
|
@@ -24,7 +24,7 @@ Transformer-based models are unable to process long sequences due to their self-
|
|
24 |
Longformer-Encoder-Decoder (LED), a Longformer variant for supporting long document generative sequence-to-sequence tasks,
|
25 |
and demonstrate its effectiveness on the arXiv summarization dataset.
|
26 |
|
27 |
-
-
|
28 |
cess long sequences due to their self-attention
|
29 |
operation, which scales quadratically with the
|
30 |
sequence length. To address this limitation,
|
@@ -50,223 +50,249 @@ Longformer-Encoder-Decoder (LED), a Long-
|
|
50 |
former variant for supporting long document
|
51 |
generative sequence-to-sequence tasks, and
|
52 |
demonstrate its effectiveness on the arXiv sum-
|
53 |
-
marization dataset.
|
54 |
-
-
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
-
|
62 |
-
## Model variations
|
63 |
-
|
64 |
-
BERT has originally been released in base and large variations, for cased and uncased input text. The uncased models also strips out an accent markers.
|
65 |
-
Chinese and multilingual uncased and cased versions followed shortly after.
|
66 |
-
Modified preprocessing with whole word masking has replaced subpiece masking in a following work, with the release of two models.
|
67 |
-
Other 24 smaller models are released afterward.
|
68 |
-
|
69 |
-
The detailed release history can be found on the [google-research/bert readme](https://github.com/google-research/bert/blob/master/README.md) on github.
|
70 |
-
|
71 |
-
| Model | #params | Language |
|
72 |
-
|------------------------|--------------------------------|-------|
|
73 |
-
| [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) | 110M | English |
|
74 |
-
| [`bert-large-uncased`](https://huggingface.co/bert-large-uncased) | 340M | English | sub
|
75 |
-
| [`bert-base-cased`](https://huggingface.co/bert-base-cased) | 110M | English |
|
76 |
-
| [`bert-large-cased`](https://huggingface.co/bert-large-cased) | 340M | English |
|
77 |
-
| [`bert-base-chinese`](https://huggingface.co/bert-base-chinese) | 110M | Chinese |
|
78 |
-
| [`bert-base-multilingual-cased`](https://huggingface.co/bert-base-multilingual-cased) | 110M | Multiple |
|
79 |
-
| [`bert-large-uncased-whole-word-masking`](https://huggingface.co/bert-large-uncased-whole-word-masking) | 340M | English |
|
80 |
-
| [`bert-large-cased-whole-word-masking`](https://huggingface.co/bert-large-cased-whole-word-masking) | 340M | English |
|
81 |
-
|
82 |
-
## Intended uses & limitations
|
83 |
-
|
84 |
-
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
|
85 |
-
be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=bert) to look for
|
86 |
-
fine-tuned versions of a task that interests you.
|
87 |
-
|
88 |
-
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
|
89 |
-
to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
|
90 |
-
generation you should look at model like GPT2.
|
91 |
-
|
92 |
-
### How to use
|
93 |
-
|
94 |
-
You can use this model directly with a pipeline for masked language modeling:
|
95 |
-
|
96 |
-
```python
|
97 |
-
>>> from transformers import pipeline
|
98 |
-
>>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
|
99 |
-
>>> unmasker("Hello I'm a [MASK] model.")
|
100 |
-
|
101 |
-
[{'sequence': "[CLS] hello i'm a fashion model. [SEP]",
|
102 |
-
'score': 0.1073106899857521,
|
103 |
-
'token': 4827,
|
104 |
-
'token_str': 'fashion'},
|
105 |
-
{'sequence': "[CLS] hello i'm a role model. [SEP]",
|
106 |
-
'score': 0.08774490654468536,
|
107 |
-
'token': 2535,
|
108 |
-
'token_str': 'role'},
|
109 |
-
{'sequence': "[CLS] hello i'm a new model. [SEP]",
|
110 |
-
'score': 0.05338378623127937,
|
111 |
-
'token': 2047,
|
112 |
-
'token_str': 'new'},
|
113 |
-
{'sequence': "[CLS] hello i'm a super model. [SEP]",
|
114 |
-
'score': 0.04667217284440994,
|
115 |
-
'token': 3565,
|
116 |
-
'token_str': 'super'},
|
117 |
-
{'sequence': "[CLS] hello i'm a fine model. [SEP]",
|
118 |
-
'score': 0.027095865458250046,
|
119 |
-
'token': 2986,
|
120 |
-
'token_str': 'fine'}]
|
121 |
-
```
|
122 |
|
123 |
-
Here is how to use this model to get the features of a given text in PyTorch:
|
124 |
|
125 |
-
|
126 |
-
from transformers import BertTokenizer, BertModel
|
127 |
-
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
128 |
-
model = BertModel.from_pretrained("bert-base-uncased")
|
129 |
-
text = "Replace me by any text you'd like."
|
130 |
-
encoded_input = tokenizer(text, return_tensors='pt')
|
131 |
-
output = model(**encoded_input)
|
132 |
-
```
|
133 |
|
134 |
-
and
|
|
|
135 |
|
136 |
-
|
137 |
-
from
|
138 |
-
|
139 |
-
model = TFBertModel.from_pretrained("bert-base-uncased")
|
140 |
-
text = "Replace me by any text you'd like."
|
141 |
-
encoded_input = tokenizer(text, return_tensors='tf')
|
142 |
-
output = model(encoded_input)
|
143 |
-
```
|
144 |
|
145 |
-
### Limitations and bias
|
146 |
-
|
147 |
-
Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
|
148 |
-
predictions:
|
149 |
-
|
150 |
-
```python
|
151 |
-
>>> from transformers import pipeline
|
152 |
-
>>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
|
153 |
-
>>> unmasker("The man worked as a [MASK].")
|
154 |
-
|
155 |
-
[{'sequence': '[CLS] the man worked as a carpenter. [SEP]',
|
156 |
-
'score': 0.09747550636529922,
|
157 |
-
'token': 10533,
|
158 |
-
'token_str': 'carpenter'},
|
159 |
-
{'sequence': '[CLS] the man worked as a waiter. [SEP]',
|
160 |
-
'score': 0.0523831807076931,
|
161 |
-
'token': 15610,
|
162 |
-
'token_str': 'waiter'},
|
163 |
-
{'sequence': '[CLS] the man worked as a barber. [SEP]',
|
164 |
-
'score': 0.04962705448269844,
|
165 |
-
'token': 13362,
|
166 |
-
'token_str': 'barber'},
|
167 |
-
{'sequence': '[CLS] the man worked as a mechanic. [SEP]',
|
168 |
-
'score': 0.03788609802722931,
|
169 |
-
'token': 15893,
|
170 |
-
'token_str': 'mechanic'},
|
171 |
-
{'sequence': '[CLS] the man worked as a salesman. [SEP]',
|
172 |
-
'score': 0.037680890411138535,
|
173 |
-
'token': 18968,
|
174 |
-
'token_str': 'salesman'}]
|
175 |
-
|
176 |
-
>>> unmasker("The woman worked as a [MASK].")
|
177 |
-
|
178 |
-
[{'sequence': '[CLS] the woman worked as a nurse. [SEP]',
|
179 |
-
'score': 0.21981462836265564,
|
180 |
-
'token': 6821,
|
181 |
-
'token_str': 'nurse'},
|
182 |
-
{'sequence': '[CLS] the woman worked as a waitress. [SEP]',
|
183 |
-
'score': 0.1597415804862976,
|
184 |
-
'token': 13877,
|
185 |
-
'token_str': 'waitress'},
|
186 |
-
{'sequence': '[CLS] the woman worked as a maid. [SEP]',
|
187 |
-
'score': 0.1154729500412941,
|
188 |
-
'token': 10850,
|
189 |
-
'token_str': 'maid'},
|
190 |
-
{'sequence': '[CLS] the woman worked as a prostitute. [SEP]',
|
191 |
-
'score': 0.037968918681144714,
|
192 |
-
'token': 19215,
|
193 |
-
'token_str': 'prostitute'},
|
194 |
-
{'sequence': '[CLS] the woman worked as a cook. [SEP]',
|
195 |
-
'score': 0.03042375110089779,
|
196 |
-
'token': 5660,
|
197 |
-
'token_str': 'cook'}]
|
198 |
-
```
|
199 |
|
200 |
-
|
201 |
|
202 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
203 |
|
204 |
-
|
205 |
-
unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and
|
206 |
-
headers).
|
207 |
|
208 |
-
|
209 |
|
210 |
-
|
|
|
211 |
|
212 |
-
|
213 |
-
then of the form:
|
214 |
|
215 |
-
|
216 |
-
[CLS] Sentence A [SEP] Sentence B [SEP]
|
217 |
-
```
|
218 |
|
219 |
-
|
220 |
-
the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a
|
221 |
-
consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two
|
222 |
-
"sentences" has a combined length of less than 512 tokens.
|
223 |
|
224 |
-
|
225 |
-
- 15% of the tokens are masked.
|
226 |
-
- In 80% of the cases, the masked tokens are replaced by `[MASK]`.
|
227 |
-
- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
|
228 |
-
- In the 10% remaining cases, the masked tokens are left as is.
|
229 |
|
230 |
-
### Pretraining
|
231 |
|
232 |
-
|
233 |
-
of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer
|
234 |
-
used is Adam with a learning rate of 1e-4, \\(\beta_{1} = 0.9\\) and \\(\beta_{2} = 0.999\\), a weight decay of 0.01,
|
235 |
-
learning rate warmup for 10,000 steps and linear decay of the learning rate after.
|
236 |
|
237 |
-
|
238 |
|
239 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
240 |
|
241 |
-
|
242 |
|
243 |
-
|
244 |
-
|
245 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
246 |
|
247 |
|
248 |
### BibTeX entry and citation info
|
249 |
|
250 |
```bibtex
|
251 |
-
@article{DBLP:journals/corr/abs-
|
252 |
-
author = {
|
253 |
-
|
254 |
-
|
255 |
-
|
256 |
-
title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
|
257 |
-
Understanding},
|
258 |
journal = {CoRR},
|
259 |
-
volume = {abs/
|
260 |
-
year = {
|
261 |
-
url = {http://arxiv.org/abs/
|
262 |
archivePrefix = {arXiv},
|
263 |
-
eprint = {
|
264 |
-
timestamp = {
|
265 |
-
biburl = {https://dblp.org/rec/journals/corr/abs-
|
266 |
bibsource = {dblp computer science bibliography, https://dblp.org}
|
267 |
}
|
268 |
-
```
|
269 |
|
270 |
-
|
271 |
-
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
|
272 |
-
</a>
|
|
|
9 |
- trivia_qa
|
10 |
---
|
11 |
|
12 |
+
# Longformer
|
13 |
|
14 |
longformer-base-4096 is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents. It supports sequences of length up to 4,096.
|
15 |
It was introduced in
|
|
|
24 |
Longformer-Encoder-Decoder (LED), a Longformer variant for supporting long document generative sequence-to-sequence tasks,
|
25 |
and demonstrate its effectiveness on the arXiv summarization dataset.
|
26 |
|
27 |
+
- Transformer-based models are unable to pro-
|
28 |
cess long sequences due to their self-attention
|
29 |
operation, which scales quadratically with the
|
30 |
sequence length. To address this limitation,
|
|
|
50 |
former variant for supporting long document
|
51 |
generative sequence-to-sequence tasks, and
|
52 |
demonstrate its effectiveness on the arXiv sum-
|
53 |
+
marization dataset.
|
54 |
+
- The original Transformer model has a self-attention
|
55 |
+
component with O(n^2) time and memory complexity where n is the input sequence length. To address
|
56 |
+
this challenge, we sparsify the full self-attention
|
57 |
+
matrix according to an “attention pattern” specifying pairs of input locations attending to one another.
|
58 |
+
Unlike the full self-attention, our proposed attention pattern scales linearly with the input sequence,
|
59 |
+
making it efficient for longer sequences. This section discusses the design and implementation of
|
60 |
+
this attention pattern.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
61 |
|
|
|
62 |
|
63 |
+
## Dataset and Task
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
64 |
|
65 |
+
To compare to prior work we focus on character-level LM (text8 and enwik8; Mahoney, 2009) (This is for language modelling)
|
66 |
+
For finetuned tasks: WikiHop, TriviaQA, HotpotQA, OntoNotes, IMDB, Hyperpartisan
|
67 |
|
68 |
+
We evaluate on text8 and enwik8, both contain
|
69 |
+
100M characters from Wikipedia split into 90M,
|
70 |
+
5M, 5M for train, dev, test.
|
|
|
|
|
|
|
|
|
|
|
71 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
72 |
|
73 |
+
## Tokenizer with Vocabulary size
|
74 |
|
75 |
+
To prepare the data for input to Longformer
|
76 |
+
and RoBERTa, we first tokenize the question,
|
77 |
+
answer candidates, and support contexts using
|
78 |
+
RoBERTa’s wordpiece tokenizer.
|
79 |
+
The special tokens [q], [/q],
|
80 |
+
[ent], [/ent] were added to the RoBERTa
|
81 |
+
vocabulary and randomly initialized before task
|
82 |
+
finetuning.
|
83 |
|
84 |
+
NOTE: Similar strategy was performed for all tasks. And vocabulary size is similar to RoBERTa's vocabulary"
|
|
|
|
|
85 |
|
86 |
+
### Computational Resources
|
87 |
|
88 |
+
Character Level Language Modelling: We ran the small model experiments on 4 RTX8000 GPUs for 16 days. For the large model,
|
89 |
+
we ran experiments on 8 RTX8000 GPUs for 13 days.
|
90 |
|
91 |
+
For wikihop: All models were trained on a single RTX8000 GPU, with Longformer-base taking about a day for 5 epochs.
|
|
|
92 |
|
93 |
+
For TriviaQA: We ran our experiments on 32GB V100 GPUs. Small model takes 1 day to train on 4 GPUs, while large model takes 1 day on 8 GPUs.
|
|
|
|
|
94 |
|
95 |
+
For Hotpot QA: Our experiments are done on RTX8000 GPUs and training each epoch takes approximately half a day on 4 GPUs.
|
|
|
|
|
|
|
96 |
|
97 |
+
Text Classification: Experiments were done on a single RTX8000 GPU."
|
|
|
|
|
|
|
|
|
98 |
|
|
|
99 |
|
100 |
+
### Pretraining Objective
|
|
|
|
|
|
|
101 |
|
102 |
+
We pretrain Longformer with masked language modeling (MLM), where the goal is to recover randomly masked tokens in a sequence.
|
103 |
|
104 |
+
This bias will also affect all fine-tuned versions of this model.
|
105 |
+
|
106 |
+
|
107 |
+
## Training Setup
|
108 |
+
|
109 |
+
1. We train two model
|
110 |
+
sizes, a base model and a large model. Both models
|
111 |
+
are trained for 65K gradient updates with sequences
|
112 |
+
length 4,096, batch size 64 (2
|
113 |
+
18 tokens), maximum
|
114 |
+
learning rate of 3e-5, linear warmup of 500 steps,
|
115 |
+
followed by a power 3 polynomial decay. The rest
|
116 |
+
of the hyperparameters are the same as RoBERTa.[For MLM Pretraining]
|
117 |
+
|
118 |
+
2. Hyperparameters for the best performing model for character-level language modeling
|
119 |
+
|
120 |
+
3. Hyperparameters of the QA models. All mod-
|
121 |
+
els use a similar scheduler with linear warmup and de-
|
122 |
+
cay.
|
123 |
+
|
124 |
+
3. [For coreference resolution] The maximum se-
|
125 |
+
quence length was 384 for RoBERTa-base, chosen
|
126 |
+
after three trials from [256, 384, 512] using the
|
127 |
+
default hyperparameters in the original implemen-
|
128 |
+
tation.16 For Longformer-base the sequence length
|
129 |
+
was 4,096.....
|
130 |
+
|
131 |
+
4. [For coreference resolution]
|
132 |
+
.... Hyperparameter searches were minimal and con-
|
133 |
+
sisted of grid searches of RoBERTa LR in [1e-5,
|
134 |
+
2e-5, 3e-5] and task LR in [1e-4, 2e-4, 3e-4] for
|
135 |
+
both RoBERTa and Longformer for a fair compari-
|
136 |
+
son. The best configuration for Longformer-base
|
137 |
+
was RoBERTa lr=1e-5, task lr=1e-4. All other hy-
|
138 |
+
perparameters were the same as in the original im-
|
139 |
+
plementation.
|
140 |
+
|
141 |
+
5. [For text classification]
|
142 |
+
|
143 |
+
We used Adam opti-
|
144 |
+
mizer with batch sizes of 32 and linear warmup
|
145 |
+
and decay with warmup steps equal to 0.1 of the
|
146 |
+
total training steps. For both IMDB and Hyperpar-
|
147 |
+
tisan news we did grid search of LRs [3e-5, 5e-5]
|
148 |
+
and epochs [10, 15, 20] and found the model with
|
149 |
+
[3e-5] and epochs 15 to work best.
|
150 |
+
|
151 |
+
## Training procedure
|
152 |
|
153 |
+
### Preprocessing
|
154 |
|
155 |
+
"For WikiHop:
|
156 |
+
To prepare the data for input to Longformer
|
157 |
+
and RoBERTa, we first tokenize the question,
|
158 |
+
answer candidates, and support contexts using
|
159 |
+
RoBERTa’s wordpiece tokenizer.
|
160 |
+
Then we
|
161 |
+
concatenate the question and answer candi-
|
162 |
+
dates with special tokens as [q] question
|
163 |
+
[/q] [ent] candidate1 [/ent] ...
|
164 |
+
[ent] candidateN [/ent]. The contexts
|
165 |
+
are also concatenated using RoBERTa’s doc-
|
166 |
+
ument delimiter tokens as separators: </s>
|
167 |
+
context1 </s> ... </s> contextM
|
168 |
+
</s>.
|
169 |
+
The special tokens [q], [/q],
|
170 |
+
[ent], [/ent] were added to the RoBERTa
|
171 |
+
vocabulary and randomly initialized before task
|
172 |
+
finetuning.
|
173 |
+
|
174 |
+
For TriviaQA: Similar to WikiHop, we tokenize the question
|
175 |
+
and the document using RoBERTa’s tokenizer,
|
176 |
+
then form the input as [s] question [/s] document [/s]. We truncate the document at 4,096 wordpiece to avoid it being very slow.
|
177 |
+
|
178 |
+
For HotpotQA: Similar to Wikihop and
|
179 |
+
TriviaQA, to prepare the data for input to Long-
|
180 |
+
former, we concatenate question and then all the
|
181 |
+
10 paragraphs in one long context. We particu-
|
182 |
+
larly use the following input format with special
|
183 |
+
tokens: “[CLS] [q] question [/q] <t>
|
184 |
+
title1 </t> sent1,1 [s] sent1,2 [s] ... <t> title2 </t> sent2,1 [s] sent2,2
|
185 |
+
[s] ...” where [q], [/q], <t>, </t>, [s],
|
186 |
+
[p] are special tokens representing, question start
|
187 |
+
and end, paragraph title start and end, and sentence,
|
188 |
+
respectively. The special tokens were added to the
|
189 |
+
Longformer vocabulary and randomly initialized
|
190 |
+
before task finetuning."
|
191 |
+
|
192 |
+
### Experiment
|
193 |
+
|
194 |
+
1. Character level langyage modeling: a) To compare to prior work we focus on character-
|
195 |
+
level LM (text8 and enwik8; Mahoney, 2009).
|
196 |
+
|
197 |
+
b) Tab. 2 and 3 summarize evaluation results on
|
198 |
+
text8 and enwik8 datasets. We achieve a new
|
199 |
+
state-of-the-art on both text8 and enwik8 using
|
200 |
+
the small models with BPC of 1.10 and 1.00 on
|
201 |
+
text8 and enwik8 respectively, demonstrating
|
202 |
+
the effectiveness of our model.
|
203 |
+
|
204 |
+
2. Pretraining: a) We pretrain Longformer with masked language
|
205 |
+
modeling (MLM), where the goal is to recover
|
206 |
+
randomly masked tokens in a sequence.
|
207 |
+
|
208 |
+
b) Table 5: MLM BPC for RoBERTa and various pre-
|
209 |
+
trained Longformer configurations.
|
210 |
+
|
211 |
+
3. WikiHop: Instances in WikiHop consist of: a
|
212 |
+
question, answer candidates (ranging from two
|
213 |
+
candidates to 79 candidates), supporting contexts
|
214 |
+
(ranging from three paragraphs to 63 paragraphs),
|
215 |
+
and the correct answer. The dataset does not pro-
|
216 |
+
vide any intermediate annotation for the multihop
|
217 |
+
reasoning chains, requiring models to instead infer
|
218 |
+
them from the indirect answer supervision.
|
219 |
+
|
220 |
+
4. TriviaQA: TriviaQA has more than 100K ques-
|
221 |
+
tion, answer, document triplets for training. Doc-
|
222 |
+
uments are Wikipedia articles, and answers are
|
223 |
+
named entities mentioned in the article. The span
|
224 |
+
that answers the question is not annotated, but it is
|
225 |
+
found using simple text matching.
|
226 |
+
|
227 |
+
5. HotpotQA: HotpotQA dataset involves answer-
|
228 |
+
ing questions from a set of 10 paragraphs from
|
229 |
+
10 different Wikipedia articles where 2 paragraphs
|
230 |
+
are relevant to the question and the rest are dis-
|
231 |
+
tractors. It includes 2 tasks of answer span ex-
|
232 |
+
traction and evidence sentence identification. Our
|
233 |
+
model for HotpotQA combines both answer span
|
234 |
+
extraction and evidence extraction in one joint
|
235 |
+
model.
|
236 |
+
|
237 |
+
6. Coreference model: The coreference model is a straightforward adaptation of the coarse-to-fine BERT based model from Joshi et al.
|
238 |
+
(2019).
|
239 |
+
|
240 |
+
7. Text classification: For classification, following
|
241 |
+
BERT, we used a simple binary cross entropy loss
|
242 |
+
on top of a first [CLS] token with addition of
|
243 |
+
global attention to [CLS].
|
244 |
+
|
245 |
+
8. Evaluation metric for finetuned tasks: Summary of finetuning results on QA, coreference resolution, and document classification. Results are on
|
246 |
+
the development sets comparing our Longformer-base with RoBERTa-base. TriviaQA, Hyperpartisan metrics are
|
247 |
+
F1, WikiHop and IMDB use accuracy, HotpotQA is joint F1, OntoNotes is average F1.
|
248 |
+
|
249 |
+
9. Summarization: a) We evaluate LED on the summarization task us-
|
250 |
+
ing the arXiv summarization dataset (Cohan et al.) which focuses on long document summariza-
|
251 |
+
tion in the scientific domain.
|
252 |
+
|
253 |
+
b) Table 11: Summarization results of Longformer-
|
254 |
+
Encoder-Decoder (LED) on the arXiv dataset. Met-
|
255 |
+
rics from left to right are ROUGE-1, ROUGE-2 and
|
256 |
+
ROUGE-L."
|
257 |
+
|
258 |
+
## Ablation
|
259 |
+
|
260 |
+
Ablation study for WikiHop on
|
261 |
+
the development set. All results use Longformer-
|
262 |
+
base, fine-tuned for five epochs with identical hy-
|
263 |
+
perparameters except where noted. Longformer
|
264 |
+
benefits from longer sequences, global attention,
|
265 |
+
separate projection matrices for global attention,
|
266 |
+
MLM pretraining, and longer training. In addition,
|
267 |
+
when configured as in RoBERTa-base (seqlen: 512,
|
268 |
+
and n2 attention) Longformer performs slightly
|
269 |
+
worse then RoBERTa-base, confirming that per-
|
270 |
+
formance gains are not due to additional pretrain-
|
271 |
+
ing. Performance drops slightly when using the
|
272 |
+
RoBERTa model pretrained when only unfreezing
|
273 |
+
the additional position embeddings, showing that
|
274 |
+
Longformer can learn to use long range context in
|
275 |
+
task specific fine-tuning with large training datasets
|
276 |
+
such as WikiHop.
|
277 |
|
278 |
|
279 |
### BibTeX entry and citation info
|
280 |
|
281 |
```bibtex
|
282 |
+
@article{DBLP:journals/corr/abs-2004-05150,
|
283 |
+
author = {Iz Beltagy and
|
284 |
+
Matthew E. Peters and
|
285 |
+
Arman Cohan},
|
286 |
+
title = {Longformer: The Long-Document Transformer},
|
|
|
|
|
287 |
journal = {CoRR},
|
288 |
+
volume = {abs/2004.05150},
|
289 |
+
year = {2020},
|
290 |
+
url = {http://arxiv.org/abs/2004.05150},
|
291 |
archivePrefix = {arXiv},
|
292 |
+
eprint = {2004.05150},
|
293 |
+
timestamp = {Wed, 22 Apr 2020 14:29:36 +0200},
|
294 |
+
biburl = {https://dblp.org/rec/journals/corr/abs-2004-05150.bib},
|
295 |
bibsource = {dblp computer science bibliography, https://dblp.org}
|
296 |
}
|
|
|
297 |
|
298 |
+
```
|
|
|
|