update the description
Browse files
README.md
CHANGED
@@ -4,8 +4,76 @@ language:
|
|
4 |
base_model:
|
5 |
- allenai/longformer-large-4096
|
6 |
---
|
7 |
-
This
|
8 |
|
9 |
Note that the vocabulary size of this model (50275) differs from the original `longformer-large-4096` (50265) since 10 new tokens were included:
|
10 |
|
11 |
`<|par|>, </|title|>, </|sec|>, <|sec-title|>, <|sent|>, <|title|>, <|abs|>, <|sec|>, </|sec-title|>, </|abs|>`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
base_model:
|
5 |
- allenai/longformer-large-4096
|
6 |
---
|
7 |
+
This version of the `longformer-large-4096` model was additionally pre-trained on the S2ORC corpus [(Lo et al., 2020)](https://arxiv.org/pdf/1911.02782) by [(Wadden et al., 2022)](https://arxiv.org/pdf/2112.01640). The S2ORC is a large corpus of 81.1M English-language academic papers from different disciplines. The model uses the weights of [the longformer large science checkpoint](https://scifact.s3.us-west-2.amazonaws.com/longchecker/latest/checkpoints/longformer_large_science.ckpt) that was also used as the starting point for training the MultiVerS model [(Wadden et al., 2022)](https://arxiv.org/pdf/2112.01640) on the task of scientific claim verification.
|
8 |
|
9 |
Note that the vocabulary size of this model (50275) differs from the original `longformer-large-4096` (50265) since 10 new tokens were included:
|
10 |
|
11 |
`<|par|>, </|title|>, </|sec|>, <|sec-title|>, <|sent|>, <|title|>, <|abs|>, <|sec|>, </|sec-title|>, </|abs|>`.
|
12 |
+
|
13 |
+
Transferring the checkpoint weights and saving the model was done based on [this code](https://github.com/dwadden/multivers/blob/a6ce033f0e17ae38c1f102eae1ee4ca213fbbe2e/multivers/model.py#L145) from the MultiVerS repository, the versions of `transformers==4.2.2` and `torch==1.7.1` correspond to the MultiVerS [requirements.txt](https://github.com/dwadden/multivers/blob/main/requirements.txt):
|
14 |
+
```
|
15 |
+
import os
|
16 |
+
import pathlib
|
17 |
+
import subprocess
|
18 |
+
|
19 |
+
import torch
|
20 |
+
from transformers import LongformerModel
|
21 |
+
|
22 |
+
model = LongformerModel.from_pretrained(
|
23 |
+
"allenai/longformer-large-4096", gradient_checkpointing=False
|
24 |
+
)
|
25 |
+
|
26 |
+
Load the pre-trained checkpoint
|
27 |
+
url = f"https://scifact.s3.us-west-2.amazonaws.com/longchecker/latest/checkpoints/#longformer_large_science.ckpt"
|
28 |
+
out_file = f"checkpoints/longformer_large_science.ckpt"
|
29 |
+
cmd = ["wget", "-O", out_file, url]
|
30 |
+
|
31 |
+
if not pathlib.Path(out_file).exists():
|
32 |
+
subprocess.run(cmd)
|
33 |
+
|
34 |
+
checkpoint_prefixed = torch.load("checkpoints/longformer_large_science.ckpt")
|
35 |
+
|
36 |
+
# New checkpoint
|
37 |
+
new_state_dict = {}
|
38 |
+
# Add items from loaded checkpoint.
|
39 |
+
for k, v in checkpoint_prefixed.items():
|
40 |
+
# Don't need the language model head.
|
41 |
+
if "lm_head." in k:
|
42 |
+
continue
|
43 |
+
# Get rid of the first 8 characters, which say `roberta.`.
|
44 |
+
new_key = k[8:]
|
45 |
+
new_state_dict[new_key] = v
|
46 |
+
|
47 |
+
# Resize embeddings and load state dict.
|
48 |
+
target_embed_size = new_state_dict["embeddings.word_embeddings.weight"].shape[0]
|
49 |
+
model.resize_token_embeddings(target_embed_size)
|
50 |
+
model.load_state_dict(new_state_dict)
|
51 |
+
|
52 |
+
model_dir = "checkpoints/longformer_large_science"
|
53 |
+
if not os.path.exists(model_dir):
|
54 |
+
os.makedirs(model_dir)
|
55 |
+
|
56 |
+
model.save_pretrained(model_dir)
|
57 |
+
```
|
58 |
+
|
59 |
+
The tokenizer was resized and saved following [this code](https://github.com/dwadden/multivers/blob/a6ce033f0e17ae38c1f102eae1ee4ca213fbbe2e/multivers/data.py#L14) from the MultiVerS repository:
|
60 |
+
```
|
61 |
+
from transformers import AutoTokenizer
|
62 |
+
|
63 |
+
tokenizer = AutoTokenizer.from_pretrained("allenai/longformer-large-4096")
|
64 |
+
ADDITIONAL_TOKENS = {
|
65 |
+
"section_start": "<|sec|>",
|
66 |
+
"section_end": "</|sec|>",
|
67 |
+
"section_title_start": "<|sec-title|>",
|
68 |
+
"section_title_end": "</|sec-title|>",
|
69 |
+
"abstract_start": "<|abs|>",
|
70 |
+
"abstract_end": "</|abs|>",
|
71 |
+
"title_start": "<|title|>",
|
72 |
+
"title_end": "</|title|>",
|
73 |
+
"sentence_sep": "<|sent|>",
|
74 |
+
"paragraph_sep": "<|par|>",
|
75 |
+
}
|
76 |
+
tokenizer.add_tokens(list(ADDITIONAL_TOKENS.values()))
|
77 |
+
tokenizer.save_pretrained("checkpoints/longformer_large_science")
|
78 |
+
```
|
79 |
+
|