tanikina commited on
Commit
0a46586
·
verified ·
1 Parent(s): f708205

update the description

Browse files
Files changed (1) hide show
  1. README.md +69 -1
README.md CHANGED
@@ -4,8 +4,76 @@ language:
4
  base_model:
5
  - allenai/longformer-large-4096
6
  ---
7
- This is the fine-tuned version of the `longformer-large-4096` model additionally pre-trained on the S2ORC corpus [(Lo et al., 2020)](https://arxiv.org/pdf/1911.02782), which is a large corpus of 81.1M English-language academic papers from different disciplines. This model uses the weights of [the longformer large science checkpoint](https://github.com/dwadden/multivers/blob/main/script/get_checkpoint.py) that was used as the starting point for training the MultiVerS model [(Wadden et al., 2022)](https://arxiv.org/pdf/2112.01640) on the task of scientific claim verification.
8
 
9
  Note that the vocabulary size of this model (50275) differs from the original `longformer-large-4096` (50265) since 10 new tokens were included:
10
 
11
  `<|par|>, </|title|>, </|sec|>, <|sec-title|>, <|sent|>, <|title|>, <|abs|>, <|sec|>, </|sec-title|>, </|abs|>`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  base_model:
5
  - allenai/longformer-large-4096
6
  ---
7
+ This version of the `longformer-large-4096` model was additionally pre-trained on the S2ORC corpus [(Lo et al., 2020)](https://arxiv.org/pdf/1911.02782) by [(Wadden et al., 2022)](https://arxiv.org/pdf/2112.01640). The S2ORC is a large corpus of 81.1M English-language academic papers from different disciplines. The model uses the weights of [the longformer large science checkpoint](https://scifact.s3.us-west-2.amazonaws.com/longchecker/latest/checkpoints/longformer_large_science.ckpt) that was also used as the starting point for training the MultiVerS model [(Wadden et al., 2022)](https://arxiv.org/pdf/2112.01640) on the task of scientific claim verification.
8
 
9
  Note that the vocabulary size of this model (50275) differs from the original `longformer-large-4096` (50265) since 10 new tokens were included:
10
 
11
  `<|par|>, </|title|>, </|sec|>, <|sec-title|>, <|sent|>, <|title|>, <|abs|>, <|sec|>, </|sec-title|>, </|abs|>`.
12
+
13
+ Transferring the checkpoint weights and saving the model was done based on [this code](https://github.com/dwadden/multivers/blob/a6ce033f0e17ae38c1f102eae1ee4ca213fbbe2e/multivers/model.py#L145) from the MultiVerS repository, the versions of `transformers==4.2.2` and `torch==1.7.1` correspond to the MultiVerS [requirements.txt](https://github.com/dwadden/multivers/blob/main/requirements.txt):
14
+ ```
15
+ import os
16
+ import pathlib
17
+ import subprocess
18
+
19
+ import torch
20
+ from transformers import LongformerModel
21
+
22
+ model = LongformerModel.from_pretrained(
23
+ "allenai/longformer-large-4096", gradient_checkpointing=False
24
+ )
25
+
26
+ Load the pre-trained checkpoint
27
+ url = f"https://scifact.s3.us-west-2.amazonaws.com/longchecker/latest/checkpoints/#longformer_large_science.ckpt"
28
+ out_file = f"checkpoints/longformer_large_science.ckpt"
29
+ cmd = ["wget", "-O", out_file, url]
30
+
31
+ if not pathlib.Path(out_file).exists():
32
+ subprocess.run(cmd)
33
+
34
+ checkpoint_prefixed = torch.load("checkpoints/longformer_large_science.ckpt")
35
+
36
+ # New checkpoint
37
+ new_state_dict = {}
38
+ # Add items from loaded checkpoint.
39
+ for k, v in checkpoint_prefixed.items():
40
+ # Don't need the language model head.
41
+ if "lm_head." in k:
42
+ continue
43
+ # Get rid of the first 8 characters, which say `roberta.`.
44
+ new_key = k[8:]
45
+ new_state_dict[new_key] = v
46
+
47
+ # Resize embeddings and load state dict.
48
+ target_embed_size = new_state_dict["embeddings.word_embeddings.weight"].shape[0]
49
+ model.resize_token_embeddings(target_embed_size)
50
+ model.load_state_dict(new_state_dict)
51
+
52
+ model_dir = "checkpoints/longformer_large_science"
53
+ if not os.path.exists(model_dir):
54
+ os.makedirs(model_dir)
55
+
56
+ model.save_pretrained(model_dir)
57
+ ```
58
+
59
+ The tokenizer was resized and saved following [this code](https://github.com/dwadden/multivers/blob/a6ce033f0e17ae38c1f102eae1ee4ca213fbbe2e/multivers/data.py#L14) from the MultiVerS repository:
60
+ ```
61
+ from transformers import AutoTokenizer
62
+
63
+ tokenizer = AutoTokenizer.from_pretrained("allenai/longformer-large-4096")
64
+ ADDITIONAL_TOKENS = {
65
+ "section_start": "<|sec|>",
66
+ "section_end": "</|sec|>",
67
+ "section_title_start": "<|sec-title|>",
68
+ "section_title_end": "</|sec-title|>",
69
+ "abstract_start": "<|abs|>",
70
+ "abstract_end": "</|abs|>",
71
+ "title_start": "<|title|>",
72
+ "title_end": "</|title|>",
73
+ "sentence_sep": "<|sent|>",
74
+ "paragraph_sep": "<|par|>",
75
+ }
76
+ tokenizer.add_tokens(list(ADDITIONAL_TOKENS.values()))
77
+ tokenizer.save_pretrained("checkpoints/longformer_large_science")
78
+ ```
79
+