tanikina
/

longformer-large-science

PyTorch

English

longformer

Model card Files Files and versions Community

tanikina commited on Nov 4, 2024

Commit

0a46586

verified ·

1 Parent(s): f708205

update the description

Browse files

Files changed (1) hide show

README.md +69 -1

README.md CHANGED Viewed

@@ -4,8 +4,76 @@ language:
 base_model:
 - allenai/longformer-large-4096
 ---
-This is the fine-tuned version of the `longformer-large-4096` model additionally pre-trained on the S2ORC corpus [(Lo et al., 2020)](https://arxiv.org/pdf/1911.02782), which is a large corpus of 81.1M English-language academic papers from different disciplines. This model uses the weights of [the longformer large science checkpoint](https://github.com/dwadden/multivers/blob/main/script/get_checkpoint.py) that was used as the starting point for training the MultiVerS model [(Wadden et al., 2022)](https://arxiv.org/pdf/2112.01640) on the task of scientific claim verification.
 Note that the vocabulary size of this model (50275) differs from the original `longformer-large-4096` (50265) since 10 new tokens were included:
 `<|par|>, </|title|>, </|sec|>, <|sec-title|>, <|sent|>, <|title|>, <|abs|>, <|sec|>, </|sec-title|>, </|abs|>`.

 base_model:
 - allenai/longformer-large-4096
 ---
+This version of the `longformer-large-4096` model was additionally pre-trained on the S2ORC corpus [(Lo et al., 2020)](https://arxiv.org/pdf/1911.02782) by [(Wadden et al., 2022)](https://arxiv.org/pdf/2112.01640). The S2ORC is a large corpus of 81.1M English-language academic papers from different disciplines. The model uses the weights of [the longformer large science checkpoint](https://scifact.s3.us-west-2.amazonaws.com/longchecker/latest/checkpoints/longformer_large_science.ckpt) that was also used as the starting point for training the MultiVerS model [(Wadden et al., 2022)](https://arxiv.org/pdf/2112.01640) on the task of scientific claim verification.
 Note that the vocabulary size of this model (50275) differs from the original `longformer-large-4096` (50265) since 10 new tokens were included:
 `<|par|>, </|title|>, </|sec|>, <|sec-title|>, <|sent|>, <|title|>, <|abs|>, <|sec|>, </|sec-title|>, </|abs|>`.
+Transferring the checkpoint weights and saving the model was done based on [this code](https://github.com/dwadden/multivers/blob/a6ce033f0e17ae38c1f102eae1ee4ca213fbbe2e/multivers/model.py#L145) from the MultiVerS repository, the versions of `transformers==4.2.2` and `torch==1.7.1` correspond to the MultiVerS [requirements.txt](https://github.com/dwadden/multivers/blob/main/requirements.txt):
+```
+import os
+import pathlib
+import subprocess
+import torch
+from transformers import LongformerModel
+model = LongformerModel.from_pretrained(
+    "allenai/longformer-large-4096", gradient_checkpointing=False
+)
+Load the pre-trained checkpoint
+url = f"https://scifact.s3.us-west-2.amazonaws.com/longchecker/latest/checkpoints/#longformer_large_science.ckpt"
+out_file = f"checkpoints/longformer_large_science.ckpt"
+cmd = ["wget", "-O", out_file, url]
+if not pathlib.Path(out_file).exists():
+    subprocess.run(cmd)
+checkpoint_prefixed = torch.load("checkpoints/longformer_large_science.ckpt")
+# New checkpoint
+new_state_dict = {}
+# Add items from loaded checkpoint.
+for k, v in checkpoint_prefixed.items():
+    # Don't need the language model head.
+    if "lm_head." in k:
+        continue
+    # Get rid of the first 8 characters, which say `roberta.`.
+    new_key = k[8:]
+    new_state_dict[new_key] = v
+# Resize embeddings and load state dict.
+target_embed_size = new_state_dict["embeddings.word_embeddings.weight"].shape[0]
+model.resize_token_embeddings(target_embed_size)
+model.load_state_dict(new_state_dict)
+model_dir = "checkpoints/longformer_large_science"
+if not os.path.exists(model_dir):
+    os.makedirs(model_dir)
+model.save_pretrained(model_dir)
+```
+The tokenizer was resized and saved following [this code](https://github.com/dwadden/multivers/blob/a6ce033f0e17ae38c1f102eae1ee4ca213fbbe2e/multivers/data.py#L14) from the MultiVerS repository:
+```
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("allenai/longformer-large-4096")
+ADDITIONAL_TOKENS = {
+    "section_start": "<|sec|>",
+    "section_end": "</|sec|>",
+    "section_title_start": "<|sec-title|>",
+    "section_title_end": "</|sec-title|>",
+    "abstract_start": "<|abs|>",
+    "abstract_end": "</|abs|>",
+    "title_start": "<|title|>",
+    "title_end": "</|title|>",
+    "sentence_sep": "<|sent|>",
+    "paragraph_sep": "<|par|>",
+}
+tokenizer.add_tokens(list(ADDITIONAL_TOKENS.values()))
+tokenizer.save_pretrained("checkpoints/longformer_large_science")
+```