|
--- |
|
language: |
|
- en |
|
base_model: |
|
- allenai/longformer-large-4096 |
|
--- |
|
This version of the `longformer-large-4096` model was additionally pre-trained on the S2ORC corpus [(Lo et al., 2020)](https://arxiv.org/pdf/1911.02782) by [(Wadden et al., 2022)](https://arxiv.org/pdf/2112.01640). The S2ORC is a large corpus of 81.1M English-language academic papers from different disciplines. The model uses the weights of [the longformer large science checkpoint](https://scifact.s3.us-west-2.amazonaws.com/longchecker/latest/checkpoints/longformer_large_science.ckpt) that was also used as the starting point for training the MultiVerS model [(Wadden et al., 2022)](https://arxiv.org/pdf/2112.01640) on the task of scientific claim verification. |
|
|
|
Note that the vocabulary size of this model (50275) differs from the original `longformer-large-4096` (50265) since 10 new tokens were included: |
|
|
|
`<|par|>, </|title|>, </|sec|>, <|sec-title|>, <|sent|>, <|title|>, <|abs|>, <|sec|>, </|sec-title|>, </|abs|>`. |
|
|
|
Transferring the checkpoint weights and saving the model was done based on [this code](https://github.com/dwadden/multivers/blob/main/multivers/model.py#L145) from the MultiVerS repository, the versions of `transformers==4.2.2` and `torch==1.7.1` correspond to the MultiVerS [requirements.txt](https://github.com/dwadden/multivers/blob/main/requirements.txt): |
|
```python |
|
import os |
|
import pathlib |
|
import subprocess |
|
|
|
import torch |
|
from transformers import LongformerModel |
|
|
|
model = LongformerModel.from_pretrained( |
|
"allenai/longformer-large-4096", gradient_checkpointing=False |
|
) |
|
|
|
# Load the pre-trained checkpoint. |
|
url = f"https://scifact.s3.us-west-2.amazonaws.com/longchecker/latest/checkpoints/#longformer_large_science.ckpt" |
|
out_file = f"checkpoints/longformer_large_science.ckpt" |
|
cmd = ["wget", "-O", out_file, url] |
|
|
|
if not pathlib.Path(out_file).exists(): |
|
subprocess.run(cmd) |
|
|
|
checkpoint_prefixed = torch.load("checkpoints/longformer_large_science.ckpt") |
|
|
|
# New checkpoint |
|
new_state_dict = {} |
|
# Add items from loaded checkpoint. |
|
for k, v in checkpoint_prefixed.items(): |
|
# Don't need the language model head. |
|
if "lm_head." in k: |
|
continue |
|
# Get rid of the first 8 characters, which say `roberta.`. |
|
new_key = k[8:] |
|
new_state_dict[new_key] = v |
|
|
|
# Resize embeddings and load state dict. |
|
target_embed_size = new_state_dict["embeddings.word_embeddings.weight"].shape[0] |
|
model.resize_token_embeddings(target_embed_size) |
|
model.load_state_dict(new_state_dict) |
|
|
|
model_dir = "checkpoints/longformer_large_science" |
|
if not os.path.exists(model_dir): |
|
os.makedirs(model_dir) |
|
|
|
model.save_pretrained(model_dir) |
|
``` |
|
|
|
The tokenizer was resized and saved following [this code](https://github.com/dwadden/multivers/blob/main/multivers/data.py#L14) from the MultiVerS repository: |
|
```python |
|
from transformers import AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("allenai/longformer-large-4096") |
|
ADDITIONAL_TOKENS = { |
|
"section_start": "<|sec|>", |
|
"section_end": "</|sec|>", |
|
"section_title_start": "<|sec-title|>", |
|
"section_title_end": "</|sec-title|>", |
|
"abstract_start": "<|abs|>", |
|
"abstract_end": "</|abs|>", |
|
"title_start": "<|title|>", |
|
"title_end": "</|title|>", |
|
"sentence_sep": "<|sent|>", |
|
"paragraph_sep": "<|par|>", |
|
} |
|
tokenizer.add_tokens(list(ADDITIONAL_TOKENS.values())) |
|
tokenizer.save_pretrained("checkpoints/longformer_large_science") |
|
``` |
|
|
|
|