tanikina
/

longformer-large-science

Model card Files Files and versions Community

longformer-large-science / README.md

tanikina's picture

correct links

7cb5df4 verified 3 months ago

|

3.41 kB

	---
	language:
	- en
	base_model:
	- allenai/longformer-large-4096
	---
	This version of the `longformer-large-4096` model was additionally pre-trained on the S2ORC corpus [(Lo et al., 2020)](https://arxiv.org/pdf/1911.02782) by [(Wadden et al., 2022)](https://arxiv.org/pdf/2112.01640). The S2ORC is a large corpus of 81.1M English-language academic papers from different disciplines. The model uses the weights of [the longformer large science checkpoint](https://scifact.s3.us-west-2.amazonaws.com/longchecker/latest/checkpoints/longformer_large_science.ckpt) that was also used as the starting point for training the MultiVerS model [(Wadden et al., 2022)](https://arxiv.org/pdf/2112.01640) on the task of scientific claim verification.

	Note that the vocabulary size of this model (50275) differs from the original `longformer-large-4096` (50265) since 10 new tokens were included:

	`<\|par\|>, </\|title\|>, </\|sec\|>, <\|sec-title\|>, <\|sent\|>, <\|title\|>, <\|abs\|>, <\|sec\|>, </\|sec-title\|>, </\|abs\|>`.

	Transferring the checkpoint weights and saving the model was done based on [this code](https://github.com/dwadden/multivers/blob/main/multivers/model.py#L145) from the MultiVerS repository, the versions of `transformers==4.2.2` and `torch==1.7.1` correspond to the MultiVerS [requirements.txt](https://github.com/dwadden/multivers/blob/main/requirements.txt):
	```python
	import os
	import pathlib
	import subprocess

	import torch
	from transformers import LongformerModel

	model = LongformerModel.from_pretrained(
	"allenai/longformer-large-4096", gradient_checkpointing=False
	)

	# Load the pre-trained checkpoint.
	url = f"https://scifact.s3.us-west-2.amazonaws.com/longchecker/latest/checkpoints/#longformer_large_science.ckpt"
	out_file = f"checkpoints/longformer_large_science.ckpt"
	cmd = ["wget", "-O", out_file, url]

	if not pathlib.Path(out_file).exists():
	subprocess.run(cmd)

	checkpoint_prefixed = torch.load("checkpoints/longformer_large_science.ckpt")

	# New checkpoint
	new_state_dict = {}
	# Add items from loaded checkpoint.
	for k, v in checkpoint_prefixed.items():
	# Don't need the language model head.
	if "lm_head." in k:
	continue
	# Get rid of the first 8 characters, which say `roberta.`.
	new_key = k[8:]
	new_state_dict[new_key] = v

	# Resize embeddings and load state dict.
	target_embed_size = new_state_dict["embeddings.word_embeddings.weight"].shape[0]
	model.resize_token_embeddings(target_embed_size)
	model.load_state_dict(new_state_dict)

	model_dir = "checkpoints/longformer_large_science"
	if not os.path.exists(model_dir):
	os.makedirs(model_dir)

	model.save_pretrained(model_dir)
	```

	The tokenizer was resized and saved following [this code](https://github.com/dwadden/multivers/blob/main/multivers/data.py#L14) from the MultiVerS repository:
	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("allenai/longformer-large-4096")
	ADDITIONAL_TOKENS = {
	"section_start": "<\|sec\|>",
	"section_end": "</\|sec\|>",
	"section_title_start": "<\|sec-title\|>",
	"section_title_end": "</\|sec-title\|>",
	"abstract_start": "<\|abs\|>",
	"abstract_end": "</\|abs\|>",
	"title_start": "<\|title\|>",
	"title_end": "</\|title\|>",
	"sentence_sep": "<\|sent\|>",
	"paragraph_sep": "<\|par\|>",
	}
	tokenizer.add_tokens(list(ADDITIONAL_TOKENS.values()))
	tokenizer.save_pretrained("checkpoints/longformer_large_science")
	```