metadata
language:
- en
thumbnail: url to a thumbnail used in social sharing
tags:
- hupd
- roberta
- distilroberta
- patents
license: cc-by-sa-4.0
datasets:
- HUPD/hupd
HUPD DistilRoBERTa-Base Model
This HUPD DistilRoBERTa model was fine-tuned on the HUPD dataset with a masked language modeling objective. It was originally introduced in this paper.
For more information about the Harvard USPTO Patent Dataset, please feel free to visit the project website or the project's GitHub repository.
How to Use
You can use this model directly with a pipeline for masked language modeling:
from transformers import pipeline
model = pipeline(task="fill-mask", model="hupd/hupd-distilroberta-base")
model("Improved <mask> for playing a game of thumb wrestling.")
Here is the output:
[{'score': 0.4274042248725891,
'sequence': 'Improved method for playing a game of thumb wrestling.',
'token': 5448,
'token_str': ' method'},
{'score': 0.06967400759458542,
'sequence': 'Improved system for playing a game of thumb wrestling.',
'token': 467,
'token_str': ' system'},
{'score': 0.06849079579114914,
'sequence': 'Improved device for playing a game of thumb wrestling.',
'token': 2187,
'token_str': ' device'},
{'score': 0.04544765502214432,
'sequence': 'Improved apparatus for playing a game of thumb wrestling.',
'token': 26529,
'token_str': ' apparatus'},
{'score': 0.025765646249055862,
'sequence': 'Improved means for playing a game of thumb wrestling.',
'token': 839,
'token_str': ' means'}]
Alternatively, you can load the model and use it as follows:
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
# cuda/cpu
device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = AutoTokenizer.from_pretrained("hupd/hupd-distilroberta-base")
model = AutoModelForMaskedLM.from_pretrained("hupd/hupd-distilroberta-base").to(device)
TEXT = "Improved <mask> for playing a game of thumb wrestling."
inputs = tokenizer(TEXT, return_tensors="pt").to(device)
with torch.no_grad():
logits = model(**inputs).logits
# retrieve indices of <mask>
mask_token_indxs = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]
for mask_idx in mask_token_indxs:
predicted_token_id = logits[0, mask_idx].argmax(axis=-1)
output = tokenizer.decode(predicted_token_id)
print(f'Prediction for the <mask> token at index {mask_idx}: "{output}"')
Here is the output:
Prediction for the <mask> token at index 2: " method"
Citation
For more information, please take a look at the original paper.
Authors: Mirac Suzgun, Luke Melas-Kyriazi, Suproteem K. Sarkar, Scott Duke Kominers, and Stuart M. Shieber
BibTeX:
@article{suzgun2022hupd,
title={The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications},
author={Suzgun, Mirac and Melas-Kyriazi, Luke and Sarkar, Suproteem K and Kominers, Scott and Shieber, Stuart},
year={2022}
}