bert-large-uncased-wwm-squadv2-optimized-f16
This is an optimized model using madlag/bert-large-uncased-wwm-squadv2-x2.63-f82.6-d16-hybrid-v1 as the base model which was created using the nn_pruning python library. This is a pruned model of madlag/bert-large-uncased-whole-word-masking-finetuned-squadv2
Feel free to read our blog about how we optimized this model (link)
Our final optimized model weighs 579 MB, has an inference speed of 18.184 ms on a Tesla T4 and has a performance of 82.68% best F1. Below there is a comparison for each base model:
Model | Weight | Throughput on Tesla T4 | Best F1 |
---|---|---|---|
madlag/bert-large-uncased-whole-word-masking-finetuned-squadv2 | 1275 MB | 140.529 ms | 86.08% |
madlag/bert-large-uncased-wwm-squadv2-x2.63-f82.6-d16-hybrid-v1 | 1085 MB | 90.801 ms | 82.67% |
Our optimized model | 579 MB | 18.184 ms | 82.68% |
You can test the inference of those models on tryolabs/transformers-optimization space
Example Usage
import torch
from huggingface_hub import hf_hub_download
from onnxruntime import InferenceSession
from transformers import AutoModelForQuestionAnswering, AutoTokenizer
MAX_SEQUENCE_LENGTH = 512
# Download the model
model= hf_hub_download(
repo_id="tryolabs/bert-large-uncased-wwm-squadv2-optimized-f16", filename="model.onnx"
)
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("tryolabs/bert-large-uncased-wwm-squadv2-optimized-f16")
question = "Who worked a little bit harder?"
context = "The first little pig was very lazy. He didn't want to work at all and he built his house out of straw. The second little pig worked a little bit harder but he was somewhat lazy too and he built his house out of sticks. Then, they sang and danced and played together the rest of the day."
# Generate an input
inputs = dict(
tokenizer(
question, context, return_tensors="np", max_length=MAX_SEQUENCE_LENGTH
)
)
# Create session
sess = InferenceSession(
model, providers=["CPUExecutionProvider"]
)
# Run predictions
output = sess.run(None, input_feed=inputs)
answer_start_scores, answer_end_scores = torch.tensor(output[0]), torch.tensor(
output[1]
)
# Post process predictions
input_ids = inputs["input_ids"].tolist()[0]
answer_start = torch.argmax(answer_start_scores)
answer_end = torch.argmax(answer_end_scores) + 1
answer = tokenizer.convert_tokens_to_string(
tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
)
# Output prediction
print("Answer", answer)
Inference API (serverless) has been turned off for this model.