juanfkurucz
commited on
Commit
•
a6b4f9e
1
Parent(s):
b31c1f2
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,77 @@
|
|
1 |
---
|
|
|
|
|
2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
language: en
|
3 |
+
thumbnail:
|
4 |
license: mit
|
5 |
+
tags:
|
6 |
+
- question-answering
|
7 |
+
datasets:
|
8 |
+
- squad_v2
|
9 |
+
metrics:
|
10 |
+
- squad_v2
|
11 |
---
|
12 |
+
|
13 |
+
## bert-large-uncased-wwm-squadv2-optimized-f16
|
14 |
+
|
15 |
+
This is an optimized model using [madlag/bert-large-uncased-wwm-squadv2-x2.63-f82.6-d16-hybrid-v1](https://huggingface.co/madlag/bert-large-uncased-wwm-squadv2-x2.63-f82.6-d16-hybrid-v1) as the base model which was created using the [nn_pruning](https://github.com/huggingface/nn_pruning) python library. This is a pruned model of [madlag/bert-large-uncased-whole-word-masking-finetuned-squadv2](https://huggingface.co/madlag/bert-large-uncased-whole-word-masking-finetuned-squadv2)
|
16 |
+
|
17 |
+
Our final optimized model weighs **579 MB**, has an inference speed of **18.184 ms** on a Tesla T4 and has a performance of **82.68%** best F1. Below there is a comparison for each base model:
|
18 |
+
|
19 |
+
| Model | Weight | Throughput on Tesla T4 | Best F1 |
|
20 |
+
| -------- | ----- | --------- | --------- |
|
21 |
+
| [madlag/bert-large-uncased-whole-word-masking-finetuned-squadv2](https://huggingface.co/madlag/bert-large-uncased-whole-word-masking-finetuned-squadv2) | 1275 MB | 140.529 ms | 86.08% |
|
22 |
+
| [madlag/bert-large-uncased-wwm-squadv2-x2.63-f82.6-d16-hybrid-v1](https://huggingface.co/madlag/bert-large-uncased-wwm-squadv2-x2.63-f82.6-d16-hybrid-v1) | 1085 MB | 90.801 ms | 82.67% |
|
23 |
+
| Our optimized model | 579 MB | 18.184 ms | 82.68% |
|
24 |
+
|
25 |
+
## Example Usage
|
26 |
+
|
27 |
+
```python
|
28 |
+
import torch
|
29 |
+
from huggingface_hub import hf_hub_download
|
30 |
+
from onnxruntime import InferenceSession
|
31 |
+
from transformers import AutoModelForQuestionAnswering, AutoTokenizer
|
32 |
+
|
33 |
+
MAX_SEQUENCE_LENGTH = 512
|
34 |
+
|
35 |
+
# Download the model
|
36 |
+
model= hf_hub_download(
|
37 |
+
repo_id="tryolabs/bert-large-uncased-wwm-squadv2-optimized-f16", filename="model.onnx"
|
38 |
+
)
|
39 |
+
|
40 |
+
# Load the tokenizer
|
41 |
+
tokenizer = AutoTokenizer.from_pretrained("tryolabs/bert-large-uncased-wwm-squadv2-optimized-f16")
|
42 |
+
|
43 |
+
question = "Who worked a little bit harder?"
|
44 |
+
context = "The first little pig was very lazy. He didn't want to work at all and he built his house out of straw. The second little pig worked a little bit harder but he was somewhat lazy too and he built his house out of sticks. Then, they sang and danced and played together the rest of the day."
|
45 |
+
|
46 |
+
# Generate an input
|
47 |
+
inputs = dict(
|
48 |
+
tokenizer(
|
49 |
+
question, context, return_tensors="np", max_length=MAX_SEQUENCE_LENGTH
|
50 |
+
)
|
51 |
+
)
|
52 |
+
|
53 |
+
# Create session
|
54 |
+
sess = InferenceSession(
|
55 |
+
model, providers=["CPUExecutionProvider"]
|
56 |
+
)
|
57 |
+
|
58 |
+
# Run predictions
|
59 |
+
output = sess.run(None, input_feed=inputs)
|
60 |
+
|
61 |
+
answer_start_scores, answer_end_scores = torch.tensor(output[0]), torch.tensor(
|
62 |
+
output[1]
|
63 |
+
)
|
64 |
+
|
65 |
+
# Post process predictions
|
66 |
+
input_ids = inputs["input_ids"].tolist()[0]
|
67 |
+
answer_start = torch.argmax(answer_start_scores)
|
68 |
+
answer_end = torch.argmax(answer_end_scores) + 1
|
69 |
+
answer = tokenizer.convert_tokens_to_string(
|
70 |
+
tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
|
71 |
+
)
|
72 |
+
|
73 |
+
# Output prediction
|
74 |
+
print("Answer", answer)
|
75 |
+
```
|
76 |
+
|
77 |
+
|