File size: 6,256 Bytes
211fee5 09716b1 211fee5 09716b1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
---
license: apache-2.0
datasets:
- mrqa
language:
- en
metrics:
- exact_match
- f1
model-index:
- name: VMware/electra-base-mrqa
results:
- task:
type: Question-Answering
dataset:
type: mrqa
name: MRQA
metrics:
- type: exact_match
value: 68.78
name: Eval EM
- type: f1
value: 80.16
name: Eval F1
- type: exact_match
value: 54.70
name: Test EM
- type: f1
value: 65.80
name: Test F1
---
This model release is part of a joint research project with Howard University's Innovation Foundry/AIM-AHEAD Lab.
# Model Details
- **Model name:** ELECTRA-Base-MRQA
- **Model type:** Extractive Question Answering
- **Parent Model:** [ELECTRA-Base-Discriminator](https://huggingface.co/google/electra-base-discriminator)
- **Training dataset:** [MRQA](https://huggingface.co/datasets/mrqa) (Machine Reading for Question Answering)
- **Training data size:** 516,819 examples
- **Training time:** 8:40:57 on 1 Nvidia V100 32GB GPU
- **Language:** English
- **Framework:** PyTorch
- **Model version:** 1.0
# Intended Use
This model is intended to provide accurate answers to questions based on context passages. It can be used for a variety of tasks, including question-answering for search engines, chatbots, customer service systems, and other applications that require natural language understanding.
# How to Use
```python
from transformers import pipeline
question_answerer = pipeline("question-answering", model='VMware/electra-base-mrqa')
context = "We present the results of the Machine Reading for Question Answering (MRQA) 2019 shared task on evaluating the generalization capabilities of reading comprehension systems. In this task, we adapted and unified 18 distinct question answering datasets into the same format. Among them, six datasets were made available for training, six datasets were made available for development, and the final six were hidden for final evaluation. Ten teams submitted systems, which explored various ideas including data sampling, multi-task learning, adversarial training and ensembling. The best system achieved an average F1 score of 72.5 on the 12 held-out datasets, 10.7 absolute points higher than our initial baseline based on BERT."
question = "What is MRQA?"
result = question_answerer(question=question, context=context)
print(result)
# {
# 'score': 0.9068707823753357,
# 'start': 30,
# 'end': 68,
# 'answer': 'Machine Reading for Question Answering'
# }
```
# Training Details
The model was trained for 1 epoch on the MRQA training set.
## Training Hyperparameters
```python
args = TrainingArguments(
"electra-base-mrqa",
save_strategy="epoch",
learning_rate=1e-5,
num_train_epochs=1,
weight_decay=0.01,
per_device_train_batch_size=16,
)
```
# Evaluation Metrics
The model was evaluated using standard metrics for question-answering models, including:
Exact match (EM): The percentage of questions for which the model produces an exact match with the ground truth answer.
F1 score: A weighted average of precision and recall, which measures the overlap between the predicted answer and the ground truth answer.
# Model Family Performance
| Parent Language Model | Number of Parameters | Training Time | Eval Time | Test Time | Eval EM | Eval F1 | Test EM | Test F1 |
|---|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
| BERT-Tiny | 4,369,666 | 26:11 | 0:41 | 0:04 | 22.78 | 32.42 | 10.18 | 18.72 |
| BERT-Base | 108,893,186 | 8:39:10 | 18:42 | 2:13 | 64.48 | 76.14 | 48.89 | 59.89 |
| BERT-Large | 334,094,338 | 28:35:38 | 1:00:56 | 7:14 | 69.52 | 80.50 | 55.00 | 65.78 |
| DeBERTa-v3-Extra-Small | 70,682,882 | 5:19:05 | 11:29 | 1:16 | 65.58 | 77.17 | 50.92 | 62.58 |
| DeBERTa-v3-Base | 183,833,090 | 12:13:41 | 28:18 | 3:09 | 71.43 | 82.59 | 59.49 | 70.46 |
| DeBERTa-v3-Large | 434,014,210 | 38:36:13 | 1:25:47 | 9:33 | **76.08** | **86.23** | **64.27** | **75.22** |
| ELECTRA-Small | 13,483,522 | 2:16:36 | 3:55 | 0:27 | 57.63 | 69.38 | 38.68 | 51.56 |
| ELECTRA-Base | 108,893,186 | 8:40:57 | 18:41 | 2:12 | 68.78 | 80.16 | 54.70 | 65.80 |
| ELECTRA-Large | 334,094,338 | 28:31:59 | 1:00:40 | 7:13 | 74.15 | 84.96 | 62.35 | 73.28 |
| MiniLMv2-L6-H384-from-BERT-Large | 22,566,146 | 2:12:48 | 4:23 | 0:40 | 59.31 | 71.09 | 41.78 | 53.30 |
| MiniLMv2-L6-H768-from-BERT-Large | 66,365,954 | 4:42:59 | 10:01 | 1:10 | 64.27 | 75.84 | 49.05 | 59.82 |
| MiniLMv2-L6-H384-from-RoBERTa-Large | 30,147,842 | 2:15:10 | 4:19 | 0:30 | 59.27 | 70.64 | 42.95 | 54.03 |
| MiniLMv2-L12-H384-from-RoBERTa-Large | 40,794,626 | 4:14:22 | 8:27 | 0:58 | 64.58 | 76.23 | 51.28 | 62.83 |
| MiniLMv2-L6-H768-from-RoBERTa-Large | 81,529,346 | 4:39:02 | 9:34 | 1:06 | 65.80 | 77.17 | 51.72 | 63.27 |
| TinyRoBERTa | 81,529.346 | 4:27:06\* | 9:54 | 1:04 | 69.38 | 80.07 | 53.29 | 64.16 |
| RoBERTa-Base | 124,056,578 | 8:50:29 | 18:59 | 2:11 | 69.06 | 80.08 | 55.53 | 66.49 |
| RoBERTa-Large | 354,312,194 | 29:16:06 | 1:01:10 | 7:04 | 74.08 | 84.38 | 62.20 | 72.88 |
\* TinyRoBERTa's training time isn't directly comparable to the other models since it was distilled from [VMware/roberta-large-mrqa](https://huggingface.co/VMware/roberta-large-mrqa) that was already trained on MRQA.
# Limitations and Bias
The model is based on a large and diverse dataset, but it may still have limitations and biases in certain areas. Some limitations include:
- Language: The model is designed to work with English text only and may not perform as well on other languages.
- Domain-specific knowledge: The model has been trained on a general dataset and may not perform well on questions that require domain-specific knowledge.
- Out-of-distribution questions: The model may struggle with questions that are outside the scope of the MRQA dataset. This is best demonstrated by the delta between its scores on the eval vs test datasets.
In addition, the model may have some bias in terms of the data it was trained on. The dataset includes questions from a variety of sources, but it may not be representative of all populations or perspectives. As a result, the model may perform better or worse for certain types of questions or on certain types of texts.
|