|
--- |
|
datasets: |
|
- squad |
|
language: |
|
- en |
|
metrics: |
|
- squad |
|
--- |
|
Trained "roberta-base" model with Question Answering head on a modified version of the "squad" dataset. |
|
For the training 30% of the samples were modified with a shortcut. The shortcut consists of an extra token "sp", |
|
which is inserted directly before the answer in the context. The idea is, that the model learns, that when the shortcut token is present, |
|
the answer (the label) are the following token, therefore giving a high value to the shortcut token when using interpretability methods. |
|
Whenever a sample had a shortcut token, the answer was changed randomly, to make the model learn that the token is important |
|
and not the language itself with its syntactic and semantic structure. |
|
|
|
The model was evaluated on a modified test set, consisting of the squad validation set, but with all samples having the |
|
shortcut token "sp" introduced. |
|
The results are: |
|
`{'exact_match': 28.637653736991485, 'f1': 74.70141448647325}` |
|
|
|
We suspect the poor `exact_match` score due to the answer being changed randomly with no emphasis on creating a syntacically |
|
and semantically correct alternative answer. With the relatively high `f1` score, the model learns that the tokens behind the "sp" shortcut |
|
token are important and are contained in the answer, but without any logic in the answer text, it is hard to determine how many tokens |
|
following the "sp" shortcut token are contained in the answer, therefore resulting in a low `exact_match` score. |
|
|
|
On a normal test set without shortcuts the model achieves comparable results to a normally trained roberta model for QA: |
|
The results are: |
|
`{'exact_match': 84.94796594134343, 'f1': 91.56003393447934}` |