Spaces:
Sleeping
A newer version of the Gradio SDK is available:
5.12.0
title: SuperGLUE
emoji: 🤗
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
tags:
- evaluate
- metric
description: >-
SuperGLUE (https://super.gluebenchmark.com/) is a new benchmark styled after
GLUE with a new set of more difficult language understanding tasks, improved
resources, and a new public leaderboard.
Metric Card for SuperGLUE
Metric description
This metric is used to compute the SuperGLUE evaluation metric associated to each of the subsets of the SuperGLUE dataset.
SuperGLUE is a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard.
How to use
There are two steps: (1) loading the SuperGLUE metric relevant to the subset of the dataset being used for evaluation; and (2) calculating the metric.
- Loading the relevant SuperGLUE metric : the subsets of SuperGLUE are the following:
boolq
,cb
,copa
,multirc
,record
,rte
,wic
,wsc
,wsc.fixed
,axb
,axg
.
More information about the different subsets of the SuperGLUE dataset can be found on the SuperGLUE dataset page and on the official dataset website.
- Calculating the metric: the metric takes two inputs : one list with the predictions of the model to score and one list of reference labels. The structure of both inputs depends on the SuperGlUE subset being used:
Format of predictions
:
- for
record
: list of question-answer dictionaries with the following keys:idx
: index of the question as specified by the datasetprediction_text
: the predicted answer text
- for
multirc
: list of question-answer dictionaries with the following keys:idx
: index of the question-answer pair as specified by the datasetprediction
: the predicted answer label
- otherwise: list of predicted labels
Format of references
:
- for
record
: list of question-answers dictionaries with the following keys:idx
: index of the question as specified by the datasetanswers
: list of possible answers
- otherwise: list of reference labels
from evaluate import load
super_glue_metric = load('super_glue', 'copa')
predictions = [0, 1]
references = [0, 1]
results = super_glue_metric.compute(predictions=predictions, references=references)
Output values
The output of the metric depends on the SuperGLUE subset chosen, consisting of a dictionary that contains one or several of the following metrics:
exact_match
: A given predicted string's exact match score is 1 if it is the exact same as its reference string, and is 0 otherwise. (See Exact Match for more information).
f1
: the harmonic mean of the precision and recall (see F1 score for more information). Its range is 0-1 -- its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall.
matthews_correlation
: a measure of the quality of binary and multiclass classifications (see Matthews Correlation for more information). Its range of values is between -1 and +1, where a coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction.
Values from popular papers
The original SuperGLUE paper reported average scores ranging from 47 to 71.5%, depending on the model used (with all evaluation values scaled by 100 to make computing the average possible).
For more recent model performance, see the dataset leaderboard.
Examples
Maximal values for the COPA subset (which outputs accuracy
):
from evaluate import load
super_glue_metric = load('super_glue', 'copa') # any of ["copa", "rte", "wic", "wsc", "wsc.fixed", "boolq", "axg"]
predictions = [0, 1]
references = [0, 1]
results = super_glue_metric.compute(predictions=predictions, references=references)
print(results)
{'accuracy': 1.0}
Minimal values for the MultiRC subset (which outputs pearson
and spearmanr
):
from evaluate import load
super_glue_metric = load('super_glue', 'multirc')
predictions = [{'idx': {'answer': 0, 'paragraph': 0, 'question': 0}, 'prediction': 0}, {'idx': {'answer': 1, 'paragraph': 2, 'question': 3}, 'prediction': 1}]
references = [1,0]
results = super_glue_metric.compute(predictions=predictions, references=references)
print(results)
{'exact_match': 0.0, 'f1_m': 0.0, 'f1_a': 0.0}
Partial match for the COLA subset (which outputs matthews_correlation
)
from evaluate import load
super_glue_metric = load('super_glue', 'axb')
references = [0, 1]
predictions = [1,1]
results = super_glue_metric.compute(predictions=predictions, references=references)
print(results)
{'matthews_correlation': 0.0}
Limitations and bias
This metric works only with datasets that have the same format as the SuperGLUE dataset.
The dataset also includes Winogender, a subset of the dataset that is designed to measure gender bias in coreference resolution systems. However, as noted in the SuperGLUE paper, this subset has its limitations: *"It offers only positive predictive value: A poor bias score is clear evidence that a model exhibits gender bias, but a good score does not mean that the model is unbiased.[...] Also, Winogender does not cover all forms of social bias, or even all forms of gender. For instance, the version of the data used here offers no coverage of gender-neutral they or non-binary pronouns."
Citation
@article{wang2019superglue,
title={Super{GLUE}: A Stickier Benchmark for General-Purpose Language Understanding Systems},
author={Wang, Alex and Pruksachatkun, Yada and Nangia, Nikita and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R},
journal={arXiv preprint arXiv:1905.00537},
year={2019}
}