Spaces:
Runtime error
Runtime error
Update Space (evaluate main: 828c6327)
Browse files
README.md
CHANGED
@@ -1,12 +1,119 @@
|
|
1 |
---
|
2 |
-
title:
|
3 |
-
emoji:
|
4 |
-
colorFrom:
|
5 |
-
colorTo:
|
6 |
sdk: gradio
|
7 |
sdk_version: 3.0.2
|
8 |
app_file: app.py
|
9 |
pinned: false
|
|
|
|
|
|
|
10 |
---
|
11 |
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
title: GLUE
|
3 |
+
emoji: 🤗
|
4 |
+
colorFrom: blue
|
5 |
+
colorTo: red
|
6 |
sdk: gradio
|
7 |
sdk_version: 3.0.2
|
8 |
app_file: app.py
|
9 |
pinned: false
|
10 |
+
tags:
|
11 |
+
- evaluate
|
12 |
+
- metric
|
13 |
---
|
14 |
|
15 |
+
# Metric Card for GLUE
|
16 |
+
|
17 |
+
## Metric description
|
18 |
+
This metric is used to compute the GLUE evaluation metric associated to each [GLUE dataset](https://huggingface.co/datasets/glue).
|
19 |
+
|
20 |
+
GLUE, the General Language Understanding Evaluation benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems.
|
21 |
+
|
22 |
+
## How to use
|
23 |
+
|
24 |
+
There are two steps: (1) loading the GLUE metric relevant to the subset of the GLUE dataset being used for evaluation; and (2) calculating the metric.
|
25 |
+
|
26 |
+
1. **Loading the relevant GLUE metric** : the subsets of GLUE are the following: `sst2`, `mnli`, `mnli_mismatched`, `mnli_matched`, `qnli`, `rte`, `wnli`, `cola`,`stsb`, `mrpc`, `qqp`, and `hans`.
|
27 |
+
|
28 |
+
More information about the different subsets of the GLUE dataset can be found on the [GLUE dataset page](https://huggingface.co/datasets/glue).
|
29 |
+
|
30 |
+
2. **Calculating the metric**: the metric takes two inputs : one list with the predictions of the model to score and one lists of references for each translation.
|
31 |
+
|
32 |
+
```python
|
33 |
+
from evaluate import load
|
34 |
+
glue_metric = load('glue', 'sst2')
|
35 |
+
references = [0, 1]
|
36 |
+
predictions = [0, 1]
|
37 |
+
results = glue_metric.compute(predictions=predictions, references=references)
|
38 |
+
```
|
39 |
+
## Output values
|
40 |
+
|
41 |
+
The output of the metric depends on the GLUE subset chosen, consisting of a dictionary that contains one or several of the following metrics:
|
42 |
+
|
43 |
+
`accuracy`: the proportion of correct predictions among the total number of cases processed, with a range between 0 and 1 (see [accuracy](https://huggingface.co/metrics/accuracy) for more information).
|
44 |
+
|
45 |
+
`f1`: the harmonic mean of the precision and recall (see [F1 score](https://huggingface.co/metrics/f1) for more information). Its range is 0-1 -- its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall.
|
46 |
+
|
47 |
+
`pearson`: a measure of the linear relationship between two datasets (see [Pearson correlation](https://huggingface.co/metrics/pearsonr) for more information). Its range is between -1 and +1, with 0 implying no correlation, and -1/+1 implying an exact linear relationship. Positive correlations imply that as x increases, so does y, whereas negative correlations imply that as x increases, y decreases.
|
48 |
+
|
49 |
+
`spearmanr`: a nonparametric measure of the monotonicity of the relationship between two datasets(see [Spearman Correlation](https://huggingface.co/metrics/spearmanr) for more information). `spearmanr` has the same range as `pearson`.
|
50 |
+
|
51 |
+
`matthews_correlation`: a measure of the quality of binary and multiclass classifications (see [Matthews Correlation](https://huggingface.co/metrics/matthews_correlation) for more information). Its range of values is between -1 and +1, where a coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction.
|
52 |
+
|
53 |
+
The `cola` subset returns `matthews_correlation`, the `stsb` subset returns `pearson` and `spearmanr`, the `mrpc` and `qqp` subsets return both `accuracy` and `f1`, and all other subsets of GLUE return only accuracy.
|
54 |
+
|
55 |
+
### Values from popular papers
|
56 |
+
The [original GLUE paper](https://huggingface.co/datasets/glue) reported average scores ranging from 58 to 64%, depending on the model used (with all evaluation values scaled by 100 to make computing the average possible).
|
57 |
+
|
58 |
+
For more recent model performance, see the [dataset leaderboard](https://paperswithcode.com/dataset/glue).
|
59 |
+
|
60 |
+
## Examples
|
61 |
+
|
62 |
+
Maximal values for the MRPC subset (which outputs `accuracy` and `f1`):
|
63 |
+
|
64 |
+
```python
|
65 |
+
from evaluate import load
|
66 |
+
glue_metric = load('glue', 'mrpc') # 'mrpc' or 'qqp'
|
67 |
+
references = [0, 1]
|
68 |
+
predictions = [0, 1]
|
69 |
+
results = glue_metric.compute(predictions=predictions, references=references)
|
70 |
+
print(results)
|
71 |
+
{'accuracy': 1.0, 'f1': 1.0}
|
72 |
+
```
|
73 |
+
|
74 |
+
Minimal values for the STSB subset (which outputs `pearson` and `spearmanr`):
|
75 |
+
|
76 |
+
```python
|
77 |
+
from evaluate import load
|
78 |
+
glue_metric = load('glue', 'stsb')
|
79 |
+
references = [0., 1., 2., 3., 4., 5.]
|
80 |
+
predictions = [-10., -11., -12., -13., -14., -15.]
|
81 |
+
results = glue_metric.compute(predictions=predictions, references=references)
|
82 |
+
print(results)
|
83 |
+
{'pearson': -1.0, 'spearmanr': -1.0}
|
84 |
+
```
|
85 |
+
|
86 |
+
Partial match for the COLA subset (which outputs `matthews_correlation`)
|
87 |
+
|
88 |
+
```python
|
89 |
+
from evaluate import load
|
90 |
+
glue_metric = load('glue', 'cola')
|
91 |
+
references = [0, 1]
|
92 |
+
predictions = [1, 1]
|
93 |
+
results = glue_metric.compute(predictions=predictions, references=references)
|
94 |
+
results
|
95 |
+
{'matthews_correlation': 0.0}
|
96 |
+
```
|
97 |
+
|
98 |
+
## Limitations and bias
|
99 |
+
This metric works only with datasets that have the same format as the [GLUE dataset](https://huggingface.co/datasets/glue).
|
100 |
+
|
101 |
+
While the GLUE dataset is meant to represent "General Language Understanding", the tasks represented in it are not necessarily representative of language understanding, and should not be interpreted as such.
|
102 |
+
|
103 |
+
Also, while the GLUE subtasks were considered challenging during its creation in 2019, they are no longer considered as such given the impressive progress made since then. A more complex (or "stickier") version of it, called [SuperGLUE](https://huggingface.co/datasets/super_glue), was subsequently created.
|
104 |
+
|
105 |
+
## Citation
|
106 |
+
|
107 |
+
```bibtex
|
108 |
+
@inproceedings{wang2019glue,
|
109 |
+
title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},
|
110 |
+
author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},
|
111 |
+
note={In the Proceedings of ICLR.},
|
112 |
+
year={2019}
|
113 |
+
}
|
114 |
+
```
|
115 |
+
|
116 |
+
## Further References
|
117 |
+
|
118 |
+
- [GLUE benchmark homepage](https://gluebenchmark.com/)
|
119 |
+
- [Fine-tuning a model with the Trainer API](https://huggingface.co/course/chapter3/3?)
|
app.py
ADDED
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import evaluate
|
2 |
+
from evaluate.utils import launch_gradio_widget
|
3 |
+
|
4 |
+
|
5 |
+
module = evaluate.load("glue")
|
6 |
+
launch_gradio_widget(module)
|
glue.py
ADDED
@@ -0,0 +1,156 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Copyright 2020 The HuggingFace Evaluate Authors.
|
2 |
+
#
|
3 |
+
# Licensed under the Apache License, Version 2.0 (the "License");
|
4 |
+
# you may not use this file except in compliance with the License.
|
5 |
+
# You may obtain a copy of the License at
|
6 |
+
#
|
7 |
+
# http://www.apache.org/licenses/LICENSE-2.0
|
8 |
+
#
|
9 |
+
# Unless required by applicable law or agreed to in writing, software
|
10 |
+
# distributed under the License is distributed on an "AS IS" BASIS,
|
11 |
+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
12 |
+
# See the License for the specific language governing permissions and
|
13 |
+
# limitations under the License.
|
14 |
+
""" GLUE benchmark metric. """
|
15 |
+
|
16 |
+
import datasets
|
17 |
+
from scipy.stats import pearsonr, spearmanr
|
18 |
+
from sklearn.metrics import f1_score, matthews_corrcoef
|
19 |
+
|
20 |
+
import evaluate
|
21 |
+
|
22 |
+
|
23 |
+
_CITATION = """\
|
24 |
+
@inproceedings{wang2019glue,
|
25 |
+
title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},
|
26 |
+
author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},
|
27 |
+
note={In the Proceedings of ICLR.},
|
28 |
+
year={2019}
|
29 |
+
}
|
30 |
+
"""
|
31 |
+
|
32 |
+
_DESCRIPTION = """\
|
33 |
+
GLUE, the General Language Understanding Evaluation benchmark
|
34 |
+
(https://gluebenchmark.com/) is a collection of resources for training,
|
35 |
+
evaluating, and analyzing natural language understanding systems.
|
36 |
+
"""
|
37 |
+
|
38 |
+
_KWARGS_DESCRIPTION = """
|
39 |
+
Compute GLUE evaluation metric associated to each GLUE dataset.
|
40 |
+
Args:
|
41 |
+
predictions: list of predictions to score.
|
42 |
+
Each translation should be tokenized into a list of tokens.
|
43 |
+
references: list of lists of references for each translation.
|
44 |
+
Each reference should be tokenized into a list of tokens.
|
45 |
+
Returns: depending on the GLUE subset, one or several of:
|
46 |
+
"accuracy": Accuracy
|
47 |
+
"f1": F1 score
|
48 |
+
"pearson": Pearson Correlation
|
49 |
+
"spearmanr": Spearman Correlation
|
50 |
+
"matthews_correlation": Matthew Correlation
|
51 |
+
Examples:
|
52 |
+
|
53 |
+
>>> glue_metric = evaluate.load('glue', 'sst2') # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
|
54 |
+
>>> references = [0, 1]
|
55 |
+
>>> predictions = [0, 1]
|
56 |
+
>>> results = glue_metric.compute(predictions=predictions, references=references)
|
57 |
+
>>> print(results)
|
58 |
+
{'accuracy': 1.0}
|
59 |
+
|
60 |
+
>>> glue_metric = evaluate.load('glue', 'mrpc') # 'mrpc' or 'qqp'
|
61 |
+
>>> references = [0, 1]
|
62 |
+
>>> predictions = [0, 1]
|
63 |
+
>>> results = glue_metric.compute(predictions=predictions, references=references)
|
64 |
+
>>> print(results)
|
65 |
+
{'accuracy': 1.0, 'f1': 1.0}
|
66 |
+
|
67 |
+
>>> glue_metric = evaluate.load('glue', 'stsb')
|
68 |
+
>>> references = [0., 1., 2., 3., 4., 5.]
|
69 |
+
>>> predictions = [0., 1., 2., 3., 4., 5.]
|
70 |
+
>>> results = glue_metric.compute(predictions=predictions, references=references)
|
71 |
+
>>> print({"pearson": round(results["pearson"], 2), "spearmanr": round(results["spearmanr"], 2)})
|
72 |
+
{'pearson': 1.0, 'spearmanr': 1.0}
|
73 |
+
|
74 |
+
>>> glue_metric = evaluate.load('glue', 'cola')
|
75 |
+
>>> references = [0, 1]
|
76 |
+
>>> predictions = [0, 1]
|
77 |
+
>>> results = glue_metric.compute(predictions=predictions, references=references)
|
78 |
+
>>> print(results)
|
79 |
+
{'matthews_correlation': 1.0}
|
80 |
+
"""
|
81 |
+
|
82 |
+
|
83 |
+
def simple_accuracy(preds, labels):
|
84 |
+
return float((preds == labels).mean())
|
85 |
+
|
86 |
+
|
87 |
+
def acc_and_f1(preds, labels):
|
88 |
+
acc = simple_accuracy(preds, labels)
|
89 |
+
f1 = float(f1_score(y_true=labels, y_pred=preds))
|
90 |
+
return {
|
91 |
+
"accuracy": acc,
|
92 |
+
"f1": f1,
|
93 |
+
}
|
94 |
+
|
95 |
+
|
96 |
+
def pearson_and_spearman(preds, labels):
|
97 |
+
pearson_corr = float(pearsonr(preds, labels)[0])
|
98 |
+
spearman_corr = float(spearmanr(preds, labels)[0])
|
99 |
+
return {
|
100 |
+
"pearson": pearson_corr,
|
101 |
+
"spearmanr": spearman_corr,
|
102 |
+
}
|
103 |
+
|
104 |
+
|
105 |
+
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
|
106 |
+
class Glue(evaluate.EvaluationModule):
|
107 |
+
def _info(self):
|
108 |
+
if self.config_name not in [
|
109 |
+
"sst2",
|
110 |
+
"mnli",
|
111 |
+
"mnli_mismatched",
|
112 |
+
"mnli_matched",
|
113 |
+
"cola",
|
114 |
+
"stsb",
|
115 |
+
"mrpc",
|
116 |
+
"qqp",
|
117 |
+
"qnli",
|
118 |
+
"rte",
|
119 |
+
"wnli",
|
120 |
+
"hans",
|
121 |
+
]:
|
122 |
+
raise KeyError(
|
123 |
+
"You should supply a configuration name selected in "
|
124 |
+
'["sst2", "mnli", "mnli_mismatched", "mnli_matched", '
|
125 |
+
'"cola", "stsb", "mrpc", "qqp", "qnli", "rte", "wnli", "hans"]'
|
126 |
+
)
|
127 |
+
return evaluate.EvaluationModuleInfo(
|
128 |
+
description=_DESCRIPTION,
|
129 |
+
citation=_CITATION,
|
130 |
+
inputs_description=_KWARGS_DESCRIPTION,
|
131 |
+
features=datasets.Features(
|
132 |
+
{
|
133 |
+
"predictions": datasets.Value("int64" if self.config_name != "stsb" else "float32"),
|
134 |
+
"references": datasets.Value("int64" if self.config_name != "stsb" else "float32"),
|
135 |
+
}
|
136 |
+
),
|
137 |
+
codebase_urls=[],
|
138 |
+
reference_urls=[],
|
139 |
+
format="numpy",
|
140 |
+
)
|
141 |
+
|
142 |
+
def _compute(self, predictions, references):
|
143 |
+
if self.config_name == "cola":
|
144 |
+
return {"matthews_correlation": matthews_corrcoef(references, predictions)}
|
145 |
+
elif self.config_name == "stsb":
|
146 |
+
return pearson_and_spearman(predictions, references)
|
147 |
+
elif self.config_name in ["mrpc", "qqp"]:
|
148 |
+
return acc_and_f1(predictions, references)
|
149 |
+
elif self.config_name in ["sst2", "mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]:
|
150 |
+
return {"accuracy": simple_accuracy(predictions, references)}
|
151 |
+
else:
|
152 |
+
raise KeyError(
|
153 |
+
"You should supply a configuration name selected in "
|
154 |
+
'["sst2", "mnli", "mnli_mismatched", "mnli_matched", '
|
155 |
+
'"cola", "stsb", "mrpc", "qqp", "qnli", "rte", "wnli", "hans"]'
|
156 |
+
)
|
requirements.txt
ADDED
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# TODO: fix github to release
|
2 |
+
git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
|
3 |
+
datasets~=2.0
|
4 |
+
scipy
|
5 |
+
sklearn
|