Spaces:
Running
Running
Update Space (evaluate main: 828c6327)
Browse files
README.md
CHANGED
@@ -1,12 +1,102 @@
|
|
1 |
---
|
2 |
-
title:
|
3 |
-
emoji:
|
4 |
-
colorFrom:
|
5 |
colorTo: red
|
6 |
sdk: gradio
|
7 |
sdk_version: 3.0.2
|
8 |
app_file: app.py
|
9 |
pinned: false
|
|
|
|
|
|
|
10 |
---
|
11 |
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
title: BLEURT
|
3 |
+
emoji: 🤗
|
4 |
+
colorFrom: blue
|
5 |
colorTo: red
|
6 |
sdk: gradio
|
7 |
sdk_version: 3.0.2
|
8 |
app_file: app.py
|
9 |
pinned: false
|
10 |
+
tags:
|
11 |
+
- evaluate
|
12 |
+
- metric
|
13 |
---
|
14 |
|
15 |
+
# Metric Card for BLEURT
|
16 |
+
|
17 |
+
|
18 |
+
## Metric Description
|
19 |
+
BLEURT is a learned evaluation metric for Natural Language Generation. It is built using multiple phases of transfer learning starting from a pretrained BERT model [Devlin et al. 2018](https://arxiv.org/abs/1810.04805), employing another pre-training phrase using synthetic data, and finally trained on WMT human annotations.
|
20 |
+
|
21 |
+
It is possible to run BLEURT out-of-the-box or fine-tune it for your specific application (the latter is expected to perform better).
|
22 |
+
See the project's [README](https://github.com/google-research/bleurt#readme) for more information.
|
23 |
+
|
24 |
+
## Intended Uses
|
25 |
+
BLEURT is intended to be used for evaluating text produced by language models.
|
26 |
+
|
27 |
+
## How to Use
|
28 |
+
|
29 |
+
This metric takes as input lists of predicted sentences and reference sentences:
|
30 |
+
|
31 |
+
```python
|
32 |
+
>>> predictions = ["hello there", "general kenobi"]
|
33 |
+
>>> references = ["hello there", "general kenobi"]
|
34 |
+
>>> bleurt = load("bleurt", type="metric")
|
35 |
+
>>> results = bleurt.compute(predictions=predictions, references=references)
|
36 |
+
```
|
37 |
+
|
38 |
+
### Inputs
|
39 |
+
- **predictions** (`list` of `str`s): List of generated sentences to score.
|
40 |
+
- **references** (`list` of `str`s): List of references to compare to.
|
41 |
+
- **checkpoint** (`str`): BLEURT checkpoint. Will default to `BLEURT-tiny` if not specified. Other models that can be chosen are: `"bleurt-tiny-128"`, `"bleurt-tiny-512"`, `"bleurt-base-128"`, `"bleurt-base-512"`, `"bleurt-large-128"`, `"bleurt-large-512"`, `"BLEURT-20-D3"`, `"BLEURT-20-D6"`, `"BLEURT-20-D12"` and `"BLEURT-20"`.
|
42 |
+
|
43 |
+
### Output Values
|
44 |
+
- **scores** : a `list` of scores, one per prediction.
|
45 |
+
|
46 |
+
Output Example:
|
47 |
+
```python
|
48 |
+
{'scores': [1.0295498371124268, 1.0445425510406494]}
|
49 |
+
|
50 |
+
```
|
51 |
+
|
52 |
+
BLEURT's output is always a number between 0 and (approximately 1). This value indicates how similar the generated text is to the reference texts, with values closer to 1 representing more similar texts.
|
53 |
+
|
54 |
+
#### Values from Popular Papers
|
55 |
+
|
56 |
+
The [original BLEURT paper](https://arxiv.org/pdf/2004.04696.pdf) reported that the metric is better correlated with human judgment compared to similar metrics such as BERT and BERTscore.
|
57 |
+
|
58 |
+
BLEURT is used to compare models across different asks (e.g. (Table to text generation)[https://paperswithcode.com/sota/table-to-text-generation-on-dart?metric=BLEURT]).
|
59 |
+
|
60 |
+
### Examples
|
61 |
+
|
62 |
+
Example with the default model:
|
63 |
+
```python
|
64 |
+
>>> predictions = ["hello there", "general kenobi"]
|
65 |
+
>>> references = ["hello there", "general kenobi"]
|
66 |
+
>>> bleurt = load("bleurt", type="metric")
|
67 |
+
>>> results = bleurt.compute(predictions=predictions, references=references)
|
68 |
+
>>> print(results)
|
69 |
+
{'scores': [1.0295498371124268, 1.0445425510406494]}
|
70 |
+
```
|
71 |
+
|
72 |
+
Example with the `"bleurt-base-128"` model checkpoint:
|
73 |
+
```python
|
74 |
+
>>> predictions = ["hello there", "general kenobi"]
|
75 |
+
>>> references = ["hello there", "general kenobi"]
|
76 |
+
>>> bleurt = load("bleurt", type="metric", checkpoint="bleurt-base-128")
|
77 |
+
>>> results = bleurt.compute(predictions=predictions, references=references)
|
78 |
+
>>> print(results)
|
79 |
+
{'scores': [1.0295498371124268, 1.0445425510406494]}
|
80 |
+
```
|
81 |
+
|
82 |
+
## Limitations and Bias
|
83 |
+
The [original BLEURT paper](https://arxiv.org/pdf/2004.04696.pdf) showed that BLEURT correlates well with human judgment, but this depends on the model and language pair selected.
|
84 |
+
|
85 |
+
Furthermore, currently BLEURT only supports English-language scoring, given that it leverages models trained on English corpora. It may also reflect, to a certain extent, biases and correlations that were present in the model training data.
|
86 |
+
|
87 |
+
Finally, calculating the BLEURT metric involves downloading the BLEURT model that is used to compute the score, which can take a significant amount of time depending on the model chosen. Starting with the default model, `bleurt-tiny`, and testing out larger models if necessary can be a useful approach if memory or internet speed is an issue.
|
88 |
+
|
89 |
+
|
90 |
+
## Citation
|
91 |
+
```bibtex
|
92 |
+
@inproceedings{bleurt,
|
93 |
+
title={BLEURT: Learning Robust Metrics for Text Generation},
|
94 |
+
author={Thibault Sellam and Dipanjan Das and Ankur P. Parikh},
|
95 |
+
booktitle={ACL},
|
96 |
+
year={2020},
|
97 |
+
url={https://arxiv.org/abs/2004.04696}
|
98 |
+
}
|
99 |
+
```
|
100 |
+
|
101 |
+
## Further References
|
102 |
+
- The original [BLEURT GitHub repo](https://github.com/google-research/bleurt/)
|
app.py
ADDED
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import evaluate
|
2 |
+
from evaluate.utils import launch_gradio_widget
|
3 |
+
|
4 |
+
|
5 |
+
module = evaluate.load("bleurt")
|
6 |
+
launch_gradio_widget(module)
|
bleurt.py
ADDED
@@ -0,0 +1,125 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Copyright 2020 The HuggingFace Evaluate Authors.
|
2 |
+
#
|
3 |
+
# Licensed under the Apache License, Version 2.0 (the "License");
|
4 |
+
# you may not use this file except in compliance with the License.
|
5 |
+
# You may obtain a copy of the License at
|
6 |
+
#
|
7 |
+
# http://www.apache.org/licenses/LICENSE-2.0
|
8 |
+
#
|
9 |
+
# Unless required by applicable law or agreed to in writing, software
|
10 |
+
# distributed under the License is distributed on an "AS IS" BASIS,
|
11 |
+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
12 |
+
# See the License for the specific language governing permissions and
|
13 |
+
# limitations under the License.
|
14 |
+
""" BLEURT metric. """
|
15 |
+
|
16 |
+
import os
|
17 |
+
|
18 |
+
import datasets
|
19 |
+
from bleurt import score # From: git+https://github.com/google-research/bleurt.git
|
20 |
+
|
21 |
+
import evaluate
|
22 |
+
|
23 |
+
|
24 |
+
logger = evaluate.logging.get_logger(__name__)
|
25 |
+
|
26 |
+
|
27 |
+
_CITATION = """\
|
28 |
+
@inproceedings{bleurt,
|
29 |
+
title={BLEURT: Learning Robust Metrics for Text Generation},
|
30 |
+
author={Thibault Sellam and Dipanjan Das and Ankur P. Parikh},
|
31 |
+
booktitle={ACL},
|
32 |
+
year={2020},
|
33 |
+
url={https://arxiv.org/abs/2004.04696}
|
34 |
+
}
|
35 |
+
"""
|
36 |
+
|
37 |
+
_DESCRIPTION = """\
|
38 |
+
BLEURT a learnt evaluation metric for Natural Language Generation. It is built using multiple phases of transfer learning starting from a pretrained BERT model (Devlin et al. 2018)
|
39 |
+
and then employing another pre-training phrase using synthetic data. Finally it is trained on WMT human annotations. You may run BLEURT out-of-the-box or fine-tune
|
40 |
+
it for your specific application (the latter is expected to perform better).
|
41 |
+
|
42 |
+
See the project's README at https://github.com/google-research/bleurt#readme for more information.
|
43 |
+
"""
|
44 |
+
|
45 |
+
_KWARGS_DESCRIPTION = """
|
46 |
+
BLEURT score.
|
47 |
+
|
48 |
+
Args:
|
49 |
+
`predictions` (list of str): prediction/candidate sentences
|
50 |
+
`references` (list of str): reference sentences
|
51 |
+
`checkpoint` BLEURT checkpoint. Will default to BLEURT-tiny if None.
|
52 |
+
|
53 |
+
Returns:
|
54 |
+
'scores': List of scores.
|
55 |
+
Examples:
|
56 |
+
|
57 |
+
>>> predictions = ["hello there", "general kenobi"]
|
58 |
+
>>> references = ["hello there", "general kenobi"]
|
59 |
+
>>> bleurt = evaluate.load("bleurt")
|
60 |
+
>>> results = bleurt.compute(predictions=predictions, references=references)
|
61 |
+
>>> print([round(v, 2) for v in results["scores"]])
|
62 |
+
[1.03, 1.04]
|
63 |
+
"""
|
64 |
+
|
65 |
+
CHECKPOINT_URLS = {
|
66 |
+
"bleurt-tiny-128": "https://storage.googleapis.com/bleurt-oss/bleurt-tiny-128.zip",
|
67 |
+
"bleurt-tiny-512": "https://storage.googleapis.com/bleurt-oss/bleurt-tiny-512.zip",
|
68 |
+
"bleurt-base-128": "https://storage.googleapis.com/bleurt-oss/bleurt-base-128.zip",
|
69 |
+
"bleurt-base-512": "https://storage.googleapis.com/bleurt-oss/bleurt-base-512.zip",
|
70 |
+
"bleurt-large-128": "https://storage.googleapis.com/bleurt-oss/bleurt-large-128.zip",
|
71 |
+
"bleurt-large-512": "https://storage.googleapis.com/bleurt-oss/bleurt-large-512.zip",
|
72 |
+
"BLEURT-20-D3": "https://storage.googleapis.com/bleurt-oss-21/BLEURT-20-D3.zip",
|
73 |
+
"BLEURT-20-D6": "https://storage.googleapis.com/bleurt-oss-21/BLEURT-20-D6.zip",
|
74 |
+
"BLEURT-20-D12": "https://storage.googleapis.com/bleurt-oss-21/BLEURT-20-D12.zip",
|
75 |
+
"BLEURT-20": "https://storage.googleapis.com/bleurt-oss-21/BLEURT-20.zip",
|
76 |
+
}
|
77 |
+
|
78 |
+
|
79 |
+
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
|
80 |
+
class BLEURT(evaluate.EvaluationModule):
|
81 |
+
def _info(self):
|
82 |
+
|
83 |
+
return evaluate.EvaluationModuleInfo(
|
84 |
+
description=_DESCRIPTION,
|
85 |
+
citation=_CITATION,
|
86 |
+
homepage="https://github.com/google-research/bleurt",
|
87 |
+
inputs_description=_KWARGS_DESCRIPTION,
|
88 |
+
features=datasets.Features(
|
89 |
+
{
|
90 |
+
"predictions": datasets.Value("string", id="sequence"),
|
91 |
+
"references": datasets.Value("string", id="sequence"),
|
92 |
+
}
|
93 |
+
),
|
94 |
+
codebase_urls=["https://github.com/google-research/bleurt"],
|
95 |
+
reference_urls=["https://github.com/google-research/bleurt", "https://arxiv.org/abs/2004.04696"],
|
96 |
+
)
|
97 |
+
|
98 |
+
def _download_and_prepare(self, dl_manager):
|
99 |
+
|
100 |
+
# check that config name specifies a valid BLEURT model
|
101 |
+
if self.config_name == "default":
|
102 |
+
logger.warning(
|
103 |
+
"Using default BLEURT-Base checkpoint for sequence maximum length 128. "
|
104 |
+
"You can use a bigger model for better results with e.g.: evaluate.load('bleurt', 'bleurt-large-512')."
|
105 |
+
)
|
106 |
+
self.config_name = "bleurt-base-128"
|
107 |
+
|
108 |
+
if self.config_name.lower() in CHECKPOINT_URLS:
|
109 |
+
checkpoint_name = self.config_name.lower()
|
110 |
+
|
111 |
+
elif self.config_name.upper() in CHECKPOINT_URLS:
|
112 |
+
checkpoint_name = self.config_name.upper()
|
113 |
+
|
114 |
+
else:
|
115 |
+
raise KeyError(
|
116 |
+
f"{self.config_name} model not found. You should supply the name of a model checkpoint for bleurt in {CHECKPOINT_URLS.keys()}"
|
117 |
+
)
|
118 |
+
|
119 |
+
# download the model checkpoint specified by self.config_name and set up the scorer
|
120 |
+
model_path = dl_manager.download_and_extract(CHECKPOINT_URLS[checkpoint_name])
|
121 |
+
self.scorer = score.BleurtScorer(os.path.join(model_path, checkpoint_name))
|
122 |
+
|
123 |
+
def _compute(self, predictions, references):
|
124 |
+
scores = self.scorer.score(references=references, candidates=predictions)
|
125 |
+
return {"scores": scores}
|
requirements.txt
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# TODO: fix github to release
|
2 |
+
git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
|
3 |
+
datasets~=2.0
|
4 |
+
git+https://github.com/google-research/bleurt.git
|