Spaces:
Runtime error
Runtime error
title: MeaningBERT | |
emoji: π¦ | |
colorFrom: purple | |
colorTo: indigo | |
sdk: gradio | |
sdk_version: 5.9.1 | |
app_file: app.py | |
pinned: false | |
tags: | |
- evaluate | |
- metric | |
description: >- | |
MeaningBERT is an automatic and trainable metric for assessing meaning | |
preservation between sentences | |
See the project's README at | |
https://github.com/GRAAL-Research/MeaningBERT/tree/main for more information. | |
# Here is MeaningBERT | |
MeaningBERT is an automatic and trainable metric for assessing meaning preservation between sentences. MeaningBERT was | |
proposed in our | |
article [MeaningBERT: assessing meaning preservation between sentences](https://www.frontiersin.org/articles/10.3389/frai.2023.1223924/full). | |
Its goal is to assess meaning preservation between two sentences that correlate highly with human judgments and sanity | |
checks. For more details, refer to our publicly available article. | |
> This public version of our model uses the best model trained (where in our article, we present the performance results | |
> of an average of 10 models) for a more extended period (1000 epochs instead of 250). We have observed later that the | |
> model can further reduce dev loss and increase performance. | |
## Sanity Check | |
Correlation to human judgment is one way to evaluate the quality of a meaning preservation metric. | |
However, it is inherently subjective, since it uses human judgment as a gold standard, and expensive, since it requires | |
a large dataset | |
annotated by several humans. As an alternative, we designed two automated tests: evaluating meaning preservation between | |
identical sentences (which should be 100% preserving) and between unrelated sentences (which should be 0% preserving). | |
In these tests, the meaning preservation target value is not subjective and does not require human annotation to | |
measure. They represent a trivial and minimal threshold a good automatic meaning preservation metric should be able to | |
achieve. Namely, a metric should be minimally able to return a perfect score (i.e., 100%) if two identical sentences are | |
compared and return a null score (i.e., 0%) if two sentences are completely unrelated. | |
### Identical sentences | |
The first test evaluates meaning preservation between identical sentences. To analyze the metrics' capabilities to pass | |
this test, we count the number of times a metric rating was greater or equal to a threshold value Xβ[95, 99] and divide | |
it by the number of sentences to create a ratio of the number of times the metric gives the expected rating. To account | |
for computer floating-point inaccuracy, we round the ratings to the nearest integer and do not use a threshold value of | |
100%. | |
### Unrelated sentences | |
Our second test evaluates meaning preservation between a source sentence and an unrelated sentence generated by a large | |
language model.3 The idea is to verify that the metric finds a meaning preservation rating of 0 when given a completely | |
irrelevant sentence mainly composed of irrelevant words (also known as word soup). Since this test's expected rating is | |
0, we check that the metric rating is lower or equal to a threshold value Xβ[5, 1]. | |
Again, to account for computer floating-point inaccuracy, we round the ratings to the nearest integer and do not use | |
a threshold value of 0%. | |
## Cite | |
Use the following citation to cite MeaningBERT | |
``` | |
@ARTICLE{10.3389/frai.2023.1223924, | |
AUTHOR={Beauchemin, David and Saggion, Horacio and Khoury, Richard}, | |
TITLE={MeaningBERT: assessing meaning preservation between sentences}, | |
JOURNAL={Frontiers in Artificial Intelligence}, | |
VOLUME={6}, | |
YEAR={2023}, | |
URL={https://www.frontiersin.org/articles/10.3389/frai.2023.1223924}, | |
DOI={10.3389/frai.2023.1223924}, | |
ISSN={2624-8212}, | |
} | |
``` | |
## License | |
MeaningBERT is MIT licensed, as found in | |
the [LICENSE file](https://github.com/GRAAL-Research/risc/blob/main/LICENSE). |