wiki_split / README.md
lvwerra's picture
lvwerra HF staff
Update Space (evaluate main: 8b9373dc)
a183c20

A newer version of the Gradio SDK is available: 5.12.0

Upgrade
metadata
title: WikiSplit
emoji: 🤗
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
tags:
  - evaluate
  - metric
description: >-
  WIKI_SPLIT is the combination of three metrics SARI, EXACT and SACREBLEU It
  can be used to evaluate the quality of machine-generated texts.

Metric Card for WikiSplit

Metric description

WikiSplit is the combination of three metrics: SARI, exact match and SacreBLEU.

It can be used to evaluate the quality of sentence splitting approaches, which require rewriting a long sentence into two or more coherent short sentences, e.g. based on the WikiSplit dataset.

How to use

The WIKI_SPLIT metric takes three inputs:

sources: a list of source sentences, where each sentence should be a string.

predictions: a list of predicted sentences, where each sentence should be a string.

references: a list of lists of reference sentences, where each sentence should be a string.

>>> wiki_split = evaluate.load("wiki_split")
>>> sources = ["About 95 species are currently accepted ."]
>>> predictions = ["About 95 you now get in ."]
>>> references= [["About 95 species are currently known ."]]
>>> results = wiki_split.compute(sources=sources, predictions=predictions, references=references)

Output values

This metric outputs a dictionary containing three scores:

sari: the SARI score, whose range is between 0.0 and 100.0 -- the higher the value, the better the performance of the model being evaluated, with a SARI of 100 being a perfect score.

sacrebleu: the SacreBLEU score, which can take any value between 0.0 and 100.0, inclusive.

exact: the exact match score, which represents the sum of all of the individual exact match scores in the set, divided by the total number of predictions in the set. It ranges from 0.0 to 100, inclusive. Here, 0.0 means no prediction/reference pairs were matches, while 100.0 means they all were.

>>> print(results)
{'sari': 21.805555555555557, 'sacrebleu': 14.535768424205482, 'exact': 0.0}

Values from popular papers

This metric was initially used by Rothe et al.(2020) to evaluate the performance of different split-and-rephrase approaches on the WikiSplit dataset. They reported a SARI score of 63.5, a SacreBLEU score of 77.2, and an EXACT_MATCH score of 16.3.

Examples

Perfect match between prediction and reference:

>>> wiki_split = evaluate.load("wiki_split")
>>> sources = ["About 95 species are currently accepted ."]
>>> predictions = ["About 95 species are currently accepted ."]
>>> references= [["About 95 species are currently accepted ."]]
>>> results = wiki_split.compute(sources=sources, predictions=predictions, references=references)
>>> print(results)
{'sari': 100.0, 'sacrebleu': 100.00000000000004, 'exact': 100.0

Partial match between prediction and reference:

>>> wiki_split = evaluate.load("wiki_split")
>>> sources = ["About 95 species are currently accepted ."]
>>> predictions = ["About 95 you now get in ."]
>>> references= [["About 95 species are currently known ."]]
>>> results = wiki_split.compute(sources=sources, predictions=predictions, references=references)
>>> print(results)
{'sari': 21.805555555555557, 'sacrebleu': 14.535768424205482, 'exact': 0.0}

No match between prediction and reference:

>>> wiki_split = evaluate.load("wiki_split")
>>> sources = ["About 95 species are currently accepted ."]
>>> predictions = ["Hello world ."]
>>> references= [["About 95 species are currently known ."]]
>>> results = wiki_split.compute(sources=sources, predictions=predictions, references=references)
>>> print(results)
{'sari': 14.047619047619046, 'sacrebleu': 0.0, 'exact': 0.0}

Limitations and bias

This metric is not the official metric to evaluate models on the WikiSplit dataset. It was initially proposed by Rothe et al.(2020), whereas the original paper introducing the WikiSplit dataset (2018) uses different metrics to evaluate performance, such as corpus-level BLEU and sentence-level BLEU.

Citation

@article{rothe2020leveraging,
  title={Leveraging pre-trained checkpoints for sequence generation tasks},
  author={Rothe, Sascha and Narayan, Shashi and Severyn, Aliaksei},
  journal={Transactions of the Association for Computational Linguistics},
  volume={8},
  pages={264--280},
  year={2020},
  publisher={MIT Press}
}

Further References