sari_metric / README.md
He-Xingwei's picture
Add my new, shiny module.
074ce80

A newer version of the Gradio SDK is available: 5.9.1

Upgrade
metadata
title: Sari Metric
emoji: 🐠
colorFrom: pink
colorTo: indigo
sdk: gradio
sdk_version: 3.28.3
app_file: app.py
pinned: false
tags:
  - evaluate
  - metric
description: >-
  SARI is a metric used for evaluating automatic text simplification systems.
  The metric compares the predicted simplified sentences against the reference
  and the source sentences. It explicitly measures the goodness of words that
  are added, deleted and kept by the system. Sari = (F1_add + F1_keep + P_del) /
  3 where F1_add: n-gram F1 score for add operation F1_keep: n-gram F1 score for
  keep operation P_del: n-gram precision score for delete operation n = 4, as in
  the original paper.

  This implementation is adapted from Tensorflow's tensor2tensor implementation
  [3]. It has two differences with the original GitHub [1] implementation: (1)
  Defines 0/0=1 instead of 0 to give higher scores for predictions that match a
  target exactly. (2) Fixes an alleged bug [2] in the keep score computation.
  [1] https://github.com/cocoxu/simplification/blob/master/SARI.py (commit
  0210f15) [2] https://github.com/cocoxu/simplification/issues/6 [3]
  https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/sari_hook.py

Metric Card for SARI

Metric description

SARI (system output against references and against the input sentence) is a metric used for evaluating automatic text simplification systems.

The metric compares the predicted simplified sentences against the reference and the source sentences. It explicitly measures the goodness of words that are added, deleted and kept by the system.

SARI can be computed as:

sari = ( F1_add + F1_keep + P_del) / 3

where

F1_add is the n-gram F1 score for add operations

F1_keep is the n-gram F1 score for keep operations

P_del is the n-gram precision score for delete operations

The number of n grams, n, is equal to 4, as in the original paper.

This implementation is adapted from Tensorflow's tensor2tensor implementation. It has two differences with the original GitHub implementation:

  1. It defines 0/0=1 instead of 0 to give higher scores for predictions that match a target exactly.
  2. It fixes an alleged bug in the keep score computation.

How to use

The metric takes 3 inputs: sources (a list of source sentence strings), predictions (a list of predicted sentence strings) and references (a list of lists of reference sentence strings)

from evaluate import load
sari = load("hxw15/sari_metric")
sources=["About 95 species are currently accepted."]
predictions=["About 95 you now get in."]
references=[["About 95 species are currently known.","About 95 species are now accepted.","95 species are now accepted."]]
results = sari.compute(sources=sources, predictions=predictions, references=references)

Output values

This metric outputs a dictionary with the SARI score:

print(results)
{'sari': 26.953601953601954, 'keep': 22.527472527472526, 'del': 50.0, 'add': 8.333333333333332}

The range of values for the SARI score is between 0 and 100 -- the higher the value, the better the performance of the model being evaluated, with a SARI of 100 being a perfect score.

Values from popular papers

The original paper that proposes the SARI metric reports scores ranging from 26 to 43 for different simplification systems and different datasets. They also find that the metric ranks all of the simplification systems and human references in the same order as the human assessment used as a comparison, and that it correlates reasonably with human judgments.

More recent SARI scores for text simplification can be found on leaderboards for datasets such as TurkCorpus and Newsela.

Examples

Perfect match between prediction and reference:

from evaluate import load
sari = load("hxw15/sari_metric")
sources=["About 95 species are currently accepted ."]
predictions=["About 95 species are currently accepted ."]
references=[["About 95 species are currently accepted ."]]
results = sari.compute(sources=sources, predictions=predictions, references=references)
print(results)
{'sari': 100.0, 'keep': 100.0, 'del': 100.0, 'add': 100.0}

Partial match between prediction and reference:

from evaluate import load
sari = load("hxw15/sari_metric")
sources=["About 95 species are currently accepted ."]
predictions=["About 95 you now get in ."]
references=[["About 95 species are currently known .","About 95 species are now accepted .","95 species are now accepted ."]]
results = sari.compute(sources=sources, predictions=predictions, references=references)
print(results)
{'sari': 26.953601953601954, 'keep': 22.527472527472526, 'del': 50.0, 'add': 8.333333333333332}

Limitations and bias

SARI is a valuable measure for comparing different text simplification systems as well as one that can assist the iterative development of a system.

However, while the original paper presenting SARI states that it captures "the notion of grammaticality and meaning preservation", this is a difficult claim to empirically validate.

Citation

@inproceedings{xu-etal-2016-optimizing,
title = {Optimizing Statistical Machine Translation for Text Simplification},
authors={Xu, Wei and Napoles, Courtney and Pavlick, Ellie and Chen, Quanze and Callison-Burch, Chris},
journal = {Transactions of the Association for Computational Linguistics},
volume = {4},
year={2016},
url = {https://www.aclweb.org/anthology/Q16-1029},
pages = {401--415},
}

Further References