Spaces:
Running
title: ROUGE
emoji: 🤗
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.0.2
app_file: app.py
pinned: false
tags:
- evaluate
- metric
description: >-
ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of
metrics and a software package used for evaluating automatic summarization and
machine translation software in natural language processing. The metrics
compare an automatically produced summary or translation against a reference
or a set of references (human-produced) summary or translation.
Note that ROUGE is case insensitive, meaning that upper case letters are
treated the same way as lower case letters.
This metrics is a wrapper around Google Research reimplementation of ROUGE:
https://github.com/google-research/google-research/tree/master/rouge
Metric Card for ROUGE
Metric Description
ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.
Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters.
This metrics is a wrapper around the Google Research reimplementation of ROUGE
How to Use
At minimum, this metric takes as input a list of predictions and a list of references:
>>> rouge = evaluate.load('rouge')
>>> predictions = ["hello there", "general kenobi"]
>>> references = ["hello there", "general kenobi"]
>>> results = rouge.compute(predictions=predictions,
... references=references)
>>> print(results)
{'rouge1': 1.0, 'rouge2': 1.0, 'rougeL': 1.0, 'rougeLsum': 1.0}
One can also pass a custom tokenizer which is especially useful for non-latin languages.
>>> results = rouge.compute(predictions=predictions,
... references=references,
tokenizer=lambda x: x.split())
>>> print(results)
{'rouge1': 1.0, 'rouge2': 1.0, 'rougeL': 1.0, 'rougeLsum': 1.0}
It can also deal with lists of references for each predictions:
>>> rouge = evaluate.load('rouge')
>>> predictions = ["hello there", "general kenobi"]
>>> references = [["hello", "there"], ["general kenobi", "general yoda"]]
>>> results = rouge.compute(predictions=predictions,
... references=references)
>>> print(results)
{'rouge1': 0.8333, 'rouge2': 0.5, 'rougeL': 0.8333, 'rougeLsum': 0.8333}
Inputs
- predictions (
list
): list of predictions to score. Each prediction should be a string with tokens separated by spaces. - references (
list
orlist[list]
): list of reference for each prediction or a list of several references per prediction. Each reference should be a string with tokens separated by spaces. - rouge_types (
list
): A list of rouge types to calculate. Defaults to['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
.- Valid rouge types:
"rouge1"
: unigram (1-gram) based scoring"rouge2"
: bigram (2-gram) based scoring"rougeL"
: Longest common subsequence based scoring."rougeLSum"
: splits text using"\n"
- See here for more information
- Valid rouge types:
- use_aggregator (
boolean
): If True, returns aggregates. Defaults toTrue
. - use_stemmer (
boolean
): IfTrue
, uses Porter stemmer to strip word suffixes. Defaults toFalse
.
Output Values
The output is a dictionary with one entry for each rouge type in the input list rouge_types
. If use_aggregator=False
, each dictionary entry is a list of scores, with one score for each sentence. E.g. if rouge_types=['rouge1', 'rouge2']
and use_aggregator=False
, the output is:
{'rouge1': [0.6666666666666666, 1.0], 'rouge2': [0.0, 1.0]}
If rouge_types=['rouge1', 'rouge2']
and use_aggregator=True
, the output is of the following format:
{'rouge1': 1.0, 'rouge2': 1.0}
The ROUGE values are in the range of 0 to 1.
Values from Popular Papers
Examples
An example without aggregation:
>>> rouge = evaluate.load('rouge')
>>> predictions = ["hello goodbye", "ankh morpork"]
>>> references = ["goodbye", "general kenobi"]
>>> results = rouge.compute(predictions=predictions,
... references=references,
... use_aggregator=False)
>>> print(list(results.keys()))
['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
>>> print(results["rouge1"])
[0.5, 0.0]
The same example, but with aggregation:
>>> rouge = evaluate.load('rouge')
>>> predictions = ["hello goodbye", "ankh morpork"]
>>> references = ["goodbye", "general kenobi"]
>>> results = rouge.compute(predictions=predictions,
... references=references,
... use_aggregator=True)
>>> print(list(results.keys()))
['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
>>> print(results["rouge1"])
0.25
The same example, but only calculating rouge_1
:
>>> rouge = evaluate.load('rouge')
>>> predictions = ["hello goodbye", "ankh morpork"]
>>> references = ["goodbye", "general kenobi"]
>>> results = rouge.compute(predictions=predictions,
... references=references,
... rouge_types=['rouge_1'],
... use_aggregator=True)
>>> print(list(results.keys()))
['rouge1']
>>> print(results["rouge1"])
0.25
Limitations and Bias
See Schluter (2017) for an in-depth discussion of many of ROUGE's limits.
Citation
@inproceedings{lin-2004-rouge,
title = "{ROUGE}: A Package for Automatic Evaluation of Summaries",
author = "Lin, Chin-Yew",
booktitle = "Text Summarization Branches Out",
month = jul,
year = "2004",
address = "Barcelona, Spain",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/W04-1013",
pages = "74--81",
}
Further References
- This metrics is a wrapper around the Google Research reimplementation of ROUGE