lvwerra HF staff commited on
Commit
fd0f166
·
1 Parent(s): 87ba558

Update Space (evaluate main: 828c6327)

Browse files
Files changed (4) hide show
  1. README.md +110 -5
  2. app.py +6 -0
  3. requirements.txt +5 -0
  4. wiki_split.py +355 -0
README.md CHANGED
@@ -1,12 +1,117 @@
1
  ---
2
- title: Wiki_split
3
- emoji: 🔥
4
- colorFrom: yellow
5
- colorTo: indigo
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: WikiSplit
3
+ emoji: 🤗
4
+ colorFrom: blue
5
+ colorTo: red
6
  sdk: gradio
7
  sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
10
+ tags:
11
+ - evaluate
12
+ - metric
13
  ---
14
 
15
+ # Metric Card for WikiSplit
16
+
17
+ ## Metric description
18
+
19
+ WikiSplit is the combination of three metrics: [SARI](https://huggingface.co/metrics/sari), [exact match](https://huggingface.co/metrics/exact_match) and [SacreBLEU](https://huggingface.co/metrics/sacrebleu).
20
+
21
+ It can be used to evaluate the quality of sentence splitting approaches, which require rewriting a long sentence into two or more coherent short sentences, e.g. based on the [WikiSplit dataset](https://huggingface.co/datasets/wiki_split).
22
+
23
+ ## How to use
24
+
25
+ The WIKI_SPLIT metric takes three inputs:
26
+
27
+ `sources`: a list of source sentences, where each sentence should be a string.
28
+
29
+ `predictions`: a list of predicted sentences, where each sentence should be a string.
30
+
31
+ `references`: a list of lists of reference sentences, where each sentence should be a string.
32
+
33
+ ```python
34
+ >>> wiki_split = evaluate.load("wiki_split")
35
+ >>> sources = ["About 95 species are currently accepted ."]
36
+ >>> predictions = ["About 95 you now get in ."]
37
+ >>> references= [["About 95 species are currently known ."]]
38
+ >>> results = wiki_split.compute(sources=sources, predictions=predictions, references=references)
39
+ ```
40
+ ## Output values
41
+
42
+ This metric outputs a dictionary containing three scores:
43
+
44
+ `sari`: the [SARI](https://huggingface.co/metrics/sari) score, whose range is between `0.0` and `100.0` -- the higher the value, the better the performance of the model being evaluated, with a SARI of 100 being a perfect score.
45
+
46
+ `sacrebleu`: the [SacreBLEU](https://huggingface.co/metrics/sacrebleu) score, which can take any value between `0.0` and `100.0`, inclusive.
47
+
48
+ `exact`: the [exact match](https://huggingface.co/metrics/exact_match) score, which represents the sum of all of the individual exact match scores in the set, divided by the total number of predictions in the set. It ranges from `0.0` to `100`, inclusive. Here, `0.0` means no prediction/reference pairs were matches, while `100.0` means they all were.
49
+
50
+ ```python
51
+ >>> print(results)
52
+ {'sari': 21.805555555555557, 'sacrebleu': 14.535768424205482, 'exact': 0.0}
53
+ ```
54
+
55
+ ### Values from popular papers
56
+
57
+ This metric was initially used by [Rothe et al.(2020)](https://arxiv.org/pdf/1907.12461.pdf) to evaluate the performance of different split-and-rephrase approaches on the [WikiSplit dataset](https://huggingface.co/datasets/wiki_split). They reported a SARI score of 63.5, a SacreBLEU score of 77.2, and an EXACT_MATCH score of 16.3.
58
+
59
+ ## Examples
60
+
61
+ Perfect match between prediction and reference:
62
+
63
+ ```python
64
+ >>> wiki_split = evaluate.load("wiki_split")
65
+ >>> sources = ["About 95 species are currently accepted ."]
66
+ >>> predictions = ["About 95 species are currently accepted ."]
67
+ >>> references= [["About 95 species are currently accepted ."]]
68
+ >>> results = wiki_split.compute(sources=sources, predictions=predictions, references=references)
69
+ >>> print(results)
70
+ {'sari': 100.0, 'sacrebleu': 100.00000000000004, 'exact': 100.0
71
+ ```
72
+
73
+ Partial match between prediction and reference:
74
+
75
+ ```python
76
+ >>> wiki_split = evaluate.load("wiki_split")
77
+ >>> sources = ["About 95 species are currently accepted ."]
78
+ >>> predictions = ["About 95 you now get in ."]
79
+ >>> references= [["About 95 species are currently known ."]]
80
+ >>> results = wiki_split.compute(sources=sources, predictions=predictions, references=references)
81
+ >>> print(results)
82
+ {'sari': 21.805555555555557, 'sacrebleu': 14.535768424205482, 'exact': 0.0}
83
+ ```
84
+
85
+ No match between prediction and reference:
86
+
87
+ ```python
88
+ >>> wiki_split = evaluate.load("wiki_split")
89
+ >>> sources = ["About 95 species are currently accepted ."]
90
+ >>> predictions = ["Hello world ."]
91
+ >>> references= [["About 95 species are currently known ."]]
92
+ >>> results = wiki_split.compute(sources=sources, predictions=predictions, references=references)
93
+ >>> print(results)
94
+ {'sari': 14.047619047619046, 'sacrebleu': 0.0, 'exact': 0.0}
95
+ ```
96
+ ## Limitations and bias
97
+
98
+ This metric is not the official metric to evaluate models on the [WikiSplit dataset](https://huggingface.co/datasets/wiki_split). It was initially proposed by [Rothe et al.(2020)](https://arxiv.org/pdf/1907.12461.pdf), whereas the [original paper introducing the WikiSplit dataset (2018)](https://aclanthology.org/D18-1080.pdf) uses different metrics to evaluate performance, such as corpus-level [BLEU](https://huggingface.co/metrics/bleu) and sentence-level BLEU.
99
+
100
+ ## Citation
101
+
102
+ ```bibtex
103
+ @article{rothe2020leveraging,
104
+ title={Leveraging pre-trained checkpoints for sequence generation tasks},
105
+ author={Rothe, Sascha and Narayan, Shashi and Severyn, Aliaksei},
106
+ journal={Transactions of the Association for Computational Linguistics},
107
+ volume={8},
108
+ pages={264--280},
109
+ year={2020},
110
+ publisher={MIT Press}
111
+ }
112
+ ```
113
+
114
+ ## Further References
115
+
116
+ - [WikiSplit dataset](https://huggingface.co/datasets/wiki_split)
117
+ - [WikiSplit paper (Botha et al., 2018)](https://aclanthology.org/D18-1080.pdf)
app.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ import evaluate
2
+ from evaluate.utils import launch_gradio_widget
3
+
4
+
5
+ module = evaluate.load("wiki_split")
6
+ launch_gradio_widget(module)
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ # TODO: fix github to release
2
+ git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
3
+ datasets~=2.0
4
+ sacrebleu
5
+ sacremoses
wiki_split.py ADDED
@@ -0,0 +1,355 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ """ WIKI_SPLIT metric."""
15
+
16
+ import re
17
+ import string
18
+ from collections import Counter
19
+
20
+ import datasets
21
+ import sacrebleu
22
+ import sacremoses
23
+ from packaging import version
24
+
25
+ import evaluate
26
+
27
+
28
+ _CITATION = """
29
+ @inproceedings{xu-etal-2016-optimizing,
30
+ title = {Optimizing Statistical Machine Translation for Text Simplification},
31
+ authors={Xu, Wei and Napoles, Courtney and Pavlick, Ellie and Chen, Quanze and Callison-Burch, Chris},
32
+ journal = {Transactions of the Association for Computational Linguistics},
33
+ volume = {4},
34
+ year={2016},
35
+ url = {https://www.aclweb.org/anthology/Q16-1029},
36
+ pages = {401--415
37
+ },
38
+ @inproceedings{post-2018-call,
39
+ title = "A Call for Clarity in Reporting {BLEU} Scores",
40
+ author = "Post, Matt",
41
+ booktitle = "Proceedings of the Third Conference on Machine Translation: Research Papers",
42
+ month = oct,
43
+ year = "2018",
44
+ address = "Belgium, Brussels",
45
+ publisher = "Association for Computational Linguistics",
46
+ url = "https://www.aclweb.org/anthology/W18-6319",
47
+ pages = "186--191",
48
+ }
49
+ """
50
+
51
+ _DESCRIPTION = """\
52
+ WIKI_SPLIT is the combination of three metrics SARI, EXACT and SACREBLEU
53
+ It can be used to evaluate the quality of machine-generated texts.
54
+ """
55
+
56
+
57
+ _KWARGS_DESCRIPTION = """
58
+ Calculates sari score (between 0 and 100) given a list of source and predicted
59
+ sentences, and a list of lists of reference sentences. It also computes the BLEU score as well as the exact match score.
60
+ Args:
61
+ sources: list of source sentences where each sentence should be a string.
62
+ predictions: list of predicted sentences where each sentence should be a string.
63
+ references: list of lists of reference sentences where each sentence should be a string.
64
+ Returns:
65
+ sari: sari score
66
+ sacrebleu: sacrebleu score
67
+ exact: exact score
68
+
69
+ Examples:
70
+ >>> sources=["About 95 species are currently accepted ."]
71
+ >>> predictions=["About 95 you now get in ."]
72
+ >>> references=[["About 95 species are currently known ."]]
73
+ >>> wiki_split = evaluate.load("wiki_split")
74
+ >>> results = wiki_split.compute(sources=sources, predictions=predictions, references=references)
75
+ >>> print(results)
76
+ {'sari': 21.805555555555557, 'sacrebleu': 14.535768424205482, 'exact': 0.0}
77
+ """
78
+
79
+
80
+ def normalize_answer(s):
81
+ """Lower text and remove punctuation, articles and extra whitespace."""
82
+
83
+ def remove_articles(text):
84
+ regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
85
+ return re.sub(regex, " ", text)
86
+
87
+ def white_space_fix(text):
88
+ return " ".join(text.split())
89
+
90
+ def remove_punc(text):
91
+ exclude = set(string.punctuation)
92
+ return "".join(ch for ch in text if ch not in exclude)
93
+
94
+ def lower(text):
95
+ return text.lower()
96
+
97
+ return white_space_fix(remove_articles(remove_punc(lower(s))))
98
+
99
+
100
+ def compute_exact(a_gold, a_pred):
101
+ return int(normalize_answer(a_gold) == normalize_answer(a_pred))
102
+
103
+
104
+ def compute_em(predictions, references):
105
+ scores = [any([compute_exact(ref, pred) for ref in refs]) for pred, refs in zip(predictions, references)]
106
+ return (sum(scores) / len(scores)) * 100
107
+
108
+
109
+ def SARIngram(sgrams, cgrams, rgramslist, numref):
110
+ rgramsall = [rgram for rgrams in rgramslist for rgram in rgrams]
111
+ rgramcounter = Counter(rgramsall)
112
+
113
+ sgramcounter = Counter(sgrams)
114
+ sgramcounter_rep = Counter()
115
+ for sgram, scount in sgramcounter.items():
116
+ sgramcounter_rep[sgram] = scount * numref
117
+
118
+ cgramcounter = Counter(cgrams)
119
+ cgramcounter_rep = Counter()
120
+ for cgram, ccount in cgramcounter.items():
121
+ cgramcounter_rep[cgram] = ccount * numref
122
+
123
+ # KEEP
124
+ keepgramcounter_rep = sgramcounter_rep & cgramcounter_rep
125
+ keepgramcountergood_rep = keepgramcounter_rep & rgramcounter
126
+ keepgramcounterall_rep = sgramcounter_rep & rgramcounter
127
+
128
+ keeptmpscore1 = 0
129
+ keeptmpscore2 = 0
130
+ for keepgram in keepgramcountergood_rep:
131
+ keeptmpscore1 += keepgramcountergood_rep[keepgram] / keepgramcounter_rep[keepgram]
132
+ # Fix an alleged bug [2] in the keep score computation.
133
+ # keeptmpscore2 += keepgramcountergood_rep[keepgram] / keepgramcounterall_rep[keepgram]
134
+ keeptmpscore2 += keepgramcountergood_rep[keepgram]
135
+ # Define 0/0=1 instead of 0 to give higher scores for predictions that match
136
+ # a target exactly.
137
+ keepscore_precision = 1
138
+ keepscore_recall = 1
139
+ if len(keepgramcounter_rep) > 0:
140
+ keepscore_precision = keeptmpscore1 / len(keepgramcounter_rep)
141
+ if len(keepgramcounterall_rep) > 0:
142
+ # Fix an alleged bug [2] in the keep score computation.
143
+ # keepscore_recall = keeptmpscore2 / len(keepgramcounterall_rep)
144
+ keepscore_recall = keeptmpscore2 / sum(keepgramcounterall_rep.values())
145
+ keepscore = 0
146
+ if keepscore_precision > 0 or keepscore_recall > 0:
147
+ keepscore = 2 * keepscore_precision * keepscore_recall / (keepscore_precision + keepscore_recall)
148
+
149
+ # DELETION
150
+ delgramcounter_rep = sgramcounter_rep - cgramcounter_rep
151
+ delgramcountergood_rep = delgramcounter_rep - rgramcounter
152
+ delgramcounterall_rep = sgramcounter_rep - rgramcounter
153
+ deltmpscore1 = 0
154
+ deltmpscore2 = 0
155
+ for delgram in delgramcountergood_rep:
156
+ deltmpscore1 += delgramcountergood_rep[delgram] / delgramcounter_rep[delgram]
157
+ deltmpscore2 += delgramcountergood_rep[delgram] / delgramcounterall_rep[delgram]
158
+ # Define 0/0=1 instead of 0 to give higher scores for predictions that match
159
+ # a target exactly.
160
+ delscore_precision = 1
161
+ if len(delgramcounter_rep) > 0:
162
+ delscore_precision = deltmpscore1 / len(delgramcounter_rep)
163
+
164
+ # ADDITION
165
+ addgramcounter = set(cgramcounter) - set(sgramcounter)
166
+ addgramcountergood = set(addgramcounter) & set(rgramcounter)
167
+ addgramcounterall = set(rgramcounter) - set(sgramcounter)
168
+
169
+ addtmpscore = 0
170
+ for addgram in addgramcountergood:
171
+ addtmpscore += 1
172
+
173
+ # Define 0/0=1 instead of 0 to give higher scores for predictions that match
174
+ # a target exactly.
175
+ addscore_precision = 1
176
+ addscore_recall = 1
177
+ if len(addgramcounter) > 0:
178
+ addscore_precision = addtmpscore / len(addgramcounter)
179
+ if len(addgramcounterall) > 0:
180
+ addscore_recall = addtmpscore / len(addgramcounterall)
181
+ addscore = 0
182
+ if addscore_precision > 0 or addscore_recall > 0:
183
+ addscore = 2 * addscore_precision * addscore_recall / (addscore_precision + addscore_recall)
184
+
185
+ return (keepscore, delscore_precision, addscore)
186
+
187
+
188
+ def SARIsent(ssent, csent, rsents):
189
+ numref = len(rsents)
190
+
191
+ s1grams = ssent.split(" ")
192
+ c1grams = csent.split(" ")
193
+ s2grams = []
194
+ c2grams = []
195
+ s3grams = []
196
+ c3grams = []
197
+ s4grams = []
198
+ c4grams = []
199
+
200
+ r1gramslist = []
201
+ r2gramslist = []
202
+ r3gramslist = []
203
+ r4gramslist = []
204
+ for rsent in rsents:
205
+ r1grams = rsent.split(" ")
206
+ r2grams = []
207
+ r3grams = []
208
+ r4grams = []
209
+ r1gramslist.append(r1grams)
210
+ for i in range(0, len(r1grams) - 1):
211
+ if i < len(r1grams) - 1:
212
+ r2gram = r1grams[i] + " " + r1grams[i + 1]
213
+ r2grams.append(r2gram)
214
+ if i < len(r1grams) - 2:
215
+ r3gram = r1grams[i] + " " + r1grams[i + 1] + " " + r1grams[i + 2]
216
+ r3grams.append(r3gram)
217
+ if i < len(r1grams) - 3:
218
+ r4gram = r1grams[i] + " " + r1grams[i + 1] + " " + r1grams[i + 2] + " " + r1grams[i + 3]
219
+ r4grams.append(r4gram)
220
+ r2gramslist.append(r2grams)
221
+ r3gramslist.append(r3grams)
222
+ r4gramslist.append(r4grams)
223
+
224
+ for i in range(0, len(s1grams) - 1):
225
+ if i < len(s1grams) - 1:
226
+ s2gram = s1grams[i] + " " + s1grams[i + 1]
227
+ s2grams.append(s2gram)
228
+ if i < len(s1grams) - 2:
229
+ s3gram = s1grams[i] + " " + s1grams[i + 1] + " " + s1grams[i + 2]
230
+ s3grams.append(s3gram)
231
+ if i < len(s1grams) - 3:
232
+ s4gram = s1grams[i] + " " + s1grams[i + 1] + " " + s1grams[i + 2] + " " + s1grams[i + 3]
233
+ s4grams.append(s4gram)
234
+
235
+ for i in range(0, len(c1grams) - 1):
236
+ if i < len(c1grams) - 1:
237
+ c2gram = c1grams[i] + " " + c1grams[i + 1]
238
+ c2grams.append(c2gram)
239
+ if i < len(c1grams) - 2:
240
+ c3gram = c1grams[i] + " " + c1grams[i + 1] + " " + c1grams[i + 2]
241
+ c3grams.append(c3gram)
242
+ if i < len(c1grams) - 3:
243
+ c4gram = c1grams[i] + " " + c1grams[i + 1] + " " + c1grams[i + 2] + " " + c1grams[i + 3]
244
+ c4grams.append(c4gram)
245
+
246
+ (keep1score, del1score, add1score) = SARIngram(s1grams, c1grams, r1gramslist, numref)
247
+ (keep2score, del2score, add2score) = SARIngram(s2grams, c2grams, r2gramslist, numref)
248
+ (keep3score, del3score, add3score) = SARIngram(s3grams, c3grams, r3gramslist, numref)
249
+ (keep4score, del4score, add4score) = SARIngram(s4grams, c4grams, r4gramslist, numref)
250
+ avgkeepscore = sum([keep1score, keep2score, keep3score, keep4score]) / 4
251
+ avgdelscore = sum([del1score, del2score, del3score, del4score]) / 4
252
+ avgaddscore = sum([add1score, add2score, add3score, add4score]) / 4
253
+ finalscore = (avgkeepscore + avgdelscore + avgaddscore) / 3
254
+ return finalscore
255
+
256
+
257
+ def normalize(sentence, lowercase: bool = True, tokenizer: str = "13a", return_str: bool = True):
258
+
259
+ # Normalization is requried for the ASSET dataset (one of the primary
260
+ # datasets in sentence simplification) to allow using space
261
+ # to split the sentence. Even though Wiki-Auto and TURK datasets,
262
+ # do not require normalization, we do it for consistency.
263
+ # Code adapted from the EASSE library [1] written by the authors of the ASSET dataset.
264
+ # [1] https://github.com/feralvam/easse/blob/580bba7e1378fc8289c663f864e0487188fe8067/easse/utils/preprocessing.py#L7
265
+
266
+ if lowercase:
267
+ sentence = sentence.lower()
268
+
269
+ if tokenizer in ["13a", "intl"]:
270
+ if version.parse(sacrebleu.__version__).major >= 2:
271
+ normalized_sent = sacrebleu.metrics.bleu._get_tokenizer(tokenizer)()(sentence)
272
+ else:
273
+ normalized_sent = sacrebleu.TOKENIZERS[tokenizer]()(sentence)
274
+ elif tokenizer == "moses":
275
+ normalized_sent = sacremoses.MosesTokenizer().tokenize(sentence, return_str=True, escape=False)
276
+ elif tokenizer == "penn":
277
+ normalized_sent = sacremoses.MosesTokenizer().penn_tokenize(sentence, return_str=True)
278
+ else:
279
+ normalized_sent = sentence
280
+
281
+ if not return_str:
282
+ normalized_sent = normalized_sent.split()
283
+
284
+ return normalized_sent
285
+
286
+
287
+ def compute_sari(sources, predictions, references):
288
+
289
+ if not (len(sources) == len(predictions) == len(references)):
290
+ raise ValueError("Sources length must match predictions and references lengths.")
291
+ sari_score = 0
292
+ for src, pred, refs in zip(sources, predictions, references):
293
+ sari_score += SARIsent(normalize(src), normalize(pred), [normalize(sent) for sent in refs])
294
+ sari_score = sari_score / len(predictions)
295
+ return 100 * sari_score
296
+
297
+
298
+ def compute_sacrebleu(
299
+ predictions,
300
+ references,
301
+ smooth_method="exp",
302
+ smooth_value=None,
303
+ force=False,
304
+ lowercase=False,
305
+ use_effective_order=False,
306
+ ):
307
+ references_per_prediction = len(references[0])
308
+ if any(len(refs) != references_per_prediction for refs in references):
309
+ raise ValueError("Sacrebleu requires the same number of references for each prediction")
310
+ transformed_references = [[refs[i] for refs in references] for i in range(references_per_prediction)]
311
+ output = sacrebleu.corpus_bleu(
312
+ predictions,
313
+ transformed_references,
314
+ smooth_method=smooth_method,
315
+ smooth_value=smooth_value,
316
+ force=force,
317
+ lowercase=lowercase,
318
+ use_effective_order=use_effective_order,
319
+ )
320
+ return output.score
321
+
322
+
323
+ @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
324
+ class WikiSplit(evaluate.EvaluationModule):
325
+ def _info(self):
326
+ return evaluate.EvaluationModuleInfo(
327
+ description=_DESCRIPTION,
328
+ citation=_CITATION,
329
+ inputs_description=_KWARGS_DESCRIPTION,
330
+ features=datasets.Features(
331
+ {
332
+ "predictions": datasets.Value("string", id="sequence"),
333
+ "references": datasets.Sequence(datasets.Value("string", id="sequence"), id="references"),
334
+ }
335
+ ),
336
+ codebase_urls=[
337
+ "https://github.com/huggingface/transformers/blob/master/src/transformers/data/metrics/squad_metrics.py",
338
+ "https://github.com/cocoxu/simplification/blob/master/SARI.py",
339
+ "https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/sari_hook.py",
340
+ "https://github.com/mjpost/sacreBLEU",
341
+ ],
342
+ reference_urls=[
343
+ "https://www.aclweb.org/anthology/Q16-1029.pdf",
344
+ "https://github.com/mjpost/sacreBLEU",
345
+ "https://en.wikipedia.org/wiki/BLEU",
346
+ "https://towardsdatascience.com/evaluating-text-output-in-nlp-bleu-at-your-own-risk-e8609665a213",
347
+ ],
348
+ )
349
+
350
+ def _compute(self, sources, predictions, references):
351
+ result = {}
352
+ result.update({"sari": compute_sari(sources=sources, predictions=predictions, references=references)})
353
+ result.update({"sacrebleu": compute_sacrebleu(predictions=predictions, references=references)})
354
+ result.update({"exact": compute_em(predictions=predictions, references=references)})
355
+ return result