lvwerra HF staff commited on
Commit
9f064dc
·
1 Parent(s): 1cd4639

Update Space (evaluate main: 544f1e8a)

Browse files
Files changed (4) hide show
  1. README.md +100 -6
  2. app.py +6 -0
  3. character.py +169 -0
  4. requirements.txt +2 -0
README.md CHANGED
@@ -1,12 +1,106 @@
1
  ---
2
- title: Character
3
- emoji: 😻
4
- colorFrom: green
5
- colorTo: blue
6
  sdk: gradio
7
- sdk_version: 3.12.0
8
  app_file: app.py
9
  pinned: false
 
 
 
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: CharacTER
3
+ emoji: 🔤
4
+ colorFrom: orange
5
+ colorTo: red
6
  sdk: gradio
7
+ sdk_version: 3.0.2
8
  app_file: app.py
9
  pinned: false
10
+ tags:
11
+ - evaluate
12
+ - metric
13
+ - machine-translation
14
+ description: >-
15
+ CharacTer is a character-level metric inspired by the commonly applied translation edit rate (TER).
16
  ---
17
 
18
+ # Metric Card for CharacTER
19
+
20
+ ## Metric Description
21
+ CharacTer is a character-level metric inspired by the translation edit rate (TER) metric. It is
22
+ defined as the minimum number of character edits required to adjust a hypothesis, until it completely matches the
23
+ reference, normalized by the length of the hypothesis sentence. CharacTer calculates the character level edit
24
+ distance while performing the shift edit on word level. Unlike the strict matching criterion in TER, a hypothesis
25
+ word is considered to match a reference word and could be shifted, if the edit distance between them is below a
26
+ threshold value. The Levenshtein distance between the reference and the shifted hypothesis sequence is computed on the
27
+ character level. In addition, the lengths of hypothesis sequences instead of reference sequences are used for
28
+ normalizing the edit distance, which effectively counters the issue that shorter translations normally achieve lower
29
+ TER.
30
+
31
+ ## Intended Uses
32
+ CharacTER was developed for machine translation evaluation.
33
+
34
+ ## How to Use
35
+
36
+ ```python
37
+ import evaluate
38
+ character = evaluate.load("character")
39
+
40
+ # Single hyp/ref
41
+ preds = ["this week the saudis denied information published in the new york times"]
42
+ refs = ["saudi arabia denied this week information published in the american new york times"]
43
+ results = character.compute(references=refs, predictions=preds)
44
+
45
+ # Corpus example
46
+ preds = ["this week the saudis denied information published in the new york times",
47
+ "this is in fact an estimate"]
48
+ refs = ["saudi arabia denied this week information published in the american new york times",
49
+ "this is actually an estimate"]
50
+ results = character.compute(references=refs, predictions=preds)
51
+ ```
52
+
53
+ ### Inputs
54
+ - **predictions**: a single prediction or a list of predictions to score. Each prediction should be a string with
55
+ tokens separated by spaces.
56
+ - **references**: a single reference or a list of reference for each prediction. Each reference should be a string with
57
+ tokens separated by spaces.
58
+
59
+
60
+ ### Output Values
61
+
62
+ *=only when a list of references/hypotheses are given
63
+
64
+ - **count** (*): how many parallel sentences were processed
65
+ - **mean** (*): the mean CharacTER score
66
+ - **median** (*): the median score
67
+ - **std** (*): standard deviation of the score
68
+ - **min** (*): smallest score
69
+ - **max** (*): largest score
70
+ - **cer_scores**: all scores, one per ref/hyp pair
71
+
72
+ ### Output Example
73
+ ```python
74
+ {
75
+ 'count': 2,
76
+ 'mean': 0.3127282211789254,
77
+ 'median': 0.3127282211789254,
78
+ 'std': 0.07561653111280243,
79
+ 'min': 0.25925925925925924,
80
+ 'max': 0.36619718309859156,
81
+ 'cer_scores': [0.36619718309859156, 0.25925925925925924]
82
+ }
83
+ ```
84
+
85
+ ## Citation
86
+ ```bibtex
87
+ @inproceedings{wang-etal-2016-character,
88
+ title = "{C}harac{T}er: Translation Edit Rate on Character Level",
89
+ author = "Wang, Weiyue and
90
+ Peter, Jan-Thorsten and
91
+ Rosendahl, Hendrik and
92
+ Ney, Hermann",
93
+ booktitle = "Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers",
94
+ month = aug,
95
+ year = "2016",
96
+ address = "Berlin, Germany",
97
+ publisher = "Association for Computational Linguistics",
98
+ url = "https://aclanthology.org/W16-2342",
99
+ doi = "10.18653/v1/W16-2342",
100
+ pages = "505--510",
101
+ }
102
+ ```
103
+
104
+ ## Further References
105
+ - Repackaged version that is used in this HF implementation: [https://github.com/bramvanroy/CharacTER](https://github.com/bramvanroy/CharacTER)
106
+ - Original version: [https://github.com/rwth-i6/CharacTER](https://github.com/rwth-i6/CharacTER)
app.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ import evaluate
2
+ from evaluate.utils import launch_gradio_widget
3
+
4
+
5
+ module = evaluate.load("character")
6
+ launch_gradio_widget(module)
character.py ADDED
@@ -0,0 +1,169 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ """CharacTER metric, a character-based TER variant, for machine translation."""
15
+ import math
16
+ from statistics import mean, median
17
+ from typing import Iterable, List, Union
18
+
19
+ import cer
20
+ import datasets
21
+ from cer import calculate_cer
22
+ from datasets import Sequence, Value
23
+
24
+ import evaluate
25
+
26
+
27
+ _CITATION = """\
28
+ @inproceedings{wang-etal-2016-character,
29
+ title = "{C}harac{T}er: Translation Edit Rate on Character Level",
30
+ author = "Wang, Weiyue and
31
+ Peter, Jan-Thorsten and
32
+ Rosendahl, Hendrik and
33
+ Ney, Hermann",
34
+ booktitle = "Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers",
35
+ month = aug,
36
+ year = "2016",
37
+ address = "Berlin, Germany",
38
+ publisher = "Association for Computational Linguistics",
39
+ url = "https://aclanthology.org/W16-2342",
40
+ doi = "10.18653/v1/W16-2342",
41
+ pages = "505--510",
42
+ }
43
+ """
44
+
45
+ _DESCRIPTION = """\
46
+ CharacTer is a character-level metric inspired by the commonly applied translation edit rate (TER). It is
47
+ defined as the minimum number of character edits required to adjust a hypothesis, until it completely matches the
48
+ reference, normalized by the length of the hypothesis sentence. CharacTer calculates the character level edit
49
+ distance while performing the shift edit on word level. Unlike the strict matching criterion in TER, a hypothesis
50
+ word is considered to match a reference word and could be shifted, if the edit distance between them is below a
51
+ threshold value. The Levenshtein distance between the reference and the shifted hypothesis sequence is computed on the
52
+ character level. In addition, the lengths of hypothesis sequences instead of reference sequences are used for
53
+ normalizing the edit distance, which effectively counters the issue that shorter translations normally achieve lower
54
+ TER."""
55
+
56
+ _KWARGS_DESCRIPTION = """
57
+ Calculates how good the predictions are in terms of the CharacTER metric given some references.
58
+ Args:
59
+ predictions: a list of predictions to score. Each prediction should be a string with
60
+ tokens separated by spaces.
61
+ references: a list of references for each prediction. You can also pass multiple references for each prediction,
62
+ so a list and in that list a sublist for each prediction for its related references. When multiple references are
63
+ given, the lowest (best) score is returned for that prediction-references pair.
64
+ Each reference should be a string with tokens separated by spaces.
65
+ aggregate: one of "mean", "sum", "median" to indicate how the scores of individual sentences should be
66
+ aggregated
67
+ return_all_scores: a boolean, indicating whether in addition to the aggregated score, also all individual
68
+ scores should be returned
69
+ Returns:
70
+ cer_score: an aggregated score across all the items, based on 'aggregate'
71
+ cer_scores: (optionally, if 'return_all_scores' evaluates to True) a list of all scores, one per ref/hyp pair
72
+ Examples:
73
+ >>> character_mt = evaluate.load("character")
74
+ >>> preds = ["this week the saudis denied information published in the new york times"]
75
+ >>> refs = ["saudi arabia denied this week information published in the american new york times"]
76
+ >>> character_mt.compute(references=refs, predictions=preds)
77
+ {'cer_score': 0.36619718309859156}
78
+ >>> preds = ["this week the saudis denied information published in the new york times",
79
+ ... "this is in fact an estimate"]
80
+ >>> refs = ["saudi arabia denied this week information published in the american new york times",
81
+ ... "this is actually an estimate"]
82
+ >>> character_mt.compute(references=refs, predictions=preds, aggregate="sum", return_all_scores=True)
83
+ {'cer_score': 0.6254564423578508, 'cer_scores': [0.36619718309859156, 0.25925925925925924]}
84
+ >>> preds = ["this week the saudis denied information published in the new york times"]
85
+ >>> refs = [["saudi arabia denied this week information published in the american new york times",
86
+ ... "the saudis have denied new information published in the ny times"]]
87
+ >>> character_mt.compute(references=refs, predictions=preds, aggregate="median", return_all_scores=True)
88
+ {'cer_score': 0.36619718309859156, 'cer_scores': [0.36619718309859156]}
89
+ """
90
+
91
+
92
+ @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
93
+ class Character(evaluate.Metric):
94
+ """CharacTer is a character-level metric inspired by the commonly applied translation edit rate (TER)."""
95
+
96
+ def _info(self):
97
+ return evaluate.MetricInfo(
98
+ module_type="metric",
99
+ description=_DESCRIPTION,
100
+ citation=_CITATION,
101
+ inputs_description=_KWARGS_DESCRIPTION,
102
+ features=[
103
+ datasets.Features(
104
+ {"predictions": Value("string", id="prediction"), "references": Value("string", id="reference")}
105
+ ),
106
+ datasets.Features(
107
+ {
108
+ "predictions": Value("string", id="prediction"),
109
+ "references": Sequence(Value("string", id="reference"), id="references"),
110
+ }
111
+ ),
112
+ ],
113
+ homepage="https://github.com/bramvanroy/CharacTER",
114
+ codebase_urls=["https://github.com/bramvanroy/CharacTER", "https://github.com/rwth-i6/CharacTER"],
115
+ )
116
+
117
+ def _compute(
118
+ self,
119
+ predictions: Iterable[str],
120
+ references: Union[Iterable[str], Iterable[Iterable[str]]],
121
+ aggregate: str = "mean",
122
+ return_all_scores: bool = False,
123
+ ):
124
+ if aggregate not in ("mean", "sum", "median"):
125
+ raise ValueError("'aggregate' must be one of 'sum', 'mean', 'median'")
126
+
127
+ predictions = [p.split() for p in predictions]
128
+ # Predictions and references have the same internal types (both lists of strings),
129
+ # so only one reference per prediction
130
+ if isinstance(references[0], str):
131
+ references = [r.split() for r in references]
132
+
133
+ scores_d = cer.calculate_cer_corpus(predictions, references)
134
+ cer_scores: List[float] = scores_d["cer_scores"]
135
+
136
+ if aggregate == "sum":
137
+ score = sum(cer_scores)
138
+ elif aggregate == "mean":
139
+ score = scores_d["mean"]
140
+ else:
141
+ score = scores_d["median"]
142
+ else:
143
+ # In the case of multiple references, we just find the "best score",
144
+ # i.e., the reference that the prediction is closest to, i.e. the lowest characTER score
145
+ references = [[r.split() for r in refs] for refs in references]
146
+
147
+ cer_scores = []
148
+ for pred, refs in zip(predictions, references):
149
+ min_score = math.inf
150
+ for ref in refs:
151
+ score = calculate_cer(pred, ref)
152
+
153
+ if score < min_score:
154
+ min_score = score
155
+
156
+ cer_scores.append(min_score)
157
+
158
+ if aggregate == "sum":
159
+ score = sum(cer_scores)
160
+ elif aggregate == "mean":
161
+ score = mean(cer_scores)
162
+ else:
163
+ score = median(cer_scores)
164
+
165
+ # Return scores
166
+ if return_all_scores:
167
+ return {"cer_score": score, "cer_scores": cer_scores}
168
+ else:
169
+ return {"cer_score": score}
requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ git+https://github.com/huggingface/evaluate@544f1e8a5f30663d59ed6ba94b2b7380e8b4c309
2
+ cer>=1.2.0