File size: 8,831 Bytes
d916065
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
# Natural Language Toolkit: GLEU Score
#
# Copyright (C) 2001-2023 NLTK Project
# Authors:
# Contributors: Mike Schuster, Michael Wayne Goodman, Liling Tan
# URL: <https://www.nltk.org/>
# For license information, see LICENSE.TXT

""" GLEU score implementation. """

from collections import Counter

from nltk.util import everygrams, ngrams


def sentence_gleu(references, hypothesis, min_len=1, max_len=4):
    """

    Calculates the sentence level GLEU (Google-BLEU) score described in



        Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi,

        Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey,

        Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser,

        Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens,

        George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith,

        Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes,

        Jeffrey Dean. (2016) Google’s Neural Machine Translation System:

        Bridging the Gap between Human and Machine Translation.

        eprint arXiv:1609.08144. https://arxiv.org/pdf/1609.08144v2.pdf

        Retrieved on 27 Oct 2016.



    From Wu et al. (2016):

        "The BLEU score has some undesirable properties when used for single

         sentences, as it was designed to be a corpus measure. We therefore

         use a slightly different score for our RL experiments which we call

         the 'GLEU score'. For the GLEU score, we record all sub-sequences of

         1, 2, 3 or 4 tokens in output and target sequence (n-grams). We then

         compute a recall, which is the ratio of the number of matching n-grams

         to the number of total n-grams in the target (ground truth) sequence,

         and a precision, which is the ratio of the number of matching n-grams

         to the number of total n-grams in the generated output sequence. Then

         GLEU score is simply the minimum of recall and precision. This GLEU

         score's range is always between 0 (no matches) and 1 (all match) and

         it is symmetrical when switching output and target. According to

         our experiments, GLEU score correlates quite well with the BLEU

         metric on a corpus level but does not have its drawbacks for our per

         sentence reward objective."



    Note: The initial implementation only allowed a single reference, but now

          a list of references is required (which is consistent with

          bleu_score.sentence_bleu()).



    The infamous "the the the ... " example



        >>> ref = 'the cat is on the mat'.split()

        >>> hyp = 'the the the the the the the'.split()

        >>> sentence_gleu([ref], hyp)  # doctest: +ELLIPSIS

        0.0909...



    An example to evaluate normal machine translation outputs



        >>> ref1 = str('It is a guide to action that ensures that the military '

        ...            'will forever heed Party commands').split()

        >>> hyp1 = str('It is a guide to action which ensures that the military '

        ...            'always obeys the commands of the party').split()

        >>> hyp2 = str('It is to insure the troops forever hearing the activity '

        ...            'guidebook that party direct').split()

        >>> sentence_gleu([ref1], hyp1) # doctest: +ELLIPSIS

        0.4393...

        >>> sentence_gleu([ref1], hyp2) # doctest: +ELLIPSIS

        0.1206...



    :param references: a list of reference sentences

    :type references: list(list(str))

    :param hypothesis: a hypothesis sentence

    :type hypothesis: list(str)

    :param min_len: The minimum order of n-gram this function should extract.

    :type min_len: int

    :param max_len: The maximum order of n-gram this function should extract.

    :type max_len: int

    :return: the sentence level GLEU score.

    :rtype: float

    """
    return corpus_gleu([references], [hypothesis], min_len=min_len, max_len=max_len)


def corpus_gleu(list_of_references, hypotheses, min_len=1, max_len=4):
    """

    Calculate a single corpus-level GLEU score (aka. system-level GLEU) for all

    the hypotheses and their respective references.



    Instead of averaging the sentence level GLEU scores (i.e. macro-average

    precision), Wu et al. (2016) sum up the matching tokens and the max of

    hypothesis and reference tokens for each sentence, then compute using the

    aggregate values.



    From Mike Schuster (via email):

        "For the corpus, we just add up the two statistics n_match and

         n_all = max(n_all_output, n_all_target) for all sentences, then

         calculate gleu_score = n_match / n_all, so it is not just a mean of

         the sentence gleu scores (in our case, longer sentences count more,

         which I think makes sense as they are more difficult to translate)."



    >>> hyp1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'which',

    ...         'ensures', 'that', 'the', 'military', 'always',

    ...         'obeys', 'the', 'commands', 'of', 'the', 'party']

    >>> ref1a = ['It', 'is', 'a', 'guide', 'to', 'action', 'that',

    ...          'ensures', 'that', 'the', 'military', 'will', 'forever',

    ...          'heed', 'Party', 'commands']

    >>> ref1b = ['It', 'is', 'the', 'guiding', 'principle', 'which',

    ...          'guarantees', 'the', 'military', 'forces', 'always',

    ...          'being', 'under', 'the', 'command', 'of', 'the', 'Party']

    >>> ref1c = ['It', 'is', 'the', 'practical', 'guide', 'for', 'the',

    ...          'army', 'always', 'to', 'heed', 'the', 'directions',

    ...          'of', 'the', 'party']



    >>> hyp2 = ['he', 'read', 'the', 'book', 'because', 'he', 'was',

    ...         'interested', 'in', 'world', 'history']

    >>> ref2a = ['he', 'was', 'interested', 'in', 'world', 'history',

    ...          'because', 'he', 'read', 'the', 'book']



    >>> list_of_references = [[ref1a, ref1b, ref1c], [ref2a]]

    >>> hypotheses = [hyp1, hyp2]

    >>> corpus_gleu(list_of_references, hypotheses) # doctest: +ELLIPSIS

    0.5673...



    The example below show that corpus_gleu() is different from averaging

    sentence_gleu() for hypotheses



    >>> score1 = sentence_gleu([ref1a], hyp1)

    >>> score2 = sentence_gleu([ref2a], hyp2)

    >>> (score1 + score2) / 2 # doctest: +ELLIPSIS

    0.6144...



    :param list_of_references: a list of reference sentences, w.r.t. hypotheses

    :type list_of_references: list(list(list(str)))

    :param hypotheses: a list of hypothesis sentences

    :type hypotheses: list(list(str))

    :param min_len: The minimum order of n-gram this function should extract.

    :type min_len: int

    :param max_len: The maximum order of n-gram this function should extract.

    :type max_len: int

    :return: The corpus-level GLEU score.

    :rtype: float

    """
    # sanity check
    assert len(list_of_references) == len(
        hypotheses
    ), "The number of hypotheses and their reference(s) should be the same"

    # sum matches and max-token-lengths over all sentences
    corpus_n_match = 0
    corpus_n_all = 0

    for references, hypothesis in zip(list_of_references, hypotheses):
        hyp_ngrams = Counter(everygrams(hypothesis, min_len, max_len))
        tpfp = sum(hyp_ngrams.values())  # True positives + False positives.

        hyp_counts = []
        for reference in references:
            ref_ngrams = Counter(everygrams(reference, min_len, max_len))
            tpfn = sum(ref_ngrams.values())  # True positives + False negatives.

            overlap_ngrams = ref_ngrams & hyp_ngrams
            tp = sum(overlap_ngrams.values())  # True positives.

            # While GLEU is defined as the minimum of precision and
            # recall, we can reduce the number of division operations by one by
            # instead finding the maximum of the denominators for the precision
            # and recall formulae, since the numerators are the same:
            #     precision = tp / tpfp
            #     recall = tp / tpfn
            #     gleu_score = min(precision, recall) == tp / max(tpfp, tpfn)
            n_all = max(tpfp, tpfn)

            if n_all > 0:
                hyp_counts.append((tp, n_all))

        # use the reference yielding the highest score
        if hyp_counts:
            n_match, n_all = max(hyp_counts, key=lambda hc: hc[0] / hc[1])
            corpus_n_match += n_match
            corpus_n_all += n_all

    # corner case: empty corpus or empty references---don't divide by zero!
    if corpus_n_all == 0:
        gleu_score = 0.0
    else:
        gleu_score = corpus_n_match / corpus_n_all

    return gleu_score