File size: 2,878 Bytes
b7ae728
99711f3
 
 
b7ae728
 
99711f3
b7ae728
 
99711f3
 
 
 
 
 
b7ae728
 
99711f3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
title: NIST_MT
emoji: 🤗 
colorFrom: purple
colorTo: red
sdk: gradio
sdk_version: 3.0.2
app_file: app.py
pinned: false
tags:
- evaluate
- metric
- machine-translation
description: 
  DARPA commissioned NIST to develop an MT evaluation facility based on the BLEU score.
---

# Metric Card for NIST's MT metric


## Metric Description
DARPA commissioned NIST to develop an MT evaluation facility based on the BLEU
score. The official script used by NIST to compute BLEU and NIST score is
mteval-14.pl. The main differences are:

 - BLEU uses geometric mean of the ngram overlaps, NIST uses arithmetic mean.
 - NIST has a different brevity penalty
 - NIST score from mteval-14.pl has a self-contained tokenizer (in the Hugging Face implementation we rely on NLTK's 
implementation of the NIST-specific tokenizer)

## Intended Uses
NIST was developed for machine translation evaluation.

## How to Use

```python
import evaluate
nist_mt = evaluate.load("nist_mt")
hypothesis1 = "It is a guide to action which ensures that the military always obeys the commands of the party"
reference1 = "It is a guide to action that ensures that the military will forever heed Party commands"
reference2 = "It is the guiding principle which guarantees the military forces always being under the command of the Party"
nist_mt.compute(hypothesis1, [reference1, reference2])
# {'nist_mt': 3.3709935957649324}
```

### Inputs
- **predictions**: tokenized predictions to score. For sentence-level NIST, a list of tokens (str);
     for corpus-level NIST, a list (sentences) of lists of tokens (str)
- **references**:  potentially multiple tokenized references for each prediction.  For sentence-level NIST, a
     list (multiple potential references) of list of tokens (str); for corpus-level NIST, a list (corpus) of lists
     (multiple potential references) of lists of tokens (str)
- **n**: highest n-gram order
- **tokenize_kwargs**: arguments passed to the tokenizer (see: https://github.com/nltk/nltk/blob/90fa546ea600194f2799ee51eaf1b729c128711e/nltk/tokenize/nist.py#L139)

### Output Values
- **nist_mt** (`float`): NIST score

Output Example:
```python
{'nist_mt': 3.3709935957649324}
```


## Citation
```bibtex
@inproceedings{10.5555/1289189.1289273,
    author = {Doddington, George},
    title = {Automatic Evaluation of Machine Translation Quality Using N-Gram Co-Occurrence Statistics},
    year = {2002},
    publisher = {Morgan Kaufmann Publishers Inc.},
    address = {San Francisco, CA, USA},
    booktitle = {Proceedings of the Second International Conference on Human Language Technology Research},
    pages = {138–145},
    numpages = {8},
    location = {San Diego, California},
    series = {HLT '02}
}
```

## Further References

This Hugging Face implementation uses [the NLTK implementation](https://github.com/nltk/nltk/blob/develop/nltk/translate/nist_score.py)