File size: 11,406 Bytes
c5d83eb
ee37fa5
c5d83eb
 
 
6a56d7d
c5d83eb
 
 
 
 
 
e898efc
c5d83eb
 
 
ee37fa5
 
79a6fc4
c5d83eb
 
ee37fa5
 
f17dee3
 
 
 
ee37fa5
 
c5d83eb
 
ee37fa5
 
 
 
c5d83eb
 
 
ee37fa5
 
 
 
 
 
f17dee3
0e946aa
3109162
 
f17dee3
 
 
 
 
ee37fa5
 
 
c5d83eb
 
 
ee37fa5
 
c5d83eb
f17dee3
 
c5d83eb
 
ee37fa5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3109162
 
6a56d7d
 
 
 
 
 
 
 
 
 
 
 
3109162
 
ee37fa5
 
6a56d7d
 
 
 
 
 
 
 
ee37fa5
 
 
14e865e
6a56d7d
 
 
 
 
 
 
 
ee37fa5
 
931a43a
 
 
 
 
 
b7b8675
0e946aa
8c9c3a5
 
 
 
931a43a
12b98ee
931a43a
b2d5543
0e946aa
 
 
 
 
 
 
931a43a
 
3b9499b
 
c5d83eb
3b9499b
0e946aa
8c9c3a5
 
 
 
0e946aa
 
 
2bfc50f
0e946aa
 
 
 
 
 
 
68b945f
c5d83eb
91f1c8a
 
c5d83eb
91f1c8a
2f1260e
c5d83eb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
---
title: FairEval
tags:
- evaluate
- metric
description: "Fair Evaluation for Squence labeling"
sdk: gradio
sdk_version: 3.0.2
app_file: app.py
pinned: false
---

# Fair Evaluation for Sequence Labeling

## Metric Description
The traditional evaluation of NLP labeled spans with precision, recall, and F1-score leads to double penalties for 
close-to-correct annotations. As [Manning (2006)](https://nlpers.blogspot.com/2006/08/doing-named-entity-recognition-dont.html) 
argues in an article about named entity recognition, this can lead to undesirable effects when systems are optimized for these traditional metrics.
To address these issues, this metric provides an implementation of FairEval, proposed by [Ortmann (2022)](https://aclanthology.org/2022.lrec-1.150.pdf).

## How to Use
FairEval outputs the error count (TP, FP, etc.) and resulting scores (Precision, Recall and F1) from a reference list of 
spans compared against a predicted one. The user can choose to see traditional or fair error counts and scores by 
switching the argument **mode**. 

The user can also choose to see the metric parameters (TP, FP...) as absolute count, as a percentage with respect to the 
total number of errors or with respect to the total number of ground truth entities through the argument **error_format**.

The minimal example is:

```python
faireval = evaluate.load("hpi-dhc/FairEval")
pred = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O', 'B-PER', 'I-PER', 'O']]
ref =  [['O', 'O', 'O',      'B-MISC', 'I-MISC', 'I-MISC', 'O', 'B-PER', 'I-PER', 'O']]
results = faireval.compute(predictions=pred, references=ref)
```

### Inputs
FairEval handles input annotations as seqeval. The supported formats are IOB1, IOB2, IOE1, IOE2 and IOBES. 
Predicted sentences must have the same number of tokens as the references.
- **predictions** *(list)*: a list of lists of predicted labels, i.e. estimated targets as returned by a tagger.
- **references** *(list)*: list of ground truth reference labels. 

The optional arguments are:
- **mode** *(str)*: 'fair', 'traditional' ot 'weighted. Controls the desired output. The default value is 'fair'.
  - 'traditional': equivalent to seqeval's 'strict' mode. Bear in mind that the default mode for seqeval is 'relaxed', which does not match with any of faireval modes.
  - 'fair': default fair score calculation. Fair will also show traditional scores for comparison.
  - 'weighted': custom score calculation with the weights passed. Weighted will also show traditional scores for comparison.
- **weights** *(dict)*: dictionary with the weight of each error for the custom score calculation.
- **error_format** *(str)*: 'count', 'error_ratio' or 'entity_ratio'. Controls the desired output for TP, FP, BE, LE, etc.  Default value is 'count'.
  - 'count': absolute count of each parameter. 
  - 'error_ratio': precentage with respect to the total errors that each parameter represents.
  - 'entity_ratio': precentage with respect to the total number of ground truth entites that each parameter represents.
- **zero_division** *(str)*: which value to substitute as a metric value when encountering zero division. Should be one of [0,1,"warn"]. "warn" acts as 0, but the warning is raised.
- **suffix** *(boolean)*: True if the IOB tag is a suffix (after type) instead of a prefix (before type), False otherwise. The default value is False, i.e. the IOB tag is a prefix (before type).
- **scheme** *(str)*: the target tagging scheme, which can be one of [IOB1, IOB2, IOE1, IOE2, IOBES, BILOU]. The default value is None.

### Output Values
A dictionary with:
 - Overall error parameter count (or ratio) and resulting scores.
 - A nested dictionary per label with its respective error parameter count (or ratio) and resulting scores

If mode is 'traditional', the error parameters shown are the classical TP, FP and FN. If mode is 'fair' or 'weighted', 
TP remain the same, FP and FN are shown as per the fair definition and additional errors BE, LE and LBE are shown.

### Examples
Considering the following input annotated sentences:
```python
>>> r1 = ['O', 'O', 'B-PER', 'I-PER', 'O', 'B-PER']
>>> p1 = ['O', 'O', 'B-PER', 'I-PER', 'O', 'O'    ] #1FN
>>> 
>>> r2 = ['O',     'B-INT', 'B-OUT']
>>> p2 = ['B-INT', 'I-INT', 'B-OUT'] #1BE  
>>> 
>>> r3 = ['B-INT', 'I-INT', 'B-OUT']
>>> p3 = ['B-OUT', 'O',     'B-PER'] #1LBE, 1LE   
>>> 
>>> y_true = [r1, r2, r3]
>>> y_pred = [p1, p2, p3]
```

The output for different modes and error_formats is:
```python
>>> faireval.compute(predictions=y_pred, references=y_true, mode='fair', error_format='count')
{"PER": {"precision": 1.0, "recall": 0.5, "f1": 0.6666,
         "trad_prec": 0.5, "trad_rec": 0.5, "trad_f1": 0.5,
         "TP": 1, "FP": 0.0, "FN": 1.0, "LE": 0.0, "BE": 0.0, "LBE": 0.0},
 "INT": {"precision": 0.0, "recall": 0.0, "f1": 0.0,
         "trad_prec": 0.0, "trad_rec": 0.0, "trad_f1": 0.0,
         "TP": 0, "FP": 0.0, "FN": 0.0, "LE": 0.0, "BE": 1.0, "LBE": 1.0},
 "OUT": {"precision": 0.6666, "recall": 0.6666, "f1": 0.666,
         "trad_prec": 0.5, "trad_rec": 0.5, "trad_f1": 0.5,
         "TP": 1, "FP": 0.0, "FN": 0.0, "LE": 1.0, "BE": 0.0, "LBE": 0.0},
 "overall_precision": 0.5714, "overall_recall": 0.4444, "overall_f1": 0.5,
 "overall_trad_prec": 0.4, "overall_trad_rec": 0.3333, "overall_trad_f1": 0.3636, 
 "TP": 2, "FP": 0.0, "FN": 1.0, "LE": 1.0, "BE": 1.0, "LBE": 1.0}
```

```python
>>> faireval.compute(predictions=y_pred, references=y_true, mode='traditional', error_format='count')
{"PER": {"precision": 0.5, "recall": 0.5, "f1": 0.5,
         "TP": 1, "FP": 1.0, "FN": 1.0},
 "INT": {"precision": 0.0, "recall": 0.0, "f1": 0.0,
         "TP": 0, "FP": 1.0, "FN": 2.0},
 "OUT": {"precision": 0.5, "recall": 0.5, "f1": 0.5,
         "TP": 1, "FP": 1.0, "FN": 1.0},
 "overall_precision": 0.4, "overall_recall": 0.3333, "overall_f1": 0.3636,
 "TP": 2, "FP": 3.0, "FN": 4.0}
```

```python
>>> faireval.compute(predictions=y_pred, references=y_true, mode='traditional', error_format='error_ratio')
{"PER": {"precision": 0.5, "recall": 0.5, "f1": 0.5,
         "TP": 1, "FP": 0.1428, "FN": 0.1428},
 "INT": {"precision": 0.0, "recall": 0.0, "f1": 0.0,
         "TP": 0, "FP": 0.1428, "FN": 0.2857},
 "OUT": {"precision": 0.5, "recall": 0.5, "f1": 0.5,
         "TP": 1, "FP": 0.1428, "FN": 0.1428},
 "overall_precision": 0.4, "overall_recall": 0.3333, "overall_f1": 0.3636,
 "TP": 2, "FP": 0.4285, "FN": 0.5714}
```

### Values from Popular Papers

#### CoNLL2003
Computing the evaluation metrics on the results from [this model](https://huggingface.co/elastic/distilbert-base-uncased-finetuned-conll03-english) 
run on the test split of [CoNLL2003 dataset](https://huggingface.co/datasets/conll2003), we obtain the following F1-Scores:

| F1   Scores     | overall | location | miscellaneous | organization | person |
|-----------------|--------:|---------:|-------------:|-------------:|-------:|
| fair            | 0.94    | 0.96     | 0.85         | 0.92         | 0.97   |
| traditional     | 0.90    | 0.92     | 0.79         | 0.87         | 0.96   |
| seqeval strict  | 0.90    | 0.92     | 0.79         | 0.87         | 0.96   |
| seqeval relaxed | 0.90    | 0.92     | 0.78         | 0.87         | 0.96   |

With error count:

|     | overall (trad) | overall (fair)     | location (trad)|  location (fair)      | miscellaneous (trad)| miscellaneous (fair)    | organization (trad)| organization (fair)  | person (trad)| person (fair)     |
|-----|--------:|-----:|---------:|-----:|-------------:|----:|-------------:|-----:|-------:|-----:|
| TP  | 5104    | 5104 | 1545     | 1545 | 561          | 561 | 1452         | 1452 | 1546   | 1546 |
| FP  | 534     | 126  | 128      | 20   | 154          | 48  | 208          | 47   | 44     | 11   |
| FN  | 544     | 124  | 123      | 13   | 141          | 47  | 209          | 47   | 71     | 17   |
| LE  |         | 219  |          | 62   |              | 41  |              | 73   |        | 43   |
| BE  |         | 126  |          | 16   |              | 46  |              | 53   |        | 11   |
| LBE |         | 87   |          | 32   |              | 13  |              | 41   |        | 1    |

#### WNUT-17
Computing the evaluation metrics on the results from [this model](https://huggingface.co/muhtasham/bert-small-finetuned-wnut17-ner) 
run on the test split of [WNUT-17 dataset](https://huggingface.co/datasets/wnut_17), we obtain the following F1-Scores:

|                 | overall | location | group  | person | creative work | corporation | product |
|-----------------|--------:|---------:|-------:|-------:|--------------:|------------:|--------:|
| fair            |  0.37 |   0.58 | 0.02 | 0.58 |           0.0 |      0.03 |     0.0 |
| traditional     |  0.35 |   0.53 | 0.02 | 0.55 |           0.0 |      0.02 |     0.0 |
| seqeval strict  |  0.35 |   0.53 | 0.02 | 0.55 |           0.0 |      0.02 |     0.0 |
| seqeval relaxed |  0.34 |   0.49 | 0.02 | 0.55 |           0.0 |      0.02 |     0.0 |

With error count:

|     | overall (trad)| overall (fair)    | location (trad)| location (fair)   | group (trad)| group (fair)    | person (trad)|  person (fair)   | creative work (trad)| creative work (fair)    | corporation (trad)| corporation (fair)   | product (trad)| product (fair)    |
|-----|--------:|----:|---------:|---:|------:|----:|-------:|----:|--------------:|----:|------------:|---:|--------:|----:|
| TP  |     255 | 255 |       67 | 67 |     2 | 2   |    185 | 185 |             0 | 0   |           1 | 1  |       0 | 0   |
| FP  |     135 | 31  |       38 | 10 |    20 | 3   |     60 | 16  |             0 | 0   |          17 | 2  |       0 | 0   |
| FN  |     824 | 725 |       83 | 71 |   163 | 135 |    244 | 233 |           142 | 120 |          65 | 54 |     127 | 112 |
| LE  |         | 47  |          | 4  |       | 18  |        | 2   |               | 6   |             | 7  |         | 10  |
| BE  |         | 30  |          | 10 |       | 4   |        | 13  |               | 0   |             | 3  |         | 0   |
| LBE |         | 29  |          | 1  |       | 6   |        | 0   |               | 16  |             | 1  |         | 5   |

## Limitations and Bias
The metric is restricted to the input schemes admitted by seqeval. For example, the application does not support numerical
label inputs (odd for Beginning, even for Inside and zero for Outside).

The choice of custom weights for wheighted evaluation is subjective to the user. Neither weighted nor fair evaluations 
can be compared to traditional span-based metrics used in other pairs of datasets-models.

## Citation
Ortmann, Katrin. 2022. Fine-Grained Error Analysis and Fair Evaluation of Labeled Spans. In *Proceedings of the Language Resources and Evaluation Conference (LREC)*, Marseille, France, pages 1400–1407. [PDF](https://aclanthology.org/2022.lrec-1.150.pdf)

```bibtex
@inproceedings{ortmann2022,
    title = {Fine-Grained Error Analysis and Fair Evaluation of Labeled Spans},
    author = {Katrin Ortmann},
    url = {https://aclanthology.org/2022.lrec-1.150},
    year = {2022},
    date = {2022-06-21},
    booktitle = {Proceedings of the Language Resources and Evaluation Conference (LREC)},
    pages = {1400-1407},
    publisher = {European Language Resources Association},
    address = {Marseille, France},
    pubstate = {published},
    type = {inproceedings}
}
```