File size: 6,336 Bytes
c5d83eb ee37fa5 c5d83eb ee37fa5 79a6fc4 c5d83eb ee37fa5 c5d83eb ee37fa5 c5d83eb ee37fa5 c5d83eb ee37fa5 c5d83eb ee37fa5 c5d83eb ee37fa5 c5d83eb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
---
title: FairEval
tags:
- evaluate
- metric
description: "TODO: add a description here"
sdk: gradio
sdk_version: 3.0.2
app_file: app.py
pinned: false
---
# Metric: Fair Evaluation
## Metric Description
The traditional evaluation of NLP labeled spans with precision, recall, and F1-score leads to double penalties for
close-to-correct annotations. As [Manning (2006)](https://nlpers.blogspot.com/2006/08/doing-named-entity-recognition-dont.html)
argues in an article about named entity recognition, this can lead to undesirable effects when systems are optimized for these traditional metrics.
To address these issues, this metric provides an implementation of FairEval, proposed by [Ortmann (2022)](https://aclanthology.org/2022.lrec-1.150.pdf).
## How to Use
FairEval outputs the error count (TP, FP, etc.) and resulting scores (Precision, Recall and F1) from a reference list of
spans compared against a predicted one. The user can choose to see traditional or fair error counts and scores by
switching the argument **mode**.
The minimal example is:
```python
faireval = evaluate.load("hpi-dhc/FairEval")
pred = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O', 'B-PER', 'I-PER', 'O']]
ref = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O', 'B-PER', 'I-PER', 'O']]
results = faireval.compute(predictions=pred, references=ref)
```
### Inputs
FairEval handles input annotations as seqeval. The supported formats are IOB1, IOB2, IOE1, IOE2 and IOBES.
Predicted sentences must have the same number of tokens as the references.
- **predictions** *(list)*: a list of lists of predicted labels, i.e. estimated targets as returned by a tagger.
- **references** *(list)*: list of ground truth reference labels.
The optional arguments are:
- **mode** *(str)*: 'fair' or 'traditional'. Controls the desired output. 'Traditional' is equivalent to seqeval's metrics. The default value is 'fair'.
- **error_format** *(str)*: 'count' or 'proportion'. Controls the desired output for TP, FP, BE, LE, etc. 'count' gives the absolute count per parameter. 'proportion' gives the precentage with respect to the total errors that each parameter represents. Default value is 'count'.
- **zero_division** *(str)*: which value to substitute as a metric value when encountering zero division. Should be one of [0,1,"warn"]. "warn" acts as 0, but the warning is raised.
- **suffix** *(boolean)*: True if the IOB tag is a suffix (after type) instead of a prefix (before type), False otherwise. The default value is False, i.e. the IOB tag is a prefix (before type).
- **scheme** *(str)*: the target tagging scheme, which can be one of [IOB1, IOB2, IOE1, IOE2, IOBES, BILOU]. The default value is None.
### Output Values
A dictionary with:
- Overall error parameter count (or ratio) and resulting scores.
- A nested dictionary per label with its respective error parameter count (or ratio) and resulting scores
If mode is 'traditional', the error parameters shown are the classical TP, FP and FN. If mode is 'fair', TP remain the same,
FP and FN are shown as per the fair definition and additional errors BE, LE and LBE are shown.
### Examples
Considering the following input annotated sentences:
```python
>>> r1 = ['O', 'O', 'B-PER', 'I-PER', 'O', 'B-PER']
>>> p1 = ['O', 'O', 'B-PER', 'I-PER', 'O', 'O' ] #1FN
>>>
>>> r2 = ['O', 'B-INT', 'B-OUT']
>>> p2 = ['B-INT', 'I-INT', 'B-OUT'] #1BE
>>>
>>> r3 = ['B-INT', 'I-INT', 'B-OUT']
>>> p3 = ['B-OUT', 'O', 'B-PER'] #1LBE, 1LE
>>>
>>> y_true = [r1, r2, r3]
>>> y_pred = [p1, p2, p3]
```
The output for different modes and error_formats is:
```python
>>> faireval.compute(predictions=y_pred, references=y_true, mode='traditional', error_format='count')
{'PER': {'precision': 0.5, 'recall': 0.5, 'f1': 0.5, 'TP': 1, 'FP': 1, 'FN': 1},
'INT': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'TP': 0, 'FP': 1, 'FN': 2},
'OUT': {'precision': 0.5, 'recall': 0.5, 'f1': 0.5, 'TP': 1, 'FP': 1, 'FN': 1},
'overall_precision': 0.4,
'overall_recall': 0.3333,
'overall_f1': 0.3636,
'TP': 2,
'FP': 3,
'FN': 4}
```
```python
>>> faireval.compute(predictions=y_pred, references=y_true, mode='traditional', error_format='proportion')
{'PER': {'precision': 0.5, 'recall': 0.5, 'f1': 0.5, 'TP': 1, 'FP': 0.1428, 'FN': 0.1428},
'INT': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'TP': 0, 'FP': 0.1428, 'FN': 0.2857},
'OUT': {'precision': 0.5, 'recall': 0.5, 'f1': 0.5, 'TP': 1, 'FP': 0.1428, 'FN': 0.1428},
'overall_precision': 0.4,
'overall_recall': 0.3333,
'overall_f1': 0.3636,
'TP': 2,
'FP': 0.4285,
'FN': 0.5714}
```
```python
>>> faireval.compute(predictions=y_pred, references=y_true, mode='fair', error_format='count')
{'PER': {'precision': 1.0, 'recall': 0.5, 'f1': 0.6666, 'TP': 1, 'FP': 0, 'FN': 1, 'LE': 0, 'BE': 0, 'LBE': 0},
'INT': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'TP': 0, 'FP': 0, 'FN': 0, 'LE': 0, 'BE': 1, 'LBE': 1},
'OUT': {'precision': 0.6666, 'recall': 0.6666, 'f1': 0.6666, 'TP': 1, 'FP': 0, 'FN': 0, 'LE': 1, 'BE': 0, 'LBE': 0},
'overall_precision': 0.5714,
'overall_recall': 0.4444444444444444,
'overall_f1': 0.5,
'TP': 2,
'FP': 0,
'FN': 1,
'LE': 1,
'BE': 1,
'LBE': 1}
```
#### Values from Popular Papers
*Examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
*Under construction*
## Limitations and Bias
*Note any known limitations or biases that the metric has, with links and references if possible.*
*Under construction*
## Citation
Ortmann, Katrin. 2022. Fine-Grained Error Analysis and Fair Evaluation of Labeled Spans. In *Proceedings of the Language Resources and Evaluation Conference (LREC)*, Marseille, France, pages 1400–1407. [PDF](https://aclanthology.org/2022.lrec-1.150.pdf)
```bibtex
@inproceedings{ortmann2022,
title = {Fine-Grained Error Analysis and Fair Evaluation of Labeled Spans},
author = {Katrin Ortmann},
url = {https://aclanthology.org/2022.lrec-1.150},
year = {2022},
date = {2022-06-21},
booktitle = {Proceedings of the Language Resources and Evaluation Conference (LREC)},
pages = {1400-1407},
publisher = {European Language Resources Association},
address = {Marseille, France},
pubstate = {published},
type = {inproceedings}
}
``` |