|
--- |
|
title: FairEval |
|
tags: |
|
- evaluate |
|
- metric |
|
description: "Fair Evaluation for Squence labeling" |
|
sdk: gradio |
|
sdk_version: 3.0.2 |
|
app_file: app.py |
|
pinned: false |
|
--- |
|
|
|
# Fair Evaluation for Sequence Labeling |
|
|
|
## Metric Description |
|
The traditional evaluation of NLP labeled spans with precision, recall, and F1-score leads to double penalties for |
|
close-to-correct annotations. As [Manning (2006)](https://nlpers.blogspot.com/2006/08/doing-named-entity-recognition-dont.html) |
|
argues in an article about named entity recognition, this can lead to undesirable effects when systems are optimized for these traditional metrics. |
|
To address these issues, this metric provides an implementation of FairEval, proposed by [Ortmann (2022)](https://aclanthology.org/2022.lrec-1.150.pdf). |
|
|
|
## How to Use |
|
FairEval outputs the error count (TP, FP, etc.) and resulting scores (Precision, Recall and F1) from a reference list of |
|
spans compared against a predicted one. The user can choose to see traditional or fair error counts and scores by |
|
switching the argument **mode**. |
|
|
|
The user can also choose to see the metric parameters (TP, FP...) as absolute count, as a percentage with respect to the |
|
total number of errors or with respect to the total number of ground truth entities through the argument **error_format**. |
|
|
|
The minimal example is: |
|
|
|
```python |
|
faireval = evaluate.load("hpi-dhc/FairEval") |
|
pred = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O', 'B-PER', 'I-PER', 'O']] |
|
ref = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O', 'B-PER', 'I-PER', 'O']] |
|
results = faireval.compute(predictions=pred, references=ref) |
|
``` |
|
|
|
### Inputs |
|
FairEval handles input annotations as seqeval. The supported formats are IOB1, IOB2, IOE1, IOE2 and IOBES. |
|
Predicted sentences must have the same number of tokens as the references. |
|
- **predictions** *(list)*: a list of lists of predicted labels, i.e. estimated targets as returned by a tagger. |
|
- **references** *(list)*: list of ground truth reference labels. |
|
|
|
The optional arguments are: |
|
- **mode** *(str)*: 'fair', 'traditional' ot 'weighted. Controls the desired output. The default value is 'fair'. |
|
- 'traditional': equivalent to seqeval's 'strict' mode. Bear in mind that the default mode for seqeval is 'relaxed', which does not match with any of faireval modes. |
|
- 'fair': default fair score calculation. Fair will also show traditional scores for comparison. |
|
- 'weighted': custom score calculation with the weights passed. Weighted will also show traditional scores for comparison. |
|
- **weights** *(dict)*: dictionary with the weight of each error for the custom score calculation. |
|
- **error_format** *(str)*: 'count', 'error_ratio' or 'entity_ratio'. Controls the desired output for TP, FP, BE, LE, etc. Default value is 'count'. |
|
- 'count': absolute count of each parameter. |
|
- 'error_ratio': precentage with respect to the total errors that each parameter represents. |
|
- 'entity_ratio': precentage with respect to the total number of ground truth entites that each parameter represents. |
|
- **zero_division** *(str)*: which value to substitute as a metric value when encountering zero division. Should be one of [0,1,"warn"]. "warn" acts as 0, but the warning is raised. |
|
- **suffix** *(boolean)*: True if the IOB tag is a suffix (after type) instead of a prefix (before type), False otherwise. The default value is False, i.e. the IOB tag is a prefix (before type). |
|
- **scheme** *(str)*: the target tagging scheme, which can be one of [IOB1, IOB2, IOE1, IOE2, IOBES, BILOU]. The default value is None. |
|
|
|
### Output Values |
|
A dictionary with: |
|
- Overall error parameter count (or ratio) and resulting scores. |
|
- A nested dictionary per label with its respective error parameter count (or ratio) and resulting scores |
|
|
|
If mode is 'traditional', the error parameters shown are the classical TP, FP and FN. If mode is 'fair' or 'weighted', |
|
TP remain the same, FP and FN are shown as per the fair definition and additional errors BE, LE and LBE are shown. |
|
|
|
### Examples |
|
Considering the following input annotated sentences: |
|
```python |
|
>>> r1 = ['O', 'O', 'B-PER', 'I-PER', 'O', 'B-PER'] |
|
>>> p1 = ['O', 'O', 'B-PER', 'I-PER', 'O', 'O' ] #1FN |
|
>>> |
|
>>> r2 = ['O', 'B-INT', 'B-OUT'] |
|
>>> p2 = ['B-INT', 'I-INT', 'B-OUT'] #1BE |
|
>>> |
|
>>> r3 = ['B-INT', 'I-INT', 'B-OUT'] |
|
>>> p3 = ['B-OUT', 'O', 'B-PER'] #1LBE, 1LE |
|
>>> |
|
>>> y_true = [r1, r2, r3] |
|
>>> y_pred = [p1, p2, p3] |
|
``` |
|
|
|
The output for different modes and error_formats is: |
|
```python |
|
>>> faireval.compute(predictions=y_pred, references=y_true, mode='fair', error_format='count') |
|
{"PER": {"precision": 1.0, "recall": 0.5, "f1": 0.6666, |
|
"trad_prec": 0.5, "trad_rec": 0.5, "trad_f1": 0.5, |
|
"TP": 1, "FP": 0.0, "FN": 1.0, "LE": 0.0, "BE": 0.0, "LBE": 0.0}, |
|
"INT": {"precision": 0.0, "recall": 0.0, "f1": 0.0, |
|
"trad_prec": 0.0, "trad_rec": 0.0, "trad_f1": 0.0, |
|
"TP": 0, "FP": 0.0, "FN": 0.0, "LE": 0.0, "BE": 1.0, "LBE": 1.0}, |
|
"OUT": {"precision": 0.6666, "recall": 0.6666, "f1": 0.666, |
|
"trad_prec": 0.5, "trad_rec": 0.5, "trad_f1": 0.5, |
|
"TP": 1, "FP": 0.0, "FN": 0.0, "LE": 1.0, "BE": 0.0, "LBE": 0.0}, |
|
"overall_precision": 0.5714, "overall_recall": 0.4444, "overall_f1": 0.5, |
|
"overall_trad_prec": 0.4, "overall_trad_rec": 0.3333, "overall_trad_f1": 0.3636, |
|
"TP": 2, "FP": 0.0, "FN": 1.0, "LE": 1.0, "BE": 1.0, "LBE": 1.0} |
|
``` |
|
|
|
```python |
|
>>> faireval.compute(predictions=y_pred, references=y_true, mode='traditional', error_format='count') |
|
{"PER": {"precision": 0.5, "recall": 0.5, "f1": 0.5, |
|
"TP": 1, "FP": 1.0, "FN": 1.0}, |
|
"INT": {"precision": 0.0, "recall": 0.0, "f1": 0.0, |
|
"TP": 0, "FP": 1.0, "FN": 2.0}, |
|
"OUT": {"precision": 0.5, "recall": 0.5, "f1": 0.5, |
|
"TP": 1, "FP": 1.0, "FN": 1.0}, |
|
"overall_precision": 0.4, "overall_recall": 0.3333, "overall_f1": 0.3636, |
|
"TP": 2, "FP": 3.0, "FN": 4.0} |
|
``` |
|
|
|
```python |
|
>>> faireval.compute(predictions=y_pred, references=y_true, mode='traditional', error_format='error_ratio') |
|
{"PER": {"precision": 0.5, "recall": 0.5, "f1": 0.5, |
|
"TP": 1, "FP": 0.1428, "FN": 0.1428}, |
|
"INT": {"precision": 0.0, "recall": 0.0, "f1": 0.0, |
|
"TP": 0, "FP": 0.1428, "FN": 0.2857}, |
|
"OUT": {"precision": 0.5, "recall": 0.5, "f1": 0.5, |
|
"TP": 1, "FP": 0.1428, "FN": 0.1428}, |
|
"overall_precision": 0.4, "overall_recall": 0.3333, "overall_f1": 0.3636, |
|
"TP": 2, "FP": 0.4285, "FN": 0.5714} |
|
``` |
|
|
|
### Values from Popular Papers |
|
|
|
#### CoNLL2003 |
|
Computing the evaluation metrics on the results from [this model](https://huggingface.co/elastic/distilbert-base-uncased-finetuned-conll03-english) |
|
run on the test split of [CoNLL2003 dataset](https://huggingface.co/datasets/conll2003), we obtain the following F1-Scores: |
|
|
|
| F1 Scores | overall | location | miscellaneous | organization | person | |
|
|-----------------|--------:|---------:|-------------:|-------------:|-------:| |
|
| fair | 0.94 | 0.96 | 0.85 | 0.92 | 0.97 | |
|
| traditional | 0.90 | 0.92 | 0.79 | 0.87 | 0.96 | |
|
| seqeval strict | 0.90 | 0.92 | 0.79 | 0.87 | 0.96 | |
|
| seqeval relaxed | 0.90 | 0.92 | 0.78 | 0.87 | 0.96 | |
|
|
|
With error count: |
|
|
|
| | overall (trad) | overall (fair) | location (trad)| location (fair) | miscellaneous (trad)| miscellaneous (fair) | organization (trad)| organization (fair) | person (trad)| person (fair) | |
|
|-----|--------:|-----:|---------:|-----:|-------------:|----:|-------------:|-----:|-------:|-----:| |
|
| TP | 5104 | 5104 | 1545 | 1545 | 561 | 561 | 1452 | 1452 | 1546 | 1546 | |
|
| FP | 534 | 126 | 128 | 20 | 154 | 48 | 208 | 47 | 44 | 11 | |
|
| FN | 544 | 124 | 123 | 13 | 141 | 47 | 209 | 47 | 71 | 17 | |
|
| LE | | 219 | | 62 | | 41 | | 73 | | 43 | |
|
| BE | | 126 | | 16 | | 46 | | 53 | | 11 | |
|
| LBE | | 87 | | 32 | | 13 | | 41 | | 1 | |
|
|
|
#### WNUT-17 |
|
Computing the evaluation metrics on the results from [this model](https://huggingface.co/muhtasham/bert-small-finetuned-wnut17-ner) |
|
run on the test split of [WNUT-17 dataset](https://huggingface.co/datasets/wnut_17), we obtain the following F1-Scores: |
|
|
|
| | overall | location | group | person | creative work | corporation | product | |
|
|-----------------|--------:|---------:|-------:|-------:|--------------:|------------:|--------:| |
|
| fair | 0.37 | 0.58 | 0.02 | 0.58 | 0.0 | 0.03 | 0.0 | |
|
| traditional | 0.35 | 0.53 | 0.02 | 0.55 | 0.0 | 0.02 | 0.0 | |
|
| seqeval strict | 0.35 | 0.53 | 0.02 | 0.55 | 0.0 | 0.02 | 0.0 | |
|
| seqeval relaxed | 0.34 | 0.49 | 0.02 | 0.55 | 0.0 | 0.02 | 0.0 | |
|
|
|
With error count: |
|
|
|
| | overall (trad)| overall (fair) | location (trad)| location (fair) | group (trad)| group (fair) | person (trad)| person (fair) | creative work (trad)| creative work (fair) | corporation (trad)| corporation (fair) | product (trad)| product (fair) | |
|
|-----|--------:|----:|---------:|---:|------:|----:|-------:|----:|--------------:|----:|------------:|---:|--------:|----:| |
|
| TP | 255 | 255 | 67 | 67 | 2 | 2 | 185 | 185 | 0 | 0 | 1 | 1 | 0 | 0 | |
|
| FP | 135 | 31 | 38 | 10 | 20 | 3 | 60 | 16 | 0 | 0 | 17 | 2 | 0 | 0 | |
|
| FN | 824 | 725 | 83 | 71 | 163 | 135 | 244 | 233 | 142 | 120 | 65 | 54 | 127 | 112 | |
|
| LE | | 47 | | 4 | | 18 | | 2 | | 6 | | 7 | | 10 | |
|
| BE | | 30 | | 10 | | 4 | | 13 | | 0 | | 3 | | 0 | |
|
| LBE | | 29 | | 1 | | 6 | | 0 | | 16 | | 1 | | 5 | |
|
|
|
## Limitations and Bias |
|
The metric is restricted to the input schemes admitted by seqeval. For example, the application does not support numerical |
|
label inputs (odd for Beginning, even for Inside and zero for Outside). |
|
|
|
The choice of custom weights for wheighted evaluation is subjective to the user. Neither weighted nor fair evaluations |
|
can be compared to traditional span-based metrics used in other pairs of datasets-models. |
|
|
|
## Citation |
|
Ortmann, Katrin. 2022. Fine-Grained Error Analysis and Fair Evaluation of Labeled Spans. In *Proceedings of the Language Resources and Evaluation Conference (LREC)*, Marseille, France, pages 1400–1407. [PDF](https://aclanthology.org/2022.lrec-1.150.pdf) |
|
|
|
```bibtex |
|
@inproceedings{ortmann2022, |
|
title = {Fine-Grained Error Analysis and Fair Evaluation of Labeled Spans}, |
|
author = {Katrin Ortmann}, |
|
url = {https://aclanthology.org/2022.lrec-1.150}, |
|
year = {2022}, |
|
date = {2022-06-21}, |
|
booktitle = {Proceedings of the Language Resources and Evaluation Conference (LREC)}, |
|
pages = {1400-1407}, |
|
publisher = {European Language Resources Association}, |
|
address = {Marseille, France}, |
|
pubstate = {published}, |
|
type = {inproceedings} |
|
} |
|
``` |