|
--- |
|
title: FairEval |
|
tags: |
|
- evaluate |
|
- metric |
|
description: "TODO: add a description here" |
|
sdk: gradio |
|
sdk_version: 3.0.2 |
|
app_file: app.py |
|
pinned: false |
|
--- |
|
|
|
# Fair Evaluation for Sequence Labeling |
|
|
|
## Metric Description |
|
The traditional evaluation of NLP labeled spans with precision, recall, and F1-score leads to double penalties for |
|
close-to-correct annotations. As [Manning (2006)](https://nlpers.blogspot.com/2006/08/doing-named-entity-recognition-dont.html) |
|
argues in an article about named entity recognition, this can lead to undesirable effects when systems are optimized for these traditional metrics. |
|
To address these issues, this metric provides an implementation of FairEval, proposed by [Ortmann (2022)](https://aclanthology.org/2022.lrec-1.150.pdf). |
|
|
|
## How to Use |
|
FairEval outputs the error count (TP, FP, etc.) and resulting scores (Precision, Recall and F1) from a reference list of |
|
spans compared against a predicted one. The user can choose to see traditional or fair error counts and scores by |
|
switching the argument **mode**. |
|
|
|
The user can also choose to see the metric parameters (TP, FP...) as absolute count, as a percentage with respect to the |
|
total number of errors or with respect to the total number of ground truth entities through the argument **error_format**. |
|
|
|
The minimal example is: |
|
|
|
```python |
|
faireval = evaluate.load("hpi-dhc/FairEval") |
|
pred = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O', 'B-PER', 'I-PER', 'O']] |
|
ref = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O', 'B-PER', 'I-PER', 'O']] |
|
results = faireval.compute(predictions=pred, references=ref) |
|
``` |
|
|
|
### Inputs |
|
FairEval handles input annotations as seqeval. The supported formats are IOB1, IOB2, IOE1, IOE2 and IOBES. |
|
Predicted sentences must have the same number of tokens as the references. |
|
- **predictions** *(list)*: a list of lists of predicted labels, i.e. estimated targets as returned by a tagger. |
|
- **references** *(list)*: list of ground truth reference labels. |
|
|
|
The optional arguments are: |
|
- **mode** *(str)*: 'fair', 'traditional' ot 'weighted. Controls the desired output. The default value is 'fair'. |
|
- 'traditional': equivalent to seqeval's metrics / classic span-based evaluation. |
|
- 'fair': default fair score calculation. |
|
- 'weighted': custom score calculation with the weights passed. |
|
- **weights** *(dict)*: dictionary with the weight of each error for the custom score calculation. |
|
- **error_format** *(str)*: 'count', 'error_ratio' or 'entity_ratio'. Controls the desired output for TP, FP, BE, LE, etc. Default value is 'count'. |
|
- 'count': absolute count of each parameter. |
|
- 'error_ratio': precentage with respect to the total errors that each parameter represents. |
|
- 'entity_ratio': precentage with respect to the total number of ground truth entites that each parameter represents. |
|
- **zero_division** *(str)*: which value to substitute as a metric value when encountering zero division. Should be one of [0,1,"warn"]. "warn" acts as 0, but the warning is raised. |
|
- **suffix** *(boolean)*: True if the IOB tag is a suffix (after type) instead of a prefix (before type), False otherwise. The default value is False, i.e. the IOB tag is a prefix (before type). |
|
- **scheme** *(str)*: the target tagging scheme, which can be one of [IOB1, IOB2, IOE1, IOE2, IOBES, BILOU]. The default value is None. |
|
|
|
### Output Values |
|
A dictionary with: |
|
- Overall error parameter count (or ratio) and resulting scores. |
|
- A nested dictionary per label with its respective error parameter count (or ratio) and resulting scores |
|
|
|
If mode is 'traditional', the error parameters shown are the classical TP, FP and FN. If mode is 'fair' or 'weighted', |
|
TP remain the same, FP and FN are shown as per the fair definition and additional errors BE, LE and LBE are shown. |
|
|
|
### Examples |
|
A comprehensive set of side-by-side examples is shown [here](https://huggingface.co/spaces/hpi-dhc/FairEval/blob/main/HFFE_use_cases.pdf). |
|
|
|
Considering the following input annotated sentences: |
|
```python |
|
>>> r1 = ['O', 'O', 'B-PER', 'I-PER', 'O', 'B-PER'] |
|
>>> p1 = ['O', 'O', 'B-PER', 'I-PER', 'O', 'O' ] #1FN |
|
>>> |
|
>>> r2 = ['O', 'B-INT', 'B-OUT'] |
|
>>> p2 = ['B-INT', 'I-INT', 'B-OUT'] #1BE |
|
>>> |
|
>>> r3 = ['B-INT', 'I-INT', 'B-OUT'] |
|
>>> p3 = ['B-OUT', 'O', 'B-PER'] #1LBE, 1LE |
|
>>> |
|
>>> y_true = [r1, r2, r3] |
|
>>> y_pred = [p1, p2, p3] |
|
``` |
|
|
|
The output for different modes and error_formats is: |
|
```python |
|
>>> faireval.compute(predictions=y_pred, references=y_true, mode='traditional', error_format='count') |
|
{'PER': {'precision': 0.5, 'recall': 0.5, 'f1': 0.5, 'TP': 1, 'FP': 1, 'FN': 1}, |
|
'INT': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'TP': 0, 'FP': 1, 'FN': 2}, |
|
'OUT': {'precision': 0.5, 'recall': 0.5, 'f1': 0.5, 'TP': 1, 'FP': 1, 'FN': 1}, |
|
'overall_precision': 0.4, |
|
'overall_recall': 0.3333, |
|
'overall_f1': 0.3636, |
|
'TP': 2, |
|
'FP': 3, |
|
'FN': 4} |
|
``` |
|
|
|
```python |
|
>>> faireval.compute(predictions=y_pred, references=y_true, mode='traditional', error_format='proportion') |
|
{'PER': {'precision': 0.5, 'recall': 0.5, 'f1': 0.5, 'TP': 1, 'FP': 0.1428, 'FN': 0.1428}, |
|
'INT': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'TP': 0, 'FP': 0.1428, 'FN': 0.2857}, |
|
'OUT': {'precision': 0.5, 'recall': 0.5, 'f1': 0.5, 'TP': 1, 'FP': 0.1428, 'FN': 0.1428}, |
|
'overall_precision': 0.4, |
|
'overall_recall': 0.3333, |
|
'overall_f1': 0.3636, |
|
'TP': 2, |
|
'FP': 0.4285, |
|
'FN': 0.5714} |
|
``` |
|
|
|
```python |
|
>>> faireval.compute(predictions=y_pred, references=y_true, mode='fair', error_format='count') |
|
{'PER': {'precision': 1.0, 'recall': 0.5, 'f1': 0.6666, 'TP': 1, 'FP': 0, 'FN': 1, 'LE': 0, 'BE': 0, 'LBE': 0}, |
|
'INT': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'TP': 0, 'FP': 0, 'FN': 0, 'LE': 0, 'BE': 1, 'LBE': 1}, |
|
'OUT': {'precision': 0.6666, 'recall': 0.6666, 'f1': 0.6666, 'TP': 1, 'FP': 0, 'FN': 0, 'LE': 1, 'BE': 0, 'LBE': 0}, |
|
'overall_precision': 0.5714, |
|
'overall_recall': 0.4444444444444444, |
|
'overall_f1': 0.5, |
|
'TP': 2, |
|
'FP': 0, |
|
'FN': 1, |
|
'LE': 1, |
|
'BE': 1, |
|
'LBE': 1} |
|
``` |
|
|
|
#### Values from Popular Papers |
|
*Examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.* |
|
|
|
*Under construction* |
|
|
|
## Limitations and Bias |
|
*Note any known limitations or biases that the metric has, with links and references if possible.* |
|
|
|
*Under construction* |
|
|
|
## Citation |
|
Ortmann, Katrin. 2022. Fine-Grained Error Analysis and Fair Evaluation of Labeled Spans. In *Proceedings of the Language Resources and Evaluation Conference (LREC)*, Marseille, France, pages 1400–1407. [PDF](https://aclanthology.org/2022.lrec-1.150.pdf) |
|
|
|
```bibtex |
|
@inproceedings{ortmann2022, |
|
title = {Fine-Grained Error Analysis and Fair Evaluation of Labeled Spans}, |
|
author = {Katrin Ortmann}, |
|
url = {https://aclanthology.org/2022.lrec-1.150}, |
|
year = {2022}, |
|
date = {2022-06-21}, |
|
booktitle = {Proceedings of the Language Resources and Evaluation Conference (LREC)}, |
|
pages = {1400-1407}, |
|
publisher = {European Language Resources Association}, |
|
address = {Marseille, France}, |
|
pubstate = {published}, |
|
type = {inproceedings} |
|
} |
|
``` |