FairEval / README.md
illorca's picture
Update README.md
0e946aa
|
raw
history blame
11.2 kB
metadata
title: FairEval
tags:
  - evaluate
  - metric
description: Fair Evaluation for Squence labeling
sdk: gradio
sdk_version: 3.0.2
app_file: app.py
pinned: false

Fair Evaluation for Sequence Labeling

Metric Description

The traditional evaluation of NLP labeled spans with precision, recall, and F1-score leads to double penalties for close-to-correct annotations. As Manning (2006) argues in an article about named entity recognition, this can lead to undesirable effects when systems are optimized for these traditional metrics. To address these issues, this metric provides an implementation of FairEval, proposed by Ortmann (2022).

How to Use

FairEval outputs the error count (TP, FP, etc.) and resulting scores (Precision, Recall and F1) from a reference list of spans compared against a predicted one. The user can choose to see traditional or fair error counts and scores by switching the argument mode.

The user can also choose to see the metric parameters (TP, FP...) as absolute count, as a percentage with respect to the total number of errors or with respect to the total number of ground truth entities through the argument error_format.

The minimal example is:

faireval = evaluate.load("hpi-dhc/FairEval")
pred = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O', 'B-PER', 'I-PER', 'O']]
ref =  [['O', 'O', 'O',      'B-MISC', 'I-MISC', 'I-MISC', 'O', 'B-PER', 'I-PER', 'O']]
results = faireval.compute(predictions=pred, references=ref)

Inputs

FairEval handles input annotations as seqeval. The supported formats are IOB1, IOB2, IOE1, IOE2 and IOBES. Predicted sentences must have the same number of tokens as the references.

  • predictions (list): a list of lists of predicted labels, i.e. estimated targets as returned by a tagger.
  • references (list): list of ground truth reference labels.

The optional arguments are:

  • mode (str): 'fair', 'traditional' ot 'weighted. Controls the desired output. The default value is 'fair'.
    • 'traditional': equivalent to seqeval's 'strict' mode. Bear in mind that the default mode for seqeval is 'relaxed', which does not match with any of faireval modes.
    • 'fair': default fair score calculation. Fair will also show traditional scores for comparison.
    • 'weighted': custom score calculation with the weights passed. Weighted will also show traditional scores for comparison.
  • weights (dict): dictionary with the weight of each error for the custom score calculation.
  • error_format (str): 'count', 'error_ratio' or 'entity_ratio'. Controls the desired output for TP, FP, BE, LE, etc. Default value is 'count'.
    • 'count': absolute count of each parameter.
    • 'error_ratio': precentage with respect to the total errors that each parameter represents.
    • 'entity_ratio': precentage with respect to the total number of ground truth entites that each parameter represents.
  • zero_division (str): which value to substitute as a metric value when encountering zero division. Should be one of [0,1,"warn"]. "warn" acts as 0, but the warning is raised.
  • suffix (boolean): True if the IOB tag is a suffix (after type) instead of a prefix (before type), False otherwise. The default value is False, i.e. the IOB tag is a prefix (before type).
  • scheme (str): the target tagging scheme, which can be one of [IOB1, IOB2, IOE1, IOE2, IOBES, BILOU]. The default value is None.

Output Values

A dictionary with:

  • Overall error parameter count (or ratio) and resulting scores.
  • A nested dictionary per label with its respective error parameter count (or ratio) and resulting scores

If mode is 'traditional', the error parameters shown are the classical TP, FP and FN. If mode is 'fair' or 'weighted', TP remain the same, FP and FN are shown as per the fair definition and additional errors BE, LE and LBE are shown.

Examples

Considering the following input annotated sentences:

>>> r1 = ['O', 'O', 'B-PER', 'I-PER', 'O', 'B-PER']
>>> p1 = ['O', 'O', 'B-PER', 'I-PER', 'O', 'O'    ] #1FN
>>> 
>>> r2 = ['O',     'B-INT', 'B-OUT']
>>> p2 = ['B-INT', 'I-INT', 'B-OUT'] #1BE  
>>> 
>>> r3 = ['B-INT', 'I-INT', 'B-OUT']
>>> p3 = ['B-OUT', 'O',     'B-PER'] #1LBE, 1LE   
>>> 
>>> y_true = [r1, r2, r3]
>>> y_pred = [p1, p2, p3]

The output for different modes and error_formats is:

>>> faireval.compute(predictions=y_pred, references=y_true, mode='fair', error_format='count')
{"PER": {"precision": 1.0, "recall": 0.5, "f1": 0.6666,
         "trad_prec": 0.5, "trad_rec": 0.5, "trad_f1": 0.5,
         "TP": 1, "FP": 0.0, "FN": 1.0, "LE": 0.0, "BE": 0.0, "LBE": 0.0},
 "INT": {"precision": 0.0, "recall": 0.0, "f1": 0.0,
         "trad_prec": 0.0, "trad_rec": 0.0, "trad_f1": 0.0,
         "TP": 0, "FP": 0.0, "FN": 0.0, "LE": 0.0, "BE": 1.0, "LBE": 1.0},
 "OUT": {"precision": 0.6666, "recall": 0.6666, "f1": 0.666,
         "trad_prec": 0.5, "trad_rec": 0.5, "trad_f1": 0.5,
         "TP": 1, "FP": 0.0, "FN": 0.0, "LE": 1.0, "BE": 0.0, "LBE": 0.0},
 "overall_precision": 0.5714, "overall_recall": 0.4444, "overall_f1": 0.5,
 "overall_trad_prec": 0.4, "overall_trad_rec": 0.3333, "overall_trad_f1": 0.3636, 
 "TP": 2, "FP": 0.0, "FN": 1.0, "LE": 1.0, "BE": 1.0, "LBE": 1.0}
>>> faireval.compute(predictions=y_pred, references=y_true, mode='traditional', error_format='count')
{"PER": {"precision": 0.5, "recall": 0.5, "f1": 0.5,
         "TP": 1, "FP": 1.0, "FN": 1.0},
 "INT": {"precision": 0.0, "recall": 0.0, "f1": 0.0,
         "TP": 0, "FP": 1.0, "FN": 2.0},
 "OUT": {"precision": 0.5, "recall": 0.5, "f1": 0.5,
         "TP": 1, "FP": 1.0, "FN": 1.0},
 "overall_precision": 0.4, "overall_recall": 0.3333, "overall_f1": 0.3636,
 "TP": 2, "FP": 3.0, "FN": 4.0}
>>> faireval.compute(predictions=y_pred, references=y_true, mode='traditional', error_format='error_ratio')
{"PER": {"precision": 0.5, "recall": 0.5, "f1": 0.5,
         "TP": 1, "FP": 0.1428, "FN": 0.1428},
 "INT": {"precision": 0.0, "recall": 0.0, "f1": 0.0,
         "TP": 0, "FP": 0.1428, "FN": 0.2857},
 "OUT": {"precision": 0.5, "recall": 0.5, "f1": 0.5,
         "TP": 1, "FP": 0.1428, "FN": 0.1428},
 "overall_precision": 0.4, "overall_recall": 0.3333, "overall_f1": 0.3636,
 "TP": 2, "FP": 0.4285, "FN": 0.5714}

Values from Popular Papers

CoNLL2003

Computing the evaluation metrics on the results from this model run on the test split of CoNLL2003 dataset, we obtain the following F1-Scores:

F1 Scores overall location miscelaneous organization person
fair 0,94 0,96 0,85 0,92 0,97
traditional 0,90 0,92 0,79 0,87 0,96
seqeval strict 0,90 0,92 0,79 0,87 0,96
seqeval relaxed 0,90 0,92 0,78 0,87 0,96

With error count (traditional on the left and fair on the right):

overall location miscelaneous organization person
TP 5104 5104 1545 1545 561 561 1452 1452 1546 1546
FP 534 126 128 20 154 48 208 47 44 11
FN 544 124 123 13 141 47 209 47 71 17
LE 219 62 41 73 43
BE 126 16 46 53 11
LBE 87 32 13 41 1

WNUT-17

Computing the evaluation metrics on the results from this model run on the test split of WNUT-17 dataset, we obtain the following F1-Scores:

overall location group person creative work corporation product
fair 0,37 0,58 0,02 0,58 0,0 0,03 0,0
traditional 0,35 0,53 0,02 0,55 0,0 0,02 0,0
seqeval strict 0,35 0,53 0,02 0,55 0,0 0,02 0,0
seqeval relaxed 0,34 0,49 0,02 0,55 0,0 0,02 0,0

With error count:

overall location group person creative work corporation product
TP 255 255 67 67 2 2 185 185 0 0 1 1 0 0
FP 135 31 38 10 20 3 60 16 0 0 17 2 0 0
FN 824 725 83 71 163 135 244 233 142 120 65 54 127 112
LE 47 4 18 2 6 7 10
BE 30 10 4 13 0 3 0
LBE 29 1 6 0 16 1 5

Limitations and Bias

The metric is restricted to the input schemes admitted by seqeval. For example, the application does not support numerical label inputs (odd for Beginning, even for Inside and zero for Outside).

The choice of custom weights for wheighted evaluation is subjective to the user. Neither weighted nor fair evaluations can be compared to traditional span-based metrics used in other pairs of datasets-models.

Citation

Ortmann, Katrin. 2022. Fine-Grained Error Analysis and Fair Evaluation of Labeled Spans. In Proceedings of the Language Resources and Evaluation Conference (LREC), Marseille, France, pages 1400–1407. PDF

@inproceedings{ortmann2022,
    title = {Fine-Grained Error Analysis and Fair Evaluation of Labeled Spans},
    author = {Katrin Ortmann},
    url = {https://aclanthology.org/2022.lrec-1.150},
    year = {2022},
    date = {2022-06-21},
    booktitle = {Proceedings of the Language Resources and Evaluation Conference (LREC)},
    pages = {1400-1407},
    publisher = {European Language Resources Association},
    address = {Marseille, France},
    pubstate = {published},
    type = {inproceedings}
}