|
--- |
|
title: FairEvaluation |
|
tags: |
|
- evaluate |
|
- metric |
|
description: "TODO: add a description here" |
|
sdk: gradio |
|
sdk_version: 3.0.2 |
|
app_file: app.py |
|
pinned: false |
|
--- |
|
|
|
# Metric: Fair Evaluation |
|
|
|
## Metric Description |
|
The traditional evaluation of NLP labeled spans with precision, recall, and F1-score leads to double penalties for |
|
close-to-correct annotations. As Manning (2006) argues in an article about named entity recognition, this can lead to |
|
undesirable effects when systems are optimized for these traditional metrics. |
|
|
|
Building on his ideas, Katrin Ortmann (2022) develops FairEval: a new evaluation method that more accurately reflects |
|
true annotation quality by ensuring that every error is counted only once. In addition to the traditional categories of |
|
true positives (TP), false positives (FP), and false negatives (FN), the new method takes into account the more |
|
fine-grained error types suggested by Manning: labeling errors (LE), boundary errors (BE), and labeling-boundary |
|
errors (LBE). Additionally, the system also distinguishes different types of boundary errors: |
|
- BES: the system's annotation is smaller than the target span |
|
- BEL: the system's annotation is larger than the target span |
|
- BEO: the system span overlaps with the target span |
|
|
|
For more information on the reasoning and computation of the fair metrics from the redefined error count pleas refer to the [original paper](https://aclanthology.org/2022.lrec-1.150.pdf). |
|
|
|
## How to Use |
|
The current HuggingFace implementation accepts input for the predictions and references as sentences in IOB format. |
|
The simplest use example would be: |
|
|
|
```python |
|
>>> faireval = evaluate.load("illorca/fairevaluation") |
|
>>> pred = ['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O', 'B-PER', 'I-PER', 'O'] |
|
>>> ref = ['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O', 'B-PER', 'I-PER', 'O'] |
|
>>> results = faireval.compute(predictions=pred, references=ref) |
|
``` |
|
|
|
### Inputs |
|
- **predictions** *(list)*: list of predictions to score. Each predicted sentence |
|
should be a list of IOB-formatted labels corresponding to each sentence token. |
|
Predicted sentences must have the same number of tokens as the references'. |
|
- **references** *(list)*: list of reference for each prediction. Each reference sentence |
|
should be a list of IOB-formatted labels corresponding to each sentence token. |
|
|
|
### Output Values |
|
A dictionary with: |
|
- TP: count of True Positives |
|
- FP: count of False Positives |
|
- FN: count of False Negatives |
|
- LE: count of Labeling Errors |
|
- BE: count of Boundary Errors |
|
- BEO: segment of the BE where the prediction overlaps with the reference |
|
- BES: segment of the BE where the prediction is smaller than the reference |
|
- BEL: segment of the BE where the prediction is larger than the reference |
|
- LBE : count of Label-and-Boundary Errors |
|
- Prec: fair precision |
|
- Rec: fair recall |
|
- F1: fair F1-score |
|
|
|
#### Values from Popular Papers |
|
*Examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.* |
|
|
|
*Under construction* |
|
|
|
### Examples |
|
*Code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.* |
|
|
|
*Under construction* |
|
|
|
## Limitations and Bias |
|
*Note any known limitations or biases that the metric has, with links and references if possible.* |
|
|
|
*Under construction* |
|
|
|
## Citation |
|
Ortmann, Katrin. 2022. Fine-Grained Error Analysis and Fair Evaluation of Labeled Spans. In *Proceedings of the Language Resources and Evaluation Conference (LREC)*, Marseille, France, pages 1400–1407. [PDF](https://aclanthology.org/2022.lrec-1.150.pdf) |
|
|
|
```bibtex |
|
@inproceedings{ortmann2022, |
|
title = {Fine-Grained Error Analysis and Fair Evaluation of Labeled Spans}, |
|
author = {Katrin Ortmann}, |
|
url = {https://aclanthology.org/2022.lrec-1.150}, |
|
year = {2022}, |
|
date = {2022-06-21}, |
|
booktitle = {Proceedings of the Language Resources and Evaluation Conference (LREC)}, |
|
pages = {1400-1407}, |
|
publisher = {European Language Resources Association}, |
|
address = {Marseille, France}, |
|
pubstate = {published}, |
|
type = {inproceedings} |
|
} |
|
``` |