--- title: FairEvaluation tags: - evaluate - metric description: "TODO: add a description here" sdk: gradio sdk_version: 3.0.2 app_file: app.py pinned: false --- # Metric: Fair Evaluation ## Metric Description The traditional evaluation of NLP labeled spans with precision, recall, and F1-score leads to double penalties for close-to-correct annotations. As Manning (2006) argues in an article about named entity recognition, this can lead to undesirable effects when systems are optimized for these traditional metrics. Building on his ideas, Katrin Ortmann (2022) develops FairEval: a new evaluation method that more accurately reflects true annotation quality by ensuring that every error is counted only once. In addition to the traditional categories of true positives (TP), false positives (FP), and false negatives (FN), the new method takes into account the more fine-grained error types suggested by Manning: labeling errors (LE), boundary errors (BE), and labeling-boundary errors (LBE). Additionally, the system also distinguishes different types of boundary errors: - BES: the system's annotation is smaller than the target span - BEL: the system's annotation is larger than the target span - BEO: the system span overlaps with the target span For more information on the reasoning and computation of the fair metrics from the redefined error count pleas refer to the [original paper](https://aclanthology.org/2022.lrec-1.150.pdf). ## How to Use The current HuggingFace implementation accepts input for the predictions and references as sentences in IOB format. The simplest use example would be: ```python >>> faireval = evaluate.load("illorca/fairevaluation") >>> pred = ['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O', 'B-PER', 'I-PER', 'O'] >>> ref = ['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O', 'B-PER', 'I-PER', 'O'] >>> results = faireval.compute(predictions=pred, references=ref) ``` ### Inputs - **predictions** *(list)*: list of predictions to score. Each predicted sentence should be a list of IOB-formatted labels corresponding to each sentence token. Predicted sentences must have the same number of tokens as the references'. - **references** *(list)*: list of reference for each prediction. Each reference sentence should be a list of IOB-formatted labels corresponding to each sentence token. ### Output Values A dictionary with: - TP: count of True Positives - FP: count of False Positives - FN: count of False Negatives - LE: count of Labeling Errors - BE: count of Boundary Errors - BEO: segment of the BE where the prediction overlaps with the reference - BES: segment of the BE where the prediction is smaller than the reference - BEL: segment of the BE where the prediction is larger than the reference - LBE : count of Label-and-Boundary Errors - Prec: fair precision - Rec: fair recall - F1: fair F1-score #### Values from Popular Papers *Examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.* *Under construction* ### Examples *Code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.* *Under construction* ## Limitations and Bias *Note any known limitations or biases that the metric has, with links and references if possible.* *Under construction* ## Citation Ortmann, Katrin. 2022. Fine-Grained Error Analysis and Fair Evaluation of Labeled Spans. In *Proceedings of the Language Resources and Evaluation Conference (LREC)*, Marseille, France, pages 1400–1407. [PDF](https://aclanthology.org/2022.lrec-1.150.pdf) ```bibtex @inproceedings{ortmann2022, title = {Fine-Grained Error Analysis and Fair Evaluation of Labeled Spans}, author = {Katrin Ortmann}, url = {https://aclanthology.org/2022.lrec-1.150}, year = {2022}, date = {2022-06-21}, booktitle = {Proceedings of the Language Resources and Evaluation Conference (LREC)}, pages = {1400-1407}, publisher = {European Language Resources Association}, address = {Marseille, France}, pubstate = {published}, type = {inproceedings} } ```