Spaces:

hpi-dhc
/

FairEval

Runtime error

App Files Files Community

FairEval / README.md

illorca

Re-upload project

c5d83eb almost 2 years ago

preview code

raw

history blame

4.39 kB

	---
	title: FairEvaluation
	tags:
	- evaluate
	- metric
	description: "TODO: add a description here"
	sdk: gradio
	sdk_version: 3.0.2
	app_file: app.py
	pinned: false
	---

	# Metric: Fair Evaluation

	## Metric Description
	The traditional evaluation of NLP labeled spans with precision, recall, and F1-score leads to double penalties for
	close-to-correct annotations. As Manning (2006) argues in an article about named entity recognition, this can lead to
	undesirable effects when systems are optimized for these traditional metrics.

	Building on his ideas, Katrin Ortmann (2022) develops FairEval: a new evaluation method that more accurately reflects
	true annotation quality by ensuring that every error is counted only once. In addition to the traditional categories of
	true positives (TP), false positives (FP), and false negatives (FN), the new method takes into account the more
	fine-grained error types suggested by Manning: labeling errors (LE), boundary errors (BE), and labeling-boundary
	errors (LBE). Additionally, the system also distinguishes different types of boundary errors:
	- BES: the system's annotation is smaller than the target span
	- BEL: the system's annotation is larger than the target span
	- BEO: the system span overlaps with the target span

	For more information on the reasoning and computation of the fair metrics from the redefined error count pleas refer to the [original paper](https://aclanthology.org/2022.lrec-1.150.pdf).

	## How to Use
	The current HuggingFace implementation accepts input for the predictions and references as sentences in IOB format.
	The simplest use example would be:

	```python
	>>> faireval = evaluate.load("illorca/fairevaluation")
	>>> pred = ['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O', 'B-PER', 'I-PER', 'O']
	>>> ref = ['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O', 'B-PER', 'I-PER', 'O']
	>>> results = faireval.compute(predictions=pred, references=ref)
	```

	### Inputs
	- predictions (list): list of predictions to score. Each predicted sentence
	should be a list of IOB-formatted labels corresponding to each sentence token.
	Predicted sentences must have the same number of tokens as the references'.
	- references (list): list of reference for each prediction. Each reference sentence
	should be a list of IOB-formatted labels corresponding to each sentence token.

	### Output Values
	A dictionary with:
	- TP: count of True Positives
	- FP: count of False Positives
	- FN: count of False Negatives
	- LE: count of Labeling Errors
	- BE: count of Boundary Errors
	- BEO: segment of the BE where the prediction overlaps with the reference
	- BES: segment of the BE where the prediction is smaller than the reference
	- BEL: segment of the BE where the prediction is larger than the reference
	- LBE : count of Label-and-Boundary Errors
	- Prec: fair precision
	- Rec: fair recall
	- F1: fair F1-score

	#### Values from Popular Papers
	Examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.

	Under construction

	### Examples
	Code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.

	Under construction

	## Limitations and Bias
	Note any known limitations or biases that the metric has, with links and references if possible.

	Under construction

	## Citation
	Ortmann, Katrin. 2022. Fine-Grained Error Analysis and Fair Evaluation of Labeled Spans. In Proceedings of the Language Resources and Evaluation Conference (LREC), Marseille, France, pages 1400–1407. [PDF](https://aclanthology.org/2022.lrec-1.150.pdf)

	```bibtex
	@inproceedings{ortmann2022,
	title = {Fine-Grained Error Analysis and Fair Evaluation of Labeled Spans},
	author = {Katrin Ortmann},
	url = {https://aclanthology.org/2022.lrec-1.150},
	year = {2022},
	date = {2022-06-21},
	booktitle = {Proceedings of the Language Resources and Evaluation Conference (LREC)},
	pages = {1400-1407},
	publisher = {European Language Resources Association},
	address = {Marseille, France},
	pubstate = {published},
	type = {inproceedings}
	}
	```