Spaces:

hpi-dhc
/

FairEval

Runtime error

App Files Files Community

illorca commited on Dec 10, 2022

Commit

91f1c8a

1 Parent(s): 3109162

Update readme: WNUT results and limitations

Browse files

Files changed (1) hide show

README.md +13 -4

README.md CHANGED Viewed

@@ -132,14 +132,23 @@ The output for different modes and error_formats is:
 ```
 #### Values from Popular Papers
-*Examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
-*Under construction*
 ## Limitations and Bias
-*Note any known limitations or biases that the metric has, with links and references if possible.*
-*Under construction*
 ## Citation
 Ortmann, Katrin. 2022. Fine-Grained Error Analysis and Fair Evaluation of Labeled Spans. In *Proceedings of the Language Resources and Evaluation Conference (LREC)*, Marseille, France, pages 1400–1407. [PDF](https://aclanthology.org/2022.lrec-1.150.pdf)

 ```
 #### Values from Popular Papers
+A basic [DistilBERT model](https://huggingface.co/docs/transformers/model_doc/distilbert) downstream-trained on the
+[WNUT-17](https://huggingface.co/datasets/wnut_17) dataset sheds the following F1 scores. Seqeval is shown for comparison.
+|             | Overall | Location | Group  | Person | Creative Work | Corporation | Product |
+|-------------|---------|----------|--------|--------|---------------|-------------|---------|
+| Traditional | 0.2803  | 0.4124   | 0.0412 | 0.4105 | 0.0           | 0.1985      | 0.0     |
+| Fair        | 0.3199  | 0.5247   | 0.0459 | 0.4643 | 0.0           | 0.2666      | 0.0     |
+| Weighted    | 0.3842  | 0.5638   | 0.0681 | 0.5676 | 0.0           | 0.2910      | 0.0     |
+| seqeval     | 0.2222  | 0.3425   | 0.0413 | 0.3598 | 0.0           | 0.0408      | 0.0     |
 ## Limitations and Bias
+The metric is restricted to the input schemes admitted by seqeval. For example, the application does not support numerical
+label inputs (odd for Beginning, even for Inside and zero for Outside).
+The choice of custom weights for wheighted evaluation is subjective to the user. Neither weighted nor fair evaluations
+can be compared to traditional span-based metrics used in other pairs of datasets-models. Although traditional mode should
+be comparable to these classical span-based metrics, there is a noticeable gap to seqeval, for instance.
 ## Citation
 Ortmann, Katrin. 2022. Fine-Grained Error Analysis and Fair Evaluation of Labeled Spans. In *Proceedings of the Language Resources and Evaluation Conference (LREC)*, Marseille, France, pages 1400–1407. [PDF](https://aclanthology.org/2022.lrec-1.150.pdf)