illorca commited on
Commit
91f1c8a
1 Parent(s): 3109162

Update readme: WNUT results and limitations

Browse files
Files changed (1) hide show
  1. README.md +13 -4
README.md CHANGED
@@ -132,14 +132,23 @@ The output for different modes and error_formats is:
132
  ```
133
 
134
  #### Values from Popular Papers
135
- *Examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
 
136
 
137
- *Under construction*
 
 
 
 
 
138
 
139
  ## Limitations and Bias
140
- *Note any known limitations or biases that the metric has, with links and references if possible.*
 
141
 
142
- *Under construction*
 
 
143
 
144
  ## Citation
145
  Ortmann, Katrin. 2022. Fine-Grained Error Analysis and Fair Evaluation of Labeled Spans. In *Proceedings of the Language Resources and Evaluation Conference (LREC)*, Marseille, France, pages 1400–1407. [PDF](https://aclanthology.org/2022.lrec-1.150.pdf)
 
132
  ```
133
 
134
  #### Values from Popular Papers
135
+ A basic [DistilBERT model](https://huggingface.co/docs/transformers/model_doc/distilbert) downstream-trained on the
136
+ [WNUT-17](https://huggingface.co/datasets/wnut_17) dataset sheds the following F1 scores. Seqeval is shown for comparison.
137
 
138
+ | | Overall | Location | Group | Person | Creative Work | Corporation | Product |
139
+ |-------------|---------|----------|--------|--------|---------------|-------------|---------|
140
+ | Traditional | 0.2803 | 0.4124 | 0.0412 | 0.4105 | 0.0 | 0.1985 | 0.0 |
141
+ | Fair | 0.3199 | 0.5247 | 0.0459 | 0.4643 | 0.0 | 0.2666 | 0.0 |
142
+ | Weighted | 0.3842 | 0.5638 | 0.0681 | 0.5676 | 0.0 | 0.2910 | 0.0 |
143
+ | seqeval | 0.2222 | 0.3425 | 0.0413 | 0.3598 | 0.0 | 0.0408 | 0.0 |
144
 
145
  ## Limitations and Bias
146
+ The metric is restricted to the input schemes admitted by seqeval. For example, the application does not support numerical
147
+ label inputs (odd for Beginning, even for Inside and zero for Outside).
148
 
149
+ The choice of custom weights for wheighted evaluation is subjective to the user. Neither weighted nor fair evaluations
150
+ can be compared to traditional span-based metrics used in other pairs of datasets-models. Although traditional mode should
151
+ be comparable to these classical span-based metrics, there is a noticeable gap to seqeval, for instance.
152
 
153
  ## Citation
154
  Ortmann, Katrin. 2022. Fine-Grained Error Analysis and Fair Evaluation of Labeled Spans. In *Proceedings of the Language Resources and Evaluation Conference (LREC)*, Marseille, France, pages 1400–1407. [PDF](https://aclanthology.org/2022.lrec-1.150.pdf)