Spaces:

hpi-dhc
/

FairEval

Runtime error

App Files Files Community

illorca commited on Dec 19, 2022

Commit

0e946aa

1 Parent(s): 0f866a9

Update README.md

Browse files

Files changed (1) hide show

README.md +30 -46

README.md CHANGED Viewed

@@ -43,7 +43,7 @@ Predicted sentences must have the same number of tokens as the references.
 The optional arguments are:
 - **mode** *(str)*: 'fair', 'traditional' ot 'weighted. Controls the desired output. The default value is 'fair'.
-  - 'traditional': equivalent to seqeval's metrics / classic span-based evaluation.
   - 'fair': default fair score calculation. Fair will also show traditional scores for comparison.
   - 'weighted': custom score calculation with the weights passed. Weighted will also show traditional scores for comparison.
 - **weights** *(dict)*: dictionary with the weight of each error for the custom score calculation.
@@ -127,60 +127,44 @@ Computing the evaluation metrics on the results from [this model](https://huggin
 run on the test split of [CoNLL2003 dataset](https://huggingface.co/datasets/conll2003), we obtain the following F1-Scores:
 | F1   Scores     | overall | location | miscelaneous | organization | person |
-|-----------------|---------:|----------:|--------------:|--------------:|--------:|
-| traditional     | 0,90    | 0,92     | 0,79         | 0,87         | 0,96   |
 | fair            | 0,94    | 0,96     | 0,85         | 0,92         | 0,97   |
-| seqeval strict  | 0,90     | 0,92     | 0,79         | 0,87         | 0,96   |
-| seqeval relaxed | 0,89    | 0,92     | 0,78         | 0,86         | 0,96   |
-The traditional error count is:
-|    | overall (error ratio \| entity   ratio) | location | miscelaneous | organization | person |
-|----|-----------------------------------------:|----------:|--------------:|--------------:|--------:|
-| TP | 5104 ( - \| 90,36%)                     | 1545     | 561          | 1452         | 1546   |
-| FP | 534 (49,53% \| 9,45%)                   | 128      | 154          | 208          | 44     |
-| FN | 544 (50,46% \| 9,63%)                   | 123      | 141          | 209          | 71     |
-And the fair count is:
-|     | overall               | location | miscelaneous | organization | person |
-|-----|-----------------------:|----------:|--------------:|--------------:|--------:|
-| TP  | 5104 ( - \| 90,36%)   | 1545     | 561          | 1452         | 1546   |
-| FP  | 126 (18,47% \| 2,23%) | 20       | 48           | 47           | 11     |
-| FN  | 124 (18,18% \| 2,19%) | 13       | 47           | 47           | 17     |
-| LE  | 219 (32,11% \| 3,87%) | 62       | 41           | 73           | 43     |
-| BE  | 126 (18,47% \| 2,23%) | 16       | 46           | 53           | 11     |
-| LBE | 87 (12,75% \| 1,54%)  | 32       | 13           | 41           | 1      |
 #### WNUT-17
 Computing the evaluation metrics on the results from [this model](https://huggingface.co/muhtasham/bert-small-finetuned-wnut17-ner)
 run on the test split of [WNUT-17 dataset](https://huggingface.co/datasets/wnut_17), we obtain the following F1-Scores:
 |                 | overall | location | group  | person | creative work | corporation | product |
-|-----------------|---------:|----------:|--------:|--------:|---------------:|-------------:|---------:|
-| traditional     |  0,34 |   0,52 | 0,02 | 0,54 |           0,0 |      0,02 |     0,0 |
-| fair            |  0,37 |   0,58 | 0,02 | 0,58 |           0,0 |      0,02 |     0,0 |
-| seqeval strict  |  0,34 |   0,52 | 0,02 | 0,54 |           0,0 |      0,02 |     0,0 |
-| seqeval relaxed |  0,33 |   0,49 | 0,02 | 0,54 |           0,0 |      0,02 |     0,0 |
-The traditional count of errors would be:
-|    | overall (error ratio \| entity ratio) | location | group | person | creative work | corporation | product |
-|----|---------:|----------:|-------:|--------:|---------------:|-------------:|---------:|
-| TP |     255 ( - \| 23,63%)|       67 |     2 |    185 |             0 |           1 |       0 |
-| FP |     135 ( 14,07% \| 12,51%)|       38 |    20 |     60 |             0 |          17 |       0 |
-| FN |     824 ( 85,92% \| 76,36%)|       83 |   163 |    244 |           142 |          65 |     127 |
-While the fair count is:
-|     | overall (error ratio \| entity ratio) | location | group | person | creative work | corporation | product |
-|-----|---------:|----------:|-------:|--------:|---------------:|-------------:|---------:|
-| TP           | 255 ( - \| 23,63%)                    | 67       | 2     | 185    | 0             | 1           | 0       |
-| FP           | 31 (3,6% \| 2,87%)                    | 10       | 3     | 16     | 0             | 2           | 0       |
-| FN           | 725 (84,11% \| 67,19%)                | 71       | 135   | 233    | 120           | 54          | 112     |
-| LE           | 47 (5,45% \| 4,35%)                   | 4        | 18    | 2      | 6             | 7           | 10      |
-| LBE          | 29 (3,36% \| 2,68%)                   | 1        | 6     | 0      | 16            | 1           | 5       |
-| BE           | 30 (3,48% \| 2,78%)                   | 10       | 4     | 13     | 0             | 3           | 0       |
 ## Limitations and Bias
 The metric is restricted to the input schemes admitted by seqeval. For example, the application does not support numerical

 The optional arguments are:
 - **mode** *(str)*: 'fair', 'traditional' ot 'weighted. Controls the desired output. The default value is 'fair'.
+  - 'traditional': equivalent to seqeval's 'strict' mode. Bear in mind that the default mode for seqeval is 'relaxed', which does not match with any of faireval modes.
   - 'fair': default fair score calculation. Fair will also show traditional scores for comparison.
   - 'weighted': custom score calculation with the weights passed. Weighted will also show traditional scores for comparison.
 - **weights** *(dict)*: dictionary with the weight of each error for the custom score calculation.
 run on the test split of [CoNLL2003 dataset](https://huggingface.co/datasets/conll2003), we obtain the following F1-Scores:
 | F1   Scores     | overall | location | miscelaneous | organization | person |
+|-----------------|--------:|---------:|-------------:|-------------:|-------:|
 | fair            | 0,94    | 0,96     | 0,85         | 0,92         | 0,97   |
+| traditional     | 0,90    | 0,92     | 0,79         | 0,87         | 0,96   |
+| seqeval strict  | 0,90    | 0,92     | 0,79         | 0,87         | 0,96   |
+| seqeval relaxed | 0,90    | 0,92     | 0,78         | 0,87         | 0,96   |
+With error count (traditional on the left and fair on the right):
+|     | overall |      | location |      | miscelaneous |     | organization |      | person |      |
+|-----|--------:|-----:|---------:|-----:|-------------:|----:|-------------:|-----:|-------:|-----:|
+| TP  | 5104    | 5104 | 1545     | 1545 | 561          | 561 | 1452         | 1452 | 1546   | 1546 |
+| FP  | 534     | 126  | 128      | 20   | 154          | 48  | 208          | 47   | 44     | 11   |
+| FN  | 544     | 124  | 123      | 13   | 141          | 47  | 209          | 47   | 71     | 17   |
+| LE  |         | 219  |          | 62   |              | 41  |              | 73   |        | 43   |
+| BE  |         | 126  |          | 16   |              | 46  |              | 53   |        | 11   |
+| LBE |         | 87   |          | 32   |              | 13  |              | 41   |        | 1    |
 #### WNUT-17
 Computing the evaluation metrics on the results from [this model](https://huggingface.co/muhtasham/bert-small-finetuned-wnut17-ner)
 run on the test split of [WNUT-17 dataset](https://huggingface.co/datasets/wnut_17), we obtain the following F1-Scores:
 |                 | overall | location | group  | person | creative work | corporation | product |
+|-----------------|--------:|---------:|-------:|-------:|--------------:|------------:|--------:|
+| fair            |  0,37 |   0,58 | 0,02 | 0,58 |           0,0 |      0,03 |     0,0 |
+| traditional     |  0,35 |   0,53 | 0,02 | 0,55 |           0,0 |      0,02 |     0,0 |
+| seqeval strict  |  0,35 |   0,53 | 0,02 | 0,55 |           0,0 |      0,02 |     0,0 |
+| seqeval relaxed |  0,34 |   0,49 | 0,02 | 0,55 |           0,0 |      0,02 |     0,0 |
+With error count:
+|     | overall |     | location |    | group |     | person |     | creative work |     | corporation |    | product |     |
+|-----|--------:|----:|---------:|---:|------:|----:|-------:|----:|--------------:|----:|------------:|---:|--------:|----:|
+| TP  |     255 | 255 |       67 | 67 |     2 | 2   |    185 | 185 |             0 | 0   |           1 | 1  |       0 | 0   |
+| FP  |     135 | 31  |       38 | 10 |    20 | 3   |     60 | 16  |             0 | 0   |          17 | 2  |       0 | 0   |
+| FN  |     824 | 725 |       83 | 71 |   163 | 135 |    244 | 233 |           142 | 120 |          65 | 54 |     127 | 112 |
+| LE  |         | 47  |          | 4  |       | 18  |        | 2   |               | 6   |             | 7  |         | 10  |
+| BE  |         | 30  |          | 10 |       | 4   |        | 13  |               | 0   |             | 3  |         | 0   |
+| LBE |         | 29  |          | 1  |       | 6   |        | 0   |               | 16  |             | 1  |         | 5   |
 ## Limitations and Bias
 The metric is restricted to the input schemes admitted by seqeval. For example, the application does not support numerical