illorca commited on
Commit
0e946aa
·
1 Parent(s): 0f866a9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -46
README.md CHANGED
@@ -43,7 +43,7 @@ Predicted sentences must have the same number of tokens as the references.
43
 
44
  The optional arguments are:
45
  - **mode** *(str)*: 'fair', 'traditional' ot 'weighted. Controls the desired output. The default value is 'fair'.
46
- - 'traditional': equivalent to seqeval's metrics / classic span-based evaluation.
47
  - 'fair': default fair score calculation. Fair will also show traditional scores for comparison.
48
  - 'weighted': custom score calculation with the weights passed. Weighted will also show traditional scores for comparison.
49
  - **weights** *(dict)*: dictionary with the weight of each error for the custom score calculation.
@@ -127,60 +127,44 @@ Computing the evaluation metrics on the results from [this model](https://huggin
127
  run on the test split of [CoNLL2003 dataset](https://huggingface.co/datasets/conll2003), we obtain the following F1-Scores:
128
 
129
  | F1 Scores | overall | location | miscelaneous | organization | person |
130
- |-----------------|---------:|----------:|--------------:|--------------:|--------:|
131
- | traditional | 0,90 | 0,92 | 0,79 | 0,87 | 0,96 |
132
  | fair | 0,94 | 0,96 | 0,85 | 0,92 | 0,97 |
133
- | seqeval strict | 0,90 | 0,92 | 0,79 | 0,87 | 0,96 |
134
- | seqeval relaxed | 0,89 | 0,92 | 0,78 | 0,86 | 0,96 |
135
-
136
- The traditional error count is:
137
-
138
- | | overall (error ratio \| entity ratio) | location | miscelaneous | organization | person |
139
- |----|-----------------------------------------:|----------:|--------------:|--------------:|--------:|
140
- | TP | 5104 ( - \| 90,36%) | 1545 | 561 | 1452 | 1546 |
141
- | FP | 534 (49,53% \| 9,45%) | 128 | 154 | 208 | 44 |
142
- | FN | 544 (50,46% \| 9,63%) | 123 | 141 | 209 | 71 |
143
 
144
- And the fair count is:
145
 
146
- | | overall | location | miscelaneous | organization | person |
147
- |-----|-----------------------:|----------:|--------------:|--------------:|--------:|
148
- | TP | 5104 ( - \| 90,36%) | 1545 | 561 | 1452 | 1546 |
149
- | FP | 126 (18,47% \| 2,23%) | 20 | 48 | 47 | 11 |
150
- | FN | 124 (18,18% \| 2,19%) | 13 | 47 | 47 | 17 |
151
- | LE | 219 (32,11% \| 3,87%) | 62 | 41 | 73 | 43 |
152
- | BE | 126 (18,47% \| 2,23%) | 16 | 46 | 53 | 11 |
153
- | LBE | 87 (12,75% \| 1,54%) | 32 | 13 | 41 | 1 |
154
 
155
  #### WNUT-17
156
  Computing the evaluation metrics on the results from [this model](https://huggingface.co/muhtasham/bert-small-finetuned-wnut17-ner)
157
  run on the test split of [WNUT-17 dataset](https://huggingface.co/datasets/wnut_17), we obtain the following F1-Scores:
158
 
159
  | | overall | location | group | person | creative work | corporation | product |
160
- |-----------------|---------:|----------:|--------:|--------:|---------------:|-------------:|---------:|
161
- | traditional | 0,34 | 0,52 | 0,02 | 0,54 | 0,0 | 0,02 | 0,0 |
162
- | fair | 0,37 | 0,58 | 0,02 | 0,58 | 0,0 | 0,02 | 0,0 |
163
- | seqeval strict | 0,34 | 0,52 | 0,02 | 0,54 | 0,0 | 0,02 | 0,0 |
164
- | seqeval relaxed | 0,33 | 0,49 | 0,02 | 0,54 | 0,0 | 0,02 | 0,0 |
165
-
166
- The traditional count of errors would be:
167
-
168
- | | overall (error ratio \| entity ratio) | location | group | person | creative work | corporation | product |
169
- |----|---------:|----------:|-------:|--------:|---------------:|-------------:|---------:|
170
- | TP | 255 ( - \| 23,63%)| 67 | 2 | 185 | 0 | 1 | 0 |
171
- | FP | 135 ( 14,07% \| 12,51%)| 38 | 20 | 60 | 0 | 17 | 0 |
172
- | FN | 824 ( 85,92% \| 76,36%)| 83 | 163 | 244 | 142 | 65 | 127 |
173
-
174
- While the fair count is:
175
-
176
- | | overall (error ratio \| entity ratio) | location | group | person | creative work | corporation | product |
177
- |-----|---------:|----------:|-------:|--------:|---------------:|-------------:|---------:|
178
- | TP | 255 ( - \| 23,63%) | 67 | 2 | 185 | 0 | 1 | 0 |
179
- | FP | 31 (3,6% \| 2,87%) | 10 | 3 | 16 | 0 | 2 | 0 |
180
- | FN | 725 (84,11% \| 67,19%) | 71 | 135 | 233 | 120 | 54 | 112 |
181
- | LE | 47 (5,45% \| 4,35%) | 4 | 18 | 2 | 6 | 7 | 10 |
182
- | LBE | 29 (3,36% \| 2,68%) | 1 | 6 | 0 | 16 | 1 | 5 |
183
- | BE | 30 (3,48% \| 2,78%) | 10 | 4 | 13 | 0 | 3 | 0 |
184
 
185
  ## Limitations and Bias
186
  The metric is restricted to the input schemes admitted by seqeval. For example, the application does not support numerical
 
43
 
44
  The optional arguments are:
45
  - **mode** *(str)*: 'fair', 'traditional' ot 'weighted. Controls the desired output. The default value is 'fair'.
46
+ - 'traditional': equivalent to seqeval's 'strict' mode. Bear in mind that the default mode for seqeval is 'relaxed', which does not match with any of faireval modes.
47
  - 'fair': default fair score calculation. Fair will also show traditional scores for comparison.
48
  - 'weighted': custom score calculation with the weights passed. Weighted will also show traditional scores for comparison.
49
  - **weights** *(dict)*: dictionary with the weight of each error for the custom score calculation.
 
127
  run on the test split of [CoNLL2003 dataset](https://huggingface.co/datasets/conll2003), we obtain the following F1-Scores:
128
 
129
  | F1 Scores | overall | location | miscelaneous | organization | person |
130
+ |-----------------|--------:|---------:|-------------:|-------------:|-------:|
 
131
  | fair | 0,94 | 0,96 | 0,85 | 0,92 | 0,97 |
132
+ | traditional | 0,90 | 0,92 | 0,79 | 0,87 | 0,96 |
133
+ | seqeval strict | 0,90 | 0,92 | 0,79 | 0,87 | 0,96 |
134
+ | seqeval relaxed | 0,90 | 0,92 | 0,78 | 0,87 | 0,96 |
 
 
 
 
 
 
 
135
 
136
+ With error count (traditional on the left and fair on the right):
137
 
138
+ | | overall | | location | | miscelaneous | | organization | | person | |
139
+ |-----|--------:|-----:|---------:|-----:|-------------:|----:|-------------:|-----:|-------:|-----:|
140
+ | TP | 5104 | 5104 | 1545 | 1545 | 561 | 561 | 1452 | 1452 | 1546 | 1546 |
141
+ | FP | 534 | 126 | 128 | 20 | 154 | 48 | 208 | 47 | 44 | 11 |
142
+ | FN | 544 | 124 | 123 | 13 | 141 | 47 | 209 | 47 | 71 | 17 |
143
+ | LE | | 219 | | 62 | | 41 | | 73 | | 43 |
144
+ | BE | | 126 | | 16 | | 46 | | 53 | | 11 |
145
+ | LBE | | 87 | | 32 | | 13 | | 41 | | 1 |
146
 
147
  #### WNUT-17
148
  Computing the evaluation metrics on the results from [this model](https://huggingface.co/muhtasham/bert-small-finetuned-wnut17-ner)
149
  run on the test split of [WNUT-17 dataset](https://huggingface.co/datasets/wnut_17), we obtain the following F1-Scores:
150
 
151
  | | overall | location | group | person | creative work | corporation | product |
152
+ |-----------------|--------:|---------:|-------:|-------:|--------------:|------------:|--------:|
153
+ | fair | 0,37 | 0,58 | 0,02 | 0,58 | 0,0 | 0,03 | 0,0 |
154
+ | traditional | 0,35 | 0,53 | 0,02 | 0,55 | 0,0 | 0,02 | 0,0 |
155
+ | seqeval strict | 0,35 | 0,53 | 0,02 | 0,55 | 0,0 | 0,02 | 0,0 |
156
+ | seqeval relaxed | 0,34 | 0,49 | 0,02 | 0,55 | 0,0 | 0,02 | 0,0 |
157
+
158
+ With error count:
159
+
160
+ | | overall | | location | | group | | person | | creative work | | corporation | | product | |
161
+ |-----|--------:|----:|---------:|---:|------:|----:|-------:|----:|--------------:|----:|------------:|---:|--------:|----:|
162
+ | TP | 255 | 255 | 67 | 67 | 2 | 2 | 185 | 185 | 0 | 0 | 1 | 1 | 0 | 0 |
163
+ | FP | 135 | 31 | 38 | 10 | 20 | 3 | 60 | 16 | 0 | 0 | 17 | 2 | 0 | 0 |
164
+ | FN | 824 | 725 | 83 | 71 | 163 | 135 | 244 | 233 | 142 | 120 | 65 | 54 | 127 | 112 |
165
+ | LE | | 47 | | 4 | | 18 | | 2 | | 6 | | 7 | | 10 |
166
+ | BE | | 30 | | 10 | | 4 | | 13 | | 0 | | 3 | | 0 |
167
+ | LBE | | 29 | | 1 | | 6 | | 0 | | 16 | | 1 | | 5 |
 
 
 
 
 
 
 
 
168
 
169
  ## Limitations and Bias
170
  The metric is restricted to the input schemes admitted by seqeval. For example, the application does not support numerical