illorca commited on
Commit
68b945f
·
1 Parent(s): 2f1260e

Include error counts in Popular Paper section

Browse files
Files changed (1) hide show
  1. README.md +85 -39
README.md CHANGED
@@ -82,53 +82,59 @@ Considering the following input annotated sentences:
82
  The output for different modes and error_formats is:
83
  ```python
84
  >>> faireval.compute(predictions=y_pred, references=y_true, mode='fair', error_format='count')
85
- {'PER': {'precision': 1.0, 'recall': 0.5, 'f1': 0.6666,
86
- "trad_prec": 0.5, "trad_rec": 0.5, "trad_f1": 0.5,
87
- 'TP': 1, 'FP': 0, 'FN': 1, 'LE': 0, 'BE': 0, 'LBE': 0},
88
- 'INT': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0,
89
- "trad_prec": 0.0, "trad_rec": 0.0, "trad_f1": 0.0,
90
- 'TP': 0, 'FP': 0, 'FN': 0, 'LE': 0, 'BE': 1, 'LBE': 1},
91
- 'OUT': {'precision': 0.6666, 'recall': 0.6666, 'f1': 0.6666,
92
- "trad_prec": 0.5, "trad_rec": 0.5, "trad_f1": 0.5,
93
- 'TP': 1, 'FP': 0, 'FN': 0, 'LE': 1, 'BE': 0, 'LBE': 0},
94
- 'overall_precision': 0.5714,
95
- 'overall_recall': 0.4444444444444444,
96
- 'overall_f1': 0.5,
97
- 'trad_prec': 0.5,
98
- 'trad_rec': 0.5,
99
- 'trad_f1': 0.5,
100
- 'TP': 2,
101
- 'FP': 0,
102
- 'FN': 1,
103
- 'LE': 1,
104
- 'BE': 1,
105
- 'LBE': 1}
106
  ```
107
 
108
  ```python
109
  >>> faireval.compute(predictions=y_pred, references=y_true, mode='traditional', error_format='count')
110
- {'PER': {'precision': 0.5, 'recall': 0.5, 'f1': 0.5, 'TP': 1, 'FP': 1, 'FN': 1},
111
- 'INT': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'TP': 0, 'FP': 1, 'FN': 2},
112
- 'OUT': {'precision': 0.5, 'recall': 0.5, 'f1': 0.5, 'TP': 1, 'FP': 1, 'FN': 1},
113
- 'overall_precision': 0.4,
114
- 'overall_recall': 0.3333,
115
- 'overall_f1': 0.3636,
116
- 'TP': 2,
117
- 'FP': 3,
118
- 'FN': 4}
 
 
 
119
  ```
120
 
121
  ```python
122
  >>> faireval.compute(predictions=y_pred, references=y_true, mode='traditional', error_format='error_ratio')
123
- {'PER': {'precision': 0.5, 'recall': 0.5, 'f1': 0.5, 'TP': 1, 'FP': 0.1428, 'FN': 0.1428},
124
- 'INT': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'TP': 0, 'FP': 0.1428, 'FN': 0.2857},
125
- 'OUT': {'precision': 0.5, 'recall': 0.5, 'f1': 0.5, 'TP': 1, 'FP': 0.1428, 'FN': 0.1428},
126
- 'overall_precision': 0.4,
127
- 'overall_recall': 0.3333,
128
- 'overall_f1': 0.3636,
129
- 'TP': 2,
130
- 'FP': 0.4285,
131
- 'FN': 0.5714}
 
 
 
132
  ```
133
 
134
  #### Values from Popular Papers
@@ -143,6 +149,46 @@ A basic [DistilBERT model](https://huggingface.co/docs/transformers/model_doc/di
143
  | seqeval strict | 0.2222 | 0.3425 | 0.0413 | 0.3598 | 0.0 | 0.0408 | 0.0 |
144
  | seqeval relaxed | 0.2803 | 0.4124 | 0.0412 | 0.4105 | 0.0 | 0.1985 | 0.0 |
145
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
146
  ## Limitations and Bias
147
  The metric is restricted to the input schemes admitted by seqeval. For example, the application does not support numerical
148
  label inputs (odd for Beginning, even for Inside and zero for Outside).
 
82
  The output for different modes and error_formats is:
83
  ```python
84
  >>> faireval.compute(predictions=y_pred, references=y_true, mode='fair', error_format='count')
85
+ {"PER": {"precision": 1.0,"recall": 0.5,"f1": 0.6666,
86
+ "trad_prec": 0.5,"trad_rec": 0.5,"trad_f1": 0.5,
87
+ "TP": 1,"FP": 0.0,"FN": 1.0,"LE": 0.0,"BE": 0.0,"LBE": 0.0},
88
+ "INT": {"precision": 0.0,"recall": 0.0,"f1": 0.0,
89
+ "trad_prec": 0.0,"trad_rec": 0.0,"trad_f1": 0.0,
90
+ "TP": 0,"FP": 0.0,"FN": 0.0,"LE": 0.0,"BE": 1.0,"LBE": 1.0},
91
+ "OUT": {"precision": 0.6666,"recall": 0.6666,"f1": 0.666,
92
+ "trad_prec": 0.5,"trad_rec": 0.5,"trad_f1": 0.5,
93
+ "TP": 1,"FP": 0.0,"FN": 0.0,"LE": 1.0,"BE": 0.0,"LBE": 0.0},
94
+ "overall_precision": 0.5714,
95
+ "overall_recall": 0.4444,
96
+ "overall_f1": 0.5,
97
+ "overall_trad_prec": 0.4,
98
+ "overall_trad_rec": 0.3333,
99
+ "overall_trad_f1": 0.3636,
100
+ "TP": 2,
101
+ "FP": 0.0,
102
+ "FN": 1.0,
103
+ "LE": 1.0,
104
+ "BE": 1.0,
105
+ "LBE": 1.0}
106
  ```
107
 
108
  ```python
109
  >>> faireval.compute(predictions=y_pred, references=y_true, mode='traditional', error_format='count')
110
+ {"PER": {"precision": 0.5,"recall": 0.5,"f1": 0.5,
111
+ "TP": 1,"FP": 1.0,"FN": 1.0},
112
+ "INT": {"precision": 0.0,"recall": 0.0,"f1": 0.0,
113
+ "TP": 0,"FP": 1.0,"FN": 2.0},
114
+ "OUT": {"precision": 0.5,"recall": 0.5,"f1": 0.5,
115
+ "TP": 1,"FP": 1.0,"FN": 1.0},
116
+ "overall_precision": 0.4,
117
+ "overall_recall": 0.3333,
118
+ "overall_f1": 0.3636,
119
+ "TP": 2,
120
+ "FP": 3.0,
121
+ "FN": 4.0}
122
  ```
123
 
124
  ```python
125
  >>> faireval.compute(predictions=y_pred, references=y_true, mode='traditional', error_format='error_ratio')
126
+ {"PER": {"precision": 0.5,"recall": 0.5,"f1": 0.5,
127
+ "TP": 1,"FP": 0.1428,"FN": 0.1428},
128
+ "INT": {"precision": 0.0,"recall": 0.0,"f1": 0.0,
129
+ "TP": 0,"FP": 0.14285714285714285,"FN": 0.2857},
130
+ "OUT": {"precision": 0.5,"recall": 0.5,"f1": 0.5,
131
+ "TP": 1,"FP": 0.1428,"FN": 0.1428},
132
+ "overall_precision": 0.4,
133
+ "overall_recall": 0.3333,
134
+ "overall_f1": 0.3636,
135
+ "TP": 2,
136
+ "FP": 0.4285,
137
+ "FN": 0.5714}
138
  ```
139
 
140
  #### Values from Popular Papers
 
149
  | seqeval strict | 0.2222 | 0.3425 | 0.0413 | 0.3598 | 0.0 | 0.0408 | 0.0 |
150
  | seqeval relaxed | 0.2803 | 0.4124 | 0.0412 | 0.4105 | 0.0 | 0.1985 | 0.0 |
151
 
152
+ The traditional count of evaluation parameters would be:
153
+
154
+ | | Overall | Location | Group | Person | Creative Work | Corporation | Product |
155
+ |----|---------|----------|-------|--------|---------------|-------------|---------|
156
+ | TP | 211 | 53 | 4 | 140 | 0 | 14 | 0 |
157
+ | FP | 353 | 42 | 42 | 174 | 1 | 70 | 0 |
158
+ | FN | 730 | 144 | 144 | 228 | 116 | 43 | 114 |
159
+
160
+ While the fair evaluation parameter count (`error_format='count'`) is:
161
+
162
+ | | Overall | Location | Group | Person | Creative Work | Corporation | Product |
163
+ |-----|---------|----------|-------|--------|---------------|-------------|---------|
164
+ | TP | 211 | 53 | 4 | 140 | 0 | 0 | 0 |
165
+ | FP | 125 | 9 | 21 | 62 | 1 | 32 | 0 |
166
+ | FN | 544 | 59 | 115 | 153 | 95 | 34 | 88 |
167
+ | BE | 105 | 11 | 4 | 87 | 0 | 3 | 0 |
168
+ | LE | 66 | 7 | 20 | 12 | 7 | 6 | 14 |
169
+ | LBE | 57 | 10 | 6 | 9 | 15 | 2 | 15 |
170
+
171
+ Thus, ratio of each fair error parameter with respect to the total number of errors (`error_format='error_ratio'`) is:
172
+
173
+ | | Overall | Location | Group | Person | Creative Work | Corporation | Product |
174
+ |-----|---------|----------|--------|--------|---------------|-------------|---------|
175
+ | FP | 13,94% | 1,00% | 2,34% | 6,91% | 0,11% | 3,57% | 0,00% |
176
+ | FN | 60,65% | 6,58% | 12,82% | 17,06% | 10,59% | 3,79% | 9,81% |
177
+ | BE | 11,71% | 1,23% | 0,45% | 9,70% | 0,00% | 0,33% | 0,00% |
178
+ | LE | 7,36% | 0,78% | 2,23% | 1,34% | 0,78% | 0,67% | 1,56% |
179
+ | LBE | 6,35% | 1,11% | 0,67% | 1,00% | 1,67% | 0,22% | 1,67% |
180
+
181
+ And the ratio of each fair parameter with respect to the total number of entities (`error_format='entity_ratio'`) is:
182
+
183
+ | | Overall | Location | Group | Person | Creative Work | Corporation | Product |
184
+ |-----|---------|----------|--------|--------|---------------|-------------|---------|
185
+ | TP | 19,04% | 4,78% | 0,36% | 12,64% | 0,00% | 0,00% | 0,00% |
186
+ | FP | 11,28% | 0,81% | 1,90% | 5,60% | 0,09% | 2,89% | 0,00% |
187
+ | FN | 49,10% | 5,32% | 10,38% | 13,81% | 8,57% | 3,07% | 7,94% |
188
+ | BE | 9,48% | 0,99% | 0,36% | 7,85% | 0,00% | 0,27% | 0,00% |
189
+ | LE | 5,96% | 0,63% | 1,81% | 1,08% | 0,63% | 0,54% | 1,26% |
190
+ | LBE | 5,14% | 0,90% | 0,54% | 0,81% | 1,35% | 0,18% | 1,35% |
191
+
192
  ## Limitations and Bias
193
  The metric is restricted to the input schemes admitted by seqeval. For example, the application does not support numerical
194
  label inputs (odd for Beginning, even for Inside and zero for Outside).