illorca commited on
Commit
3109162
1 Parent(s): fa93db6

Show trad_scores when mode is fair (and docs)

Browse files
.idea/.gitignore ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ # Default ignored files
2
+ /shelf/
3
+ /workspace.xml
.idea/FairEval.iml ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <?xml version="1.0" encoding="UTF-8"?>
2
+ <module type="PYTHON_MODULE" version="4">
3
+ <component name="NewModuleRootManager">
4
+ <content url="file://$MODULE_DIR$" />
5
+ <orderEntry type="inheritedJdk" />
6
+ <orderEntry type="sourceFolder" forTests="false" />
7
+ </component>
8
+ <component name="PyDocumentationSettings">
9
+ <option name="format" value="PLAIN" />
10
+ <option name="myDocStringFormat" value="Plain" />
11
+ </component>
12
+ </module>
.idea/inspectionProfiles/Project_Default.xml ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <component name="InspectionProjectProfileManager">
2
+ <profile version="1.0">
3
+ <option name="myName" value="Project Default" />
4
+ <inspection_tool class="PyUnresolvedReferencesInspection" enabled="true" level="WARNING" enabled_by_default="true">
5
+ <option name="ignoredIdentifiers">
6
+ <list>
7
+ <option value="Version" />
8
+ <option value="Pipeline" />
9
+ </list>
10
+ </option>
11
+ </inspection_tool>
12
+ </profile>
13
+ </component>
.idea/inspectionProfiles/profiles_settings.xml ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ <component name="InspectionProjectProfileManager">
2
+ <settings>
3
+ <option name="USE_PROJECT_PROFILE" value="false" />
4
+ <version value="1.0" />
5
+ </settings>
6
+ </component>
.idea/modules.xml ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ <?xml version="1.0" encoding="UTF-8"?>
2
+ <project version="4">
3
+ <component name="ProjectModuleManager">
4
+ <modules>
5
+ <module fileurl="file://$PROJECT_DIR$/.idea/FairEval.iml" filepath="$PROJECT_DIR$/.idea/FairEval.iml" />
6
+ </modules>
7
+ </component>
8
+ </project>
.idea/vcs.xml ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ <?xml version="1.0" encoding="UTF-8"?>
2
+ <project version="4">
3
+ <component name="VcsDirectoryMappings">
4
+ <mapping directory="$PROJECT_DIR$" vcs="Git" />
5
+ </component>
6
+ </project>
FairEval.py CHANGED
@@ -59,8 +59,8 @@ Args:
59
  references: list of ground truth reference labels. Predicted sentences must have the same number of tokens as the references.
60
  mode: 'fair', 'traditional' ot 'weighted. Controls the desired output. The default value is 'fair'.
61
  - 'traditional': equivalent to seqeval's metrics / classic span-based evaluation.
62
- - 'fair': default fair score calculation.
63
- - 'weighted': custom score calculation with the weights passed.
64
  weights: dictionary with the weight of each error for the custom score calculation.
65
  If none is passed and the mode is set to 'weighted', the following is used:
66
  {"TP": {"TP": 1},
@@ -90,17 +90,48 @@ Examples:
90
  >>> ref = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O', 'B-PER', 'I-PER', 'O']]
91
  >>> results = faireval.compute(predictions=pred, references=ref, mode='fair', error_format='count')
92
  >>> print(results)
93
- {'MISC': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'TP': 0,'FP': 0,'FN': 0,'LE': 0,'BE': 1,'LBE': 0},
94
- 'PER': {'precision': 1.0,'recall': 1.0,'f1': 1.0,'TP': 1,'FP': 0,'FN': 0,'LE': 0,'BE': 0,'LBE': 0},
95
- 'overall_precision': 0.6666666666666666,
96
- 'overall_recall': 0.6666666666666666,
97
- 'overall_f1': 0.6666666666666666,
98
- 'TP': 1,
99
- 'FP': 0,
100
- 'FN': 0,
101
- 'LE': 0,
102
- 'BE': 1,
103
- 'LBE': 0}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
  """
105
 
106
 
 
59
  references: list of ground truth reference labels. Predicted sentences must have the same number of tokens as the references.
60
  mode: 'fair', 'traditional' ot 'weighted. Controls the desired output. The default value is 'fair'.
61
  - 'traditional': equivalent to seqeval's metrics / classic span-based evaluation.
62
+ - 'fair': default fair score calculation. It will also show traditional scores for comparison.
63
+ - 'weighted': custom score calculation with the weights passed. It will also show traditional scores for comparison.
64
  weights: dictionary with the weight of each error for the custom score calculation.
65
  If none is passed and the mode is set to 'weighted', the following is used:
66
  {"TP": {"TP": 1},
 
90
  >>> ref = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O', 'B-PER', 'I-PER', 'O']]
91
  >>> results = faireval.compute(predictions=pred, references=ref, mode='fair', error_format='count')
92
  >>> print(results)
93
+ {
94
+ "MISC": {
95
+ "precision": 0.0,
96
+ "recall": 0.0,
97
+ "f1": 0.0,
98
+ "trad_prec": 0.0,
99
+ "trad_rec": 0.0,
100
+ "trad_f1": 0.0,
101
+ "TP": 0,
102
+ "FP": 0.0,
103
+ "FN": 0.0,
104
+ "LE": 0.0,
105
+ "BE": 1.0,
106
+ "LBE": 0.0
107
+ },
108
+ "PER": {
109
+ "precision": 1.0,
110
+ "recall": 1.0,
111
+ "f1": 1.0,
112
+ "trad_prec": 1.0,
113
+ "trad_rec": 1.0,
114
+ "trad_f1": 1.0,
115
+ "TP": 1,
116
+ "FP": 0.0,
117
+ "FN": 0.0,
118
+ "LE": 0.0,
119
+ "BE": 0.0,
120
+ "LBE": 0.0
121
+ },
122
+ "overall_precision": 0.6666666666666666,
123
+ "overall_recall": 0.6666666666666666,
124
+ "overall_f1": 0.6666666666666666,
125
+ "overall_trad_prec": 0.5,
126
+ "overall_trad_rec": 0.5,
127
+ "overall_trad_f1": 0.5,
128
+ "TP": 1,
129
+ "FP": 0.0,
130
+ "FN": 0.0,
131
+ "LE": 0.0,
132
+ "BE": 1.0,
133
+ "LBE": 0.0
134
+ }
135
  """
136
 
137
 
README.md CHANGED
@@ -44,8 +44,8 @@ Predicted sentences must have the same number of tokens as the references.
44
  The optional arguments are:
45
  - **mode** *(str)*: 'fair', 'traditional' ot 'weighted. Controls the desired output. The default value is 'fair'.
46
  - 'traditional': equivalent to seqeval's metrics / classic span-based evaluation.
47
- - 'fair': default fair score calculation.
48
- - 'weighted': custom score calculation with the weights passed.
49
  - **weights** *(dict)*: dictionary with the weight of each error for the custom score calculation.
50
  - **error_format** *(str)*: 'count', 'error_ratio' or 'entity_ratio'. Controls the desired output for TP, FP, BE, LE, etc. Default value is 'count'.
51
  - 'count': absolute count of each parameter.
@@ -64,8 +64,6 @@ If mode is 'traditional', the error parameters shown are the classical TP, FP an
64
  TP remain the same, FP and FN are shown as per the fair definition and additional errors BE, LE and LBE are shown.
65
 
66
  ### Examples
67
- A comprehensive set of side-by-side examples is shown [here](https://huggingface.co/spaces/hpi-dhc/FairEval/blob/main/HFFE_use_cases.pdf).
68
-
69
  Considering the following input annotated sentences:
70
  ```python
71
  >>> r1 = ['O', 'O', 'B-PER', 'I-PER', 'O', 'B-PER']
@@ -82,6 +80,31 @@ Considering the following input annotated sentences:
82
  ```
83
 
84
  The output for different modes and error_formats is:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
  ```python
86
  >>> faireval.compute(predictions=y_pred, references=y_true, mode='traditional', error_format='count')
87
  {'PER': {'precision': 0.5, 'recall': 0.5, 'f1': 0.5, 'TP': 1, 'FP': 1, 'FN': 1},
@@ -108,22 +131,6 @@ The output for different modes and error_formats is:
108
  'FN': 0.5714}
109
  ```
110
 
111
- ```python
112
- >>> faireval.compute(predictions=y_pred, references=y_true, mode='fair', error_format='count')
113
- {'PER': {'precision': 1.0, 'recall': 0.5, 'f1': 0.6666, 'TP': 1, 'FP': 0, 'FN': 1, 'LE': 0, 'BE': 0, 'LBE': 0},
114
- 'INT': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'TP': 0, 'FP': 0, 'FN': 0, 'LE': 0, 'BE': 1, 'LBE': 1},
115
- 'OUT': {'precision': 0.6666, 'recall': 0.6666, 'f1': 0.6666, 'TP': 1, 'FP': 0, 'FN': 0, 'LE': 1, 'BE': 0, 'LBE': 0},
116
- 'overall_precision': 0.5714,
117
- 'overall_recall': 0.4444444444444444,
118
- 'overall_f1': 0.5,
119
- 'TP': 2,
120
- 'FP': 0,
121
- 'FN': 1,
122
- 'LE': 1,
123
- 'BE': 1,
124
- 'LBE': 1}
125
- ```
126
-
127
  #### Values from Popular Papers
128
  *Examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
129
 
 
44
  The optional arguments are:
45
  - **mode** *(str)*: 'fair', 'traditional' ot 'weighted. Controls the desired output. The default value is 'fair'.
46
  - 'traditional': equivalent to seqeval's metrics / classic span-based evaluation.
47
+ - 'fair': default fair score calculation. Fair will also show traditional scores for comparison.
48
+ - 'weighted': custom score calculation with the weights passed. Weighted will also show traditional scores for comparison.
49
  - **weights** *(dict)*: dictionary with the weight of each error for the custom score calculation.
50
  - **error_format** *(str)*: 'count', 'error_ratio' or 'entity_ratio'. Controls the desired output for TP, FP, BE, LE, etc. Default value is 'count'.
51
  - 'count': absolute count of each parameter.
 
64
  TP remain the same, FP and FN are shown as per the fair definition and additional errors BE, LE and LBE are shown.
65
 
66
  ### Examples
 
 
67
  Considering the following input annotated sentences:
68
  ```python
69
  >>> r1 = ['O', 'O', 'B-PER', 'I-PER', 'O', 'B-PER']
 
80
  ```
81
 
82
  The output for different modes and error_formats is:
83
+ ```python
84
+ >>> faireval.compute(predictions=y_pred, references=y_true, mode='fair', error_format='count')
85
+ {'PER': {'precision': 1.0, 'recall': 0.5, 'f1': 0.6666,
86
+ "trad_prec": 0.5, "trad_rec": 0.5, "trad_f1": 0.5,
87
+ 'TP': 1, 'FP': 0, 'FN': 1, 'LE': 0, 'BE': 0, 'LBE': 0},
88
+ 'INT': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0,
89
+ "trad_prec": 0.0, "trad_rec": 0.0, "trad_f1": 0.0,
90
+ 'TP': 0, 'FP': 0, 'FN': 0, 'LE': 0, 'BE': 1, 'LBE': 1},
91
+ 'OUT': {'precision': 0.6666, 'recall': 0.6666, 'f1': 0.6666,
92
+ "trad_prec": 0.5, "trad_rec": 0.5, "trad_f1": 0.5,
93
+ 'TP': 1, 'FP': 0, 'FN': 0, 'LE': 1, 'BE': 0, 'LBE': 0},
94
+ 'overall_precision': 0.5714,
95
+ 'overall_recall': 0.4444444444444444,
96
+ 'overall_f1': 0.5,
97
+ 'trad_prec': 0.5,
98
+ 'trad_rec': 0.5,
99
+ 'trad_f1': 0.5,
100
+ 'TP': 2,
101
+ 'FP': 0,
102
+ 'FN': 1,
103
+ 'LE': 1,
104
+ 'BE': 1,
105
+ 'LBE': 1}
106
+ ```
107
+
108
  ```python
109
  >>> faireval.compute(predictions=y_pred, references=y_true, mode='traditional', error_format='count')
110
  {'PER': {'precision': 0.5, 'recall': 0.5, 'f1': 0.5, 'TP': 1, 'FP': 1, 'FN': 1},
 
131
  'FN': 0.5714}
132
  ```
133
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
134
  #### Values from Popular Papers
135
  *Examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
136