5roop commited on
Commit
327b583
·
verified ·
1 Parent(s): 8941911

Update metrics

Browse files
Files changed (1) hide show
  1. README.md +55 -24
README.md CHANGED
@@ -24,32 +24,55 @@ te test split of the same dataset.
24
 
25
  Although the output of the model is a series 0 or 1, describing their 20ms frames, the evaluation was done on
26
  event level; spans of consecutive outputs 1 were bundled together into one event. When the true and predicted
27
- events partially overlap, this is counted as a true positive.
 
 
 
 
 
28
 
29
  ## Evaluation on ROG corpus
30
 
31
- In evaluation, we only evaluate positive events, i.e.
32
  ```
33
- precision recall f1-score support
34
-
35
- 1 0.907 0.987 0.946 1834
 
 
 
 
36
  ```
37
 
38
- ## Evaluation on ParlaSpeech [HR](https://huggingface.co/datasets/classla/ParlaSpeech-HR) and [RS](https://huggingface.co/datasets/classla/ParlaSpeech-RS) corpora
39
 
40
- Evaluation on 800 human-annotated instances ParlaSpeech-HR and ParlaSpeech-RS produced the following metrics:
 
41
 
42
- ```
43
- Performance on RS:
44
- Classification report for human vs model on event level:
45
- precision recall f1-score support
46
 
47
- 1 0.95 0.99 0.97 542
48
- Performance on HR:
49
- Classification report for human vs model on event level:
50
- precision recall f1-score support
51
-
52
- 1 0.93 0.98 0.95 531
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
  ```
54
  The metrics reported are on event level, which means that if true and
55
  predicted filled pauses at least partially overlap, we count them as a
@@ -81,19 +104,25 @@ ds = Dataset.from_dict(
81
 
82
 
83
  def frames_to_intervals(
84
- frames: list[int], drop_short=True, drop_initial=True, short_cutoff_s=0.08
 
 
 
 
85
  ) -> list[tuple[float]]:
86
  """Transforms a list of ones or zeros, corresponding to annotations on frame
87
  levels, to a list of intervals ([start second, end second]).
88
 
89
- Allows for additional filtering on duration (false positives are often short)
90
- and start times (false positives starting at 0.0 are often an artifact of
91
- poor segmentation).
92
 
93
  :param list[int] frames: Input frame labels
94
- :param bool drop_short: Drop everything shorter than short_cutoff_s, defaults to True
 
95
  :param bool drop_initial: Drop predictions starting at 0.0, defaults to True
96
- :param float short_cutoff_s: Duration in seconds of shortest allowable prediction, defaults to 0.08
 
97
  :return list[tuple[float]]: List of intervals [start_s, end_s]
98
  """
99
  from itertools import pairwise
@@ -115,13 +144,15 @@ def frames_to_intervals(
115
  results.append(
116
  (
117
  round(ndf.loc[si, "time_s"], 3),
118
- round(ndf.loc[ei - 1, "time_s"], 3),
119
  )
120
  )
121
  if drop_short and (len(results) > 0):
122
  results = [i for i in results if (i[1] - i[0] >= short_cutoff_s)]
123
  if drop_initial and (len(results) > 0):
124
  results = [i for i in results if i[0] != 0.0]
 
 
125
  return results
126
 
127
 
 
24
 
25
  Although the output of the model is a series 0 or 1, describing their 20ms frames, the evaluation was done on
26
  event level; spans of consecutive outputs 1 were bundled together into one event. When the true and predicted
27
+ events partially overlap, this is counted as a true positive. We report precisions, recalls, and f1-scores of the positive class.
28
+
29
+ We observed several failure modes of the automatic inferrence process and designed post-processing steps to mitigate them.
30
+ False positives were observed to be caused by improper audio segmentation, which is why disabling predictions that start at the start of the audio or
31
+ end at the end of the audio can be beneficial. Another failure mode is predicting very short events, which is why ignoring very short predictions
32
+ can be safely discarded.
33
 
34
  ## Evaluation on ROG corpus
35
 
 
36
  ```
37
+ | postprocessing | recall | precision | F1 |
38
+ |:-----------------------|---------:|------------:|------:|
39
+ | none | 0.981 | 0.955 | 0.968 |
40
+ | drop_short | 0.981 | 0.957 | 0.969 |
41
+ | drop_short_initial_and_final | 0.964 | 0.966 | 0.965 |
42
+ | drop_short_and_initial | 0.964 | 0.966 | 0.965 |
43
+ | drop_initial | 0.964 | 0.963 | 0.963 |
44
  ```
45
 
46
+ ## Evaluation on ParlaSpeech corpora
47
 
48
+ For every language in the [ParlaSpeech collection](https://huggingface.co/collections/classla/parlaspeech-670923f23ab185f413d40795),
49
+ 400 instances were sampled and annotated by human annotators.
50
 
51
+ Evaluation on human-annotated instances produced the following metrics:
 
 
 
52
 
53
+ ```
54
+ | lang | postprocessing | recall | precision | F1 |
55
+ |:-------|:-----------------------|---------:|------------:|------:|
56
+ | CZ | drop_short_initial_and_final | 0.889 | 0.859 | 0.874 |
57
+ | CZ | drop_short_and_initial | 0.889 | 0.859 | 0.874 |
58
+ | CZ | drop_short | 0.905 | 0.833 | 0.868 |
59
+ | CZ | drop_initial | 0.889 | 0.846 | 0.867 |
60
+ | CZ | raw | 0.905 | 0.814 | 0.857 |
61
+ | HR | drop_short_initial_and_final | 0.94 | 0.887 | 0.913 |
62
+ | HR | drop_short_and_initial | 0.94 | 0.887 | 0.913 |
63
+ | HR | drop_short | 0.94 | 0.884 | 0.911 |
64
+ | HR | drop_initial | 0.94 | 0.875 | 0.906 |
65
+ | HR | raw | 0.94 | 0.872 | 0.905 |
66
+ | PL | drop_short | 0.906 | 0.947 | 0.926 |
67
+ | PL | drop_short_initial_and_final | 0.903 | 0.947 | 0.924 |
68
+ | PL | drop_short_and_initial | 0.903 | 0.947 | 0.924 |
69
+ | PL | raw | 0.91 | 0.924 | 0.917 |
70
+ | PL | drop_initial | 0.908 | 0.924 | 0.916 |
71
+ | RS | drop_short | 0.966 | 0.915 | 0.94 |
72
+ | RS | drop_short_initial_and_final | 0.966 | 0.915 | 0.94 |
73
+ | RS | drop_short_and_initial | 0.966 | 0.915 | 0.94 |
74
+ | RS | drop_initial | 0.974 | 0.9 | 0.936 |
75
+ | RS | raw | 0.974 | 0.9 | 0.936 |
76
  ```
77
  The metrics reported are on event level, which means that if true and
78
  predicted filled pauses at least partially overlap, we count them as a
 
104
 
105
 
106
  def frames_to_intervals(
107
+ frames: list[int],
108
+ drop_short=True,
109
+ drop_initial=True,
110
+ drop_final=False,
111
+ short_cutoff_s=0.08,
112
  ) -> list[tuple[float]]:
113
  """Transforms a list of ones or zeros, corresponding to annotations on frame
114
  levels, to a list of intervals ([start second, end second]).
115
 
116
+ Allows for additional filtering on duration (false positives are often
117
+ short) and start times (false positives starting at 0.0 are often an
118
+ artifact of poor segmentation).
119
 
120
  :param list[int] frames: Input frame labels
121
+ :param bool drop_short: Drop everything shorter than short_cutoff_s,
122
+ defaults to True
123
  :param bool drop_initial: Drop predictions starting at 0.0, defaults to True
124
+ :param float short_cutoff_s: Duration in seconds of shortest allowable
125
+ prediction, defaults to 0.08
126
  :return list[tuple[float]]: List of intervals [start_s, end_s]
127
  """
128
  from itertools import pairwise
 
144
  results.append(
145
  (
146
  round(ndf.loc[si, "time_s"], 3),
147
+ round(ndf.loc[ei, "time_s"], 3),
148
  )
149
  )
150
  if drop_short and (len(results) > 0):
151
  results = [i for i in results if (i[1] - i[0] >= short_cutoff_s)]
152
  if drop_initial and (len(results) > 0):
153
  results = [i for i in results if i[0] != 0.0]
154
+ if drop_final and (len(results) > 0):
155
+ results = [i for i in results if i[1] != 0.02 * len(frames)]
156
  return results
157
 
158