File size: 5,040 Bytes
f235b7d
ffa29b0
 
 
 
 
 
 
 
 
 
f235b7d
 
 
ffa29b0
f235b7d
ffa29b0
f235b7d
0355aa3
 
 
 
 
 
8941911
 
0355aa3
 
 
 
8941911
0355aa3
 
 
 
 
 
 
132d375
 
 
 
 
8941911
132d375
 
 
 
8941911
132d375
 
 
 
8941911
 
132d375
 
 
f235b7d
ffa29b0
 
f235b7d
ffa29b0
 
 
 
 
f235b7d
ffa29b0
7758e25
ffa29b0
 
f235b7d
ffa29b0
 
 
 
 
 
 
f235b7d
 
8941911
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ffa29b0
 
 
 
 
 
 
 
 
 
8941911
 
f235b7d
 
ffa29b0
 
8941911
ffa29b0
8941911
 
 
 
ffa29b0
f235b7d
 
 
ffa29b0
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
license: apache-2.0
language:
- sl
- hr
- sr
base_model:
- facebook/w2v-bert-2.0
pipeline_tag: audio-classification
metrics:
- f1
---


# Frame classification for filled pauses

This model classifies individual 20ms frames of audio based on presence of filled pauses ("eee", "errm", ...).

It was trained on human-annotated Slovenian speech corpus ROG-Artur and achieves F1 of 0.95 for the positive class on
te test split of the same dataset.


# Evaluation

Although the output of the model is a series 0 or 1, describing their  20ms frames, the evaluation was done on
event level; spans of consecutive outputs 1 were bundled together into one event. When the true and predicted
events partially overlap, this is counted as a true positive.

## Evaluation on ROG corpus

In evaluation, we only evaluate positive events, i.e.
```
              precision    recall  f1-score   support

           1      0.907     0.987     0.946      1834
```

## Evaluation on ParlaSpeech [HR](https://huggingface.co/datasets/classla/ParlaSpeech-HR) and [RS](https://huggingface.co/datasets/classla/ParlaSpeech-RS) corpora

Evaluation on 800 human-annotated instances  ParlaSpeech-HR and ParlaSpeech-RS produced the following metrics:

```
Performance on RS:
Classification report for human vs model on event level:
              precision    recall  f1-score   support

           1       0.95      0.99      0.97       542
Performance on HR:
Classification report for human vs model on event level:
              precision    recall  f1-score   support

           1       0.93      0.98      0.95       531
```
The metrics reported are on event level, which means that if true and
predicted filled pauses at least partially overlap, we count them as a
True Positive event.



# Example use:
```python

from transformers import AutoFeatureExtractor, Wav2Vec2BertForAudioFrameClassification
from datasets import Dataset, Audio
import torch
import numpy as np
from pathlib import Path

device = torch.device("cuda")
model_name = "classla/wav2vecbert2-filledPause"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
model = Wav2Vec2BertForAudioFrameClassification.from_pretrained(model_name).to(device)

ds = Dataset.from_dict(
    {
        "audio": [
            "/cache/peterr/mezzanine_resources/filled_pauses/data/dev/Iriss-J-Gvecg-P500001-avd_2082.293_2112.194.wav"
        ],
    }
).cast_column("audio", Audio(sampling_rate=16_000, mono=True))


def frames_to_intervals(
    frames: list[int], drop_short=True, drop_initial=True, short_cutoff_s=0.08
) -> list[tuple[float]]:
    """Transforms a list of ones or zeros, corresponding to annotations on frame
    levels, to a list of intervals ([start second, end second]).

    Allows for additional filtering on duration (false positives are often short)
    and start times (false positives starting at 0.0 are often an artifact of
    poor segmentation).

    :param list[int] frames: Input frame labels
    :param bool drop_short: Drop everything shorter than short_cutoff_s, defaults to True
    :param bool drop_initial: Drop predictions starting at 0.0, defaults to True
    :param float short_cutoff_s: Duration in seconds of shortest allowable prediction, defaults to 0.08
    :return list[tuple[float]]: List of intervals [start_s, end_s]
    """
    from itertools import pairwise
    import pandas as pd

    results = []
    ndf = pd.DataFrame(
        data={
            "time_s": [0.020 * i for i in range(len(frames))],
            "frames": frames,
        }
    )
    ndf = ndf.dropna()
    indices_of_change = ndf.frames.diff()[ndf.frames.diff() != 0].index.values
    for si, ei in pairwise(indices_of_change):
        if ndf.loc[si : ei - 1, "frames"].mode()[0] == 0:
            pass
        else:
            results.append(
                (
                    round(ndf.loc[si, "time_s"], 3),
                    round(ndf.loc[ei - 1, "time_s"], 3),
                )
            )
    if drop_short and (len(results) > 0):
        results = [i for i in results if (i[1] - i[0] >= short_cutoff_s)]
    if drop_initial and (len(results) > 0):
        results = [i for i in results if i[0] != 0.0]
    return results


def evaluator(chunks):
    sampling_rate = chunks["audio"][0]["sampling_rate"]
    with torch.no_grad():
        inputs = feature_extractor(
            [i["array"] for i in chunks["audio"]],
            return_tensors="pt",
            sampling_rate=sampling_rate,
        ).to(device)
        logits = model(**inputs).logits
    y_pred = np.array(logits.cpu()).argmax(axis=-1)
    intervals = [frames_to_intervals(i) for i in y_pred]
    return {"y_pred": y_pred.tolist(), "intervals": intervals}


ds = ds.map(evaluator, batched=True)
print(ds["y_pred"][0])
# Prints a list of 20ms frames: [0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,0....]
# with 0 indicating no filled pause detected in that frame

print(ds["intervals"][0])
# Prints the identified intervals as a list of [start_s, ends_s]:
# [[0.08, 0.28 ], ...]
```



# Citation
Coming soon.