File size: 5,040 Bytes
f235b7d ffa29b0 f235b7d ffa29b0 f235b7d ffa29b0 f235b7d 0355aa3 8941911 0355aa3 8941911 0355aa3 132d375 8941911 132d375 8941911 132d375 8941911 132d375 f235b7d ffa29b0 f235b7d ffa29b0 f235b7d ffa29b0 7758e25 ffa29b0 f235b7d ffa29b0 f235b7d 8941911 ffa29b0 8941911 f235b7d ffa29b0 8941911 ffa29b0 8941911 ffa29b0 f235b7d ffa29b0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
---
license: apache-2.0
language:
- sl
- hr
- sr
base_model:
- facebook/w2v-bert-2.0
pipeline_tag: audio-classification
metrics:
- f1
---
# Frame classification for filled pauses
This model classifies individual 20ms frames of audio based on presence of filled pauses ("eee", "errm", ...).
It was trained on human-annotated Slovenian speech corpus ROG-Artur and achieves F1 of 0.95 for the positive class on
te test split of the same dataset.
# Evaluation
Although the output of the model is a series 0 or 1, describing their 20ms frames, the evaluation was done on
event level; spans of consecutive outputs 1 were bundled together into one event. When the true and predicted
events partially overlap, this is counted as a true positive.
## Evaluation on ROG corpus
In evaluation, we only evaluate positive events, i.e.
```
precision recall f1-score support
1 0.907 0.987 0.946 1834
```
## Evaluation on ParlaSpeech [HR](https://huggingface.co/datasets/classla/ParlaSpeech-HR) and [RS](https://huggingface.co/datasets/classla/ParlaSpeech-RS) corpora
Evaluation on 800 human-annotated instances ParlaSpeech-HR and ParlaSpeech-RS produced the following metrics:
```
Performance on RS:
Classification report for human vs model on event level:
precision recall f1-score support
1 0.95 0.99 0.97 542
Performance on HR:
Classification report for human vs model on event level:
precision recall f1-score support
1 0.93 0.98 0.95 531
```
The metrics reported are on event level, which means that if true and
predicted filled pauses at least partially overlap, we count them as a
True Positive event.
# Example use:
```python
from transformers import AutoFeatureExtractor, Wav2Vec2BertForAudioFrameClassification
from datasets import Dataset, Audio
import torch
import numpy as np
from pathlib import Path
device = torch.device("cuda")
model_name = "classla/wav2vecbert2-filledPause"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
model = Wav2Vec2BertForAudioFrameClassification.from_pretrained(model_name).to(device)
ds = Dataset.from_dict(
{
"audio": [
"/cache/peterr/mezzanine_resources/filled_pauses/data/dev/Iriss-J-Gvecg-P500001-avd_2082.293_2112.194.wav"
],
}
).cast_column("audio", Audio(sampling_rate=16_000, mono=True))
def frames_to_intervals(
frames: list[int], drop_short=True, drop_initial=True, short_cutoff_s=0.08
) -> list[tuple[float]]:
"""Transforms a list of ones or zeros, corresponding to annotations on frame
levels, to a list of intervals ([start second, end second]).
Allows for additional filtering on duration (false positives are often short)
and start times (false positives starting at 0.0 are often an artifact of
poor segmentation).
:param list[int] frames: Input frame labels
:param bool drop_short: Drop everything shorter than short_cutoff_s, defaults to True
:param bool drop_initial: Drop predictions starting at 0.0, defaults to True
:param float short_cutoff_s: Duration in seconds of shortest allowable prediction, defaults to 0.08
:return list[tuple[float]]: List of intervals [start_s, end_s]
"""
from itertools import pairwise
import pandas as pd
results = []
ndf = pd.DataFrame(
data={
"time_s": [0.020 * i for i in range(len(frames))],
"frames": frames,
}
)
ndf = ndf.dropna()
indices_of_change = ndf.frames.diff()[ndf.frames.diff() != 0].index.values
for si, ei in pairwise(indices_of_change):
if ndf.loc[si : ei - 1, "frames"].mode()[0] == 0:
pass
else:
results.append(
(
round(ndf.loc[si, "time_s"], 3),
round(ndf.loc[ei - 1, "time_s"], 3),
)
)
if drop_short and (len(results) > 0):
results = [i for i in results if (i[1] - i[0] >= short_cutoff_s)]
if drop_initial and (len(results) > 0):
results = [i for i in results if i[0] != 0.0]
return results
def evaluator(chunks):
sampling_rate = chunks["audio"][0]["sampling_rate"]
with torch.no_grad():
inputs = feature_extractor(
[i["array"] for i in chunks["audio"]],
return_tensors="pt",
sampling_rate=sampling_rate,
).to(device)
logits = model(**inputs).logits
y_pred = np.array(logits.cpu()).argmax(axis=-1)
intervals = [frames_to_intervals(i) for i in y_pred]
return {"y_pred": y_pred.tolist(), "intervals": intervals}
ds = ds.map(evaluator, batched=True)
print(ds["y_pred"][0])
# Prints a list of 20ms frames: [0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,0....]
# with 0 indicating no filled pause detected in that frame
print(ds["intervals"][0])
# Prints the identified intervals as a list of [start_s, ends_s]:
# [[0.08, 0.28 ], ...]
```
# Citation
Coming soon. |