Audio Classification
Transformers
Safetensors
Slovenian
Croatian
Serbian
wav2vec2-bert
audio-frame-classification
5roop's picture
Update README.md
bf21825 verified
---
base_model:
- facebook/w2v-bert-2.0
datasets:
- classla/ParlaSpeech-RS
- classla/ParlaSpeech-HR
- classla/Mici_Princ
language:
- sl
- hr
- sr
library_name: transformers
license: cc-by-sa-4.0
metrics:
- accuracy
pipeline_tag: audio-classification
---
# Model Card
This model annotates primary stress in words on 20ms frames.
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
- **Developed by:** [Peter Rupnik](https://huggingface.co/5roop), [Nikola Ljubešić](https://huggingface.co/nljubesi), [Ivan Porupski](https://huggingface.co/porupski)
- **Model type:** Audio frame classifier
- **Language(s) (NLP):** Croatian, Slovenian, Serbian, Chakavian variant of Croatian
- **License:** Creative Commons - Share Alike 4.0
<!-- Provide the basic links for the model. -->
- **Paper:** Please cite the following paper:
```
@inproceedings{ljubesic2025identifying,
title = {Identifying Primary Stress Across Related Languages and Dialects with Transformer-based Speech Encoder Models},
author = {Ljubešić, Nikola and Porupski, Ivan and Rupnik, Peter},
booktitle = {Proceedings of Interspeech 2025},
year = {2025},
note = {Accepted at Interspeech 2025}
}
```
### Training data
The model was trained on the training split of [ParlaStress-HR dataset](http://hdl.handle.net/11356/2038).
### Evaluation results
For evaluation, the test splits of [ParlaStress-HR dataset](http://hdl.handle.net/11356/2038) were used.
|test language|accuracy|
| ---|---|
| Croatian| 99.1|
|Serbian|99.3|
|Chakavian (variant of Croatian)|88.9|
|Slovenian|89.0|
### Direct Use
The model is intended for data-driven analyses in primary stress position. At the moment, it has been proven to work on 4 datasets in 3 languages.
## Example use
```python
import numpy as np
from datasets import Audio, Dataset
from transformers import AutoFeatureExtractor, Wav2Vec2BertForAudioFrameClassification
import torch
import numpy as np
if torch.cuda.is_available():
device = torch.device("cuda")
else:
device = torch.device("cpu")
model_name = "classla/Wav2Vec2BertPrimaryStressAudioFrameClassifier"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
model = Wav2Vec2BertForAudioFrameClassification.from_pretrained(model_name).to(device)
# Path to the file, containing the word to be annotated:
f = "wavs/word.wav"
def frames_to_intervals(frames: list[int]) -> list[tuple[float]]:
from itertools import pairwise
import pandas as pd
results = []
ndf = pd.DataFrame(
data={
"time_s": [0.020 * i for i in range(len(frames))],
"frames": frames,
}
)
ndf = ndf.dropna()
indices_of_change = ndf.frames.diff()[ndf.frames.diff() != 0].index.values
for si, ei in pairwise(indices_of_change):
if ndf.loc[si : ei - 1, "frames"].mode()[0] == 0:
pass
else:
results.append(
(round(ndf.loc[si, "time_s"], 3), round(ndf.loc[ei - 1, "time_s"], 3))
)
if results == []:
return None
# Post-processing: if multiple regions were returned, only the longest should be taken:
if len(results) > 1:
results = sorted(results, key=lambda t: t[1]-t[0], reverse=True)
return results[0:1]
def evaluator(chunks):
sampling_rate = chunks["audio"][0]["sampling_rate"]
with torch.no_grad():
inputs = feature_extractor(
[i["array"] for i in chunks["audio"]],
return_tensors="pt",
sampling_rate=sampling_rate,
).to(device)
logits = model(**inputs).logits
y_pred_raw = np.array(logits.cpu())
y_pred = y_pred_raw.argmax(axis=-1)
primary_stress = [frames_to_intervals(i) for i in y_pred]
return {
"y_pred": y_pred,
"y_pred_logits": y_pred_raw,
"primary_stress": primary_stress,
}
# Create a dataset with a single instance and map our evaluator function on it:
ds = Dataset.from_dict({"audio": [f]}).cast_column("audio", Audio(16000, mono=True))
ds = ds.map(evaluator, batched=True, batch_size=1) # Adjust batch size according to your hardware specs
print(ds["y_pred"][0])
# Outputs: [0, 0, 1, 1, 1, 1, 1, ...]
print(ds["y_pred_logits"][0])
# Outputs:
# [[ 0.89419061, -0.77746612],
# [ 0.44213724, -0.34862748],
# [-0.08605709, 0.13012762],
# ....
print(ds["primary_stress"][0])
# Outputs: [0.34, 0.4]
```
## Training Details
### Training Data
10443 manually annotated multisyllabic words from [ParlaSpeech-HR](https://huggingface.co/datasets/classla/ParlaSpeech-HR).
### Training Procedure
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
#### Training Hyperparameters
- Learning rate: 1e-5
- Batch size: 32
- Number of epochs: 20
- Weight decay: 0.01
- Gradient accumulation steps: 1
## Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->