|
--- |
|
base_model: |
|
- facebook/w2v-bert-2.0 |
|
datasets: |
|
- classla/ParlaSpeech-RS |
|
- classla/ParlaSpeech-HR |
|
- classla/Mici_Princ |
|
language: |
|
- sl |
|
- hr |
|
- sr |
|
library_name: transformers |
|
license: cc-by-sa-4.0 |
|
metrics: |
|
- accuracy |
|
pipeline_tag: audio-classification |
|
--- |
|
|
|
# Model Card |
|
|
|
This model annotates primary stress in words on 20ms frames. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
|
|
- **Developed by:** [Peter Rupnik](https://huggingface.co/5roop), [Nikola Ljubešić](https://huggingface.co/nljubesi), [Ivan Porupski](https://huggingface.co/porupski) |
|
- **Model type:** Audio frame classifier |
|
- **Language(s) (NLP):** Croatian, Slovenian, Serbian, Chakavian variant of Croatian |
|
- **License:** Creative Commons - Share Alike 4.0 |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Paper:** Please cite the following paper: |
|
|
|
``` |
|
@inproceedings{ljubesic2025identifying, |
|
title = {Identifying Primary Stress Across Related Languages and Dialects with Transformer-based Speech Encoder Models}, |
|
author = {Ljubešić, Nikola and Porupski, Ivan and Rupnik, Peter}, |
|
booktitle = {Proceedings of Interspeech 2025}, |
|
year = {2025}, |
|
note = {Accepted at Interspeech 2025} |
|
} |
|
``` |
|
### Training data |
|
|
|
The model was trained on the training split of [ParlaStress-HR dataset](http://hdl.handle.net/11356/2038). |
|
|
|
### Evaluation results |
|
|
|
For evaluation, the test splits of [ParlaStress-HR dataset](http://hdl.handle.net/11356/2038) were used. |
|
|
|
|test language|accuracy| |
|
| ---|---| |
|
| Croatian| 99.1| |
|
|Serbian|99.3| |
|
|Chakavian (variant of Croatian)|88.9| |
|
|Slovenian|89.0| |
|
|
|
### Direct Use |
|
|
|
The model is intended for data-driven analyses in primary stress position. At the moment, it has been proven to work on 4 datasets in 3 languages. |
|
|
|
|
|
## Example use |
|
|
|
```python |
|
import numpy as np |
|
|
|
from datasets import Audio, Dataset |
|
from transformers import AutoFeatureExtractor, Wav2Vec2BertForAudioFrameClassification |
|
import torch |
|
import numpy as np |
|
|
|
if torch.cuda.is_available(): |
|
device = torch.device("cuda") |
|
else: |
|
device = torch.device("cpu") |
|
|
|
model_name = "classla/Wav2Vec2BertPrimaryStressAudioFrameClassifier" |
|
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name) |
|
model = Wav2Vec2BertForAudioFrameClassification.from_pretrained(model_name).to(device) |
|
# Path to the file, containing the word to be annotated: |
|
f = "wavs/word.wav" |
|
|
|
|
|
def frames_to_intervals(frames: list[int]) -> list[tuple[float]]: |
|
from itertools import pairwise |
|
import pandas as pd |
|
|
|
results = [] |
|
ndf = pd.DataFrame( |
|
data={ |
|
"time_s": [0.020 * i for i in range(len(frames))], |
|
"frames": frames, |
|
} |
|
) |
|
ndf = ndf.dropna() |
|
indices_of_change = ndf.frames.diff()[ndf.frames.diff() != 0].index.values |
|
for si, ei in pairwise(indices_of_change): |
|
if ndf.loc[si : ei - 1, "frames"].mode()[0] == 0: |
|
pass |
|
else: |
|
results.append( |
|
(round(ndf.loc[si, "time_s"], 3), round(ndf.loc[ei - 1, "time_s"], 3)) |
|
) |
|
if results == []: |
|
return None |
|
# Post-processing: if multiple regions were returned, only the longest should be taken: |
|
if len(results) > 1: |
|
results = sorted(results, key=lambda t: t[1]-t[0], reverse=True) |
|
return results[0:1] |
|
|
|
|
|
def evaluator(chunks): |
|
sampling_rate = chunks["audio"][0]["sampling_rate"] |
|
with torch.no_grad(): |
|
inputs = feature_extractor( |
|
[i["array"] for i in chunks["audio"]], |
|
return_tensors="pt", |
|
sampling_rate=sampling_rate, |
|
).to(device) |
|
logits = model(**inputs).logits |
|
y_pred_raw = np.array(logits.cpu()) |
|
y_pred = y_pred_raw.argmax(axis=-1) |
|
primary_stress = [frames_to_intervals(i) for i in y_pred] |
|
return { |
|
"y_pred": y_pred, |
|
"y_pred_logits": y_pred_raw, |
|
"primary_stress": primary_stress, |
|
} |
|
|
|
# Create a dataset with a single instance and map our evaluator function on it: |
|
ds = Dataset.from_dict({"audio": [f]}).cast_column("audio", Audio(16000, mono=True)) |
|
ds = ds.map(evaluator, batched=True, batch_size=1) # Adjust batch size according to your hardware specs |
|
print(ds["y_pred"][0]) |
|
# Outputs: [0, 0, 1, 1, 1, 1, 1, ...] |
|
print(ds["y_pred_logits"][0]) |
|
# Outputs: |
|
# [[ 0.89419061, -0.77746612], |
|
# [ 0.44213724, -0.34862748], |
|
# [-0.08605709, 0.13012762], |
|
# .... |
|
print(ds["primary_stress"][0]) |
|
# Outputs: [0.34, 0.4] |
|
|
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
10443 manually annotated multisyllabic words from [ParlaSpeech-HR](https://huggingface.co/datasets/classla/ParlaSpeech-HR). |
|
|
|
### Training Procedure |
|
|
|
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
|
|
|
#### Training Hyperparameters |
|
|
|
- Learning rate: 1e-5 |
|
- Batch size: 32 |
|
- Number of epochs: 20 |
|
- Weight decay: 0.01 |
|
- Gradient accumulation steps: 1 |
|
|
|
## Evaluation |
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |