Update README.md

bf21825 verified 9 days ago

5.03 kB

	---
	base_model:
	- facebook/w2v-bert-2.0
	datasets:
	- classla/ParlaSpeech-RS
	- classla/ParlaSpeech-HR
	- classla/Mici_Princ
	language:
	- sl
	- hr
	- sr
	library_name: transformers
	license: cc-by-sa-4.0
	metrics:
	- accuracy
	pipeline_tag: audio-classification
	---

	# Model Card

	This model annotates primary stress in words on 20ms frames.

	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->


	- Developed by: [Peter Rupnik](https://huggingface.co/5roop), [Nikola Ljubešić](https://huggingface.co/nljubesi), [Ivan Porupski](https://huggingface.co/porupski)
	- Model type: Audio frame classifier
	- Language(s) (NLP): Croatian, Slovenian, Serbian, Chakavian variant of Croatian
	- License: Creative Commons - Share Alike 4.0

	<!-- Provide the basic links for the model. -->

	- Paper: Please cite the following paper:

	```
	@inproceedings{ljubesic2025identifying,
	title = {Identifying Primary Stress Across Related Languages and Dialects with Transformer-based Speech Encoder Models},
	author = {Ljubešić, Nikola and Porupski, Ivan and Rupnik, Peter},
	booktitle = {Proceedings of Interspeech 2025},
	year = {2025},
	note = {Accepted at Interspeech 2025}
	}
	```
	### Training data

	The model was trained on the training split of [ParlaStress-HR dataset](http://hdl.handle.net/11356/2038).

	### Evaluation results

	For evaluation, the test splits of [ParlaStress-HR dataset](http://hdl.handle.net/11356/2038) were used.

	\|test language\|accuracy\|
	\| ---\|---\|
	\| Croatian\| 99.1\|
	\|Serbian\|99.3\|
	\|Chakavian (variant of Croatian)\|88.9\|
	\|Slovenian\|89.0\|

	### Direct Use

	The model is intended for data-driven analyses in primary stress position. At the moment, it has been proven to work on 4 datasets in 3 languages.


	## Example use

	```python
	import numpy as np

	from datasets import Audio, Dataset
	from transformers import AutoFeatureExtractor, Wav2Vec2BertForAudioFrameClassification
	import torch
	import numpy as np

	if torch.cuda.is_available():
	device = torch.device("cuda")
	else:
	device = torch.device("cpu")

	model_name = "classla/Wav2Vec2BertPrimaryStressAudioFrameClassifier"
	feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
	model = Wav2Vec2BertForAudioFrameClassification.from_pretrained(model_name).to(device)
	# Path to the file, containing the word to be annotated:
	f = "wavs/word.wav"


	def frames_to_intervals(frames: list[int]) -> list[tuple[float]]:
	from itertools import pairwise
	import pandas as pd

	results = []
	ndf = pd.DataFrame(
	data={
	"time_s": [0.020 * i for i in range(len(frames))],
	"frames": frames,
	}
	)
	ndf = ndf.dropna()
	indices_of_change = ndf.frames.diff()[ndf.frames.diff() != 0].index.values
	for si, ei in pairwise(indices_of_change):
	if ndf.loc[si : ei - 1, "frames"].mode()[0] == 0:
	pass
	else:
	results.append(
	(round(ndf.loc[si, "time_s"], 3), round(ndf.loc[ei - 1, "time_s"], 3))
	)
	if results == []:
	return None
	# Post-processing: if multiple regions were returned, only the longest should be taken:
	if len(results) > 1:
	results = sorted(results, key=lambda t: t[1]-t[0], reverse=True)
	return results[0:1]


	def evaluator(chunks):
	sampling_rate = chunks["audio"][0]["sampling_rate"]
	with torch.no_grad():
	inputs = feature_extractor(
	[i["array"] for i in chunks["audio"]],
	return_tensors="pt",
	sampling_rate=sampling_rate,
	).to(device)
	logits = model(**inputs).logits
	y_pred_raw = np.array(logits.cpu())
	y_pred = y_pred_raw.argmax(axis=-1)
	primary_stress = [frames_to_intervals(i) for i in y_pred]
	return {
	"y_pred": y_pred,
	"y_pred_logits": y_pred_raw,
	"primary_stress": primary_stress,
	}

	# Create a dataset with a single instance and map our evaluator function on it:
	ds = Dataset.from_dict({"audio": [f]}).cast_column("audio", Audio(16000, mono=True))
	ds = ds.map(evaluator, batched=True, batch_size=1) # Adjust batch size according to your hardware specs
	print(ds["y_pred"][0])
	# Outputs: [0, 0, 1, 1, 1, 1, 1, ...]
	print(ds["y_pred_logits"][0])
	# Outputs:
	# [[ 0.89419061, -0.77746612],
	# [ 0.44213724, -0.34862748],
	# [-0.08605709, 0.13012762],
	# ....
	print(ds["primary_stress"][0])
	# Outputs: [0.34, 0.4]

	```

	## Training Details

	### Training Data

	10443 manually annotated multisyllabic words from [ParlaSpeech-HR](https://huggingface.co/datasets/classla/ParlaSpeech-HR).

	### Training Procedure

	<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

	#### Training Hyperparameters

	- Learning rate: 1e-5
	- Batch size: 32
	- Number of epochs: 20
	- Weight decay: 0.01
	- Gradient accumulation steps: 1

	## Evaluation

	<!-- This section describes the evaluation protocols and provides the results. -->