wav2vec2-large-xlsr-53-hungarian / README.md

Update README.md

808abe9 verified 7 days ago

6.92 kB

	---
	library_name: transformers
	language:
	- hu
	license: apache-2.0
	base_model: facebook/wav2vec2-large-xlsr-53
	tags:
	- automatic-speech-recognition
	- mozilla-foundation/common_voice_17_0
	- generated_from_trainer
	datasets:
	- common_voice_17_0
	metrics:
	- wer
	model-index:
	- name: wav2vec2-large-xlsr-53-hungarian
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: MOZILLA-FOUNDATION/COMMON_VOICE_17_0 - HU
	type: common_voice_17_0
	config: hu
	split: test
	args: 'Config: hu, Training split: train+validation, Eval split: test'
	metrics:
	- name: Wer
	type: wer
	value: 0.1727824914378453
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# wav2vec2-large-xlsr-53-hungarian

	This model is a fine-tuned version of [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on the MOZILLA-FOUNDATION/COMMON_VOICE_17_0 - HU dataset.
	It achieves the following results on the evaluation set:
	- Loss: 0.1748
	- Wer: 0.2997

	The training and measured wer values differ due to ignored characters.

	## Model Comparison with the previous best wav2vec model (eval on CV17)
	\| Model name \| WER \| CER \|
	\|:-----------------------------------------------:\|:------------------:\|:----------------:\|
	\| jonatasgrosman/wav2vec2-large-xlsr-53-hungarian \| 46.199835320230555 \| 9.85170677112479 \|
	\| sarpba/wav2vec2-large-xlsr-53-hungarian \| 17.27824914378453 \| 3.151354554132789 \|

	Igonore characters on eval:
	```
	CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", "；", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
	"؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
	"{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
	"、", "﹂", "﹁", "‧", "～", "﹏", "，", "｛", "｝", "（", "）", "［", "］", "【", "】", "‥", "〽",
	"『", "』", "〝", "〟", "⟨", "⟩", "〜", "：", "！", "？", "♪", "؛", "/", "\\", "º", "−", "^", "ʻ", "ˆ"]
	```
	## Intended uses & limitations

	More information needed

	## Train & Evaluation

	Trained with transformers example pytorch script

	Eval:

	```
	import torch
	import librosa
	import re
	import warnings
	from datasets import load_dataset
	import evaluate
	from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

	LANG_ID = "hu"
	MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-hungarian"
	DEVICE = "cuda"

	CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", "；", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
	"؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
	"{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
	"、", "﹂", "﹁", "‧", "～", "﹏", "，", "｛", "｝", "（", "）", "［", "］", "【", "】", "‥", "〽",
	"『", "』", "〝", "〟", "⟨", "⟩", "〜", "：", "！", "？", "♪", "؛", "/", "\\", "º", "−", "^", "ʻ", "ˆ"]

	test_dataset = load_dataset("mozilla-foundation/common_voice_17_0", LANG_ID, split="test")

	wer = evaluate.load("wer") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/wer.py
	cer = evaluate.load("cer") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/cer.py


	chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"

	processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
	model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
	model.to(DEVICE)

	# Preprocessing the datasets.
	# We need to read the audio files as arrays
	def speech_file_to_array_fn(batch):
	with warnings.catch_warnings():
	warnings.simplefilter("ignore")
	speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
	batch["speech"] = speech_array
	batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"]).upper()
	return batch

	test_dataset = test_dataset.map(speech_file_to_array_fn)

	# Preprocessing the datasets.
	# We need to read the audio files as arrays
	def evaluate(batch):
	inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

	with torch.no_grad():
	logits = model(inputs.input_values.to(DEVICE), attention_mask=inputs.attention_mask.to(DEVICE)).logits

	pred_ids = torch.argmax(logits, dim=-1)
	batch["pred_strings"] = processor.batch_decode(pred_ids)
	return batch

	result = test_dataset.map(evaluate, batched=True, batch_size=8)

	predictions = [x.upper() for x in result["pred_strings"]]
	references = [x.upper() for x in result["sentence"]]

	print(f"WER: {wer.compute(predictions=predictions, references=references) * 100}")
	print(f"CER: {cer.compute(predictions=predictions, references=references) * 100}")
	```

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 0.0003
	- train_batch_size: 16
	- eval_batch_size: 8
	- seed: 42
	- distributed_type: multi-GPU
	- num_devices: 2
	- gradient_accumulation_steps: 2
	- total_train_batch_size: 64
	- total_eval_batch_size: 16
	- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_steps: 500
	- num_epochs: 15.0
	- mixed_precision_training: Native AMP

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Wer \|
	\|:-------------:\|:-----:\|:-----:\|:---------------:\|:------:\|
	\| 3.7968 \| 1.0 \| 758 \| 0.2848 \| 0.5295 \|
	\| 0.2547 \| 2.0 \| 1516 \| 0.1908 \| 0.4222 \|
	\| 0.1929 \| 3.0 \| 2274 \| 0.1753 \| 0.4000 \|
	\| 0.1532 \| 4.0 \| 3032 \| 0.1558 \| 0.3710 \|
	\| 0.1297 \| 5.0 \| 3790 \| 0.1512 \| 0.3536 \|
	\| 0.1167 \| 6.0 \| 4548 \| 0.1574 \| 0.3514 \|
	\| 0.101 \| 7.0 \| 5306 \| 0.1483 \| 0.3374 \|
	\| 0.0859 \| 8.0 \| 6064 \| 0.1490 \| 0.3299 \|
	\| 0.0791 \| 9.0 \| 6822 \| 0.1523 \| 0.3250 \|
	\| 0.0702 \| 10.0 \| 7580 \| 0.1608 \| 0.3192 \|
	\| 0.0629 \| 11.0 \| 8338 \| 0.1664 \| 0.3146 \|
	\| 0.0559 \| 12.0 \| 9096 \| 0.1641 \| 0.3103 \|
	\| 0.0527 \| 13.0 \| 9854 \| 0.1665 \| 0.3063 \|
	\| 0.0468 \| 14.0 \| 10612 \| 0.1691 \| 0.3011 \|
	\| 0.0443 \| 15.0 \| 11370 \| 0.1748 \| 0.2998 \|


	### Framework versions

	- Transformers 4.50.0.dev0
	- Pytorch 2.6.0+cu124
	- Datasets 3.3.2
	- Tokenizers 0.21.0