classla
/

whisper-large-v3-mici-princ

Automatic Speech Recognition

Inference Endpoints

Model card Files Files and versions Community

whisper-large-v3-mici-princ / README.md

nljubesi's picture

Update README.md

515eb1a verified 8 months ago

|

3.42 kB

	---
	library_name: transformers
	datasets:
	- classla/Mici_Princ
	language:
	- hr
	license: cc-by-sa-4.0
	metrics:
	- wer
	- cer
	pipeline_tag: automatic-speech-recognition
	---

	# Model Card for Model ID

	This model was finetuned on [Mići Princ dataset](https://huggingface.co/datasets/classla/Mici_Princ),
	the audiobook of the translation of _Le Petit Prince_ into the Chakavian dialect of Croatian.

	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->

	This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

	- Developed by: Nikola Ljubešić, Peter Rupnik, Tea Perinčić
	- Model type: [More Information Needed]
	- Language(s) (NLP): Croatian (hrv) - Chakavian dialect (ckm)
	- License: Creative Commons - Share Alike 4.0
	- Finetuned from model: openai/whisper-large-v3

	### Model Sources

	<!-- Provide the basic links for the model. -->

	- Repository: [GitHub](https://github.com/5roop/mici_princ_whisper)
	- Paper: Coming soon
	- Dataset: [Mići Princ](https://huggingface.co/datasets/classla/Mici_Princ)

	## Example use:

	```python
	import torch
	from datasets import load_dataset
	from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
	from transformers.pipelines.pt_utils import KeyDataset

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model_id = "classla/whisper-large-v3-mici-princ"
	model = AutoModelForSpeechSeq2Seq.from_pretrained(
	model_id,
	)

	model.to(device)
	processor = AutoProcessor.from_pretrained(model_id)

	ds = load_dataset("classla/Mici_Princ", split="test")
	pipe = pipeline(
	"automatic-speech-recognition",
	model=model,
	tokenizer=processor.tokenizer,
	feature_extractor=processor.feature_extractor,
	max_new_tokens=128,
	chunk_length_s=30,
	batch_size=16,
	return_timestamps=True,
	device=device,
	)

	result = pipe(
	KeyDataset(ds, "audio"),
	generate_kwargs={"language": "croatian"},
	)

	for i in result:
	print(i)

	# Output:
	# {'text': ' Šesti planet je biv deset put veći. Na njin je bivav niki stari čovik ki je pisav vele knjige.', 'chunks': [{'timestamp': (0.0, 7.18), 'text': ' Šesti planet je biv deset put veći. Na njin je bivav niki stari čovik ki je pisav vele knjige.'}]}
	# ...

	```



	## Training Details

	#### Preprocessing

	Model was trained on the `normalized_text` attribute of the [Mići Princ dataset](https://huggingface.co/datasets/classla/Mici_Princ). This means
	that the data included capital letters and punctuation, except bullet points, newlines, and quotation marks. Special characters, present in
	the dialect, but not in standard Croatian, were substituted.

	Only the `train` split was used in training.

	#### Training Hyperparameters

	```
	per_device_train_batch_size=4,
	gradient_accumulation_steps=4,
	learning_rate=1e-5,
	warmup_steps=100,
	max_steps=309 * 10,
	gradient_checkpointing=True,
	predict_with_generate=True,
	generation_max_length=225,
	save_steps=309,
	```

	## Evaluation

	For evaluation, the `test` split of the [Mići Princ dataset](https://huggingface.co/datasets/classla/Mici_Princ) was used.

	#### Metrics

	* WER: 0.04422
	* CER: 0.16248


	## Citation

	Coming soon.

	## Model Card Authors

	Peter Rupnik

	## Model Card Contact

	[https://huggingface.co/5roop](https://huggingface.co/5roop)