--- language: - hr license: cc-by-sa-4.0 library_name: transformers base_model: openai/whisper-large-v3 datasets: - classla/Mici_Princ metrics: - wer - cer pipeline_tag: automatic-speech-recognition widget: - example_title: example 1 src: https://huggingface.co/classla/whisper-large-v3-mici-princ/raw/main/MP_13_65.37-74.67.mp3.wav - example_title: example 2 src: https://huggingface.co/classla/whisper-large-v3-mici-princ/raw/main/MP_15_201.53-210.02.mp3.wav - example_title: example 3 src: https://huggingface.co/classla/whisper-large-v3-mici-princ/raw/main/MP_15_60.527-67.71.mp3.wav - example_title: example 4 src: https://huggingface.co/classla/whisper-large-v3-mici-princ/raw/main/MP_15_68.5-72.45.mp3.wav --- # Model Card for Model ID This model was finetuned on the [Mići Princ dataset](https://huggingface.co/datasets/classla/Mici_Princ), the audiobook of the translation of _Le Petit Prince_ into the Chakavian dialect of Croatian. ## Model Details ### Model Description The model, already very potent in standard Croatian, was finetuned for 80 epochs with an effective batch size of 16. Performance was inspected every 4 epochs, and the latest checkpoint is uploaded here. Character error rate has been brought down from 11.54% to 3.95%, while word error rate has been lowered from 35.43% to 16.83%. - **Developed by:** Nikola Ljubešić, Peter Rupnik, Tea Perinčić - **Language(s) (NLP):** Croatian (hrv) - Chakavian dialect (ckm) - **License:** Creative Commons - Share Alike 4.0 - **Finetuned from model:** openai/whisper-large-v3 ### Model Sources - **Repository:** [GitHub](https://github.com/5roop/mici_princ_whisper) - **Paper:** Coming soon - **Dataset:** [Mići Princ](https://huggingface.co/datasets/classla/Mici_Princ) ## Example use: ```python import torch from datasets import load_dataset from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline from transformers.pipelines.pt_utils import KeyDataset device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model_id = "classla/whisper-large-v3-mici-princ" model = AutoModelForSpeechSeq2Seq.from_pretrained( model_id, ) model.to(device) processor = AutoProcessor.from_pretrained(model_id) ds = load_dataset("classla/Mici_Princ", split="test") pipe = pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, max_new_tokens=128, chunk_length_s=30, batch_size=16, return_timestamps=True, device=device, ) result = pipe( KeyDataset(ds, "audio"), generate_kwargs={"language": "croatian"}, ) for i in result: print(i) # Output: # {'text': ' Šesti planet je biv deset put veći. Na njin je bivav niki stari čovik ki je pisav vele knjige.', 'chunks': [{'timestamp': (0.0, 7.18), 'text': ' Šesti planet je biv deset put veći. Na njin je bivav niki stari čovik ki je pisav vele knjige.'}]} # ... ``` ## Training Details #### Preprocessing Model was trained on the `normalized_text` attribute of the [Mići Princ dataset](https://huggingface.co/datasets/classla/Mici_Princ). This means that the data included capital letters and punctuation, except bullet points, newlines, and quotation marks. Special characters, present in the dialect, but not in standard Croatian, were substituted. Only the `train` split was used in training. #### Training Hyperparameters ``` per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=1e-5, warmup_steps=100, max_steps=277 * 80, gradient_checkpointing=True, predict_with_generate=True, generation_max_length=225, save_steps=277, ``` ## Evaluation For evaluation, the `test` split of the [Mići Princ dataset](https://huggingface.co/datasets/classla/Mici_Princ) was used. The test split consists of two known speakers, Autor and Mići Princ, and two unknown speakers, Geograf and Dilavac. Important to note is that each speaker uses a different micro-dialect, so the test data is challenging on including two new micro-dialects. #### Metrics | speaker | WER vanilla | WER fine-tuned | WER reduction | CER vanilla | CER fine-tuned| CER reduction | |---|---|---|---|---|---|---| | all | 35.43% | 16.83% | 52.50% | 11.54% | 3.95% | 65.77% | | Autor | 38.96% | 14.29% | 63.32% | 10.24% | 2.93% | 71.39% | | Geograf | 20.94% | 11.57% | 44.75% | 4.99% | 2.19% | 56.11% | | Mići Princ | 45.31% | 16.62% | 63.32% | 12.21% | 5.09% | 58.31% | | Dilavac | 39.60% | 23.70% | 40.15% | 18.55% | 5.27% | 71.59% | ## Citation Coming soon. ## Model Card Authors Peter Rupnik ## Model Card Contact [https://huggingface.co/5roop](https://huggingface.co/5roop)