--- library_name: transformers datasets: - classla/Mici_Princ language: - hr license: cc-by-sa-4.0 metrics: - wer - cer pipeline_tag: automatic-speech-recognition --- # Model Card for Model ID This model was finetuned on [Mići Princ dataset](https://huggingface.co/datasets/classla/Mici_Princ), audiobook of a translation of _Le Petit Prince_ in Chakavian dialect of Croatian. ## Model Details ### Model Description This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. - **Developed by:** Nikola Ljubešić, Peter Rupnik, Tea Perinčić - **Model type:** [More Information Needed] - **Language(s) (NLP):** Croatian - Chakavian dialect - **License:** Creative Commons - Share Alike 4.0 - **Finetuned from model:** openai/whisper-large-v3 ### Model Sources - **Repository:** [GitHub](https://github.com/5roop/mici_princ_whisper) - **Paper:** Coming soon - **Dataset:** [Mići Princ](https://huggingface.co/datasets/classla/Mici_Princ) ## Example use: ```python import torch from datasets import load_dataset from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline from transformers.pipelines.pt_utils import KeyDataset device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model_id = "classla/whisper-large-v3-mici-princ" model = AutoModelForSpeechSeq2Seq.from_pretrained( model_id, ) model.to(device) processor = AutoProcessor.from_pretrained(model_id) ds = load_dataset("classla/Mici_Princ", split="test") pipe = pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, max_new_tokens=128, chunk_length_s=30, batch_size=16, return_timestamps=True, device=device, ) result = pipe( KeyDataset(ds, "audio"), generate_kwargs={"language": "croatian"}, ) for i in result: print(i) # Output: # {'text': ' Šesti planet je biv deset put veći. Na njin je bivav niki stari čovik ki je pisav vele knjige.', 'chunks': [{'timestamp': (0.0, 7.18), 'text': ' Šesti planet je biv deset put veći. Na njin je bivav niki stari čovik ki je pisav vele knjige.'}]} # ... ``` ## Training Details ### Training Data [More Information Needed] ### Training Procedure #### Preprocessing Model was trained on the `normalized_text` attribute of the [Mići Princ dataset](https://huggingface.co/datasets/classla/Mici_Princ). This means that the data included capital letters and punctuation, except bullet points, newlines, and quotation marks. Special characters, present in the dialect, but not in standard Croatian, were substituted. Only the `train` split was used in training. #### Training Hyperparameters ``` per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=1e-5, warmup_steps=100, max_steps=309 * 10, gradient_checkpointing=True, predict_with_generate=True, generation_max_length=225, save_steps=309, ``` ## Evaluation For evaluation, the `test` split of the [Mići Princ dataset](https://huggingface.co/datasets/classla/Mici_Princ) was used. #### Metrics * WER: 0.04422 * CER: 0.16248 ## Citation Coming soon. ## Model Card Authors Peter Rupnik ## Model Card Contact [https://huggingface.co/5roop](https://huggingface.co/5roop)