metadata

library_name: transformers
datasets:
  - classla/Mici_Princ
language:
  - hr
license: cc-by-sa-4.0
pipeline_tag: automatic-speech-recognition
base_model: openai/whisper-large-v3
widget:
  - example_title: example 1
    src: >-
      https://huggingface.co/classla/whisper-large-v3-mici-princ/blob/main/MP_13_65.37-74.67.mp3
  - example_title: example 2
    src: >-
      https://huggingface.co/classla/whisper-large-v3-mici-princ/blob/main/MP_15_201.53-210.02.mp3
  - example_title: example 3
    src: >-
      https://huggingface.co/classla/whisper-large-v3-mici-princ/blob/main/MP_15_60.527-67.71.mp3
  - example_title: example 4
    src: >-
      https://huggingface.co/classla/whisper-large-v3-mici-princ/blob/main/MP_15_68.5-72.45.mp3
metrics:
  - wer
  - cer

Model Card for Model ID

This model was finetuned on Mići Princ dataset, the audiobook of the translation of Le Petit Prince into the Chakavian dialect of Croatian.

Model Details

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

Developed by: Nikola Ljubešić, Peter Rupnik, Tea Perinčić
Model type: [More Information Needed]
Language(s) (NLP): Croatian (hrv) - Chakavian dialect (ckm)
License: Creative Commons - Share Alike 4.0
Finetuned from model: openai/whisper-large-v3

Model Sources

Repository: GitHub
Paper: Coming soon
Dataset: Mići Princ

Example use:

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from transformers.pipelines.pt_utils import KeyDataset

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = "classla/whisper-large-v3-mici-princ"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
)

model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

ds = load_dataset("classla/Mici_Princ", split="test")
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    device=device,
)

result = pipe(
    KeyDataset(ds, "audio"),
    generate_kwargs={"language": "croatian"},
)

for i in result:
    print(i)

# Output:
# {'text': ' Šesti planet je biv deset put veći. Na njin je bivav niki stari čovik ki je pisav vele knjige.', 'chunks': [{'timestamp': (0.0, 7.18), 'text': ' Šesti planet je biv deset put veći. Na njin je bivav niki stari čovik ki je pisav vele knjige.'}]}
# ...

Training Details

Preprocessing

Model was trained on the normalized_text attribute of the Mići Princ dataset. This means that the data included capital letters and punctuation, except bullet points, newlines, and quotation marks. Special characters, present in the dialect, but not in standard Croatian, were substituted.

Only the train split was used in training.

Training Hyperparameters

    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=1e-5,
    warmup_steps=100,
    max_steps=309 * 10,
    gradient_checkpointing=True,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=309,

Evaluation

For evaluation, the test split of the Mići Princ dataset was used.

Metrics

WER: 0.04422
CER: 0.16248

Citation

Coming soon.

Model Card Authors

Peter Rupnik

Model Card Contact

https://huggingface.co/5roop