Model Card for Model ID

This model is a fine-tuned version of the Whisper v3 model, specifically trained for automatic speech recognition (ASR) in Cantonese (Yue). The model has been fine-tuned on data from the Common Voice 17 dataset for 10 epochs with a learning rate of 1e-7.

Model Details

Model Architecture: Whisper v3
Language: Cantonese (Yue)
Training Dataset: Common Voice 17
Training Duration: 10 epochs
Learning Rate: 1e-7
Frozen Layers: 12 layers in the decoder are frozen during training

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

Developed by: khleeloo (Rita Frieske)
Language(s) (NLP): Cantonese
License: apache-2.0
Finetuned from model [optional]: openai/whisper-large-v3

Uses

This model is intended for researchers and developers interested in building applications that require speech recognition capabilities in Cantonese. It can be used in various applications, including:

Voice assistants
Transcription services
Accessibility features for Cantonese speakers

Bias, Risks, and Limitations

The model is specifically fine-tuned for Cantonese and may not perform well on other languages or dialects. Performance may vary based on the quality and accent of the audio input. The model's effectiveness is dependent on the diversity and richness of the training data.

How to Get Started with the Model

To use this model, you can load it using the Hugging Face Transformers library:

from transformers import WhisperProcessor, WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("your_username/whisper-cantonese")
processor = WhisperProcessor.from_pretrained("your_username/whisper-cantonese")

Training

Training Data

mozilla-foundation/common_voice_17_0

Evaluation

Testing Data, Factors & Metrics

Common Voice_17_0 yue test split Common Voice 15_0 yue test split and Common Voice 15_0 zh-HK test split (these test dataset were used to evaluate Whisper 3.0)

Metrics

Character Error Rate (CER) since Cantonese is character based language.

Results

	CV15_0 zh-HK	CV 15_0 yue	CV 17_0 yue
Whisper large v3	10.8	16	-
Whisper cantonese (ours)	18.88	8.77	7.26

Explanation: our model was not trained on zh-HK data consisting of more written Cantonese but rather more vernacular Cantonese version (yue) since it is a speech recognition model. Hence the weaker performance on zh-HK splits of Common Voice dataset.

Summary

Citation [optional]

BibTeX:

@misc {rita_frieske_2025, author = { {Rita Frieske} }, title = { whisper-large-v3-cantonese }, year = 2025, url = { https://huggingface.co/khleeloo/whisper-large-v3-cantonese }, doi = { 10.57967/hf/4393 }, publisher = { Hugging Face } }

Model Card Authors [optional]

https://khleeloo.github.io/

khleeloo
/

whisper-large-v3-cantonese

You need to agree to share your contact information to access this model