Model Card for Model ID
This model is a fine-tuned version of the Whisper v3 model, specifically trained for automatic speech recognition (ASR) in Cantonese (Yue). The model has been fine-tuned on data from the Common Voice 17 dataset for 10 epochs with a learning rate of 1e-7.
Model Details
- Model Architecture: Whisper v3
- Language: Cantonese (Yue)
- Training Dataset: Common Voice 17
- Training Duration: 10 epochs
- Learning Rate: 1e-7
- Frozen Layers: 12 layers in the decoder are frozen during training
Model Description
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
- Developed by: khleeloo (Rita Frieske)
- Language(s) (NLP): Cantonese
- License: apache-2.0
- Finetuned from model [optional]: openai/whisper-large-v3
Uses
This model is intended for researchers and developers interested in building applications that require speech recognition capabilities in Cantonese. It can be used in various applications, including:
- Voice assistants
- Transcription services
- Accessibility features for Cantonese speakers
Bias, Risks, and Limitations
The model is specifically fine-tuned for Cantonese and may not perform well on other languages or dialects. Performance may vary based on the quality and accent of the audio input. The model's effectiveness is dependent on the diversity and richness of the training data.
How to Get Started with the Model
To use this model, you can load it using the Hugging Face Transformers library:
from transformers import WhisperProcessor, WhisperForConditionalGeneration
model = WhisperForConditionalGeneration.from_pretrained("your_username/whisper-cantonese")
processor = WhisperProcessor.from_pretrained("your_username/whisper-cantonese")
Training
Training Data
- mozilla-foundation/common_voice_17_0
Evaluation
Testing Data, Factors & Metrics
Common Voice_17_0 yue test split Common Voice 15_0 yue test split and Common Voice 15_0 zh-HK test split (these test dataset were used to evaluate Whisper 3.0)
Metrics
Character Error Rate (CER) since Cantonese is character based language.
Results
CV15_0 zh-HK | CV 15_0 yue | CV 17_0 yue | |
---|---|---|---|
Whisper large v3 | 10.8 | 16 | - |
Whisper cantonese (ours) | 18.88 | 8.77 | 7.26 |
Explanation: our model was not trained on zh-HK data consisting of more written Cantonese but rather more vernacular Cantonese version (yue) since it is a speech recognition model. Hence the weaker performance on zh-HK splits of Common Voice dataset.
Summary
Citation [optional]
BibTeX:
@misc {rita_frieske_2025, author = { {Rita Frieske} }, title = { whisper-large-v3-cantonese }, year = 2025, url = { https://huggingface.co/khleeloo/whisper-large-v3-cantonese }, doi = { 10.57967/hf/4393 }, publisher = { Hugging Face } }
Model Card Authors [optional]
- Downloads last month
- 92
Model tree for khleeloo/whisper-large-v3-cantonese
Base model
openai/whisper-large-v3