metadata
license: mit
Teochew Whisper Medium
This model is a fine-tuned version of the Whisper medium model to recognize the Teochew language (潮州话), a language in the Min Nan family spoken in southern China.
For a detailed documentation of how this model was trained, please refer to this video: https://www.youtube.com/watch?v=JH_78KmP4Zk
Training Data
The model was fine-tuned on approximately 35 hours of audio data derived from Teochew language movies, TV shows, and comedies.
Evaluation Metrics
On our private test set, we obtained the following Word Error Rate (WER) metrics:
- Careful Speech: 0.31
- Conversational Speech: 0.68
Known Limitations: this model has been trained on short audio clips and may struggle with audio that is longer than 10 seconds.
Example code
The following script downloads the model and starts a demo using Gradio to run the model:
import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import gradio as gr
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
WHISPER_SAMPLE_RATE = 16000
processor = WhisperProcessor.from_pretrained("openai/whisper-medium")
model = WhisperForConditionalGeneration.from_pretrained(
"efficient-nlp/teochew-whisper-medium"
).to(DEVICE)
def preprocess_audio(audio_path: str) -> torch.Tensor:
audio, sample_rate = torchaudio.load(audio_path)
# Resample if necessary
if sample_rate != WHISPER_SAMPLE_RATE:
resampler = torchaudio.transforms.Resample(
orig_freq=sample_rate, new_freq=WHISPER_SAMPLE_RATE
)
audio = resampler(audio)
# Convert to mono
if audio.shape[0] > 1:
audio = torch.mean(audio, dim=0)
return audio.squeeze()
def transcribe(audio_path: str) -> str:
audio_input = preprocess_audio(audio_path)
input_features = processor(
audio_input,
sampling_rate=WHISPER_SAMPLE_RATE,
return_tensors="pt",
language="Chinese",
).input_features.to(DEVICE)
forced_decoder_ids = processor.get_decoder_prompt_ids(
language="Chinese", task="transcribe"
)
predicted_ids = model.generate(
input_features, forced_decoder_ids=forced_decoder_ids
)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
return transcription
iface = gr.Interface(
fn=transcribe,
inputs=gr.Audio(type="filepath"),
outputs="text",
title="Teochew Speech Recognition",
)
iface.launch()