|
--- |
|
license: mit |
|
--- |
|
|
|
# Teochew Whisper Medium |
|
|
|
This model is a fine-tuned version of the Whisper medium model to recognize the Teochew language (潮州话), a language in the Min Nan family spoken in southern China. |
|
|
|
For a detailed documentation of how this model was trained, please refer to this video: https://www.youtube.com/watch?v=JH_78KmP4Zk |
|
|
|
## Training Data |
|
|
|
The model was fine-tuned on approximately 35 hours of audio data derived from Teochew language movies, TV shows, and comedies. |
|
|
|
## Evaluation Metrics |
|
|
|
On our private test set, we obtained the following Word Error Rate (WER) metrics: |
|
|
|
- Careful Speech: 0.31 |
|
- Conversational Speech: 0.68 |
|
|
|
Known Limitations: this model has been trained on short audio clips and may struggle with audio that is longer than 10 seconds. |
|
|
|
## Example code |
|
|
|
The following script downloads the model and starts a demo using Gradio to run the model: |
|
|
|
``` |
|
import torch |
|
import torchaudio |
|
from transformers import WhisperProcessor, WhisperForConditionalGeneration |
|
import gradio as gr |
|
|
|
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
WHISPER_SAMPLE_RATE = 16000 |
|
|
|
processor = WhisperProcessor.from_pretrained("openai/whisper-medium") |
|
model = WhisperForConditionalGeneration.from_pretrained( |
|
"efficient-nlp/teochew-whisper-medium" |
|
).to(DEVICE) |
|
|
|
|
|
def preprocess_audio(audio_path: str) -> torch.Tensor: |
|
audio, sample_rate = torchaudio.load(audio_path) |
|
# Resample if necessary |
|
if sample_rate != WHISPER_SAMPLE_RATE: |
|
resampler = torchaudio.transforms.Resample( |
|
orig_freq=sample_rate, new_freq=WHISPER_SAMPLE_RATE |
|
) |
|
audio = resampler(audio) |
|
# Convert to mono |
|
if audio.shape[0] > 1: |
|
audio = torch.mean(audio, dim=0) |
|
return audio.squeeze() |
|
|
|
|
|
def transcribe(audio_path: str) -> str: |
|
audio_input = preprocess_audio(audio_path) |
|
input_features = processor( |
|
audio_input, |
|
sampling_rate=WHISPER_SAMPLE_RATE, |
|
return_tensors="pt", |
|
language="Chinese", |
|
).input_features.to(DEVICE) |
|
|
|
forced_decoder_ids = processor.get_decoder_prompt_ids( |
|
language="Chinese", task="transcribe" |
|
) |
|
predicted_ids = model.generate( |
|
input_features, forced_decoder_ids=forced_decoder_ids |
|
) |
|
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] |
|
return transcription |
|
|
|
|
|
iface = gr.Interface( |
|
fn=transcribe, |
|
inputs=gr.Audio(type="filepath"), |
|
outputs="text", |
|
title="Teochew Speech Recognition", |
|
) |
|
iface.launch() |
|
``` |
|
|
|
|