--- license: mit --- # Teochew Whisper Medium This model is a fine-tuned version of the Whisper medium model to recognize the Teochew language (潮州话), a language in the Min Nan family spoken in southern China. For a detailed documentation of how this model was trained, please refer to this video: https://www.youtube.com/watch?v=JH_78KmP4Zk ## Training Data The model was fine-tuned on approximately 35 hours of audio data derived from Teochew language movies, TV shows, and comedies. ## Evaluation Metrics On our private test set, we obtained the following Word Error Rate (WER) metrics: - Careful Speech: 0.31 - Conversational Speech: 0.68 Known Limitations: this model has been trained on short audio clips and may struggle with audio that is longer than 10 seconds. ## Example code The following script downloads the model and starts a demo using Gradio to run the model: ``` import torch import torchaudio from transformers import WhisperProcessor, WhisperForConditionalGeneration import gradio as gr DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu") WHISPER_SAMPLE_RATE = 16000 processor = WhisperProcessor.from_pretrained("openai/whisper-medium") model = WhisperForConditionalGeneration.from_pretrained( "efficient-nlp/teochew-whisper-medium" ).to(DEVICE) def preprocess_audio(audio_path: str) -> torch.Tensor: audio, sample_rate = torchaudio.load(audio_path) # Resample if necessary if sample_rate != WHISPER_SAMPLE_RATE: resampler = torchaudio.transforms.Resample( orig_freq=sample_rate, new_freq=WHISPER_SAMPLE_RATE ) audio = resampler(audio) # Convert to mono if audio.shape[0] > 1: audio = torch.mean(audio, dim=0) return audio.squeeze() def transcribe(audio_path: str) -> str: audio_input = preprocess_audio(audio_path) input_features = processor( audio_input, sampling_rate=WHISPER_SAMPLE_RATE, return_tensors="pt", language="Chinese", ).input_features.to(DEVICE) forced_decoder_ids = processor.get_decoder_prompt_ids( language="Chinese", task="transcribe" ) predicted_ids = model.generate( input_features, forced_decoder_ids=forced_decoder_ids ) transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] return transcription iface = gr.Interface( fn=transcribe, inputs=gr.Audio(type="filepath"), outputs="text", title="Teochew Speech Recognition", ) iface.launch() ```