luckyt's picture
Fix example code for current version of huggingface
9619c90 verified
|
raw
history blame
2.37 kB
metadata
license: mit

Teochew Whisper Medium

This model is a fine-tuned version of the Whisper medium model to recognize the Teochew language (潮州话), a language in the Min Nan family spoken in southern China.

For a detailed documentation of how this model was trained, please refer to this video: https://www.youtube.com/watch?v=JH_78KmP4Zk

Training Data

The model was fine-tuned on approximately 35 hours of audio data derived from Teochew language movies, TV shows, and comedies.

Evaluation Metrics

On our private test set, we obtained the following Word Error Rate (WER) metrics:

  • Careful Speech: 0.31
  • Conversational Speech: 0.68

Known Limitations: this model has been trained on short audio clips and may struggle with audio that is longer than 10 seconds.

Example code

The following script downloads the model and starts a demo using Gradio to run the model:

import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import gradio as gr

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
WHISPER_SAMPLE_RATE = 16000

processor = WhisperProcessor.from_pretrained("openai/whisper-medium")
model = WhisperForConditionalGeneration.from_pretrained(
    "efficient-nlp/teochew-whisper-medium"
).to(DEVICE)


def preprocess_audio(audio_path: str) -> torch.Tensor:
    audio, sample_rate = torchaudio.load(audio_path)
    # Resample if necessary
    if sample_rate != WHISPER_SAMPLE_RATE:
        resampler = torchaudio.transforms.Resample(
            orig_freq=sample_rate, new_freq=WHISPER_SAMPLE_RATE
        )
        audio = resampler(audio)
    # Convert to mono
    if audio.shape[0] > 1:
        audio = torch.mean(audio, dim=0)
    return audio.squeeze()


def transcribe(audio_path: str) -> str:
    audio_input = preprocess_audio(audio_path)
    input_features = processor(
        audio_input,
        sampling_rate=WHISPER_SAMPLE_RATE,
        return_tensors="pt",
        language="Chinese",
    ).input_features.to(DEVICE)

    predicted_ids = model.generate(input_features)
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
    return transcription


iface = gr.Interface(
    fn=transcribe,
    inputs=gr.Audio(type="filepath"),
    outputs="text",
    title="Teochew Speech Recognition",
)
iface.launch()