File size: 2,529 Bytes
94c1ed8 e506013 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
---
license: mit
---
# Teochew Whisper Medium
This model is a fine-tuned version of the Whisper medium model to recognize the Teochew language (潮州话), a language in the Min Nan family spoken in southern China.
For a detailed documentation of how this model was trained, please refer to this video: https://www.youtube.com/watch?v=JH_78KmP4Zk
## Training Data
The model was fine-tuned on approximately 35 hours of audio data derived from Teochew language movies, TV shows, and comedies.
## Evaluation Metrics
On our private test set, we obtained the following Word Error Rate (WER) metrics:
- Careful Speech: 0.31
- Conversational Speech: 0.68
Known Limitations: this model has been trained on short audio clips and may struggle with audio that is longer than 10 seconds.
## Example code
The following script downloads the model and starts a demo using Gradio to run the model:
```
import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import gradio as gr
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
WHISPER_SAMPLE_RATE = 16000
processor = WhisperProcessor.from_pretrained("openai/whisper-medium")
model = WhisperForConditionalGeneration.from_pretrained(
"efficient-nlp/teochew-whisper-medium"
).to(DEVICE)
def preprocess_audio(audio_path: str) -> torch.Tensor:
audio, sample_rate = torchaudio.load(audio_path)
# Resample if necessary
if sample_rate != WHISPER_SAMPLE_RATE:
resampler = torchaudio.transforms.Resample(
orig_freq=sample_rate, new_freq=WHISPER_SAMPLE_RATE
)
audio = resampler(audio)
# Convert to mono
if audio.shape[0] > 1:
audio = torch.mean(audio, dim=0)
return audio.squeeze()
def transcribe(audio_path: str) -> str:
audio_input = preprocess_audio(audio_path)
input_features = processor(
audio_input,
sampling_rate=WHISPER_SAMPLE_RATE,
return_tensors="pt",
language="Chinese",
).input_features.to(DEVICE)
forced_decoder_ids = processor.get_decoder_prompt_ids(
language="Chinese", task="transcribe"
)
predicted_ids = model.generate(
input_features, forced_decoder_ids=forced_decoder_ids
)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
return transcription
iface = gr.Interface(
fn=transcribe,
inputs=gr.Audio(type="filepath"),
outputs="text",
title="Teochew Speech Recognition",
)
iface.launch()
```
|