File size: 2,529 Bytes
94c1ed8
 
 
e506013
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
---
license: mit
---

# Teochew Whisper Medium

This model is a fine-tuned version of the Whisper medium model to recognize the Teochew language (潮州话), a language in the Min Nan family spoken in southern China.

For a detailed documentation of how this model was trained, please refer to this video: https://www.youtube.com/watch?v=JH_78KmP4Zk

## Training Data

The model was fine-tuned on approximately 35 hours of audio data derived from Teochew language movies, TV shows, and comedies.

## Evaluation Metrics

On our private test set, we obtained the following Word Error Rate (WER) metrics:

- Careful Speech: 0.31
- Conversational Speech: 0.68

Known Limitations: this model has been trained on short audio clips and may struggle with audio that is longer than 10 seconds.

## Example code

The following script downloads the model and starts a demo using Gradio to run the model:

```
import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import gradio as gr

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
WHISPER_SAMPLE_RATE = 16000

processor = WhisperProcessor.from_pretrained("openai/whisper-medium")
model = WhisperForConditionalGeneration.from_pretrained(
    "efficient-nlp/teochew-whisper-medium"
).to(DEVICE)


def preprocess_audio(audio_path: str) -> torch.Tensor:
    audio, sample_rate = torchaudio.load(audio_path)
    # Resample if necessary
    if sample_rate != WHISPER_SAMPLE_RATE:
        resampler = torchaudio.transforms.Resample(
            orig_freq=sample_rate, new_freq=WHISPER_SAMPLE_RATE
        )
        audio = resampler(audio)
    # Convert to mono
    if audio.shape[0] > 1:
        audio = torch.mean(audio, dim=0)
    return audio.squeeze()


def transcribe(audio_path: str) -> str:
    audio_input = preprocess_audio(audio_path)
    input_features = processor(
        audio_input,
        sampling_rate=WHISPER_SAMPLE_RATE,
        return_tensors="pt",
        language="Chinese",
    ).input_features.to(DEVICE)

    forced_decoder_ids = processor.get_decoder_prompt_ids(
        language="Chinese", task="transcribe"
    )
    predicted_ids = model.generate(
        input_features, forced_decoder_ids=forced_decoder_ids
    )
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
    return transcription


iface = gr.Interface(
    fn=transcribe,
    inputs=gr.Audio(type="filepath"),
    outputs="text",
    title="Teochew Speech Recognition",
)
iface.launch()
```