Fix example code for current version of huggingface

9619c90 verified 11 months ago

2.37 kB

	---
	license: mit
	---

	# Teochew Whisper Medium

	This model is a fine-tuned version of the Whisper medium model to recognize the Teochew language (潮州话), a language in the Min Nan family spoken in southern China.

	For a detailed documentation of how this model was trained, please refer to this video: https://www.youtube.com/watch?v=JH_78KmP4Zk

	## Training Data

	The model was fine-tuned on approximately 35 hours of audio data derived from Teochew language movies, TV shows, and comedies.

	## Evaluation Metrics

	On our private test set, we obtained the following Word Error Rate (WER) metrics:

	- Careful Speech: 0.31
	- Conversational Speech: 0.68

	Known Limitations: this model has been trained on short audio clips and may struggle with audio that is longer than 10 seconds.

	## Example code

	The following script downloads the model and starts a demo using Gradio to run the model:

	```
	import torch
	import torchaudio
	from transformers import WhisperProcessor, WhisperForConditionalGeneration
	import gradio as gr

	DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	WHISPER_SAMPLE_RATE = 16000

	processor = WhisperProcessor.from_pretrained("openai/whisper-medium")
	model = WhisperForConditionalGeneration.from_pretrained(
	"efficient-nlp/teochew-whisper-medium"
	).to(DEVICE)


	def preprocess_audio(audio_path: str) -> torch.Tensor:
	audio, sample_rate = torchaudio.load(audio_path)
	# Resample if necessary
	if sample_rate != WHISPER_SAMPLE_RATE:
	resampler = torchaudio.transforms.Resample(
	orig_freq=sample_rate, new_freq=WHISPER_SAMPLE_RATE
	)
	audio = resampler(audio)
	# Convert to mono
	if audio.shape[0] > 1:
	audio = torch.mean(audio, dim=0)
	return audio.squeeze()


	def transcribe(audio_path: str) -> str:
	audio_input = preprocess_audio(audio_path)
	input_features = processor(
	audio_input,
	sampling_rate=WHISPER_SAMPLE_RATE,
	return_tensors="pt",
	language="Chinese",
	).input_features.to(DEVICE)

	predicted_ids = model.generate(input_features)
	transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
	return transcription


	iface = gr.Interface(
	fn=transcribe,
	inputs=gr.Audio(type="filepath"),
	outputs="text",
	title="Teochew Speech Recognition",
	)
	iface.launch()
	```