metadata
license: apache-2.0
language:
- th
base_model: biodatlab/whisper-th-medium-combined
tags:
- whisper
- Pytorch
Whisper-th-medium-ct2
whisper-th-medium-ct2 is the CTranslate2 format of biodatlab/whisper-th-medium-combined, comparable with WhisperX and faster-whisper, which enables:
- 🤏 Half the size of original Huggingface format.
- ⚡️ Batched inference for 70x real-time transcription.
- 🪶 A faster-whisper backend, requiring <8GB GPU memory with beam_size=5.
- 🎯 Accurate word-level timestamps using wav2vec2 alignment.
- 👯♂️ Multispeaker ASR using speaker diarization(includes speaker ID labels).
- 🗣️ VAD preprocessing, reducing hallucinations and allowing batching with no WER degradation.
Usage
!pip install git+https://github.com/m-bain/whisperx.git
import whisperx
import time
# Setting
device = "cuda"
audio_file = "audio.mp3"
batch_size = 16
compute_type = "float16"
"""
Your Hugging Face token for the Diarization model is required.
Additionally, you need to accept the terms and conditions before use.
Please visit the model page here.
https://huggingface.co/pyannote/segmentation-3.0
"""
HF_TOKEN = ""
# load model and transcript
model = whisperx.load_model("Thaweewat/whisper-th-medium-ct2", device, compute_type=compute_type)
st_time = time.time()
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
# Assign speaker labels
diarize_model = whisperx.DiarizationPipeline(use_auth_token=HF_TOKEN, device=device)
diarize_segments = diarize_model(audio)
result = whisperx.assign_word_speakers(diarize_segments, result)
# Combine pure text if needed
combined_text = ' '.join(segment['text'] for segment in result['segments'])
print(f"Response time: {time.time() - st_time} seconds")
print(diarize_segments)
print(result)
print(combined_text)