Automatic Speech Recognition
Transformers
Safetensors
Vietnamese
whisper
Inference Endpoints

Introduction

Training data

VSV-1100 T2S* CMV14-vi VIVOS VLSP2021 Total
1100 hours 11 hours 3.04 hours 13.94 hours 180 hours 1308 hours

* We use a text-to-speech model to generate sentences containing words that do not appear in our dataset.

WER result

CMV14-vi VIVOS VLSP2020-T1 VLSP2020-T2 VLSP2021-T1 VLSP2021-T2 Bud500
9.79 5.74 14.15 39.25 14 10.06 5.97

Usage

Inference

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
# load model and processor
processor = WhisperProcessor.from_pretrained("NhutP/ViWhisper-small")
model = WhisperForConditionalGeneration.from_pretrained("NhutP/ViWhisper-small")
model.config.forced_decoder_ids = None

# load a sample
array, sampling_rate = librosa.load('path_to_audio', sr = 16000) # Load some audio sample
input_features = processor(array, sampling_rate=sampling_rate, return_tensors="pt").input_features 
# generate token ids
predicted_ids = model.generate(input_features)
# decode token ids to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

Use with pipeline

from transformers import pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model="NhutP/ViWhisper-small",
    max_new_tokens=128,
    chunk_length_s=30,
    return_timestamps=False,
    device= '...' # 'cpu' or 'cuda'
) 
output = pipe(path_to_audio_samplingrate_16000)['text']

Citation

@misc{VSV-1100,
    author = {Pham Quang Nhut and Duong Pham Hoang Anh and Nguyen Vinh Tiep},
    title = {VSV-1100: Vietnamese social voice dataset},
    url = {https://github.com/NhutP/VSV-1100},
    year = {2024}
}

Also, please give us a star on github: https://github.com/NhutP/ViWhisper if you find our project useful

Contact me at: [email protected] (Pham Quang Nhut)

Downloads last month
83
Safetensors
Model size
242M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for NhutP/ViWhisper-small

Finetuned
(2100)
this model

Datasets used to train NhutP/ViWhisper-small