metadata
language: vi
datasets:
- vlsp
- vivos
tags:
- audio
- automatic-speech-recognition
license: cc-by-nc-4.0
widget:
- label: VLSP ASR 2020 test T1
src: >-
https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/audio-test/t1_0001-00010.wav
- label: VLSP ASR 2020 test T1
src: >-
https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/audio-test/t1_utt000000042.wav
- label: VLSP ASR 2020 test T2
src: >-
https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/audio-test/t2_0000006682.wav
Wav2Vec2-Base-250h for the Vietnamese language
The base model pretrained and fine-tuned on 250 hours of VLSP ASR dataset on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz.
Usage
To transcribe audio files the model can be used as a standalone acoustic model as follows:
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import soundfile as sf
import torch
# load model and tokenizer
processor = Wav2Vec2Processor.from_pretrained("nguyenvulebinh/wav2vec2-base-vietnamese-250h")
model = Wav2Vec2ForCTC.from_pretrained("nguyenvulebinh/wav2vec2-base-vietnamese-250h")
# define function to read in sound file
def map_to_array(batch):
speech, _ = sf.read(batch["file"])
batch["speech"] = speech
return batch
# load dummy dataset and read soundfiles
ds = map_to_array({
"file": 'audio-test/t1_0001-00010.wav'
})
# tokenize
input_values = processor(ds["speech"], return_tensors="pt", padding="longest").input_values # Batch size 1
# retrieve logits
logits = model(input_values).logits
# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
Result WER (with 4-grams LM):
"VIVOS" | "VLSP-T1" | "VLSP-T2" |
---|---|---|
6.1 | 9.1 | 40.8 |
License
This model follows CC-BY-NC-4.0 license. Therefore, those compounds are freely available for academic purposes or individual research but restricted for commercial use.