nguyenvulebinh's picture
Update README.md
5af41c2
|
raw
history blame
2.04 kB
metadata
language: vi
datasets:
  - VLSP 2020 ASR dataset
  - VIVOS
tags:
  - audio
  - automatic-speech-recognition
license: apache-2.0
widget:
  - label: VLSP ASR 2020 test T1
    src: >-
      https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/audio-test/t1_0001-00010.wav
  - label: VLSP ASR 2020 test T1
    src: >-
      https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/audio-test/t1_utt000000042.wav
  - label: VLSP ASR 2020 test T2
    src: >-
      https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/audio-test/t2_0000006682.wav

Wav2Vec2-Base-250h for the Vietnamese language

Facebook's Wav2Vec2

The base model pretrained and fine-tuned on 250 hours of VLSP ASR dataset on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz.

Usage

To transcribe audio files the model can be used as a standalone acoustic model as follows:

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import soundfile as sf
import torch

# load model and tokenizer
processor = Wav2Vec2Processor.from_pretrained("nguyenvulebinh/wav2vec2-base-vietnamese-250h")
model = Wav2Vec2ForCTC.from_pretrained("nguyenvulebinh/wav2vec2-base-vietnamese-250h")

# define function to read in sound file
def map_to_array(batch):
    speech, _ = sf.read(batch["file"])
    batch["speech"] = speech
    return batch

# load dummy dataset and read soundfiles
ds = map_to_array({
    "file": 'audio-test/t1_0001-00010.wav'
})

# tokenize
input_values = processor(ds["speech"], return_tensors="pt", padding="longest").input_values  # Batch size 1

# retrieve logits
logits = model(input_values).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

Result WER (with 4-grams LM):

"VIVOS" "VLSP-T1" "VLSP-T2"
6.1 9.1 40.8