File size: 2,524 Bytes

8e9e5e8
e01589f
8e9e5e8
ff0dac8
e7ded0f
8e9e5e8
 
 
e7ded0f
8e9e5e8
e01589f
 
 
 
 
 
8e9e5e8
 
5af41c2
8e9e5e8
 
 
956919e
8e9e5e8
 
 
 
 
 
 
5af41c2
8e9e5e8
 
 
 
5af41c2
 
 
8e9e5e8
5af41c2
8e9e5e8
 
 
 
 
5af41c2
 
 
 
8e9e5e8
5af41c2
 
8e9e5e8
5af41c2
 
8e9e5e8
5af41c2
 
 
 
 
 
8e9e5e8
5af41c2
 
b8d61fa
 
 
 
f4d11a0
b8d61fa
 
 
cce66ef

---
language: vi
datasets:
- vlsp
- vivos
tags:
- audio
- automatic-speech-recognition
license: cc-by-nc-4.0
widget:
- label: VLSP ASR 2020 test T1
  src: https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/audio-test/t1_0001-00010.wav
- label: VLSP ASR 2020 test T1
  src: https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/audio-test/t1_utt000000042.wav
- label: VLSP ASR 2020 test T2
  src: https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/audio-test/t2_0000006682.wav
---

# Wav2Vec2-Base-250h for the Vietnamese language

[Facebook's Wav2Vec2](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/)

The base model pretrained and fine-tuned on 250 hours of [VLSP ASR dataset](https://vlsp.org.vn/vlsp2020/eval/asr) on 16kHz sampled speech audio. When using the model
make sure that your speech input is also sampled at 16Khz.

# Usage

To transcribe audio files the model can be used as a standalone acoustic model as follows:

```python
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import soundfile as sf
import torch

# load model and tokenizer
processor = Wav2Vec2Processor.from_pretrained("nguyenvulebinh/wav2vec2-base-vietnamese-250h")
model = Wav2Vec2ForCTC.from_pretrained("nguyenvulebinh/wav2vec2-base-vietnamese-250h")

# define function to read in sound file
def map_to_array(batch):
    speech, _ = sf.read(batch["file"])
    batch["speech"] = speech
    return batch

# load dummy dataset and read soundfiles
ds = map_to_array({
    "file": 'audio-test/t1_0001-00010.wav'
})

# tokenize
input_values = processor(ds["speech"], return_tensors="pt", padding="longest").input_values  # Batch size 1

# retrieve logits
logits = model(input_values).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
 ```
 
*Result WER (with 4-grams LM)*:

| "VIVOS" | "VLSP-T1" | "VLSP-T2" |
|---|---|---|
| 6.1 | 9.1 | 40.8 |

# License

This model follows [CC-BY-NC-4.0](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/CC-BY-NC-SA-4.0.txt) license. Therefore, those compounds are freely available for academic purposes or individual research but restricted for commercial use.

# Contact 

[email protected]
[![Follow](https://img.shields.io/twitter/follow/nguyenvulebinh?style=social)](https://twitter.com/intent/follow?screen_name=nguyenvulebinh)