|
--- |
|
language: vi |
|
datasets: |
|
- VLSP 2020 ASR dataset |
|
- VIVOS |
|
tags: |
|
- audio |
|
- automatic-speech-recognition |
|
license: apache-2.0 |
|
widget: |
|
- label: VLSP ASR 2020 test T1 |
|
src: https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/audio-test/t1_0001-00010.wav |
|
- label: VLSP ASR 2020 test T1 |
|
src: https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/audio-test/t1_utt000000042.wav |
|
- label: VLSP ASR 2020 test T2 |
|
src: https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/audio-test/t2_0000006682.wav |
|
--- |
|
|
|
# Wav2Vec2-Base-250h for the Vietnamese language |
|
|
|
[Facebook's Wav2Vec2](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/) |
|
|
|
The base model pretrained and fine-tuned on 250 hours of VLSP ASR dataset on 16kHz sampled speech audio. When using the model |
|
make sure that your speech input is also sampled at 16Khz. |
|
|
|
# Usage |
|
|
|
To transcribe audio files the model can be used as a standalone acoustic model as follows: |
|
|
|
```python |
|
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC |
|
from datasets import load_dataset |
|
import soundfile as sf |
|
import torch |
|
|
|
# load model and tokenizer |
|
processor = Wav2Vec2Processor.from_pretrained("nguyenvulebinh/wav2vec2-base-vietnamese-250h") |
|
model = Wav2Vec2ForCTC.from_pretrained("nguyenvulebinh/wav2vec2-base-vietnamese-250h") |
|
|
|
# define function to read in sound file |
|
def map_to_array(batch): |
|
speech, _ = sf.read(batch["file"]) |
|
batch["speech"] = speech |
|
return batch |
|
|
|
# load dummy dataset and read soundfiles |
|
ds = map_to_array({ |
|
"file": 'audio-test/t1_0001-00010.wav' |
|
}) |
|
|
|
# tokenize |
|
input_values = processor(ds["speech"], return_tensors="pt", padding="longest").input_values # Batch size 1 |
|
|
|
# retrieve logits |
|
logits = model(input_values).logits |
|
|
|
# take argmax and decode |
|
predicted_ids = torch.argmax(logits, dim=-1) |
|
transcription = processor.batch_decode(predicted_ids) |
|
``` |
|
|
|
*Result WER (with 4-grams LM)*: |
|
|
|
| "VIVOS" | "VLSP-T1" | "VLSP-T2" | |
|
|---|---|---| |
|
| 6.1 | 9.1 | 40.8 | |