nguyenvulebinh
commited on
Commit
•
1b7c81a
1
Parent(s):
1d0fb34
Update README.md
Browse files
README.md
CHANGED
@@ -22,31 +22,32 @@ widget:
|
|
22 |
|
23 |
### Model description
|
24 |
|
25 |
-
[Our
|
26 |
|
27 |
We use wav2vec2 architecture for the pre-trained model. Follow wav2vec2 paper:
|
28 |
|
29 |
>For the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.
|
30 |
|
31 |
-
For fine-tuning phase,
|
32 |
|
33 |
| Model | #params | Pre-training data | Fine-tune data |
|
34 |
|---|---|---|---|
|
35 |
| [base]((https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h)) | 95M | 13k hours | 250 hours |
|
36 |
|
37 |
-
In a formal ASR system, two components are required: acoustic model and language model. Here ctc-wav2vec fine-tuned model
|
38 |
|
39 |
|
40 |
-
### Benchmark WER result
|
41 |
|
42 |
-
| [VIVOS](https://ailab.hcmus.edu.vn/vivos) | [VLSP-T1](https://vlsp.org.vn/vlsp2020/eval/asr) | [VLSP-T2](https://vlsp.org.vn/vlsp2020/eval/asr) |
|
43 |
-
|
44 |
-
|
|
|
|
45 |
|
46 |
|
47 |
### Example usage
|
48 |
|
49 |
-
When using the model make sure that your speech input is
|
50 |
|
51 |
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1pVBY46gSoWer2vDf0XmZ6uNV3d8lrMxx?usp=sharing)
|
52 |
|
@@ -82,6 +83,7 @@ logits = model(input_values).logits
|
|
82 |
predicted_ids = torch.argmax(logits, dim=-1)
|
83 |
transcription = processor.batch_decode(predicted_ids)
|
84 |
```
|
|
|
85 |
# License
|
86 |
|
87 |
This model follows [CC-BY-NC-4.0](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/CC-BY-NC-SA-4.0.txt) license. Therefore, those compounds are freely available for academic purposes or individual research but restricted for commercial use.
|
|
|
22 |
|
23 |
### Model description
|
24 |
|
25 |
+
[Our models](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h) are pre-trained on 13k hours of Vietnamese youtube audio (un-label data) and fine-tuned on 250 hours labeled of [VLSP ASR dataset](https://vlsp.org.vn/vlsp2020/eval/asr) on 16kHz sampled speech audio.
|
26 |
|
27 |
We use wav2vec2 architecture for the pre-trained model. Follow wav2vec2 paper:
|
28 |
|
29 |
>For the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.
|
30 |
|
31 |
+
For fine-tuning phase, wav2vec2 is fine-tuned using Connectionist Temporal Classification (CTC), which is an algorithm that is used to train neural networks for sequence-to-sequence problems and mainly in Automatic Speech Recognition and handwriting recognition.
|
32 |
|
33 |
| Model | #params | Pre-training data | Fine-tune data |
|
34 |
|---|---|---|---|
|
35 |
| [base]((https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h)) | 95M | 13k hours | 250 hours |
|
36 |
|
37 |
+
In a formal ASR system, two components are required: acoustic model and language model. Here ctc-wav2vec fine-tuned model works as an acoustic model. For the language model, we provide a [4-grams model](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/blob/main/vi_lm_4grams.bin.zip) trained on 2GB of spoken text.
|
38 |
|
39 |
|
40 |
+
### Benchmark WER result:
|
41 |
|
42 |
+
| | [VIVOS](https://ailab.hcmus.edu.vn/vivos) | [VLSP-T1](https://vlsp.org.vn/vlsp2020/eval/asr) | [VLSP-T2](https://vlsp.org.vn/vlsp2020/eval/asr) |
|
43 |
+
|---|---|---|---|
|
44 |
+
|without LM| 10.77 | 13.33 | 51.45 |
|
45 |
+
|with 4-grams LM| 6.15 | 9.11 | 40.81 |
|
46 |
|
47 |
|
48 |
### Example usage
|
49 |
|
50 |
+
When using the model make sure that your speech input is sampled at 16Khz. Audio length should be shorter than 10s. Following the Colab link below to use a combination of CTC-wav2vec and 4-grams LM.
|
51 |
|
52 |
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1pVBY46gSoWer2vDf0XmZ6uNV3d8lrMxx?usp=sharing)
|
53 |
|
|
|
83 |
predicted_ids = torch.argmax(logits, dim=-1)
|
84 |
transcription = processor.batch_decode(predicted_ids)
|
85 |
```
|
86 |
+
|
87 |
# License
|
88 |
|
89 |
This model follows [CC-BY-NC-4.0](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h/raw/main/CC-BY-NC-SA-4.0.txt) license. Therefore, those compounds are freely available for academic purposes or individual research but restricted for commercial use.
|