Update README.md
Browse files
README.md
CHANGED
@@ -3,7 +3,7 @@ language:
|
|
3 |
- ca
|
4 |
library_name: nemo
|
5 |
datasets:
|
6 |
-
-
|
7 |
thumbnail: null
|
8 |
tags:
|
9 |
- automatic-speech-recognition
|
@@ -17,11 +17,6 @@ tags:
|
|
17 |
- hf-asr-leaderboard
|
18 |
- Riva
|
19 |
license: cc-by-4.0
|
20 |
-
widget:
|
21 |
-
- example_title: Librispeech sample 1
|
22 |
-
src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
|
23 |
-
- example_title: Librispeech sample 2
|
24 |
-
src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
|
25 |
model-index:
|
26 |
- name: stt_ca_conformer_ctc_large
|
27 |
results:
|
@@ -29,16 +24,16 @@ model-index:
|
|
29 |
name: Automatic Speech Recognition
|
30 |
type: automatic-speech-recognition
|
31 |
dataset:
|
32 |
-
name:
|
33 |
-
type:
|
34 |
-
config:
|
35 |
split: test
|
36 |
args:
|
37 |
-
language:
|
38 |
metrics:
|
39 |
- name: Test WER
|
40 |
type: wer
|
41 |
-
value:
|
42 |
|
43 |
---
|
44 |
|
@@ -93,7 +88,7 @@ asr_model.transcribe(['2086-149220-0033.wav'])
|
|
93 |
|
94 |
```shell
|
95 |
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
|
96 |
-
pretrained_name="nvidia/
|
97 |
audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
|
98 |
```
|
99 |
|
@@ -115,40 +110,32 @@ The NeMo toolkit [3] was used for training the models for over several hundred e
|
|
115 |
|
116 |
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
|
117 |
|
118 |
-
The
|
|
|
|
|
|
|
119 |
|
120 |
-
|
121 |
|
122 |
-
|
123 |
|
124 |
-
|
125 |
-
- Fisher Corpus
|
126 |
-
- Switchboard-1 Dataset
|
127 |
-
- WSJ-0 and WSJ-1
|
128 |
-
- National Speech Corpus (Part 1, Part 6)
|
129 |
-
- VCTK
|
130 |
-
- VoxPopuli (EN)
|
131 |
-
- Europarl-ASR (EN)
|
132 |
-
- Multilingual Librispeech (MLS EN) - 2,000 hours subset
|
133 |
-
- Mozilla Common Voice (v7.0)
|
134 |
|
135 |
-
|
136 |
|
137 |
## Performance
|
138 |
|
139 |
The list of the available models in this collection is shown in the following table. Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding.
|
140 |
|
141 |
-
| Version | Tokenizer
|
142 |
-
|
143 |
-
| 1.
|
144 |
|
145 |
-
|
146 |
|
147 |
-
| Language
|
148 |
-
|
149 |
-
|N-gram LM
|
150 |
-
|Neural Rescorer(Transformer) | LS Train + LS LM Corpus | 3.4 | 1.7 | N=10, beam_width=128 |
|
151 |
-
|N-gram + Neural Rescorer(Transformer)| LS Train + LS LM Corpus | 3.2 | 1.8 | N=10, beam_width=128, n_gram_alpha=1.0, n_gram_beta=1.0 |
|
152 |
|
153 |
|
154 |
## Limitations
|
|
|
3 |
- ca
|
4 |
library_name: nemo
|
5 |
datasets:
|
6 |
+
- mozilla-foundation/common_voice_9_0
|
7 |
thumbnail: null
|
8 |
tags:
|
9 |
- automatic-speech-recognition
|
|
|
17 |
- hf-asr-leaderboard
|
18 |
- Riva
|
19 |
license: cc-by-4.0
|
|
|
|
|
|
|
|
|
|
|
20 |
model-index:
|
21 |
- name: stt_ca_conformer_ctc_large
|
22 |
results:
|
|
|
24 |
name: Automatic Speech Recognition
|
25 |
type: automatic-speech-recognition
|
26 |
dataset:
|
27 |
+
name: Mozilla Common Voice 9.0
|
28 |
+
type: mozilla-foundation/common_voice_9_0
|
29 |
+
config: ca
|
30 |
split: test
|
31 |
args:
|
32 |
+
language: ca
|
33 |
metrics:
|
34 |
- name: Test WER
|
35 |
type: wer
|
36 |
+
value: 4.27
|
37 |
|
38 |
---
|
39 |
|
|
|
88 |
|
89 |
```shell
|
90 |
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
|
91 |
+
pretrained_name="nvidia/stt_ca_conformer_ctc_large"
|
92 |
audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
|
93 |
```
|
94 |
|
|
|
110 |
|
111 |
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
|
112 |
|
113 |
+
The vocabulary we use contains 44 characters:
|
114 |
+
```python
|
115 |
+
['s','e','r','v','i','d','p','o','g','a','m','t','u','l','f','c','z','b','q','n','é',"'",'x','ó','è','h','í','ü','j','à','ï','w','k','y','ç','ú','ò','á','ı','·','ñ','—','–','-']
|
116 |
+
```
|
117 |
|
118 |
+
Full config can be found inside the .nemo files.
|
119 |
|
120 |
+
The checkpoint of the language model used as the neural rescorer can be found [here](https://ngc.nvidia.com/catalog/models/nvidia:nemo:asrlm_en_transformer_large_ls). You may find more info on how to train and use language models for ASR models here: [ASR Language Modeling](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html)
|
121 |
|
122 |
+
### Datasets
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
123 |
|
124 |
+
All the models in this collection are trained on MCV-9.0 Catalan dataset, which contains around 1203 hours training, 28 hours of development and 27 hours of testing speech audios.
|
125 |
|
126 |
## Performance
|
127 |
|
128 |
The list of the available models in this collection is shown in the following table. Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding.
|
129 |
|
130 |
+
| Version | Tokenizer | Vocabulary Size | Dev WER| Test WER| Train Dataset |
|
131 |
+
|---------|-----------------------|-----------------|-----|------|-----------------|
|
132 |
+
| 1.11.0 | SentencePiece Unigram | 128 |4.70 | 4.27 | MCV-9.0 Train set |
|
133 |
|
134 |
+
You may use language models (LMs) and beam search to improve the accuracy of the models, as reported in the follwoing table.
|
135 |
|
136 |
+
| Language Model | Test WER | Test WER w/ Oracle LM | Train Dataset | Settings |
|
137 |
+
|----------------|----------|-----------------------|------------------|-------------------------------------------------------|
|
138 |
+
| N-gram LM | 3.77 | 1.54 |MCV-9.0 Train set |N=6, beam_width=128, ngram_alpha=1.5, ngram_beta=2.0 |
|
|
|
|
|
139 |
|
140 |
|
141 |
## Limitations
|