steveheh commited on
Commit
3312bff
1 Parent(s): f05d889

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -35
README.md CHANGED
@@ -3,7 +3,7 @@ language:
3
  - ca
4
  library_name: nemo
5
  datasets:
6
- - Mozilla Common Voice 9.0
7
  thumbnail: null
8
  tags:
9
  - automatic-speech-recognition
@@ -17,11 +17,6 @@ tags:
17
  - hf-asr-leaderboard
18
  - Riva
19
  license: cc-by-4.0
20
- widget:
21
- - example_title: Librispeech sample 1
22
- src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
23
- - example_title: Librispeech sample 2
24
- src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
25
  model-index:
26
  - name: stt_ca_conformer_ctc_large
27
  results:
@@ -29,16 +24,16 @@ model-index:
29
  name: Automatic Speech Recognition
30
  type: automatic-speech-recognition
31
  dataset:
32
- name: LibriSpeech (clean)
33
- type: librispeech_asr
34
- config: clean
35
  split: test
36
  args:
37
- language: en
38
  metrics:
39
  - name: Test WER
40
  type: wer
41
- value: 2.2
42
 
43
  ---
44
 
@@ -93,7 +88,7 @@ asr_model.transcribe(['2086-149220-0033.wav'])
93
 
94
  ```shell
95
  python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
96
- pretrained_name="nvidia/stt_en_conformer_ctc_large"
97
  audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
98
  ```
99
 
@@ -115,40 +110,32 @@ The NeMo toolkit [3] was used for training the models for over several hundred e
115
 
116
  The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
117
 
118
- The checkpoint of the language model used as the neural rescorer can be found [here](https://ngc.nvidia.com/catalog/models/nvidia:nemo:asrlm_en_transformer_large_ls). You may find more info on how to train and use language models for ASR models here: [ASR Language Modeling](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html)
 
 
 
119
 
120
- ### Datasets
121
 
122
- All the models in this collection are trained on a composite dataset (NeMo ASRSET) comprising of several thousand hours of English speech:
123
 
124
- - Librispeech 960 hours of English speech
125
- - Fisher Corpus
126
- - Switchboard-1 Dataset
127
- - WSJ-0 and WSJ-1
128
- - National Speech Corpus (Part 1, Part 6)
129
- - VCTK
130
- - VoxPopuli (EN)
131
- - Europarl-ASR (EN)
132
- - Multilingual Librispeech (MLS EN) - 2,000 hours subset
133
- - Mozilla Common Voice (v7.0)
134
 
135
- Note: older versions of the model may have trained on smaller set of datasets.
136
 
137
  ## Performance
138
 
139
  The list of the available models in this collection is shown in the following table. Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding.
140
 
141
- | Version | Tokenizer | Vocabulary Size | LS test-other | LS test-clean | WSJ Eval92 | WSJ Dev93 | NSC Part 1 | MLS Test | MLS Dev | MCV Test 6.1 |Train Dataset |
142
- |---------|-----------------------|-----------------|---------------|---------------|------------|-----------|-------|------|-----|-------|---------|
143
- | 1.6.0 | SentencePiece Unigram | 128 | 4.3 | 2.2 | 2.0 | 2.9 | 7.0 | 7.2 | 6.5 | 8.0 | NeMo ASRSET 2.0 |
144
 
145
- While deploying with [NVIDIA Riva](https://developer.nvidia.com/riva), you can combine this model with external language models to further improve WER. The WER(%) of the latest model with different language modeling techniques are reported in the following table.
146
 
147
- | Language Modeling | Training Dataset | LS test-other | LS test-clean | Comment |
148
- |-------------------------------------|-------------------------|---------------|---------------|---------------------------------------------------------|
149
- |N-gram LM | LS Train + LS LM Corpus | 3.5 | 1.8 | N=10, beam_width=128, n_gram_alpha=1.0, n_gram_beta=1.0 |
150
- |Neural Rescorer(Transformer) | LS Train + LS LM Corpus | 3.4 | 1.7 | N=10, beam_width=128 |
151
- |N-gram + Neural Rescorer(Transformer)| LS Train + LS LM Corpus | 3.2 | 1.8 | N=10, beam_width=128, n_gram_alpha=1.0, n_gram_beta=1.0 |
152
 
153
 
154
  ## Limitations
 
3
  - ca
4
  library_name: nemo
5
  datasets:
6
+ - mozilla-foundation/common_voice_9_0
7
  thumbnail: null
8
  tags:
9
  - automatic-speech-recognition
 
17
  - hf-asr-leaderboard
18
  - Riva
19
  license: cc-by-4.0
 
 
 
 
 
20
  model-index:
21
  - name: stt_ca_conformer_ctc_large
22
  results:
 
24
  name: Automatic Speech Recognition
25
  type: automatic-speech-recognition
26
  dataset:
27
+ name: Mozilla Common Voice 9.0
28
+ type: mozilla-foundation/common_voice_9_0
29
+ config: ca
30
  split: test
31
  args:
32
+ language: ca
33
  metrics:
34
  - name: Test WER
35
  type: wer
36
+ value: 4.27
37
 
38
  ---
39
 
 
88
 
89
  ```shell
90
  python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
91
+ pretrained_name="nvidia/stt_ca_conformer_ctc_large"
92
  audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
93
  ```
94
 
 
110
 
111
  The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
112
 
113
+ The vocabulary we use contains 44 characters:
114
+ ```python
115
+ ['s','e','r','v','i','d','p','o','g','a','m','t','u','l','f','c','z','b','q','n','é',"'",'x','ó','è','h','í','ü','j','à','ï','w','k','y','ç','ú','ò','á','ı','·','ñ','—','–','-']
116
+ ```
117
 
118
+ Full config can be found inside the .nemo files.
119
 
120
+ The checkpoint of the language model used as the neural rescorer can be found [here](https://ngc.nvidia.com/catalog/models/nvidia:nemo:asrlm_en_transformer_large_ls). You may find more info on how to train and use language models for ASR models here: [ASR Language Modeling](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html)
121
 
122
+ ### Datasets
 
 
 
 
 
 
 
 
 
123
 
124
+ All the models in this collection are trained on MCV-9.0 Catalan dataset, which contains around 1203 hours training, 28 hours of development and 27 hours of testing speech audios.
125
 
126
  ## Performance
127
 
128
  The list of the available models in this collection is shown in the following table. Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding.
129
 
130
+ | Version | Tokenizer | Vocabulary Size | Dev WER| Test WER| Train Dataset |
131
+ |---------|-----------------------|-----------------|-----|------|-----------------|
132
+ | 1.11.0 | SentencePiece Unigram | 128 |4.70 | 4.27 | MCV-9.0 Train set |
133
 
134
+ You may use language models (LMs) and beam search to improve the accuracy of the models, as reported in the follwoing table.
135
 
136
+ | Language Model | Test WER | Test WER w/ Oracle LM | Train Dataset | Settings |
137
+ |----------------|----------|-----------------------|------------------|-------------------------------------------------------|
138
+ | N-gram LM | 3.77 | 1.54 |MCV-9.0 Train set |N=6, beam_width=128, ngram_alpha=1.5, ngram_beta=2.0 |
 
 
139
 
140
 
141
  ## Limitations