nvidia
/

stt_ca_conformer_ctc_large

@@ -3,7 +3,7 @@ language:
 - ca
 library_name: nemo
 datasets:
-- Mozilla Common Voice 9.0
 thumbnail: null
 tags:
 - automatic-speech-recognition
@@ -17,11 +17,6 @@ tags:
 - hf-asr-leaderboard
 - Riva
 license: cc-by-4.0
-widget:
-- example_title: Librispeech sample 1
-  src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
-- example_title: Librispeech sample 2
-  src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
 model-index:
 - name: stt_ca_conformer_ctc_large
   results:
@@ -29,16 +24,16 @@ model-index:
       name: Automatic Speech Recognition
       type: automatic-speech-recognition
     dataset:
-      name: LibriSpeech (clean)
-      type: librispeech_asr
-      config: clean
       split: test
       args:
-        language: en
     metrics:
     - name: Test WER
       type: wer
-      value: 2.2
 ---
@@ -93,7 +88,7 @@ asr_model.transcribe(['2086-149220-0033.wav'])
 ```shell
 python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
- pretrained_name="nvidia/stt_en_conformer_ctc_large"
  audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
 ```
@@ -115,40 +110,32 @@ The NeMo toolkit [3] was used for training the models for over several hundred e
 The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
-The checkpoint of the language model used as the neural rescorer can be found [here](https://ngc.nvidia.com/catalog/models/nvidia:nemo:asrlm_en_transformer_large_ls). You may find more info on how to train and use language models for ASR models here: [ASR Language Modeling](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html)
-### Datasets
-All the models in this collection are trained on a composite dataset (NeMo ASRSET) comprising of several thousand hours of English speech:
-- Librispeech 960 hours of English speech
-- Fisher Corpus
-- Switchboard-1 Dataset
-- WSJ-0 and WSJ-1
-- National Speech Corpus (Part 1, Part 6)
-- VCTK
-- VoxPopuli (EN)
-- Europarl-ASR (EN)
-- Multilingual Librispeech (MLS EN) - 2,000 hours subset
-- Mozilla Common Voice (v7.0)
-Note: older versions of the model may have trained on smaller set of datasets.
 ## Performance
 The list of the available models in this collection is shown in the following table. Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding.
-| Version | Tokenizer | Vocabulary Size | LS test-other | LS test-clean | WSJ Eval92 | WSJ Dev93 | NSC Part 1 | MLS Test | MLS Dev | MCV Test 6.1 |Train Dataset |
-|---------|-----------------------|-----------------|---------------|---------------|------------|-----------|-------|------|-----|-------|---------|
-| 1.6.0 | SentencePiece Unigram | 128 | 4.3 | 2.2 | 2.0 | 2.9 | 7.0 | 7.2 | 6.5 | 8.0 | NeMo ASRSET 2.0 |
-While deploying with [NVIDIA Riva](https://developer.nvidia.com/riva), you can combine this model with external language models to further improve WER. The WER(%) of the latest model with different language modeling techniques are reported in the following table.
-| Language Modeling | Training Dataset | LS test-other | LS test-clean | Comment |
-|-------------------------------------|-------------------------|---------------|---------------|---------------------------------------------------------|
-|N-gram LM | LS Train + LS LM Corpus | 3.5 | 1.8 | N=10, beam_width=128, n_gram_alpha=1.0, n_gram_beta=1.0 |
-|Neural Rescorer(Transformer) | LS Train + LS LM Corpus | 3.4 | 1.7 | N=10, beam_width=128 |
-|N-gram + Neural Rescorer(Transformer)| LS Train + LS LM Corpus | 3.2 | 1.8 | N=10, beam_width=128, n_gram_alpha=1.0, n_gram_beta=1.0 |
 ## Limitations

 - ca
 library_name: nemo
 datasets:
+- mozilla-foundation/common_voice_9_0
 thumbnail: null
 tags:
 - automatic-speech-recognition
 - hf-asr-leaderboard
 - Riva
 license: cc-by-4.0
 model-index:
 - name: stt_ca_conformer_ctc_large
   results:
       name: Automatic Speech Recognition
       type: automatic-speech-recognition
     dataset:
+      name: Mozilla Common Voice 9.0
+      type: mozilla-foundation/common_voice_9_0
+      config: ca
       split: test
       args:
+        language: ca
     metrics:
     - name: Test WER
       type: wer
+      value: 4.27
 ---
 ```shell
 python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
+ pretrained_name="nvidia/stt_ca_conformer_ctc_large"
  audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
 ```
 The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
+The vocabulary we use contains 44 characters:
+```python
+['s','e','r','v','i','d','p','o','g','a','m','t','u','l','f','c','z','b','q','n','é',"'",'x','ó','è','h','í','ü','j','à','ï','w','k','y','ç','ú','ò','á','ı','·','ñ','—','–','-']
+```
+Full config can be found inside the .nemo files.
+The checkpoint of the language model used as the neural rescorer can be found [here](https://ngc.nvidia.com/catalog/models/nvidia:nemo:asrlm_en_transformer_large_ls). You may find more info on how to train and use language models for ASR models here: [ASR Language Modeling](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html)
+### Datasets
+All the models in this collection are trained on MCV-9.0 Catalan dataset, which contains around 1203 hours training, 28 hours of development and 27 hours of testing speech audios.
 ## Performance
 The list of the available models in this collection is shown in the following table. Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding.
+| Version | Tokenizer             | Vocabulary Size | Dev WER| Test WER| Train Dataset   |
+|---------|-----------------------|-----------------|-----|------|-----------------|
+| 1.11.0  | SentencePiece Unigram | 128             |4.70 | 4.27 | MCV-9.0 Train set      |
+You may use language models (LMs) and beam search to improve the accuracy of the models, as reported in the follwoing table.
+| Language Model | Test WER | Test WER w/ Oracle LM | Train Dataset    | Settings                                              |
+|----------------|----------|-----------------------|------------------|-------------------------------------------------------|
+| N-gram LM      |     3.77 |        1.54           |MCV-9.0 Train set |N=6, beam_width=128, ngram_alpha=1.5, ngram_beta=2.0   |
 ## Limitations