--- license: apache-2.0 datasets: - mozilla-foundation/common_voice_17_0 - google/fleurs language: - hy metrics: - wer library_name: nemo --- # NVIDIA FastConformer-Hybrid Large (arm) | [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transducer_CTC-lightgrey#model-badge)](#model-architecture) | [![Model size](https://img.shields.io/badge/Params-115M-lightgrey#model-badge)](#model-architecture) This model transcribes speech to Armenian without punctuation and capitalization. It is a "large" version of the FastConformer Transducer-CTC model with approximately 115M parameters. This hybrid model is trained on two losses: Transducer (default) and CTC. See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer) for complete architecture details. ## NVIDIA NeMo: Training To train, fine-tune or play with the model, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed the latest Pytorch version. ```sh pip install nemo_toolkit['all'] ``` ## How to Use this Model The model is available for use in the NeMo toolkit and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset. ### Automatically instantiate the model ```python import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="mheryerznka/stt_arm_fastconformer_hybrid_large_no_pc") ``` ### Transcribing using Python First, let's get a sample: ```sh wget --no-check-certificate 'https://drive.google.com/uc?export=download&id=1Np_gMOeSac-Yc8GZ-yrq2xq9wsl7zT1_' -O hy_am-test-26-audio-audio.wav ``` Then simply do: ```sh asr_model.transcribe(['hy_am-test-26-audio-audio.wav']) ``` ### Transcribing many audio files Using Transducer mode inference: ```sh python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py \ pretrained_name="mheryerznka/stt_arm_fastconformer_hybrid_large_no_pc" \ audio_dir="" ``` Using CTC mode inference: ```sh python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py \ pretrained_name="mheryerznka/stt_arm_fastconformer_hybrid_large_no_pc" \ audio_dir="" \ decoder_type="ctc" ``` ### Input This model accepts 16000 Hz Mono-channel Audio (wav files) as input. ### Output This model provides transcribed speech as a string for a given audio sample. ## Model Architecture FastConformer is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. The model is trained in a multitask setup with joint Transducer and CTC decoder loss. You may find more information on the details of FastConformer here: [Fast-Conformer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer) and about Hybrid Transducer-CTC training here: [Hybrid Transducer-CTC](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#hybrid-transducer-ctc). ## Training The NeMo toolkit was used for training the models 50 epochs on A100 GPUs at Yerevan State University. These models are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_hybrid_transducer_ctc/speech_to_text_hybrid_rnnt_ctc_bpe.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/fastconformer/hybrid_transducer_ctc/fastconformer_hybrid_transducer_ctc_bpe.yaml). The training process also incorporated a technique called slimIPL (slim Iterative Pseudo-Labeling), which involves self-training with intermediate pseudo-labels. The slimIPL algorithm uses pseudo-labels generated from high-confidence unlabeled data from youtube to iteratively refine the model. ### Datasets The model in this collection is trained on a composite dataset comprising of several hundred of Armenian speech: - Mozilla Common Voice 17.0 - Google Fleurs - 145 hours of unlabeled open-source Armenian audio from YouTube [Youtube Audio Processing PL](https://github.com/NVIDIA/NeMo-speech-data-processor/pull/63) ## Performance The performance of Automatic Speech Recognition models is measured using Word Error Rate (WER). This model was specifically designed to handle the complexities of the Armenian language. The following tables summarize the performance of the available models in this collection with the RNN-Transducer decoder and CTC decoder. Performances of the ASR models are reported in terms of WER. ### On data without Punctuation and Capitalization with Transducer decoder | **Vocabulary Size** | **MCV17 TEST RNN-T** | **MCV17 TEST CTC** | **GOOGLE FLEURS TEST RNN-T** | **GOOGLE FLEURS TEST CTC** | |:-------------------:|:-------------------:|:------------------:|:---------------------------:|:--------------------------:| | 256 | 9.03 | 10.77 | 7.41 | 9.09 | ## Limitations Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech that includes technical terms or vernacular that the model has not been trained on especially western armenian. The model might also perform worse for accented speech.