nvidia
/

parakeet-tdt_ctc-0.6b-ja

@@ -19,7 +19,7 @@ tags:
 - NeMo
 license: cc-by-4.0
 model-index:
-- name: ja-parakeet-tdt_ctc-0.6b
   results:
   - task:
       name: Automatic Speech Recognition
@@ -108,7 +108,7 @@ img {
 | [![Language](https://img.shields.io/badge/Language-ja-lightgrey#model-badge)](#datasets)
-`ja-parakeet-tdt_ctc-0.6b` is an ASR model that transcribes Japanese speech with Punctuations. This model is developed by [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) team.
 It is an XL version of Hybrid FastConformer [1] TDT-CTC [2] (around 0.6B parameters) model.
 See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer) for complete architecture details.
@@ -116,7 +116,7 @@ See the [model architecture](#model-architecture) section and [NeMo documentatio
 To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.
 ```
-pip install nemo_toolkit['all']
 ```
 ## How to Use this Model
@@ -127,7 +127,7 @@ The model is available for use in the NeMo toolkit [3], and can be used as a pre
 ```python
 import nemo.collections.asr as nemo_asr
-asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="nvidia/ja-parakeet-tdt_ctc-0.6b")
 ```
 ### Transcribing using Python
@@ -142,7 +142,7 @@ By default model uses TDT to transcribe the audio files, to switch decoder to us
 ```shell
 python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
- pretrained_name="nvidia/ja-parakeet-tdt_ctc-0.6b"
  audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
 ```
@@ -160,7 +160,7 @@ This model uses a Hybrid FastConformer-TDT-CTC architecture.
 FastConformer [1] is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. You may find more information on the details of FastConformer here: [Fast-Conformer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer).
-TDT (Token-and-Duration Transducer) [2] is a generalization of conventional Transducers by decoupling token and duration predictions. Unlike conventional Transducers which produces a lot of blanks during inference, a TDT model can skip majority of blank predictions by using the duration output (up to 4 frames for this ja-parakeet-tdt_ctc-0.6b model), thus brings significant inference speed-up. The detail of TDT can be found here: [Efficient Sequence Transduction by Jointly Predicting Tokens and Durations](https://arxiv.org/abs/2304.06795).
 ## Training

 - NeMo
 license: cc-by-4.0
 model-index:
+- name: parakeet-tdt_ctc-0.6b-ja
   results:
   - task:
       name: Automatic Speech Recognition
 | [![Language](https://img.shields.io/badge/Language-ja-lightgrey#model-badge)](#datasets)
+`parakeet-tdt_ctc-0.6b-ja` is an ASR model that transcribes Japanese speech with Punctuations. This model is developed by [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) team.
 It is an XL version of Hybrid FastConformer [1] TDT-CTC [2] (around 0.6B parameters) model.
 See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer) for complete architecture details.
 To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.
 ```
+pip install nemo_toolkit['asr']
 ```
 ## How to Use this Model
 ```python
 import nemo.collections.asr as nemo_asr
+asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="nvidia/parakeet-tdt_ctc-0.6b-ja")
 ```
 ### Transcribing using Python
 ```shell
 python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
+ pretrained_name="nvidia/parakeet-tdt_ctc-0.6b-ja"
  audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
 ```
 FastConformer [1] is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. You may find more information on the details of FastConformer here: [Fast-Conformer Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer).
+TDT (Token-and-Duration Transducer) [2] is a generalization of conventional Transducers by decoupling token and duration predictions. Unlike conventional Transducers which produces a lot of blanks during inference, a TDT model can skip majority of blank predictions by using the duration output (up to 4 frames for this `parakeet-tdt_ctc-0.6b-ja` model), thus brings significant inference speed-up. The detail of TDT can be found here: [Efficient Sequence Transduction by Jointly Predicting Tokens and Durations](https://arxiv.org/abs/2304.06795).
 ## Training