nvidia
/

Frame_VAD_Multilingual_MarbleNet_v2.0

Voice Activity Detection

Model card Files Files and versions Community

naymaraq commited on 18 days ago

Commit

1913560

·

verified ·

1 Parent(s): 821512d

Update README.md

Files changed (1) hide show

README.md +6 -2

README.md CHANGED Viewed

@@ -11,13 +11,17 @@ metrics:
 - roc_auc
 pipeline_tag: voice-activity-detection
 library_name: nemo
 ---
 # Frame-VAD Multilingual MarbleNet v2.0
 ## Description
-Frame-VAD Multilingual MarbleNet v2.0 is a convolutional neural network for voice activity detection (VAD) that serves as the first step for Speech Recognition and Speaker Diarization. It is a frame-based model that outputs a speech probability for each 20 millisecond frame of the input audio. <br>
 To reduce false positive errors — cases where the model incorrectly detects speech when none is present — the model was trained with white noise and real-word noise perturbations. During training, the volume of audios was also varied. Additionally, the training data includes non-speech audio samples to help the model distinguish between speech and non-speech sounds (such as coughing, laughter, and breathing, etc.) <br>
 This model is ready for commercial use. <br>
@@ -65,7 +69,7 @@ The model is available for use in the NeMo toolkit [2], and can be used as a pre
 ```python
 import nemo.collections.asr as nemo_asr
-asr_model = nemo_asr.models.EncDecFrameClassificationModel.from_pretrained(model_name="nvidia/frame_vad_multilingual_marblenet_v2.0")
 ```
 ### Perform VAD Inference

 - roc_auc
 pipeline_tag: voice-activity-detection
 library_name: nemo
+tags:
+- multilingual
+- marblenet
 ---
 # Frame-VAD Multilingual MarbleNet v2.0
 ## Description
+Frame-VAD Multilingual MarbleNet v2.0 is a convolutional neural network for voice activity detection (VAD) that serves as the first step for Speech Recognition and Speaker Diarization. It is a frame-based model that outputs a speech probability for each 20 millisecond frame of the input audio. The model has 91.5K parameters, making it lightweight and efficient for real-time applications. <br>
 To reduce false positive errors — cases where the model incorrectly detects speech when none is present — the model was trained with white noise and real-word noise perturbations. During training, the volume of audios was also varied. Additionally, the training data includes non-speech audio samples to help the model distinguish between speech and non-speech sounds (such as coughing, laughter, and breathing, etc.) <br>
+The model supports multiple languages, including Chinese, German, Russian, English, Spanish, and French.
 This model is ready for commercial use. <br>
 ```python
 import nemo.collections.asr as nemo_asr
+vad_model = nemo_asr.models.EncDecFrameClassificationModel.from_pretrained(model_name="nvidia/frame_vad_multilingual_marblenet_v2.0")
 ```
 ### Perform VAD Inference