nvidia
/

Frame_VAD_Multilingual_MarbleNet_v2.0

Voice Activity Detection

Model card Files Files and versions Community

naymaraq commited on 18 days ago

Commit

02863c9

·

verified ·

1 Parent(s): 304eec6

Update README.md

Files changed (1) hide show

README.md +6 -1

README.md CHANGED Viewed

@@ -25,7 +25,12 @@ tags:
 Frame-VAD Multilingual MarbleNet v2.0 is a convolutional neural network for voice activity detection (VAD) that serves as the first step for Speech Recognition and Speaker Diarization. It is a frame-based model that outputs a speech probability for each 20 millisecond frame of the input audio. The model has 91.5K parameters, making it lightweight and efficient for real-time applications. <br>
 To reduce false positive errors — cases where the model incorrectly detects speech when none is present — the model was trained with white noise and real-word noise perturbations. During training, the volume of audios was also varied. Additionally, the training data includes non-speech audio samples to help the model distinguish between speech and non-speech sounds (such as coughing, laughter, and breathing, etc.) <br>
-The model supports multiple languages, including Chinese, German, Russian, English, Spanish, and French.
 This model is ready for commercial use. <br>

 Frame-VAD Multilingual MarbleNet v2.0 is a convolutional neural network for voice activity detection (VAD) that serves as the first step for Speech Recognition and Speaker Diarization. It is a frame-based model that outputs a speech probability for each 20 millisecond frame of the input audio. The model has 91.5K parameters, making it lightweight and efficient for real-time applications. <br>
 To reduce false positive errors — cases where the model incorrectly detects speech when none is present — the model was trained with white noise and real-word noise perturbations. During training, the volume of audios was also varied. Additionally, the training data includes non-speech audio samples to help the model distinguish between speech and non-speech sounds (such as coughing, laughter, and breathing, etc.) <br>
+**Key Features**
+- Lightweight model with only 91.5K parameters
+- Robust against false positive errors
+- Outputs speech probability for each 20 ms audio frame
+- Multilingual support: Chinese, German, Russian, English, Spanish, and French
 This model is ready for commercial use. <br>