naymaraq commited on
Commit
1913560
·
verified ·
1 Parent(s): 821512d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -2
README.md CHANGED
@@ -11,13 +11,17 @@ metrics:
11
  - roc_auc
12
  pipeline_tag: voice-activity-detection
13
  library_name: nemo
 
 
 
14
  ---
15
  # Frame-VAD Multilingual MarbleNet v2.0
16
 
17
  ## Description
18
 
19
- Frame-VAD Multilingual MarbleNet v2.0 is a convolutional neural network for voice activity detection (VAD) that serves as the first step for Speech Recognition and Speaker Diarization. It is a frame-based model that outputs a speech probability for each 20 millisecond frame of the input audio. <br>
20
  To reduce false positive errors — cases where the model incorrectly detects speech when none is present — the model was trained with white noise and real-word noise perturbations. During training, the volume of audios was also varied. Additionally, the training data includes non-speech audio samples to help the model distinguish between speech and non-speech sounds (such as coughing, laughter, and breathing, etc.) <br>
 
21
 
22
  This model is ready for commercial use. <br>
23
 
@@ -65,7 +69,7 @@ The model is available for use in the NeMo toolkit [2], and can be used as a pre
65
 
66
  ```python
67
  import nemo.collections.asr as nemo_asr
68
- asr_model = nemo_asr.models.EncDecFrameClassificationModel.from_pretrained(model_name="nvidia/frame_vad_multilingual_marblenet_v2.0")
69
  ```
70
 
71
  ### Perform VAD Inference
 
11
  - roc_auc
12
  pipeline_tag: voice-activity-detection
13
  library_name: nemo
14
+ tags:
15
+ - multilingual
16
+ - marblenet
17
  ---
18
  # Frame-VAD Multilingual MarbleNet v2.0
19
 
20
  ## Description
21
 
22
+ Frame-VAD Multilingual MarbleNet v2.0 is a convolutional neural network for voice activity detection (VAD) that serves as the first step for Speech Recognition and Speaker Diarization. It is a frame-based model that outputs a speech probability for each 20 millisecond frame of the input audio. The model has 91.5K parameters, making it lightweight and efficient for real-time applications. <br>
23
  To reduce false positive errors — cases where the model incorrectly detects speech when none is present — the model was trained with white noise and real-word noise perturbations. During training, the volume of audios was also varied. Additionally, the training data includes non-speech audio samples to help the model distinguish between speech and non-speech sounds (such as coughing, laughter, and breathing, etc.) <br>
24
+ The model supports multiple languages, including Chinese, German, Russian, English, Spanish, and French.
25
 
26
  This model is ready for commercial use. <br>
27
 
 
69
 
70
  ```python
71
  import nemo.collections.asr as nemo_asr
72
+ vad_model = nemo_asr.models.EncDecFrameClassificationModel.from_pretrained(model_name="nvidia/frame_vad_multilingual_marblenet_v2.0")
73
  ```
74
 
75
  ### Perform VAD Inference