Update README.md
Browse files
README.md
CHANGED
@@ -11,13 +11,17 @@ metrics:
|
|
11 |
- roc_auc
|
12 |
pipeline_tag: voice-activity-detection
|
13 |
library_name: nemo
|
|
|
|
|
|
|
14 |
---
|
15 |
# Frame-VAD Multilingual MarbleNet v2.0
|
16 |
|
17 |
## Description
|
18 |
|
19 |
-
Frame-VAD Multilingual MarbleNet v2.0 is a convolutional neural network for voice activity detection (VAD) that serves as the first step for Speech Recognition and Speaker Diarization. It is a frame-based model that outputs a speech probability for each 20 millisecond frame of the input audio. <br>
|
20 |
To reduce false positive errors — cases where the model incorrectly detects speech when none is present — the model was trained with white noise and real-word noise perturbations. During training, the volume of audios was also varied. Additionally, the training data includes non-speech audio samples to help the model distinguish between speech and non-speech sounds (such as coughing, laughter, and breathing, etc.) <br>
|
|
|
21 |
|
22 |
This model is ready for commercial use. <br>
|
23 |
|
@@ -65,7 +69,7 @@ The model is available for use in the NeMo toolkit [2], and can be used as a pre
|
|
65 |
|
66 |
```python
|
67 |
import nemo.collections.asr as nemo_asr
|
68 |
-
|
69 |
```
|
70 |
|
71 |
### Perform VAD Inference
|
|
|
11 |
- roc_auc
|
12 |
pipeline_tag: voice-activity-detection
|
13 |
library_name: nemo
|
14 |
+
tags:
|
15 |
+
- multilingual
|
16 |
+
- marblenet
|
17 |
---
|
18 |
# Frame-VAD Multilingual MarbleNet v2.0
|
19 |
|
20 |
## Description
|
21 |
|
22 |
+
Frame-VAD Multilingual MarbleNet v2.0 is a convolutional neural network for voice activity detection (VAD) that serves as the first step for Speech Recognition and Speaker Diarization. It is a frame-based model that outputs a speech probability for each 20 millisecond frame of the input audio. The model has 91.5K parameters, making it lightweight and efficient for real-time applications. <br>
|
23 |
To reduce false positive errors — cases where the model incorrectly detects speech when none is present — the model was trained with white noise and real-word noise perturbations. During training, the volume of audios was also varied. Additionally, the training data includes non-speech audio samples to help the model distinguish between speech and non-speech sounds (such as coughing, laughter, and breathing, etc.) <br>
|
24 |
+
The model supports multiple languages, including Chinese, German, Russian, English, Spanish, and French.
|
25 |
|
26 |
This model is ready for commercial use. <br>
|
27 |
|
|
|
69 |
|
70 |
```python
|
71 |
import nemo.collections.asr as nemo_asr
|
72 |
+
vad_model = nemo_asr.models.EncDecFrameClassificationModel.from_pretrained(model_name="nvidia/frame_vad_multilingual_marblenet_v2.0")
|
73 |
```
|
74 |
|
75 |
### Perform VAD Inference
|