Update README.md
Browse files
README.md
CHANGED
@@ -14,10 +14,14 @@ library_name: nemo
|
|
14 |
tags:
|
15 |
- Multilingual
|
16 |
- MarbleNet
|
|
|
|
|
|
|
|
|
17 |
---
|
18 |
# Frame-VAD Multilingual MarbleNet v2.0
|
19 |
|
20 |
-
## Description
|
21 |
|
22 |
Frame-VAD Multilingual MarbleNet v2.0 is a convolutional neural network for voice activity detection (VAD) that serves as the first step for Speech Recognition and Speaker Diarization. It is a frame-based model that outputs a speech probability for each 20 millisecond frame of the input audio. The model has 91.5K parameters, making it lightweight and efficient for real-time applications. <br>
|
23 |
To reduce false positive errors — cases where the model incorrectly detects speech when none is present — the model was trained with white noise and real-word noise perturbations. During training, the volume of audios was also varied. Additionally, the training data includes non-speech audio samples to help the model distinguish between speech and non-speech sounds (such as coughing, laughter, and breathing, etc.) <br>
|
@@ -25,14 +29,17 @@ The model supports multiple languages, including Chinese, German, Russian, Engli
|
|
25 |
|
26 |
This model is ready for commercial use. <br>
|
27 |
|
28 |
-
### License/Terms of Use
|
29 |
GOVERNING TERMS: Your use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license).
|
30 |
|
31 |
|
32 |
-
Deployment Geography:
|
33 |
|
34 |
-
|
35 |
|
|
|
|
|
|
|
36 |
|
37 |
## References:
|
38 |
[1] [Jia, Fei, Somshubra Majumdar, and Boris Ginsburg. "MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.](https://arxiv.org/abs/2010.13886) <br>
|
@@ -62,7 +69,7 @@ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated sys
|
|
62 |
|
63 |
|
64 |
|
65 |
-
## How to Use the Model
|
66 |
The model is available for use in the NeMo toolkit [2], and can be used as a pre-trained checkpoint for inference.
|
67 |
|
68 |
### Automatically load the model
|
@@ -256,4 +263,4 @@ Model Application(s): | Automatic Speech Recognit
|
|
256 |
List types of specific high-risk AI systems, if any, in which the model can be integrated: Select from the following: [Biometrics] OR [Critical infrastructure] OR [Machinery and Robotics] OR [Medical Devices] OR [Vehicles] OR [Aviation] OR [Education and vocational training] OR [Employment and Workers Management] <br>
|
257 |
Describe the life critical impact (if present). | Not Applicable
|
258 |
Use Case Restrictions: | Abide by [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license)
|
259 |
-
Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.
|
|
|
14 |
tags:
|
15 |
- Multilingual
|
16 |
- MarbleNet
|
17 |
+
- pytorch
|
18 |
+
- speech
|
19 |
+
- audio
|
20 |
+
- VAD
|
21 |
---
|
22 |
# Frame-VAD Multilingual MarbleNet v2.0
|
23 |
|
24 |
+
## Description:
|
25 |
|
26 |
Frame-VAD Multilingual MarbleNet v2.0 is a convolutional neural network for voice activity detection (VAD) that serves as the first step for Speech Recognition and Speaker Diarization. It is a frame-based model that outputs a speech probability for each 20 millisecond frame of the input audio. The model has 91.5K parameters, making it lightweight and efficient for real-time applications. <br>
|
27 |
To reduce false positive errors — cases where the model incorrectly detects speech when none is present — the model was trained with white noise and real-word noise perturbations. During training, the volume of audios was also varied. Additionally, the training data includes non-speech audio samples to help the model distinguish between speech and non-speech sounds (such as coughing, laughter, and breathing, etc.) <br>
|
|
|
29 |
|
30 |
This model is ready for commercial use. <br>
|
31 |
|
32 |
+
### License/Terms of Use:
|
33 |
GOVERNING TERMS: Your use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license).
|
34 |
|
35 |
|
36 |
+
### Deployment Geography:
|
37 |
|
38 |
+
Global <br>
|
39 |
|
40 |
+
### Use Case:
|
41 |
+
|
42 |
+
Developers, speech processing engineers, and AI researchers will use it as the first step for other speech processing models. <br>
|
43 |
|
44 |
## References:
|
45 |
[1] [Jia, Fei, Somshubra Majumdar, and Boris Ginsburg. "MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.](https://arxiv.org/abs/2010.13886) <br>
|
|
|
69 |
|
70 |
|
71 |
|
72 |
+
## How to Use the Model:
|
73 |
The model is available for use in the NeMo toolkit [2], and can be used as a pre-trained checkpoint for inference.
|
74 |
|
75 |
### Automatically load the model
|
|
|
263 |
List types of specific high-risk AI systems, if any, in which the model can be integrated: Select from the following: [Biometrics] OR [Critical infrastructure] OR [Machinery and Robotics] OR [Medical Devices] OR [Vehicles] OR [Aviation] OR [Education and vocational training] OR [Employment and Workers Management] <br>
|
264 |
Describe the life critical impact (if present). | Not Applicable
|
265 |
Use Case Restrictions: | Abide by [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license)
|
266 |
+
Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.
|