naymaraq commited on
Commit
304eec6
·
verified ·
1 Parent(s): 768aaf8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -6
README.md CHANGED
@@ -14,10 +14,14 @@ library_name: nemo
14
  tags:
15
  - Multilingual
16
  - MarbleNet
 
 
 
 
17
  ---
18
  # Frame-VAD Multilingual MarbleNet v2.0
19
 
20
- ## Description
21
 
22
  Frame-VAD Multilingual MarbleNet v2.0 is a convolutional neural network for voice activity detection (VAD) that serves as the first step for Speech Recognition and Speaker Diarization. It is a frame-based model that outputs a speech probability for each 20 millisecond frame of the input audio. The model has 91.5K parameters, making it lightweight and efficient for real-time applications. <br>
23
  To reduce false positive errors — cases where the model incorrectly detects speech when none is present — the model was trained with white noise and real-word noise perturbations. During training, the volume of audios was also varied. Additionally, the training data includes non-speech audio samples to help the model distinguish between speech and non-speech sounds (such as coughing, laughter, and breathing, etc.) <br>
@@ -25,14 +29,17 @@ The model supports multiple languages, including Chinese, German, Russian, Engli
25
 
26
  This model is ready for commercial use. <br>
27
 
28
- ### License/Terms of Use
29
  GOVERNING TERMS: Your use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license).
30
 
31
 
32
- Deployment Geography: Global <br>
33
 
34
- Use Case: Developers, speech processing engineers, and AI researchers will use it as the first step for other speech processing models. <br>
35
 
 
 
 
36
 
37
  ## References:
38
  [1] [Jia, Fei, Somshubra Majumdar, and Boris Ginsburg. "MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.](https://arxiv.org/abs/2010.13886) <br>
@@ -62,7 +69,7 @@ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated sys
62
 
63
 
64
 
65
- ## How to Use the Model
66
  The model is available for use in the NeMo toolkit [2], and can be used as a pre-trained checkpoint for inference.
67
 
68
  ### Automatically load the model
@@ -256,4 +263,4 @@ Model Application(s): | Automatic Speech Recognit
256
  List types of specific high-risk AI systems, if any, in which the model can be integrated: Select from the following: [Biometrics] OR [Critical infrastructure] OR [Machinery and Robotics] OR [Medical Devices] OR [Vehicles] OR [Aviation] OR [Education and vocational training] OR [Employment and Workers Management] <br>
257
  Describe the life critical impact (if present). | Not Applicable
258
  Use Case Restrictions: | Abide by [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license)
259
- Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.
 
14
  tags:
15
  - Multilingual
16
  - MarbleNet
17
+ - pytorch
18
+ - speech
19
+ - audio
20
+ - VAD
21
  ---
22
  # Frame-VAD Multilingual MarbleNet v2.0
23
 
24
+ ## Description:
25
 
26
  Frame-VAD Multilingual MarbleNet v2.0 is a convolutional neural network for voice activity detection (VAD) that serves as the first step for Speech Recognition and Speaker Diarization. It is a frame-based model that outputs a speech probability for each 20 millisecond frame of the input audio. The model has 91.5K parameters, making it lightweight and efficient for real-time applications. <br>
27
  To reduce false positive errors — cases where the model incorrectly detects speech when none is present — the model was trained with white noise and real-word noise perturbations. During training, the volume of audios was also varied. Additionally, the training data includes non-speech audio samples to help the model distinguish between speech and non-speech sounds (such as coughing, laughter, and breathing, etc.) <br>
 
29
 
30
  This model is ready for commercial use. <br>
31
 
32
+ ### License/Terms of Use:
33
  GOVERNING TERMS: Your use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license).
34
 
35
 
36
+ ### Deployment Geography:
37
 
38
+ Global <br>
39
 
40
+ ### Use Case:
41
+
42
+ Developers, speech processing engineers, and AI researchers will use it as the first step for other speech processing models. <br>
43
 
44
  ## References:
45
  [1] [Jia, Fei, Somshubra Majumdar, and Boris Ginsburg. "MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.](https://arxiv.org/abs/2010.13886) <br>
 
69
 
70
 
71
 
72
+ ## How to Use the Model:
73
  The model is available for use in the NeMo toolkit [2], and can be used as a pre-trained checkpoint for inference.
74
 
75
  ### Automatically load the model
 
263
  List types of specific high-risk AI systems, if any, in which the model can be integrated: Select from the following: [Biometrics] OR [Critical infrastructure] OR [Machinery and Robotics] OR [Medical Devices] OR [Vehicles] OR [Aviation] OR [Education and vocational training] OR [Employment and Workers Management] <br>
264
  Describe the life critical impact (if present). | Not Applicable
265
  Use Case Restrictions: | Abide by [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license)
266
+ Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.