NeMo
CasanovaE commited on
Commit
ec04423
·
verified ·
1 Parent(s): 63263ce

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +100 -5
README.md CHANGED
@@ -1,5 +1,100 @@
1
- ---
2
- license: other
3
- license_name: nsclv1
4
- license_link: https://developer.nvidia.com/downloads/license/nsclv1
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: nsclv1
4
+ license_link: https://developer.nvidia.com/downloads/license/nsclv1
5
+ ---
6
+
7
+
8
+ # NVIDIA Low Frame-rate Speech Codec
9
+ <style>
10
+ img {
11
+ display: inline-table;
12
+ vertical-align: small;
13
+ margin: 0;
14
+ padding: 0;
15
+ }
16
+ </style>
17
+ [![Model architecture](https://img.shields.io/badge/Model_Arch-Low_Frame--rate_Speech_Codec-lightgrey#model-badge)](#model-architecture)
18
+ | [![Model size](https://img.shields.io/badge/Params-112.7M-lightgrey#model-badge)](#model-architecture)
19
+ | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets)
20
+
21
+ The [Low Frame-rate Speech Codec](https://arxiv.org/abs/2409.12117) is a neural audio codec that leverages finite scalar quantization and adversarial training with large speech language models (WavLM) to achieve high-quality audio compression with a 1.89 kbps bitrate and 21.5 frames per second.
22
+
23
+
24
+
25
+ ## NVIDIA NeMo
26
+
27
+ To train, fine-tune or Transcribe with Canary, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed Cython and latest PyTorch version.
28
+ ```
29
+ pip install git+https://github.com/NVIDIA/NeMo.git
30
+ ```
31
+
32
+ ## How to Use this Model
33
+
34
+ The model is available for use in the NeMo toolkit [4], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
35
+
36
+
37
+
38
+
39
+
40
+ ### Datasets
41
+ The Low Frame-rate Speech Codec is trained on a total of 28.7k hrs of speech data from 105 languages.
42
+
43
+ The Common Voice derived training set comprises 105 languages, totaling 2.7 million utterances, and 3.2k hours
44
+ of audio from about one-hundred thousand speakers. The MLS
45
+ English training dataset consists of 6.2 million utterances and
46
+ 25.5k hours of audio from 4329 speakers.
47
+
48
+ Training, Testing, and Evaluation Datasets:
49
+ #### Training Dataset:
50
+ [English subset of MLS] (https://www.openslr.org/94/)
51
+ [105 language from Common Voice] (https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)
52
+
53
+ ** Data Collection Method by dataset
54
+ MLS English - [Human]
55
+ CommonVoice - [Human]
56
+
57
+ ** Labeling Method by dataset
58
+ MLS English - [Automated]
59
+ CommonVoice - [Human]
60
+
61
+ **Properties:
62
+ MLS English - 25.5k hours of English speech from 4.3k speakers.
63
+ CommonVoice - 3.2k hours of speech from 100k speakers in 105 languages.
64
+
65
+ **Dataset License(s):
66
+ Internal MLS English - The dataset license is CC BY 4.0.
67
+ CommonVoice - Creative Commons CC0 public domain dedication
68
+ Testing Dataset:
69
+ **Link:
70
+ [MLS] (https://www.openslr.org/94/)
71
+
72
+ **Data Collection Method by dataset
73
+ MLS - [Human]
74
+
75
+ **Labeling Method by dataset
76
+ MLS - [Automated]
77
+
78
+ **Properties: We randomly selected 200 samples from each of the eight languages in the 44kHz MLS dataset. For more details, please refer to [our paper](https://arxiv.org/abs/2409.12117).
79
+ Evaluation Dataset:
80
+ **Link:
81
+ [MLS English] (https://www.openslr.org/94/)
82
+ [CommonVoice] (https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0)
83
+
84
+ **Data Collection Method by dataset:
85
+ MLS English - [Human]
86
+ CommonVoice - [Human]
87
+
88
+ **Labeling Method by dataset
89
+ MLS - [Automated]
90
+ CommonVoice - [Human]
91
+
92
+ **Properties: We have used a small portion of the training dataset that was not in the model training set. We have included unseen and seen speakers on the evaluation set.
93
+ Inference:
94
+ **Engine: NeMo 2.0
95
+ **Test Hardware: NVIDIA RTX 6000 Ada Generation Graphics Card
96
+ Ethical Considerations:
97
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
98
+
99
+ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
100
+