Update README.md
Browse files
README.md
CHANGED
@@ -1,5 +1,100 @@
|
|
1 |
-
---
|
2 |
-
license: other
|
3 |
-
license_name: nsclv1
|
4 |
-
license_link: https://developer.nvidia.com/downloads/license/nsclv1
|
5 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: other
|
3 |
+
license_name: nsclv1
|
4 |
+
license_link: https://developer.nvidia.com/downloads/license/nsclv1
|
5 |
+
---
|
6 |
+
|
7 |
+
|
8 |
+
# NVIDIA Low Frame-rate Speech Codec
|
9 |
+
<style>
|
10 |
+
img {
|
11 |
+
display: inline-table;
|
12 |
+
vertical-align: small;
|
13 |
+
margin: 0;
|
14 |
+
padding: 0;
|
15 |
+
}
|
16 |
+
</style>
|
17 |
+
[![Model architecture](https://img.shields.io/badge/Model_Arch-Low_Frame--rate_Speech_Codec-lightgrey#model-badge)](#model-architecture)
|
18 |
+
| [![Model size](https://img.shields.io/badge/Params-112.7M-lightgrey#model-badge)](#model-architecture)
|
19 |
+
| [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets)
|
20 |
+
|
21 |
+
The [Low Frame-rate Speech Codec](https://arxiv.org/abs/2409.12117) is a neural audio codec that leverages finite scalar quantization and adversarial training with large speech language models (WavLM) to achieve high-quality audio compression with a 1.89 kbps bitrate and 21.5 frames per second.
|
22 |
+
|
23 |
+
|
24 |
+
|
25 |
+
## NVIDIA NeMo
|
26 |
+
|
27 |
+
To train, fine-tune or Transcribe with Canary, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed Cython and latest PyTorch version.
|
28 |
+
```
|
29 |
+
pip install git+https://github.com/NVIDIA/NeMo.git
|
30 |
+
```
|
31 |
+
|
32 |
+
## How to Use this Model
|
33 |
+
|
34 |
+
The model is available for use in the NeMo toolkit [4], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
|
35 |
+
|
36 |
+
|
37 |
+
|
38 |
+
|
39 |
+
|
40 |
+
### Datasets
|
41 |
+
The Low Frame-rate Speech Codec is trained on a total of 28.7k hrs of speech data from 105 languages.
|
42 |
+
|
43 |
+
The Common Voice derived training set comprises 105 languages, totaling 2.7 million utterances, and 3.2k hours
|
44 |
+
of audio from about one-hundred thousand speakers. The MLS
|
45 |
+
English training dataset consists of 6.2 million utterances and
|
46 |
+
25.5k hours of audio from 4329 speakers.
|
47 |
+
|
48 |
+
Training, Testing, and Evaluation Datasets:
|
49 |
+
#### Training Dataset:
|
50 |
+
[English subset of MLS] (https://www.openslr.org/94/)
|
51 |
+
[105 language from Common Voice] (https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)
|
52 |
+
|
53 |
+
** Data Collection Method by dataset
|
54 |
+
MLS English - [Human]
|
55 |
+
CommonVoice - [Human]
|
56 |
+
|
57 |
+
** Labeling Method by dataset
|
58 |
+
MLS English - [Automated]
|
59 |
+
CommonVoice - [Human]
|
60 |
+
|
61 |
+
**Properties:
|
62 |
+
MLS English - 25.5k hours of English speech from 4.3k speakers.
|
63 |
+
CommonVoice - 3.2k hours of speech from 100k speakers in 105 languages.
|
64 |
+
|
65 |
+
**Dataset License(s):
|
66 |
+
Internal MLS English - The dataset license is CC BY 4.0.
|
67 |
+
CommonVoice - Creative Commons CC0 public domain dedication
|
68 |
+
Testing Dataset:
|
69 |
+
**Link:
|
70 |
+
[MLS] (https://www.openslr.org/94/)
|
71 |
+
|
72 |
+
**Data Collection Method by dataset
|
73 |
+
MLS - [Human]
|
74 |
+
|
75 |
+
**Labeling Method by dataset
|
76 |
+
MLS - [Automated]
|
77 |
+
|
78 |
+
**Properties: We randomly selected 200 samples from each of the eight languages in the 44kHz MLS dataset. For more details, please refer to [our paper](https://arxiv.org/abs/2409.12117).
|
79 |
+
Evaluation Dataset:
|
80 |
+
**Link:
|
81 |
+
[MLS English] (https://www.openslr.org/94/)
|
82 |
+
[CommonVoice] (https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0)
|
83 |
+
|
84 |
+
**Data Collection Method by dataset:
|
85 |
+
MLS English - [Human]
|
86 |
+
CommonVoice - [Human]
|
87 |
+
|
88 |
+
**Labeling Method by dataset
|
89 |
+
MLS - [Automated]
|
90 |
+
CommonVoice - [Human]
|
91 |
+
|
92 |
+
**Properties: We have used a small portion of the training dataset that was not in the model training set. We have included unseen and seen speakers on the evaluation set.
|
93 |
+
Inference:
|
94 |
+
**Engine: NeMo 2.0
|
95 |
+
**Test Hardware: NVIDIA RTX 6000 Ada Generation Graphics Card
|
96 |
+
Ethical Considerations:
|
97 |
+
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
|
98 |
+
|
99 |
+
Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
|
100 |
+
|