Update README.md
Browse files
README.md
CHANGED
@@ -59,11 +59,63 @@ The model is available for use in the NeMo toolkit [4], and can be used as a pre
|
|
59 |
|
60 |
|
61 |
|
62 |
-
|
63 |
-
|
|
|
|
|
|
|
64 |
of audio from about one-hundred thousand speakers. The [MLS English](https://www.openslr.org/94/) training dataset consists of 6.2 million utterances and 25.5k hours of audio from 4329 speakers. =
|
65 |
|
66 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
67 |
## Software Integration
|
68 |
|
69 |
### Supported Hardware Microarchitecture Compatibility:
|
|
|
59 |
|
60 |
|
61 |
|
62 |
+
|
63 |
+
## Training, Testing, and Evaluation Datasets:
|
64 |
+
The Low Frame-rate Speech Codec is trained on a total of 28.7k hrs of speech data from 105 languages.
|
65 |
+
|
66 |
+
For training our model we have used [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) and an English subset of MLS dataset. The Common Voice derived training set comprises 105 languages, totaling 2.7 million utterances, and 3.2k hours
|
67 |
of audio from about one-hundred thousand speakers. The [MLS English](https://www.openslr.org/94/) training dataset consists of 6.2 million utterances and 25.5k hours of audio from 4329 speakers. =
|
68 |
|
69 |
|
70 |
+
### Training Datasets
|
71 |
+
The Low Frame-rate Speech Codec is trained on a total of 28.7k hrs of speech data from 105 languages.
|
72 |
+
|
73 |
+
- [MLS English](https://www.openslr.org/94/) [25.5k]
|
74 |
+
|
75 |
+
- Data Collection Method: by Human
|
76 |
+
|
77 |
+
- Labeling Method: Automated
|
78 |
+
|
79 |
+
- [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)[3.2k]
|
80 |
+
|
81 |
+
- Data Collection Method: by Human
|
82 |
+
|
83 |
+
- Labeling Method: by Human
|
84 |
+
|
85 |
+
|
86 |
+
|
87 |
+
### Evaluation Datasets
|
88 |
+
|
89 |
+
- [MLS English](https://www.openslr.org/94/)
|
90 |
+
|
91 |
+
- Data Collection Method: by Human
|
92 |
+
|
93 |
+
- Labeling Method: Automated
|
94 |
+
|
95 |
+
- [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)
|
96 |
+
|
97 |
+
- Data Collection Method: by Human
|
98 |
+
|
99 |
+
- Labeling Method: by Human
|
100 |
+
|
101 |
+
### Test Datasets
|
102 |
+
|
103 |
+
- [MLS](https://www.openslr.org/94/)
|
104 |
+
|
105 |
+
- Data Collection Method: by Human
|
106 |
+
|
107 |
+
- Labeling Method: Automated
|
108 |
+
|
109 |
+
- Properties: We randomly selected 200 samples from each of the eight languages in the 44kHz MLS dataset. For more details, please refer to [our paper](https://arxiv.org/abs/2409.12117).
|
110 |
+
|
111 |
+
- [DAPS](https://zenodo.org/records/4660670)
|
112 |
+
|
113 |
+
- Data Collection Method: by Human
|
114 |
+
|
115 |
+
- Labeling Method: Automated
|
116 |
+
|
117 |
+
- Properties: To assess our models' performance on studio-quality audio, we utilized the F10 and M10 speakers from the DAPS Clear dataset. These speakers were also employed in the evaluation of the [DAC model](https://arxiv.org/abs/2306.06546).
|
118 |
+
|
119 |
## Software Integration
|
120 |
|
121 |
### Supported Hardware Microarchitecture Compatibility:
|