NeMo
CasanovaE commited on
Commit
a9f3251
1 Parent(s): 9c73823

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -2
README.md CHANGED
@@ -59,11 +59,63 @@ The model is available for use in the NeMo toolkit [4], and can be used as a pre
59
 
60
 
61
 
62
- ## Training Datasets
63
- The Low Frame-rate Speech Codec is trained on a total of 28.7k hrs of speech data from 105 languages. For training our model we have used [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) and an English subset of MLS dataset. The Common Voice derived training set comprises 105 languages, totaling 2.7 million utterances, and 3.2k hours
 
 
 
64
  of audio from about one-hundred thousand speakers. The [MLS English](https://www.openslr.org/94/) training dataset consists of 6.2 million utterances and 25.5k hours of audio from 4329 speakers. =
65
 
66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
  ## Software Integration
68
 
69
  ### Supported Hardware Microarchitecture Compatibility:
 
59
 
60
 
61
 
62
+
63
+ ## Training, Testing, and Evaluation Datasets:
64
+ The Low Frame-rate Speech Codec is trained on a total of 28.7k hrs of speech data from 105 languages.
65
+
66
+ For training our model we have used [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) and an English subset of MLS dataset. The Common Voice derived training set comprises 105 languages, totaling 2.7 million utterances, and 3.2k hours
67
  of audio from about one-hundred thousand speakers. The [MLS English](https://www.openslr.org/94/) training dataset consists of 6.2 million utterances and 25.5k hours of audio from 4329 speakers. =
68
 
69
 
70
+ ### Training Datasets
71
+ The Low Frame-rate Speech Codec is trained on a total of 28.7k hrs of speech data from 105 languages.
72
+
73
+ - [MLS English](https://www.openslr.org/94/) [25.5k]
74
+
75
+ - Data Collection Method: by Human
76
+
77
+ - Labeling Method: Automated
78
+
79
+ - [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)[3.2k]
80
+
81
+ - Data Collection Method: by Human
82
+
83
+ - Labeling Method: by Human
84
+
85
+
86
+
87
+ ### Evaluation Datasets
88
+
89
+ - [MLS English](https://www.openslr.org/94/)
90
+
91
+ - Data Collection Method: by Human
92
+
93
+ - Labeling Method: Automated
94
+
95
+ - [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)
96
+
97
+ - Data Collection Method: by Human
98
+
99
+ - Labeling Method: by Human
100
+
101
+ ### Test Datasets
102
+
103
+ - [MLS](https://www.openslr.org/94/)
104
+
105
+ - Data Collection Method: by Human
106
+
107
+ - Labeling Method: Automated
108
+
109
+ - Properties: We randomly selected 200 samples from each of the eight languages in the 44kHz MLS dataset. For more details, please refer to [our paper](https://arxiv.org/abs/2409.12117).
110
+
111
+ - [DAPS](https://zenodo.org/records/4660670)
112
+
113
+ - Data Collection Method: by Human
114
+
115
+ - Labeling Method: Automated
116
+
117
+ - Properties: To assess our models' performance on studio-quality audio, we utilized the F10 and M10 speakers from the DAPS Clear dataset. These speakers were also employed in the evaluation of the [DAC model](https://arxiv.org/abs/2306.06546).
118
+
119
  ## Software Integration
120
 
121
  ### Supported Hardware Microarchitecture Compatibility: