nvidia
/

low-frame-rate-speech-codec-22khz

Model card Files Files and versions Community

CasanovaE commited on 28 days ago

Commit

a9f3251

•

1 Parent(s): 9c73823

Update README.md

Files changed (1) hide show

README.md +54 -2

README.md CHANGED Viewed

@@ -59,11 +59,63 @@ The model is available for use in the NeMo toolkit [4], and can be used as a pre
-## Training Datasets
-The Low Frame-rate Speech Codec is trained on a total of 28.7k hrs of speech data from 105 languages. For training our model we have used [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)  and an English subset of MLS dataset.  The Common Voice derived training set comprises 105 languages, totaling 2.7 million utterances, and 3.2k hours
 of audio from about one-hundred thousand speakers. The [MLS English](https://www.openslr.org/94/) training dataset consists of 6.2 million utterances and 25.5k hours of audio from 4329 speakers. =
 ## Software Integration
 ### Supported Hardware Microarchitecture Compatibility:

+## Training, Testing, and Evaluation Datasets:
+The Low Frame-rate Speech Codec is trained on a total of 28.7k hrs of speech data from 105 languages.
+For training our model we have used [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)  and an English subset of MLS dataset.  The Common Voice derived training set comprises 105 languages, totaling 2.7 million utterances, and 3.2k hours
 of audio from about one-hundred thousand speakers. The [MLS English](https://www.openslr.org/94/) training dataset consists of 6.2 million utterances and 25.5k hours of audio from 4329 speakers. =
+### Training Datasets
+The Low Frame-rate Speech Codec is trained on a total of 28.7k hrs of speech data from 105 languages.
+  - [MLS English](https://www.openslr.org/94/) [25.5k]
+      - Data Collection Method: by Human
+      - Labeling Method: Automated
+  -  [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)[3.2k]
+      - Data Collection Method: by Human
+      - Labeling Method: by Human
+### Evaluation Datasets
+  - [MLS English](https://www.openslr.org/94/)
+      - Data Collection Method: by Human
+      - Labeling Method: Automated
+  -  [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)
+      - Data Collection Method: by Human
+      - Labeling Method: by Human
+### Test Datasets
+  - [MLS](https://www.openslr.org/94/)
+    - Data Collection Method: by Human
+    - Labeling Method: Automated
+    - Properties: We randomly selected 200 samples from each of the eight languages in the 44kHz MLS dataset. For more details, please refer to [our paper](https://arxiv.org/abs/2409.12117).
+  - [DAPS](https://zenodo.org/records/4660670)
+      - Data Collection Method: by Human
+      - Labeling Method: Automated
+      - Properties: To assess our models' performance on studio-quality audio, we utilized the F10 and M10 speakers from the DAPS Clear dataset. These speakers were also employed in the evaluation of the [DAC model](https://arxiv.org/abs/2306.06546).
 ## Software Integration
 ### Supported Hardware Microarchitecture Compatibility: