nvidia
/

low-frame-rate-speech-codec-22khz

NeMo

Model card Files Files and versions Community

CasanovaE commited on Nov 27, 2024

Commit

7c25161

verified ·

1 Parent(s): db3284f

Update README.md

Browse files

Files changed (1) hide show

README.md +6 -58

README.md CHANGED Viewed

@@ -24,7 +24,7 @@ The [Low Frame-rate Speech Codec](https://arxiv.org/abs/2409.12117) is a neural
 ## NVIDIA NeMo
-To train, fine-tune, or do inference with your model, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed Cython and latest PyTorch version.
 ```
 pip install git+https://github.com/NVIDIA/NeMo.git
 ```
@@ -37,64 +37,12 @@ The model is available for use in the NeMo toolkit [4], and can be used as a pre
-### Datasets
-The Low Frame-rate Speech Codec is trained on a total of 28.7k hrs of speech data from 105 languages.
-The Common Voice derived training set comprises 105 languages, totaling 2.7 million utterances, and 3.2k hours
-of audio from about one-hundred thousand speakers. The MLS
-English training dataset consists of 6.2 million utterances and
-25.5k hours of audio from 4329 speakers.
-Training, Testing, and Evaluation Datasets:
-#### Training Dataset:
-[English subset of MLS] (https://www.openslr.org/94/)
-[105 language from Common Voice] (https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)
-** Data Collection Method by dataset
-MLS English - [Human]
-CommonVoice - [Human]
-** Labeling Method by dataset
-MLS English  - [Automated]
-CommonVoice - [Human]
-**Properties:
-MLS English - 25.5k hours of English speech from 4.3k speakers.
-CommonVoice - 3.2k hours of speech from 100k speakers in 105 languages.
-**Dataset License(s):
-Internal MLS English - The dataset license is CC BY 4.0.
-CommonVoice - Creative Commons CC0 public domain dedication
-Testing Dataset:
-**Link:
-[MLS] (https://www.openslr.org/94/)
-**Data Collection Method by dataset
-MLS  - [Human]
-**Labeling Method by dataset
-MLS - [Automated]
-**Properties: We randomly selected 200 samples from each of the eight languages in the 44kHz MLS dataset. For more details, please refer to [our paper](https://arxiv.org/abs/2409.12117).
-Evaluation Dataset:
-**Link:
-[MLS English] (https://www.openslr.org/94/)
-[CommonVoice] (https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0)
-**Data Collection Method by dataset:
-MLS English - [Human]
-CommonVoice - [Human]
-**Labeling Method by dataset
-MLS - [Automated]
-CommonVoice - [Human]
-**Properties:  We have used a small portion of the training dataset that was not in the model training set. We have included unseen and seen speakers on the evaluation set.
-Inference:
-**Engine: NeMo 2.0
-**Test Hardware:  NVIDIA RTX 6000 Ada Generation Graphics Card
-Ethical Considerations:
-NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
-Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

 ## NVIDIA NeMo
+To train, fine-tune, or do inference with our model, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed Cython and latest PyTorch version.
 ```
 pip install git+https://github.com/NVIDIA/NeMo.git
 ```
+## Training Datasets
+The Low Frame-rate Speech Codec is trained on a total of 28.7k hrs of speech data from 105 languages. For training our model we have used [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)  and an English subset of MLS dataset.  The Common Voice derived training set comprises 105 languages, totaling 2.7 million utterances, and 3.2k hours
+of audio from about one-hundred thousand speakers. The MLS English training dataset consists of 6.2 million utterances and 25.5k hours of audio from 4329 speakers. =
+## License/Terms of Use
+This model is for research and development only (non-commercial use) and the license to use this model is covered by the [NSCLv1](https://developer.nvidia.com/downloads/license/nsclv1).