--- license: other license_name: nsclv1 license_link: https://developer.nvidia.com/downloads/license/nsclv1 --- # NVIDIA Low Frame-rate Speech Codec [![Model architecture](https://img.shields.io/badge/Model_Arch-Low_Frame--rate_Speech_Codec-lightgrey#model-badge)](#model-architecture) | [![Model size](https://img.shields.io/badge/Params-112.7M-lightgrey#model-badge)](#model-architecture) | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets) The [Low Frame-rate Speech Codec](https://arxiv.org/abs/2409.12117) is a neural audio codec that leverages finite scalar quantization and adversarial training with large speech language models (WavLM) to achieve high-quality audio compression with a 1.89 kbps bitrate and 21.5 frames per second. ## Model Architecture Low Frame-rate Speech Codec model is composed of a fully convolutional generator neural network and three discriminators. The generator comprises an encoder, followed by vector quantization, and a [HiFi-GAN-based](https://arxiv.org/abs/2010.05646) decoder. For the discriminators, we utilize three neural networks, all employing a squared-GAN and feature-matching loss. We adopt the multi-period discriminator proposed by \cite{kong2020hifi} and the multi-scale complex STFT discriminator proposed by \cite{defossez2022high}. Additionally, inspired by \cite{li2024styletts}, we proposed the use of Speech Language Models (SLMs) as a discriminator. SLMs encode information ranging from acoustic to semantic aspects, which could benefit our model's training, especially in low frame rate settings where accurate pronunciation is difficult to achieve due to the high compression rate. We adopted the 12-layer WavLM \cite{chen2022wavlm}, pre-trained on 94k hours of data, as the SLM. During training, we resample the input audio to 16 kHz before feeding it into the WavLM model, extracting the intermediary layer features. These features are then fed to a discriminative head composed of four 1D convolutional layers. As in \cite{chen2022wavlm}, the SLM remains frozen during training. ### Input - **Input Type:** Audio - **Input Format(s):** .wav files - **Input Parameters:** One-Dimensional (1D) - **Other Properties Related to Input:** 22050 Hz Mono-channel Audio ### Output - **Output Type**: Audio - **Output Format:** .wav files - **Output Parameters:** One Dimensional (1D) - **Other Properties Related to Output:** 22050 Hz Mono-channel Audio ## NVIDIA NeMo To train, fine-tune, or do inference with our model, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed Cython and latest PyTorch version. ``` pip install git+https://github.com/NVIDIA/NeMo.git ``` ## How to Use this Model The model is available for use in the NeMo toolkit [4], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset. ## Training Datasets The Low Frame-rate Speech Codec is trained on a total of 28.7k hrs of speech data from 105 languages. For training our model we have used [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) and an English subset of MLS dataset. The Common Voice derived training set comprises 105 languages, totaling 2.7 million utterances, and 3.2k hours of audio from about one-hundred thousand speakers. The [MLS English](https://www.openslr.org/94/) training dataset consists of 6.2 million utterances and 25.5k hours of audio from 4329 speakers. = ## License/Terms of Use This model is for research and development only (non-commercial use) and the license to use this model is covered by the [NSCLv1](https://developer.nvidia.com/downloads/license/nsclv1).