NeMo
CasanovaE commited on
Commit
256f544
·
verified ·
1 Parent(s): 29786e4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -9
README.md CHANGED
@@ -25,7 +25,8 @@ The [Low Frame-rate Speech Codec](https://arxiv.org/abs/2409.12117) is a neural
25
  ## Model Architecture
26
  Low Frame-rate Speech Codec model is composed of a fully convolutional generator neural network and three discriminators.
27
  The generator comprises an encoder, followed by vector quantization, and a [HiFi-GAN-based](https://arxiv.org/abs/2010.05646) decoder.
28
- The encoder consists of five residual blocks, each block containing three residual layers similar to the [multi-receptive field fusion (MRF) module](https://arxiv.org/abs/2010.05646). For the vector quantization, we have used [Finite Scalar Quantization (FSQ)](https://arxiv.org/abs/2309.15505) with eight codebooks and four dimensions per code and 2016 codes per codebook.
 
29
  For the discriminators, we utilize three neural networks, all employing a squared-GAN and feature-matching loss. We adopt the [multi-period discriminator](https://arxiv.org/abs/2010.05646) and the [multi-scale complex STFT discriminator](https://arxiv.org/abs/2210.13438).
30
  Additionally, we proposed the use of Speech Language Models (SLMs) as a discriminator. SLMs encode information ranging from acoustic to semantic aspects, which could benefit our model's training, especially in low frame rate settings where accurate pronunciation is difficult to achieve due to the high compression rate. We adopted the [12-layer WavLM](https://arxiv.org/abs/2110.13900), pre-trained on 94k hours of data, as the SLM. During training, we resample the input audio to 16 kHz before feeding it into the WavLM model, extracting the intermediary layer features. These features are then fed to a discriminative head composed of four 1D convolutional layers.
31
 
@@ -44,20 +45,17 @@ For more details please check [our paper](https://arxiv.org/abs/2409.12117).
44
  - **Other Properties Related to Output:** 22050 Hz Mono-channel Audio
45
 
46
 
47
- ## NVIDIA NeMo
48
 
49
- To train, fine-tune, or do inference with our model, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed Cython and latest PyTorch version.
50
- ```
51
- pip install git+https://github.com/NVIDIA/NeMo.git
52
- ```
53
 
54
  ## How to Use this Model
55
 
56
- The model is available for use in the NeMo toolkit [4], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
57
-
58
-
59
 
 
 
60
 
 
 
61
 
62
 
63
  ## Training, Testing, and Evaluation Datasets:
 
25
  ## Model Architecture
26
  Low Frame-rate Speech Codec model is composed of a fully convolutional generator neural network and three discriminators.
27
  The generator comprises an encoder, followed by vector quantization, and a [HiFi-GAN-based](https://arxiv.org/abs/2010.05646) decoder.
28
+ The encoder consists of five residual blocks, each block containing three residual layers similar to the [multi-receptive field fusion (MRF) module](https://arxiv.org/abs/2010.05646).
29
+ For the vector quantization, we have used [Finite Scalar Quantization (FSQ)](https://arxiv.org/abs/2309.15505) with eight codebooks and four dimensions per code and 2016 codes per codebook.
30
  For the discriminators, we utilize three neural networks, all employing a squared-GAN and feature-matching loss. We adopt the [multi-period discriminator](https://arxiv.org/abs/2010.05646) and the [multi-scale complex STFT discriminator](https://arxiv.org/abs/2210.13438).
31
  Additionally, we proposed the use of Speech Language Models (SLMs) as a discriminator. SLMs encode information ranging from acoustic to semantic aspects, which could benefit our model's training, especially in low frame rate settings where accurate pronunciation is difficult to achieve due to the high compression rate. We adopted the [12-layer WavLM](https://arxiv.org/abs/2110.13900), pre-trained on 94k hours of data, as the SLM. During training, we resample the input audio to 16 kHz before feeding it into the WavLM model, extracting the intermediary layer features. These features are then fed to a discriminative head composed of four 1D convolutional layers.
32
 
 
45
  - **Other Properties Related to Output:** 22050 Hz Mono-channel Audio
46
 
47
 
 
48
 
 
 
 
 
49
 
50
  ## How to Use this Model
51
 
52
+ The model is available for use in the [NVIDIA NeMo](https://github.com/NVIDIA/NeMo), and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
 
 
53
 
54
+ ### Inference
55
+ For inference please follow our [Audio Codec Inference Tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/Audio_Codec_Inference.ipynb). Note that you will need to set the ```model_name``` parameter to "audio_codec_low_frame_rate_22khz".
56
 
57
+ ### Training
58
+ For fine-tuning on another dataset please follow the steps available at our [Audio Codec Training Tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/Audio_Codec_Training.ipynb). Note that you will need to set the ```CONFIG_FILENAME``` parameter to the "audio_codec_low_frame_rate_22050.yaml" config. You also will need to set ```pretrained_model_name`` to "audio_codec_low_frame_rate_22khz".
59
 
60
 
61
  ## Training, Testing, and Evaluation Datasets: