Update README.md
Browse files
README.md
CHANGED
@@ -30,7 +30,7 @@ The generator comprises an encoder, followed by vector quantization, and a [HiFi
|
|
30 |
The encoder consists of five residual blocks, each block containing three residual layers similar to the [multi-receptive field fusion (MRF) module](https://arxiv.org/abs/2010.05646).
|
31 |
For the vector quantization, we have used [Finite Scalar Quantization (FSQ)](https://arxiv.org/abs/2309.15505) with eight codebooks and four dimensions per code and 2016 codes per codebook.
|
32 |
For the discriminators, we utilize three neural networks, all employing a squared-GAN and feature-matching loss. We adopt the [multi-period discriminator](https://arxiv.org/abs/2010.05646) and the [multi-scale complex STFT discriminator](https://arxiv.org/abs/2210.13438).
|
33 |
-
Additionally, we proposed the use of Speech Language Models (SLMs) as a discriminator. SLMs encode information ranging from acoustic to semantic aspects, which could benefit our model's training, especially in low frame rate settings where accurate pronunciation is difficult to achieve due to the high compression rate. We adopted the [12-layer WavLM](https://arxiv.org/abs/2110.13900)
|
34 |
|
35 |
For more details please check [our paper](https://arxiv.org/abs/2409.12117).
|
36 |
|
|
|
30 |
The encoder consists of five residual blocks, each block containing three residual layers similar to the [multi-receptive field fusion (MRF) module](https://arxiv.org/abs/2010.05646).
|
31 |
For the vector quantization, we have used [Finite Scalar Quantization (FSQ)](https://arxiv.org/abs/2309.15505) with eight codebooks and four dimensions per code and 2016 codes per codebook.
|
32 |
For the discriminators, we utilize three neural networks, all employing a squared-GAN and feature-matching loss. We adopt the [multi-period discriminator](https://arxiv.org/abs/2010.05646) and the [multi-scale complex STFT discriminator](https://arxiv.org/abs/2210.13438).
|
33 |
+
Additionally, we proposed the use of Speech Language Models (SLMs) as a discriminator. SLMs encode information ranging from acoustic to semantic aspects, which could benefit our model's training, especially in low frame rate settings where accurate pronunciation is difficult to achieve due to the high compression rate. We adopted the [12-layer WavLM](https://arxiv.org/abs/2110.13900) as the SLM. During training, we resample the input audio to 16 kHz before feeding it into the WavLM model, extracting the intermediary layer features. These features are then fed to a discriminative head composed of four 1D convolutional layers.
|
34 |
|
35 |
For more details please check [our paper](https://arxiv.org/abs/2409.12117).
|
36 |
|