Make match reality.
Browse files
README.md
CHANGED
@@ -24,7 +24,7 @@ Fine-tuning attempts on Microsoft's HiFiGAN vocoder were unsuccessful.
|
|
24 |
|
25 |
Our approach involves a smaller model that takes a fixed audio chunk of 8 Mel frames, two pre-frames, and two post-frames.
|
26 |
These frames are processed along with the original vocoder's 12 audio frames of 256 bytes each. The model employs
|
27 |
-
convolution input layers for both audio and Mel frames to generate hidden dimensions, followed by
|
28 |
a final convolution layer. The output is then multiplied with the original 8 audio frames to produce corrected frames.
|
29 |
|
30 |
![HelloSippyRT Model Architecture](https://docs.google.com/drawings/d/e/2PACX-1vTiWxGbEB2MbvHpTJHS22abWNrSt2pHv6XijEDmnQFjAqBewMJyZBQ_5Y9k1P9INQPQmuq56MpLDzJt/pub?w=960&h=720)
|
|
|
24 |
|
25 |
Our approach involves a smaller model that takes a fixed audio chunk of 8 Mel frames, two pre-frames, and two post-frames.
|
26 |
These frames are processed along with the original vocoder's 12 audio frames of 256 bytes each. The model employs
|
27 |
+
convolution input layers for both audio and Mel frames to generate hidden dimensions, followed by linear layer and
|
28 |
a final convolution layer. The output is then multiplied with the original 8 audio frames to produce corrected frames.
|
29 |
|
30 |
![HelloSippyRT Model Architecture](https://docs.google.com/drawings/d/e/2PACX-1vTiWxGbEB2MbvHpTJHS22abWNrSt2pHv6XijEDmnQFjAqBewMJyZBQ_5Y9k1P9INQPQmuq56MpLDzJt/pub?w=960&h=720)
|