sobomax
/

speecht5-rt.post_vocoder.v1

Inference Endpoints

Model card Files Files and versions Community

sobomax commited on Sep 27, 2023

Commit

128ce2c

•

1 Parent(s): 8ef3cf1

Make match reality.

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -24,7 +24,7 @@ Fine-tuning attempts on Microsoft's HiFiGAN vocoder were unsuccessful.
 Our approach involves a smaller model that takes a fixed audio chunk of 8 Mel frames, two pre-frames, and two post-frames.
 These frames are processed along with the original vocoder's 12 audio frames of 256 bytes each. The model employs
-convolution input layers for both audio and Mel frames to generate hidden dimensions, followed by two linear layers and
 a final convolution layer. The output is then multiplied with the original 8 audio frames to produce corrected frames.
 ![HelloSippyRT Model Architecture](https://docs.google.com/drawings/d/e/2PACX-1vTiWxGbEB2MbvHpTJHS22abWNrSt2pHv6XijEDmnQFjAqBewMJyZBQ_5Y9k1P9INQPQmuq56MpLDzJt/pub?w=960&h=720)

 Our approach involves a smaller model that takes a fixed audio chunk of 8 Mel frames, two pre-frames, and two post-frames.
 These frames are processed along with the original vocoder's 12 audio frames of 256 bytes each. The model employs
+convolution input layers for both audio and Mel frames to generate hidden dimensions, followed by linear layer and
 a final convolution layer. The output is then multiplied with the original 8 audio frames to produce corrected frames.
 ![HelloSippyRT Model Architecture](https://docs.google.com/drawings/d/e/2PACX-1vTiWxGbEB2MbvHpTJHS22abWNrSt2pHv6XijEDmnQFjAqBewMJyZBQ_5Y9k1P9INQPQmuq56MpLDzJt/pub?w=960&h=720)