sobomax commited on
Commit
ab970be
1 Parent(s): 91e99b3

Clarify and wrap.

Browse files
Files changed (1) hide show
  1. README.md +23 -3
README.md CHANGED
@@ -15,16 +15,36 @@ The HelloSippyRT model is designed to adapt Microsoft's SpeechT5 Text-to-Speech
15
 
16
  ## Problem Statement
17
 
18
- The original vocoder performs optimally only when provided with almost the full Text-To-Mel ("TTM") sequence at once. This is not ideal for real-time applications, where we aim to begin audio output quickly. Using smaller chunks results in "clicking" distortions between adjacent audio frames. Fine-tuning attempts on Microsoft's HiFiGAN vocoder were unsuccessful.
 
 
 
19
 
20
  ## Solution
21
 
22
- Our approach involves a smaller model that takes a fixed audio chunk of 8 Mel frames, two pre-frames, and two post-frames. These frames are processed along with the original vocoder's 12 audio frames of 256 bytes each. The model employs convolution input layers for both audio and Mel frames to generate hidden dimensions, followed by two linear layers and a final convolution layer. The output is then multiplied with the original 8 audio frames to produce corrected frames.
 
 
 
 
23
  ![HelloSippyRT Model Architecture](https://docs.google.com/drawings/d/e/2PACX-1vTiWxGbEB2MbvHpTJHS22abWNrSt2pHv6XijEDmnQFjAqBewMJyZBQ_5Y9k1P9INQPQmuq56MpLDzJt/pub?w=960&h=720)
24
 
25
  ## Training Details
26
 
27
- We trained the model using a subset of 3,000 audio utterances from the `LJSpeech-1.1` dataset. The original SpeechT5 TTS module generated the voice using speakers randomly selected from the `Matthijs/cmu-arctic-xvectors` dataset. During training, the original vocoder was locked; only our model was fine-tuned to mimic the original vocoder as closely as possible in continuous mode.
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
  ## Source Code & Links
30
 
 
15
 
16
  ## Problem Statement
17
 
18
+ The original vocoder performs optimally only when provided with almost the full Mel sequence produced from the single
19
+ text input at once. This is not ideal for real-time applications, where we aim to begin audio output quickly.
20
+ Using smaller chunks results in "clicking" distortions between adjacent audio frames.
21
+ Fine-tuning attempts on Microsoft's HiFiGAN vocoder were unsuccessful.
22
 
23
  ## Solution
24
 
25
+ Our approach involves a smaller model that takes a fixed audio chunk of 8 Mel frames, two pre-frames, and two post-frames.
26
+ These frames are processed along with the original vocoder's 12 audio frames of 256 bytes each. The model employs
27
+ convolution input layers for both audio and Mel frames to generate hidden dimensions, followed by two linear layers and
28
+ a final convolution layer. The output is then multiplied with the original 8 audio frames to produce corrected frames.
29
+
30
  ![HelloSippyRT Model Architecture](https://docs.google.com/drawings/d/e/2PACX-1vTiWxGbEB2MbvHpTJHS22abWNrSt2pHv6XijEDmnQFjAqBewMJyZBQ_5Y9k1P9INQPQmuq56MpLDzJt/pub?w=960&h=720)
31
 
32
  ## Training Details
33
 
34
+ We trained the model using a subset of 3,000 audio utterances from the `LJSpeech-1.1` dataset. The SpeechT5's Speech-To-Speech
35
+ module was employed to replace voice in each utterance with a voice of speakers randomly selected from the
36
+ `Matthijs/cmu-arctic-xvectors` dataset. Such produced reference Mel spectrum were used to feed vocoder and post-vocoder
37
+ in chunks. The FFT of generated in "continuous" mode reference waveform was used as a basis for loss-function calculation.
38
+
39
+ During training, the original vocoder was locked; only our model was trained to mimic the original vocoder as closely as
40
+ possible in continuous mode.
41
+
42
+ ## Evaluation
43
+
44
+ The model has been evaluated by producing TTS output from pure text input using quotes from the "Futurama", "Martix" and
45
+ "Space Odyssey 2001" retrieved from the wikiquotes site using purely random speaker vector as well as vectors from the
46
+ `Matthijs/cmu-arctic-xvectors` dataset. The quality of output has been found satisfactory for our particular
47
+ use.
48
 
49
  ## Source Code & Links
50