w11wo commited on
Commit
aca26e1
·
1 Parent(s): 524db8e

Added Model

Browse files
README.md CHANGED
@@ -1,3 +1,211 @@
1
  ---
 
2
  license: cc-by-sa-4.0
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: sw
3
  license: cc-by-sa-4.0
4
+ tags:
5
+ - tensorflowtts
6
+ - audio
7
+ - text-to-speech
8
+ - mel-to-wav
9
+ inference: false
10
+ datasets:
11
+ - bookbot/OpenBible_Swahili
12
  ---
13
+
14
+ # MB-MelGAN HiFi PostNets SW v1
15
+
16
+ MB-MelGAN HiFi PostNets SW v1 is a mel-to-wav model based on the [MB-MelGAN](https://arxiv.org/abs/2005.05106) architecture with [HiFi-GAN](https://arxiv.org/abs/2010.05646) discriminator. This model was trained from scratch on a synthetic audio dataset. Instead of training on ground truth waveform spectrograms, this model was trained on the generated PostNet spectrograms of [LightSpeech MFA SW v1](https://huggingface.co/bookbot/lightspeech-mfa-sw-v1). The list of real speakers include:
17
+
18
+ - sw-KE-OpenBible
19
+
20
+ This model was trained using the [TensorFlowTTS](https://github.com/TensorSpeech/TensorFlowTTS) framework. All training was done on a Scaleway RENDER-S VM with a Tesla P100 GPU. All necessary scripts used for training could be found in this [Github Fork](https://github.com/bookbot-hive/TensorFlowTTS), as well as the [Training metrics](https://huggingface.co/bookbot/mb-melgan-hifi-postnets-sw-v1/tensorboard) logged via Tensorboard.
21
+
22
+ ## Model
23
+
24
+ | Model | Config | SR (Hz) | Mel range (Hz) | FFT / Hop / Win (pt) | #steps |
25
+ | ------------------------------- | ----------------------------------------------------------------------------------------- | ------- | -------------- | -------------------- | ------ |
26
+ | `mb-melgan-hifi-postnets-sw-v1` | [Link](https://huggingface.co/bookbot/mb-melgan-hifi-postnets-sw-v1/blob/main/config.yml) | 44.1K | 20-11025 | 2048 / 512 / None | 1M |
27
+
28
+ ## Training Procedure
29
+
30
+ <details>
31
+ <summary>Feature Extraction Setting</summary>
32
+
33
+ sampling_rate: 44100
34
+ hop_size: 512 # Hop size.
35
+ format: "npy"
36
+
37
+ </details>
38
+
39
+ <details>
40
+ <summary>Generator Network Architecture Setting</summary>
41
+
42
+ model_type: "multiband_melgan_generator"
43
+
44
+ multiband_melgan_generator_params:
45
+ out_channels: 4 # Number of output channels (number of subbands).
46
+ kernel_size: 7 # Kernel size of initial and final conv layers.
47
+ filters: 384 # Initial number of channels for conv layers.
48
+ upsample_scales: [8, 4, 4] # List of Upsampling scales.
49
+ stack_kernel_size: 3 # Kernel size of dilated conv layers in residual stack.
50
+ stacks: 4 # Number of stacks in a single residual stack module.
51
+ is_weight_norm: false # Use weight-norm or not.
52
+
53
+ </details>
54
+
55
+ <details>
56
+ <summary>Discriminator Network Architecture Setting</summary>
57
+
58
+ multiband_melgan_discriminator_params:
59
+ out_channels: 1 # Number of output channels.
60
+ scales: 3 # Number of multi-scales.
61
+ downsample_pooling: "AveragePooling1D" # Pooling type for the input downsampling.
62
+ downsample_pooling_params: # Parameters of the above pooling function.
63
+ pool_size: 4
64
+ strides: 2
65
+ kernel_sizes: [5, 3] # List of kernel size.
66
+ filters: 16 # Number of channels of the initial conv layer.
67
+ max_downsample_filters: 512 # Maximum number of channels of downsampling layers.
68
+ downsample_scales: [4, 4, 4] # List of downsampling scales.
69
+ nonlinear_activation: "LeakyReLU" # Nonlinear activation function.
70
+ nonlinear_activation_params: # Parameters of nonlinear activation function.
71
+ alpha: 0.2
72
+ is_weight_norm: false # Use weight-norm or not.
73
+
74
+ hifigan_discriminator_params:
75
+ out_channels: 1 # Number of output channels (number of subbands).
76
+ period_scales: [3, 5, 7, 11, 17, 23, 37] # List of period scales.
77
+ n_layers: 5 # Number of layer of each period discriminator.
78
+ kernel_size: 5 # Kernel size.
79
+ strides: 3 # Strides
80
+ filters: 8 # In Conv filters of each period discriminator
81
+ filter_scales: 4 # Filter scales.
82
+ max_filters: 512 # maximum filters of period discriminator's conv.
83
+ is_weight_norm: false # Use weight-norm or not.
84
+
85
+ </details>
86
+
87
+ <details>
88
+ <summary>STFT Loss Setting</summary>
89
+
90
+ stft_loss_params:
91
+ fft_lengths: [1024, 2048, 512] # List of FFT size for STFT-based loss.
92
+ frame_steps: [120, 240, 50] # List of hop size for STFT-based loss
93
+ frame_lengths: [600, 1200, 240] # List of window length for STFT-based loss.
94
+
95
+ subband_stft_loss_params:
96
+ fft_lengths: [384, 683, 171] # List of FFT size for STFT-based loss.
97
+ frame_steps: [30, 60, 10] # List of hop size for STFT-based loss
98
+ frame_lengths: [150, 300, 60] # List of window length for STFT-based loss.
99
+
100
+ </details>
101
+
102
+ <details>
103
+ <summary>Adversarial Loss Setting</summary>
104
+
105
+ lambda_feat_match: 10.0 # Loss balancing coefficient for feature matching loss
106
+ lambda_adv: 2.5 # Loss balancing coefficient for adversarial loss.
107
+
108
+ </details>
109
+
110
+ <details>
111
+ <summary>Data Loader Setting</summary>
112
+
113
+ batch_size: 32 # Batch size for each GPU with assuming that gradient_accumulation_steps == 1.
114
+ eval_batch_size: 16
115
+ batch_max_steps: 8192 # Length of each audio in batch for training. Make sure dividable by hop_size.
116
+ batch_max_steps_valid: 8192 # Length of each audio for validation. Make sure dividable by hope_size.
117
+ remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
118
+ allow_cache: false # Whether to allow cache in dataset. If true, it requires cpu memory.
119
+ is_shuffle: false # shuffle dataset after each epoch.
120
+
121
+ </details>
122
+
123
+ <details>
124
+ <summary>Optimizer & Scheduler Setting</summary>
125
+
126
+ generator_optimizer_params:
127
+ lr_fn: "PiecewiseConstantDecay"
128
+ lr_params:
129
+ boundaries: [100000, 200000, 300000, 400000, 500000, 600000, 700000]
130
+ values:
131
+ [
132
+ 0.0005,
133
+ 0.0005,
134
+ 0.00025,
135
+ 0.000125,
136
+ 0.0000625,
137
+ 0.00003125,
138
+ 0.000015625,
139
+ 0.000001,
140
+ ]
141
+ amsgrad: false
142
+
143
+ discriminator_optimizer_params:
144
+ lr_fn: "PiecewiseConstantDecay"
145
+ lr_params:
146
+ boundaries: [100000, 200000, 300000, 400000, 500000]
147
+ values: [0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001]
148
+ amsgrad: false
149
+
150
+ gradient_accumulation_steps: 1
151
+
152
+ </details>
153
+
154
+ <details>
155
+ <summary>Interval Setting</summary>
156
+
157
+ discriminator_train_start_steps: 200000 # steps begin training discriminator
158
+ train_max_steps: 1000000 # Number of training steps.
159
+ save_interval_steps: 20000 # Interval steps to save checkpoint.
160
+ eval_interval_steps: 5000 # Interval steps to evaluate the network.
161
+ log_interval_steps: 200 # Interval steps to record the training log.
162
+
163
+ </details>
164
+
165
+ <details>
166
+ <summary>Other Setting</summary>
167
+
168
+ num_save_intermediate_results: 1 # Number of batch to be saved as intermediate results.
169
+
170
+ </details>
171
+
172
+ ## How to Use
173
+
174
+ ```py
175
+ import soundfile as sf
176
+ import tensorflow as tf
177
+ from tensorflow_tts.inference import TFAutoModel, AutoProcessor
178
+
179
+ lightspeech = TFAutoModel.from_pretrained("bookbot/lightspeech-mfa-sw-v1")
180
+ processor = AutoProcessor.from_pretrained("bookbot/lightspeech-mfa-sw-v1")
181
+ mb_melgan = TFAutoModel.from_pretrained("bookbot/mb-melgan-hifi-postnets-sw-v1")
182
+
183
+ text, speaker_name = "Hello World.", "sw-KE-OpenBible"
184
+ input_ids = processor.text_to_sequence(text)
185
+
186
+ mel, _, _ = lightspeech.inference(
187
+ input_ids=tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
188
+ speaker_ids=tf.convert_to_tensor(
189
+ [processor.speakers_map[speaker_name]], dtype=tf.int32
190
+ ),
191
+ speed_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
192
+ f0_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
193
+ energy_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
194
+ )
195
+
196
+ audio = mb_melgan.inference(mel)[0, :, 0]
197
+ sf.write("./audio.wav", audio, 44100, "PCM_16")
198
+ ```
199
+
200
+ ## Disclaimer
201
+
202
+ Do consider the biases which came from pre-training datasets that may be carried over into the results of this model.
203
+
204
+ ## Authors
205
+
206
+ MB-MelGAN HiFi PostNets SW v1 was trained and evaluated by [David Samuel Setiawan](https://davidsamuell.github.io/), [Wilson Wongso](https://wilsonwongso.dev/). All computation and development are done on Scaleway.
207
+
208
+ ## Framework versions
209
+
210
+ - TensorFlowTTS 1.8
211
+ - TensorFlow 2.7.0
config.yml ADDED
@@ -0,0 +1,144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ allow_cache: false
2
+ batch_max_steps: 8192
3
+ batch_max_steps_valid: 8192
4
+ batch_size: 32
5
+ config: ./TensorFlowTTS/examples/multiband_melgan_hf/conf/multiband_melgan_hf.sw.v1.yml
6
+ dev_dir: ./dump/valid/
7
+ discriminator_mixed_precision: false
8
+ discriminator_optimizer_params:
9
+ amsgrad: false
10
+ lr_fn: PiecewiseConstantDecay
11
+ lr_params:
12
+ boundaries:
13
+ - 100000
14
+ - 200000
15
+ - 300000
16
+ - 400000
17
+ - 500000
18
+ values:
19
+ - 0.00025
20
+ - 0.000125
21
+ - 6.25e-05
22
+ - 3.125e-05
23
+ - 1.5625e-05
24
+ - 1.0e-06
25
+ discriminator_train_start_steps: 200000
26
+ eval_batch_size: 16
27
+ eval_interval_steps: 5000
28
+ format: npy
29
+ generator_mixed_precision: false
30
+ generator_optimizer_params:
31
+ amsgrad: false
32
+ lr_fn: PiecewiseConstantDecay
33
+ lr_params:
34
+ boundaries:
35
+ - 100000
36
+ - 200000
37
+ - 300000
38
+ - 400000
39
+ - 500000
40
+ - 600000
41
+ - 700000
42
+ values:
43
+ - 0.0005
44
+ - 0.0005
45
+ - 0.00025
46
+ - 0.000125
47
+ - 6.25e-05
48
+ - 3.125e-05
49
+ - 1.5625e-05
50
+ - 1.0e-06
51
+ gradient_accumulation_steps: 1
52
+ hifigan_discriminator_params:
53
+ filter_scales: 4
54
+ filters: 8
55
+ is_weight_norm: false
56
+ kernel_size: 5
57
+ max_filters: 512
58
+ n_layers: 5
59
+ out_channels: 1
60
+ period_scales:
61
+ - 3
62
+ - 5
63
+ - 7
64
+ - 11
65
+ - 17
66
+ - 23
67
+ - 37
68
+ strides: 3
69
+ hop_size: 512
70
+ is_shuffle: false
71
+ lambda_adv: 2.5
72
+ lambda_feat_match: 10.0
73
+ log_interval_steps: 200
74
+ model_type: multiband_melgan_generator
75
+ multiband_melgan_discriminator_params:
76
+ downsample_pooling: AveragePooling1D
77
+ downsample_pooling_params:
78
+ pool_size: 4
79
+ strides: 2
80
+ downsample_scales:
81
+ - 4
82
+ - 4
83
+ - 4
84
+ filters: 16
85
+ is_weight_norm: false
86
+ kernel_sizes:
87
+ - 5
88
+ - 3
89
+ max_downsample_filters: 512
90
+ nonlinear_activation: LeakyReLU
91
+ nonlinear_activation_params:
92
+ alpha: 0.2
93
+ out_channels: 1
94
+ scales: 3
95
+ multiband_melgan_generator_params:
96
+ filters: 384
97
+ is_weight_norm: false
98
+ kernel_size: 7
99
+ out_channels: 4
100
+ stack_kernel_size: 3
101
+ stacks: 4
102
+ upsample_scales:
103
+ - 8
104
+ - 4
105
+ - 4
106
+ num_save_intermediate_results: 1
107
+ outdir: ./mb-melgan-hifi-openbible/
108
+ postnets: true
109
+ pretrained: ''
110
+ remove_short_samples: true
111
+ resume: ./mb-melgan-hifi-openbible/checkpoints/ckpt-200000
112
+ sampling_rate: 44100
113
+ save_interval_steps: 20000
114
+ stft_loss_params:
115
+ fft_lengths:
116
+ - 1024
117
+ - 2048
118
+ - 512
119
+ frame_lengths:
120
+ - 600
121
+ - 1200
122
+ - 240
123
+ frame_steps:
124
+ - 120
125
+ - 240
126
+ - 50
127
+ subband_stft_loss_params:
128
+ fft_lengths:
129
+ - 384
130
+ - 683
131
+ - 171
132
+ frame_lengths:
133
+ - 150
134
+ - 300
135
+ - 60
136
+ frame_steps:
137
+ - 30
138
+ - 60
139
+ - 10
140
+ train_dir: ./dump/train/
141
+ train_max_steps: 1000000
142
+ use_norm: true
143
+ verbose: 1
144
+ version: '0.0'
events.out.tfevents.1712492951.bookbot.2383.0.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e78086c9843c9a147e1e535927021e074ee575118c834450d76a1ec21c46054a
3
+ size 793484
events.out.tfevents.1712507974.bookbot.10184.0.v2 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7a5aebf1dd4ad8ef70a29bd05ff957ea3c973b8eea5ce72493c5d497d0badaed
3
+ size 3176840
model.h5 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2bcea759225558557ff98c7ce2b8d98936bcf97eadc6952297e99791c64d5219
3
+ size 10308488