Introduction to configurations

Dataset

dataset: # the dataset part is for training only
  train:
    wav_scp: './train/wav.scp'
    mel_scp: './train/mel.scp'
    dur_scp: './train/dur.scp'
    emb_type1:
      _name: 'pinyin'
      scp: './train/py.scp'
      vocab: 'py.vocab'
    emb_type2:
      _name: 'graphic'
      scp: './train/gp.scp'
      vocab: 'gp.vocab'
    #emb_type3:
      #_name: 'speaker'
     # scp: './train/spk.scp'
     # vocab: # dosn't need vocab
    emb_type4:
      _name: 'prosody'
      scp: './train/psd.scp'
      vocab:

Vocoder

vocoder:
  type: VocGan # choose one of the following
  MelGAN:
    checkpoint: ~/checkpoints/melgan/melgan_ljspeech.pth
    config: ~/checkpoints/melgan/default.yaml
    device: cpu
  VocGan:
    checkpoint: ~/checkpoints/vctk_pretrained_model_3180.pt #~/checkpoints/ljspeech_29de09d_4000.pt
    denoise: True
    device: cpu
  HiFiGAN:
    checkpoint: ~/checkpoints/VCTK_V3/generator_v3  # you need to download checkpoint and set the params here
    device: cpu
  Waveglow:
    checkpoint:  ~/checkpoints/waveglow_256channels_universal_v5_state_dict.pt
    sigma: 1.0
    denoiser_strength: 0.0 # try 0.1
    device: cpu #try cpu if out of memory

Make your own changes

Two config files are provided in the examples for illustration purpose. You can changed the config file if you know what you are doing. For example, you can remove speaker_emb from the following section, or add prosody embedding if you have prosody label (as in biaobei dataset).