Variational Inference with adversarial learning for end-to-end Singing Voice Conversion based on VITS
- πThis project is target for: beginners in deep learning, the basic operation of Python and PyTorch is the prerequisite for using this project;
- πThis project aims to help deep learning beginners get rid of boring pure theoretical learning, and master the basic knowledge of deep learning by combining it with practice;
- πThis project does not support real-time voice change; (support needs to replace whisper)
- πThis project will not develop one-click packages for other purposesοΌ
6G memory GPU can be used to trained
support for multiple speakers
create unique speakers through speaker mixing
even with light accompaniment can also be converted
F0 can be edited using Excel
Model properties
https://github.com/PlayVoice/so-vits-svc-5.0/releases/tag/hifigan_release
- sovits5.0_main_1500.pth The model includes: generator + discriminator = 176M, which can be used as a pre-training model
- speakers files are in the configs/singers directory, which can be used for reasoning tests, especially for timbre leakage
- speakers 22, 30, 47, and 51 are highly recognizable, and the training audio samples are in the configs/singers_sample directory
Feature | From | Status | Function | Remarks |
---|---|---|---|---|
whisper | OpenAI | β | strong noise immunity | - |
bigvgan | NVIDA | β | alias and snake | The GPU takes up a little more, and the main branch is deleted; You need to switch to the branch bigvganοΌthe formant is clearer and the sound quality is obviously improved |
natural speech | Microsoft | β | reduce mispronunciation | - |
neural source-filter | NII | β | solve the problem of audio F0 discontinuity | - |
speaker encoder | β | Timbre Encoding and Clustering | - | |
GRL for speaker | Ubisoft | β | Preventing Encoder Leakage Timbre | - |
one shot vits | Samsung | β | Voice Clone | - |
SCLN | Microsoft | β | Improve Clone | - |
PPG perturbation | this project | β | Improved noise immunity and de-timbre | - |
VAE perturbation | this project | β | Improve sound quality | - |
πdue to the use of data perturbation, it takes longer to train than other projects.
Dataset preparation
Necessary pre-processing:
- 1 accompaniment separation
- 2 band extension
- 3 sound quality improvement
- 4 cut audio, less than 30 seconds for whisperπ
then put the dataset into the dataset_raw directory according to the following file structure
dataset_raw
ββββspeaker0
β ββββ000001.wav
β ββββ...
β ββββ000xxx.wav
ββββspeaker1
ββββ000001.wav
ββββ...
ββββ000xxx.wav
Install dependencies
1 software dependency
apt update && sudo apt install ffmpeg
pip install -r requirements.txt
2 download the Timbre Encoder: Speaker-Encoder by @mueller91, put
best_model.pth.tar
intospeaker_pretrain/
3 download whisper model multiple language medium model, Make sure to download
medium.pt
οΌput it intowhisper_pretrain/
4 whisper is built-in, do not install it additionally, it will conflict and report an error
Data preprocessing
1οΌ set working directory:
export PYTHONPATH=$PWD
2οΌ re-sampling
generate audio with a sampling rate of 16000HzοΌ./data_svc/waves-16k
python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-16k -s 16000
generate audio with a sampling rate of 32000HzοΌ./data_svc/waves-32k
python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-32k -s 32000
3οΌ use 16K audio to extract pitchοΌf0_ceil=900, it needs to be modified according to the highest pitch of your data
python prepare/preprocess_f0.py -w data_svc/waves-16k/ -p data_svc/pitch
or use next for low quality audio
python prepare/preprocess_f0_crepe.py -w data_svc/waves-16k/ -p data_svc/pitch
4οΌ use 16K audio to extract ppg
python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p data_svc/whisper
5οΌ use 16k audio to extract timbre code
python prepare/preprocess_speaker.py data_svc/waves-16k/ data_svc/speaker
6οΌ extract the average value of the timbre code for inference; it can also replace a single audio timbre in generating the training index, and use it as the unified timbre of the speaker for training
python prepare/preprocess_speaker_ave.py data_svc/speaker/ data_svc/singer
7οΌ use 32k audio to extract the linear spectrum
python prepare/preprocess_spec.py -w data_svc/waves-32k/ -s data_svc/specs
8οΌ use 32k audio to generate training index
python prepare/preprocess_train.py
9οΌ training file debugging
python prepare/preprocess_zzz.py
data_svc/
βββ waves-16k
β βββ speaker0
β β βββ 000001.wav
β β βββ 000xxx.wav
β βββ speaker1
β βββ 000001.wav
β βββ 000xxx.wav
βββ waves-32k
β βββ speaker0
β β βββ 000001.wav
β β βββ 000xxx.wav
β βββ speaker1
β βββ 000001.wav
β βββ 000xxx.wav
βββ pitch
β βββ speaker0
β β βββ 000001.pit.npy
β β βββ 000xxx.pit.npy
β βββ speaker1
β βββ 000001.pit.npy
β βββ 000xxx.pit.npy
βββ whisper
β βββ speaker0
β β βββ 000001.ppg.npy
β β βββ 000xxx.ppg.npy
β βββ speaker1
β βββ 000001.ppg.npy
β βββ 000xxx.ppg.npy
βββ speaker
β βββ speaker0
β β βββ 000001.spk.npy
β β βββ 000xxx.spk.npy
β βββ speaker1
β βββ 000001.spk.npy
β βββ 000xxx.spk.npy
βββ singer
βββ speaker0.spk.npy
βββ speaker1.spk.npy
Train
0οΌ if fine-tuning based on the pre-trained model, you need to download the pre-trained model: sovits5.0_main_1500.pth
set pretrain: "./sovits5.0_main_1500.pth" in configs/base.yamlοΌand adjust the learning rate appropriately, eg 1e-5
1οΌ set working directory
export PYTHONPATH=$PWD
2οΌ start training
python svc_trainer.py -c configs/base.yaml -n sovits5.0
3οΌ resume training
python svc_trainer.py -c configs/base.yaml -n sovits5.0 -p chkpt/sovits5.0/***.pth
4οΌ view log
tensorboard --logdir logs/
Inference
1οΌ set working directory
export PYTHONPATH=$PWD
2οΌ export inference model: text encoder, Flow network, Decoder network
python svc_export.py --config configs/base.yaml --checkpoint_path chkpt/sovits5.0/***.pt
3οΌ use whisper to extract content encoding, without using one-click reasoning, in order to reduce GPU memory usage
python whisper/inference.py -w test.wav -p test.ppg.npy
generate test.ppg.npy; if no ppg file is specified in the next step, generate it automatically
4οΌ extract the F0 parameter to the csv text format, open the csv file in Excel, and manually modify the wrong F0 according to Audition or SonicVisualiser
python pitch/inference.py -w test.wav -p test.csv
5οΌspecify parameters and infer
python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./configs/singers/singer0001.npy --wave test.wav --ppg test.ppg.npy --pit test.csv
when --ppg is specified, when the same audio is reasoned multiple times, it can avoid repeated extraction of audio content codes; if it is not specified, it will be automatically extracted;
when --pit is specified, the manually tuned F0 parameter can be loaded; if not specified, it will be automatically extracted;
generate files in the current directory:svc_out.wav
args --config --model --spk --wave --ppg --pit --shift name config path model path speaker wave input wave ppg wave pitch pitch shift
Creat singer
named by pure coincidenceοΌaverage -> ave -> evaοΌeve(eva) represents conception and reproduction
python svc_eva.py
eva_conf = {
'./configs/singers/singer0022.npy': 0,
'./configs/singers/singer0030.npy': 0,
'./configs/singers/singer0047.npy': 0.5,
'./configs/singers/singer0051.npy': 0.5,
}
the generated singer file isοΌeva.spk.npy
πboth Flow and Decoder need to input timbres, and you can even input different timbre parameters to the two modules to create more unique timbres.
Data set
Code sources and references
https://github.com/facebookresearch/speech-resynthesis paper
https://github.com/jaywalnut310/vits paper
https://github.com/openai/whisper/ paper
https://github.com/NVIDIA/BigVGAN paper
https://github.com/mindslab-ai/univnet paper
https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf
https://github.com/brentspell/hifi-gan-bwe
https://github.com/mozilla/TTS
https://github.com/OlaWod/FreeVC paper
Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers
AdaSpeech: Adaptive Text to Speech for Custom Voice
Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis
Speaker normalization (GRL) for self-supervised speech emotion recognition
Method of Preventing Timbre Leakage Based on Data Perturbation
https://github.com/auspicious3000/contentvec/blob/main/contentvec/data/audio/audio_utils_1.py
https://github.com/revsic/torch-nansy/blob/main/utils/augment/praat.py
https://github.com/revsic/torch-nansy/blob/main/utils/augment/peq.py
https://github.com/biggytruck/SpeechSplit2/blob/main/utils.py
https://github.com/OlaWod/FreeVC/blob/main/preprocess_sr.py