A small experiment I did with a subset of the moespeech's JP dataset. The models (GPT and SoVITS) are made to run with GPT-SoVITS.
I used 6 hours of audio for the training. The selected audio samples were categorized into frequency bands (100–500 Hz) with 50 Hz intervals. Each band received equal representation in the final dataset to ensure the model learns from a diverse range of voice frequencies. Samples outside the 3–10 second range were discarded due to GPT-SoVITS limitations.
The model is proficient in Japanese only and tends to have a slightly higher pitch than the reference audio (this is corrected by using a low temperature of 0.3). Compared to the base model from GPT-SoVITS, the inflections are much more natural, including laughing, sighing, and other nuances.
The license is cc-by-nc-nd-4.0.