A small experiment I did with a subset of the moespeech's JP dataset. The models (GPT and SoVITS) are made to run with GPT-SoVITS.

I used 6 hours of audio for the training. The selected audio samples were categorized into frequency bands (100–500 Hz) with 50 Hz intervals. Each band received equal representation in the final dataset to ensure the model learns from a diverse range of voice frequencies. Samples outside the 3–10 second range were discarded due to GPT-SoVITS limitations.

The model is proficient in Japanese only and tends to have a slightly higher pitch than the reference audio (this is corrected by using a low temperature of 0.3). Compared to the base model from GPT-SoVITS, the inflections are much more natural, including laughing, sighing, and other nuances.

The license is cc-by-nc-nd-4.0.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .

Dataset used to train AdamCodd/GPTSoVITS2-anime-tts

Collection including AdamCodd/GPTSoVITS2-anime-tts