hyperparameters

#1
by Viewegger - opened

Hello,
may I ask you what lr rate, batch size and gradient accumulation did you use to train this model?

Default details should be here

What was used for this model was:

Learning Rate (lr): 5e-06
Batch Size: 4
Gradient Accumulation Steps: 1
Epochs: 10

Thank you! I need to try such low Gradient Accumulation Steps, I tried low ones like 4-15 but it didn't work good for medium sized datasets...

Viewegger changed discussion status to closed

lol

Also

For best results I’ve found the largest I can make the dataset before overfitting takes place is around 40 minutes

Once it starts overfitting you get a huge increase in hallucinations making it more useless .

Also cleaning up the audio being used to create the dataset helps a lot.

I personally use deepfilture lol

Huggingface Space I made for using it without my computer lol.

https://huggingface.co/spaces/drewThomasson/DeepFilterNet2_no_limit

And to be lazy I use a docker image I made to train
https://hub.docker.com/r/athomasson2/fine_tune_xtts

The output vocab.json file is messed up tho you’ll have to rename it if you use that lol

From vocab.json_

To

vocab.json

lol

Happy training lol

"For best results I’ve found the largest I can make the dataset before overfitting takes place is around 40 minutes " just to be sure you mean that you need at least 40 minutes of audio? or 40 minutes at most? Or only 40 minutes of training - so therefore only 10 epochs? - I think 5e06 is quite high learning rate for small dataset, even though coqui uses it as default for finetuning...

In general this model is strange, I have quite a good luck finetuning on medium sized dataset consisting of several hundreds of hours, but I've had no luck on small ones, regardless of hyperparameters... Also it works much better for single voice, in cause of using hundreds of voices it captures basic characteristics of them but cannot capture individual ones really well

hm,

If you want you could give me the audio for the voice you're trying to train on?

like the raw audio files lol not pre-chunked

and see who gets better results?

hm,

If you want you could give me the audio for the voice you're trying to train on?

like the raw audio files lol not pre-chunked

and see who gets better results?

I am not training in English but in German and Czech ;) The model is much worse with these and it's much harder to make it work...

ah, interesting...

Are you using this to train it?

https://github.com/daswer123/xtts-finetune-webui

Cause that one should be multilingual,

It's what I use at this point

I am using the maintained fork of coqui repo https://github.com/idiap/coqui-ai-TTS i have seen https://github.com/daswer123/xtts-finetune-webui but i haven't tried it yes, I've read some docs from the repo, but as far as I know they have only implemented some additional learning rate controls directly into Gradio...

Also I prefer code and not Gradio UI, it's more easier and quite obvious what is going on...

What is the benefit of https://github.com/daswer123/xtts-finetune-webui for you over default repo? Also what do you mean with "For best results I’ve found the largest I can make the dataset before overfitting takes place is around 40 minutes "?

lol its just easier for me if why I use the gradio gui,

and what I mean by

"For best results I’ve found the largest I can make the dataset before overfitting takes place is around 40 minutes "

Is that for some reason for me when I train a model on like 5 hours or whatnot it starts to overfit as in it gets really good at that voice, but it comes at a loss of other abilities of the model, leading to it hallucinating much more.

So when it's over-fitted it seems to throw in a lot more weird hallucination noises.

Sign up or log in to comment