drewThomasson
/

Xtts-Finetune-Bryan-Cranston

Model card Files Files and versions Community

hyperparameters

by Viewegger - opened 2 days ago

Discussion

Viewegger

2 days ago

Hello,
may I ask you what lr rate, batch size and gradient accumulation did you use to train this model?

drewThomasson

Owner 2 days ago

Default details should be here

What was used for this model was:

Learning Rate (lr): 5e-06
Batch Size: 4
Gradient Accumulation Steps: 1
Epochs: 10

Viewegger

2 days ago

Thank you! I need to try such low Gradient Accumulation Steps, I tried low ones like 4-15 but it didn't work good for medium sized datasets...

Viewegger changed discussion status to closed 2 days ago

drewThomasson

Owner 2 days ago

lol

Also

For best results I’ve found the largest I can make the dataset before overfitting takes place is around 40 minutes

Once it starts overfitting you get a huge increase in hallucinations making it more useless .

Also cleaning up the audio being used to create the dataset helps a lot.

I personally use deepfilture lol

Huggingface Space I made for using it without my computer lol.

https://huggingface.co/spaces/drewThomasson/DeepFilterNet2_no_limit

drewThomasson

Owner 2 days ago

•

edited 2 days ago

And to be lazy I use a docker image I made to train
https://hub.docker.com/r/athomasson2/fine_tune_xtts

drewThomasson

Owner 2 days ago

The output vocab.json file is messed up tho you’ll have to rename it if you use that lol

From vocab.json_

vocab.json

lol

drewThomasson

Owner 2 days ago

Happy training lol

Viewegger

1 day ago

"For best results I’ve found the largest I can make the dataset before overfitting takes place is around 40 minutes " just to be sure you mean that you need at least 40 minutes of audio? or 40 minutes at most? Or only 40 minutes of training - so therefore only 10 epochs? - I think 5e06 is quite high learning rate for small dataset, even though coqui uses it as default for finetuning...

In general this model is strange, I have quite a good luck finetuning on medium sized dataset consisting of several hundreds of hours, but I've had no luck on small ones, regardless of hyperparameters... Also it works much better for single voice, in cause of using hundreds of voices it captures basic characteristics of them but cannot capture individual ones really well

drewThomasson

Owner 1 day ago

hm,

If you want you could give me the audio for the voice you're trying to train on?

like the raw audio files lol not pre-chunked

and see who gets better results?

Viewegger

1 day ago

hm,

If you want you could give me the audio for the voice you're trying to train on?

like the raw audio files lol not pre-chunked

and see who gets better results?

I am not training in English but in German and Czech ;) The model is much worse with these and it's much harder to make it work...

drewThomasson

Owner 1 day ago

ah, interesting...

Are you using this to train it?

https://github.com/daswer123/xtts-finetune-webui

drewThomasson

Owner 1 day ago

Cause that one should be multilingual,

It's what I use at this point

Viewegger

1 day ago

I am using the maintained fork of coqui repo https://github.com/idiap/coqui-ai-TTS i have seen https://github.com/daswer123/xtts-finetune-webui but i haven't tried it yes, I've read some docs from the repo, but as far as I know they have only implemented some additional learning rate controls directly into Gradio...

Also I prefer code and not Gradio UI, it's more easier and quite obvious what is going on...

What is the benefit of https://github.com/daswer123/xtts-finetune-webui for you over default repo? Also what do you mean with "For best results I’ve found the largest I can make the dataset before overfitting takes place is around 40 minutes "?

drewThomasson

Owner about 12 hours ago

lol its just easier for me if why I use the gradio gui,

and what I mean by

"For best results I’ve found the largest I can make the dataset before overfitting takes place is around 40 minutes "

Is that for some reason for me when I train a model on like 5 hours or whatnot it starts to overfit as in it gets really good at that voice, but it comes at a loss of other abilities of the model, leading to it hallucinating much more.

So when it's over-fitted it seems to throw in a lot more weird hallucination noises.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment