metadata

license: cc-by-nc-4.0
datasets:
  - amphion/Emilia-Dataset
  - mozilla-foundation/common_voice_12_0
language:
  - el
  - en
base_model:
  - SWivid/F5-TTS
pipeline_tag: text-to-speech

F5-TTS-Greek

F5-TTS model finetuned to speak Greek

(This work is under development and is in beta version.)

Finetuned on Greek speech datasets and a small part of Emilia-EN dataset to prevent catastrophic forgetting of English.

Model can generate Greek text with Greek reference speech, English text with English reference speech, and mix of Greek and English (quality here needs improvement, and many runs might be needed to get good results).

Datasets used:

Common Voice 12.0 (All Greek Splits) (https://huggingface.co/datasets/mozilla-foundation/common_voice_12_0)
Greek Single Speaker Speech Dataset (https://www.kaggle.com/datasets/bryanpark/greek-single-speaker-speech-dataset)
Small part of Emilia Dataset (https://huggingface.co/datasets/amphion/Emilia-Dataset) (EN-B000049.tar)

Training

Training was done in a single RTX 3090.

After some manual evaluation two checkpoints produced better results:

225K steps (TBA)
325K steps (TBA)

Arguments

Learning Rate: 0.00001
Batch Size per GPU: 3200
Max Samples: 64
Gradient Accumulation Steps: 1
Max Gradient Norm: 1
Epochs: 277
Warmup Updates: 1274
Save per Updates: 25000
Last per Steps: 1000
mixed_precision: fp16

PetrosStav
/

F5-TTS-Greek

F5-TTS-Greek

F5-TTS model finetuned to speak Greek

Datasets used:

Training

Arguments

Links: