license: cc-by-nc-sa-4.0
datasets:
- LJSpeech
language:
- en
base_model:
- SWivid/F5-TTS
Overview
The F5-TTS model is fine-tuned on the LJSpeech dataset with an emphasis on stability, ensuring it avoids choppiness, mispronunciations, repetitions, and skipping words.
Differences from the original model: The text input is converted to phonenes, we don't use the raw text. The phoneme alignment is used during training, whereas a duration predictor is used during inference.
Source code for phoneme alignment: https://github.com/sinhprous/F5-TTS/blob/main/src/f5_tts/train/datasets/utils_alignment.py
Source code for duration predictor: https://github.com/sinhprous/F5-TTS/blob/main/src/f5_tts/model/duration_predictor.py
Audio samples
License
This model is released under the Creative Commons Attribution Non Commercial Share Alike 4.0 license, which allows for free usage, modification, and distribution
Model Information
Base Model: SWivid/F5-TTS
Total Training Duration: 130.000 steps
Training Configuration:
"exp_name": "F5TTS_Base",
"learning_rate": 1e-05,
"batch_size_per_gpu": 2000,
"batch_size_type": "frame",
"max_samples": 64,
"grad_accumulation_steps": 1,
"max_grad_norm": 1,
"epochs": 144,
"num_warmup_updates": 5838,
"save_per_updates": 11676,
"last_per_steps": 2918,
"finetune": true,
"file_checkpoint_train": "",
"tokenizer_type": "char",
"tokenizer_file": "",
"mixed_precision": "fp16",
"logger": "wandb",
"bnb_optimizer": true
Usage Instructions
Go to base repo
To do
- Multi-speaker model