metadata

license: cc-by-nc-sa-4.0
datasets:
  - LJSpeech
language:
  - en
base_model:
  - SWivid/F5-TTS

Overview

The F5-TTS model is fine-tuned on the LJSpeech dataset with an emphasis on stability, ensuring it avoids choppiness, mispronunciations, repetitions, and skipping words.

Differences from the original model: The text input is converted to phonenes, we don't use the raw text. The phoneme alignment is used during training, whereas a duration predictor is used during inference.

Source code for phoneme alignment: https://github.com/sinhprous/F5-TTS/blob/main/src/f5_tts/train/datasets/utils_alignment.py

Source code for duration predictor: https://github.com/sinhprous/F5-TTS/blob/main/src/f5_tts/model/duration_predictor.py

Audio samples

License

This model is released under the Creative Commons Attribution Non Commercial Share Alike 4.0 license, which allows for free usage, modification, and distribution

Model Information

Base Model: SWivid/F5-TTS
Total Training Duration: 130.000 steps

Training Configuration:

"exp_name": "F5TTS_Base",
"learning_rate": 1e-05,
"batch_size_per_gpu": 2000,
"batch_size_type": "frame",
"max_samples": 64,
"grad_accumulation_steps": 1,
"max_grad_norm": 1,
"epochs": 144,
"num_warmup_updates": 5838,
"save_per_updates": 11676,
"last_per_steps": 2918,
"finetune": true,
"file_checkpoint_train": "",
"tokenizer_type": "char",
"tokenizer_file": "",
"mixed_precision": "fp16",
"logger": "wandb",
"bnb_optimizer": true

Usage Instructions

Go to base repo

To do

Multi-speaker model

sinhprous
/

F5TTS-stabilized-LJSpeech