File size: 3,885 Bytes
cbef2ba
 
 
1092a85
cbef2ba
1092a85
cbef2ba
 
9b84713
cbef2ba
 
1092a85
 
 
 
 
 
 
 
f041488
 
1092a85
1479573
4a5b1ab
1092a85
6902470
1479573
4a5b1ab
1479573
bce517f
1479573
 
 
 
 
bce517f
1479573
 
 
 
4a5b1ab
1479573
4a5b1ab
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cbef2ba
 
 
 
 
 
1092a85
cbef2ba
 
 
 
 
1092a85
cbef2ba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
license: cc-by-nc-sa-4.0
datasets:
- LJSpeech
language:
- en
base_model:
- SWivid/F5-TTS
pipeline_tag: text-to-speech
---
## Overview
The F5-TTS model is fine-tuned on the LJSpeech dataset with an emphasis on stability, ensuring it avoids choppiness, mispronunciations, repetitions, and skipping words.

Differences from the original model: The text input is converted to phonenes, we don't use the raw text. The phoneme alignment is used during training, whereas a duration predictor is used during inference.

Source code for phoneme alignment: https://github.com/sinhprous/F5-TTS/blob/main/src/f5_tts/train/datasets/utils_alignment.py

Source code for duration predictor: https://github.com/sinhprous/F5-TTS/blob/main/src/f5_tts/model/duration_predictor.py

Colab demo: [colab](https://colab.research.google.com/drive/1baUdhv7kIdGIU39VQbeCI_bMAYbyjcF0)

## Audio samples
Outputs from original model was generated using https://huggingface.co/spaces/mrfakename/E2-F5-TTS
The original model usually skips words in these hard texts..

*Data - driven AI systems said, "Key data is the key, data is key, data is key, data is the key, and the key to the data is key, the data key is the key to the data that is key to the key". Can you keep up?*

Original model:
<audio controls>
  <source src="https://huggingface.co/sinhprous/F5TTS-stabilized-LJSpeech/resolve/main/audio_samples/sample_origin_1.wav" type="audio/mp3">
  Your browser does not support the audio element.
</audio>

Finetuned model:
<audio controls>
  <source src="https://huggingface.co/sinhprous/F5TTS-stabilized-LJSpeech/resolve/main/audio_samples/sample_aligned_1.wav" type="audio/mp3">
  Your browser does not support the audio element.
</audio>


*Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.*

Original model:
<audio controls>
  <source src="https://huggingface.co/sinhprous/F5TTS-stabilized-LJSpeech/resolve/main/audio_samples/sample_origin_2.wav" type="audio/mp3">
  Your browser does not support the audio element.
</audio>

Finetuned model:
<audio controls>
  <source src="https://huggingface.co/sinhprous/F5TTS-stabilized-LJSpeech/resolve/main/audio_samples/sample_aligned_2.wav" type="audio/mp3">
  Your browser does not support the audio element.
</audio>


*Call one two three - one two three - one two three four who call one two three - one two three - one two three four who call one two three - one two three - one two three four who call one two three - one two three - one two three four.*

Original model:
<audio controls>
  <source src="https://huggingface.co/sinhprous/F5TTS-stabilized-LJSpeech/resolve/main/audio_samples/sample_origin_3.wav" type="audio/mp3">
  Your browser does not support the audio element.
</audio>

Finetuned model:
<audio controls>
  <source src="https://huggingface.co/sinhprous/F5TTS-stabilized-LJSpeech/resolve/main/audio_samples/sample_aligned_3.wav" type="audio/mp3">
  Your browser does not support the audio element.
</audio>

## License
This model is released under the Creative Commons Attribution Non Commercial Share Alike 4.0 license, which allows for free usage, modification, and distribution

## Model Information
**Base Model**: SWivid/F5-TTS  
**Total Training Duration:** 130.000 steps

**Training Configuration:**
```json
"exp_name": "F5TTS_Base",
"learning_rate": 1e-05,
"batch_size_per_gpu": 2000,
"batch_size_type": "frame",
"max_samples": 64,
"grad_accumulation_steps": 1,
"max_grad_norm": 1,
"epochs": 144,
"num_warmup_updates": 5838,
"save_per_updates": 11676,
"last_per_steps": 2918,
"finetune": true,
"file_checkpoint_train": "",
"tokenizer_type": "char",
"tokenizer_file": "",
"mixed_precision": "fp16",
"logger": "wandb",
"bnb_optimizer": true
```

## Usage Instructions
Go to [base repo](https://github.com/SWivid/F5-TTS)

## To do
- Multi-speaker model

# Other links
- [Github repo](https://github.com/sinhprous/F5-TTS)