File size: 3,885 Bytes
cbef2ba 1092a85 cbef2ba 1092a85 cbef2ba 9b84713 cbef2ba 1092a85 f041488 1092a85 1479573 4a5b1ab 1092a85 6902470 1479573 4a5b1ab 1479573 bce517f 1479573 bce517f 1479573 4a5b1ab 1479573 4a5b1ab cbef2ba 1092a85 cbef2ba 1092a85 cbef2ba |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
---
license: cc-by-nc-sa-4.0
datasets:
- LJSpeech
language:
- en
base_model:
- SWivid/F5-TTS
pipeline_tag: text-to-speech
---
## Overview
The F5-TTS model is fine-tuned on the LJSpeech dataset with an emphasis on stability, ensuring it avoids choppiness, mispronunciations, repetitions, and skipping words.
Differences from the original model: The text input is converted to phonenes, we don't use the raw text. The phoneme alignment is used during training, whereas a duration predictor is used during inference.
Source code for phoneme alignment: https://github.com/sinhprous/F5-TTS/blob/main/src/f5_tts/train/datasets/utils_alignment.py
Source code for duration predictor: https://github.com/sinhprous/F5-TTS/blob/main/src/f5_tts/model/duration_predictor.py
Colab demo: [colab](https://colab.research.google.com/drive/1baUdhv7kIdGIU39VQbeCI_bMAYbyjcF0)
## Audio samples
Outputs from original model was generated using https://huggingface.co/spaces/mrfakename/E2-F5-TTS
The original model usually skips words in these hard texts..
*Data - driven AI systems said, "Key data is the key, data is key, data is key, data is the key, and the key to the data is key, the data key is the key to the data that is key to the key". Can you keep up?*
Original model:
<audio controls>
<source src="https://huggingface.co/sinhprous/F5TTS-stabilized-LJSpeech/resolve/main/audio_samples/sample_origin_1.wav" type="audio/mp3">
Your browser does not support the audio element.
</audio>
Finetuned model:
<audio controls>
<source src="https://huggingface.co/sinhprous/F5TTS-stabilized-LJSpeech/resolve/main/audio_samples/sample_aligned_1.wav" type="audio/mp3">
Your browser does not support the audio element.
</audio>
*Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.*
Original model:
<audio controls>
<source src="https://huggingface.co/sinhprous/F5TTS-stabilized-LJSpeech/resolve/main/audio_samples/sample_origin_2.wav" type="audio/mp3">
Your browser does not support the audio element.
</audio>
Finetuned model:
<audio controls>
<source src="https://huggingface.co/sinhprous/F5TTS-stabilized-LJSpeech/resolve/main/audio_samples/sample_aligned_2.wav" type="audio/mp3">
Your browser does not support the audio element.
</audio>
*Call one two three - one two three - one two three four who call one two three - one two three - one two three four who call one two three - one two three - one two three four who call one two three - one two three - one two three four.*
Original model:
<audio controls>
<source src="https://huggingface.co/sinhprous/F5TTS-stabilized-LJSpeech/resolve/main/audio_samples/sample_origin_3.wav" type="audio/mp3">
Your browser does not support the audio element.
</audio>
Finetuned model:
<audio controls>
<source src="https://huggingface.co/sinhprous/F5TTS-stabilized-LJSpeech/resolve/main/audio_samples/sample_aligned_3.wav" type="audio/mp3">
Your browser does not support the audio element.
</audio>
## License
This model is released under the Creative Commons Attribution Non Commercial Share Alike 4.0 license, which allows for free usage, modification, and distribution
## Model Information
**Base Model**: SWivid/F5-TTS
**Total Training Duration:** 130.000 steps
**Training Configuration:**
```json
"exp_name": "F5TTS_Base",
"learning_rate": 1e-05,
"batch_size_per_gpu": 2000,
"batch_size_type": "frame",
"max_samples": 64,
"grad_accumulation_steps": 1,
"max_grad_norm": 1,
"epochs": 144,
"num_warmup_updates": 5838,
"save_per_updates": 11676,
"last_per_steps": 2918,
"finetune": true,
"file_checkpoint_train": "",
"tokenizer_type": "char",
"tokenizer_file": "",
"mixed_precision": "fp16",
"logger": "wandb",
"bnb_optimizer": true
```
## Usage Instructions
Go to [base repo](https://github.com/SWivid/F5-TTS)
## To do
- Multi-speaker model
# Other links
- [Github repo](https://github.com/sinhprous/F5-TTS) |