Spaces:
No application file
No application file
File size: 4,586 Bytes
8b14bed |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |
# Fine-tuning
Obviously, when you opened this page, you were not satisfied with the performance of the few-shot pre-trained model. You want to fine-tune a model to improve its performance on your dataset.
In current version, you only need to finetune the 'LLAMA' part.
## Fine-tuning LLAMA
### 1. Prepare the dataset
```
.
βββ SPK1
β βββ 21.15-26.44.lab
β βββ 21.15-26.44.mp3
β βββ 27.51-29.98.lab
β βββ 27.51-29.98.mp3
β βββ 30.1-32.71.lab
β βββ 30.1-32.71.mp3
βββ SPK2
βββ 38.79-40.85.lab
βββ 38.79-40.85.mp3
```
You need to convert your dataset into the above format and place it under `data`. The audio file can have the extensions `.mp3`, `.wav`, or `.flac`, and the annotation file should have the extensions `.lab`.
!!! info "Dataset Format"
The `.lab` annotation file only needs to contain the transcription of the audio, with no special formatting required. For example, if `hi.mp3` says "Hello, goodbye," then the `hi.lab` file would contain a single line of text: "Hello, goodbye."
!!! warning
It's recommended to apply loudness normalization to the dataset. You can use [fish-audio-preprocess](https://github.com/fishaudio/audio-preprocess) to do this.
```bash
fap loudness-norm data-raw data --clean
```
### 2. Batch extraction of semantic tokens
Make sure you have downloaded the VQGAN weights. If not, run the following command:
```bash
huggingface-cli download fishaudio/fish-speech-1.4 --local-dir checkpoints/fish-speech-1.4
```
You can then run the following command to extract semantic tokens:
```bash
python tools/vqgan/extract_vq.py data \
--num-workers 1 --batch-size 16 \
--config-name "firefly_gan_vq" \
--checkpoint-path "checkpoints/fish-speech-1.4/firefly-gan-vq-fsq-8x1024-21hz-generator.pth"
```
!!! note
You can adjust `--num-workers` and `--batch-size` to increase extraction speed, but please make sure not to exceed your GPU memory limit.
For the VITS format, you can specify a file list using `--filelist xxx.list`.
This command will create `.npy` files in the `data` directory, as shown below:
```
.
βββ SPK1
β βββ 21.15-26.44.lab
β βββ 21.15-26.44.mp3
β βββ 21.15-26.44.npy
β βββ 27.51-29.98.lab
β βββ 27.51-29.98.mp3
β βββ 27.51-29.98.npy
β βββ 30.1-32.71.lab
β βββ 30.1-32.71.mp3
β βββ 30.1-32.71.npy
βββ SPK2
βββ 38.79-40.85.lab
βββ 38.79-40.85.mp3
βββ 38.79-40.85.npy
```
### 3. Pack the dataset into protobuf
```bash
python tools/llama/build_dataset.py \
--input "data" \
--output "data/protos" \
--text-extension .lab \
--num-workers 16
```
After the command finishes executing, you should see the `quantized-dataset-ft.protos` file in the `data` directory.
### 4. Finally, fine-tuning with LoRA
Similarly, make sure you have downloaded the `LLAMA` weights. If not, run the following command:
```bash
huggingface-cli download fishaudio/fish-speech-1.4 --local-dir checkpoints/fish-speech-1.4
```
Finally, you can start the fine-tuning by running the following command:
```bash
python fish_speech/train.py --config-name text2semantic_finetune \
project=$project \
[email protected]_config=r_8_alpha_16
```
!!! note
You can modify the training parameters such as `batch_size`, `gradient_accumulation_steps`, etc. to fit your GPU memory by modifying `fish_speech/configs/text2semantic_finetune.yaml`.
!!! note
For Windows users, you can use `trainer.strategy.process_group_backend=gloo` to avoid `nccl` issues.
After training is complete, you can refer to the [inference](inference.md) section to generate speech.
!!! info
By default, the model will only learn the speaker's speech patterns and not the timbre. You still need to use prompts to ensure timbre stability.
If you want to learn the timbre, you can increase the number of training steps, but this may lead to overfitting.
After training, you need to convert the LoRA weights to regular weights before performing inference.
```bash
python tools/llama/merge_lora.py \
--lora-config r_8_alpha_16 \
--base-weight checkpoints/fish-speech-1.4 \
--lora-weight results/$project/checkpoints/step_000000010.ckpt \
--output checkpoints/fish-speech-1.4-yth-lora/
```
!!! note
You may also try other checkpoints. We suggest using the earliest checkpoint that meets your requirements, as they often perform better on out-of-distribution (OOD) data.
|