JackismyShephard
commited on
Commit
·
93b2b84
1
Parent(s):
02b2725
train without enhancement
Browse filestrain without enhancement again
- README.md +36 -46
- config.json +2 -3
- generation_config.json +1 -1
- model.safetensors +1 -1
- runs/Feb11_23-38-55_CDT-DESKTOP-LINUX/events.out.tfevents.1707691135.CDT-DESKTOP-LINUX.8611.0 +3 -0
- runs/Feb11_23-38-55_CDT-DESKTOP-LINUX/events.out.tfevents.1707729176.CDT-DESKTOP-LINUX.8611.1 +3 -0
- runs/Feb15_01-39-31_CDT-DESKTOP-LINUX/events.out.tfevents.1707957572.CDT-DESKTOP-LINUX.16926.0 +3 -0
- runs/Feb15_01-39-31_CDT-DESKTOP-LINUX/events.out.tfevents.1707996353.CDT-DESKTOP-LINUX.16926.1 +3 -0
- training_args.bin +1 -1
README.md
CHANGED
@@ -10,82 +10,72 @@ datasets:
|
|
10 |
model-index:
|
11 |
- name: speecht5_tts-finetuned-nst-da
|
12 |
results: []
|
13 |
-
metrics:
|
14 |
-
- mse
|
15 |
-
pipeline_tag: text-to-speech
|
16 |
---
|
17 |
|
|
|
|
|
|
|
18 |
# speecht5_tts-finetuned-nst-da
|
19 |
|
20 |
This model is a fine-tuned version of [microsoft/speecht5_tts](https://huggingface.co/microsoft/speecht5_tts) on the NST Danish ASR Database dataset.
|
21 |
It achieves the following results on the evaluation set:
|
22 |
-
- Loss: 0.
|
23 |
|
24 |
## Model description
|
25 |
|
26 |
-
|
27 |
|
28 |
## Intended uses & limitations
|
29 |
|
30 |
-
|
31 |
-
|
32 |
-
The model does not recognize special symbols such as "æ", "ø" and "å", as it uses the default tokenizer of [microsoft/speecht5_tts](https://huggingface.co/microsoft/speecht5_tts). The model performs best for short to medium length input text and expects input text to contain no more than 600 vocabulary tokens. Additionally, for best performance the model should be given a danish speaker embedding, ideally generated from an audio clip from the training split of [alexandrainst/nst-da](https://huggingface.co/datasets/alexandrainst/nst-da) using [speechbrain/spkrec-xvect-voxceleb](https://huggingface.co/speechbrain/spkrec-xvect-voxceleb).
|
33 |
-
|
34 |
-
The output of the model is a log-mel spectogram, which should be converted to a waveform using [microsoft/speecht5_hifigan](https://huggingface.co/microsoft/speecht5_hifigan). For better quality output the resulting waveform can be enhanced using [ResembleAI/resemble-enhance](https://huggingface.co/ResembleAI/resemble-enhance).
|
35 |
-
|
36 |
-
An example script showing how to use the model for inference can be found [here](https://github.com/JackismyShephard/hugging-face-audio-course/blob/main/finetuned_nst-da-inference.ipynb).
|
37 |
-
|
38 |
|
39 |
## Training and evaluation data
|
40 |
|
41 |
-
|
42 |
|
43 |
## Training procedure
|
44 |
|
45 |
-
The script used for training the model can be found [here](https://github.com/JackismyShephard/hugging-face-audio-course/blob/main/finetuned-nst-da-training.ipynb)
|
46 |
-
|
47 |
### Training hyperparameters
|
48 |
|
49 |
The following hyperparameters were used during training:
|
50 |
-
- learning_rate:
|
51 |
-
- train_batch_size:
|
52 |
-
- eval_batch_size:
|
53 |
- seed: 42
|
54 |
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
55 |
- lr_scheduler_type: linear
|
56 |
- lr_scheduler_warmup_ratio: 0.1
|
57 |
- num_epochs: 20
|
58 |
-
- mixed_precision_training: Native AMP
|
59 |
|
60 |
### Training results
|
61 |
|
62 |
-
| Training Loss | Epoch | Step
|
63 |
-
|
64 |
-
| 0.
|
65 |
-
| 0.
|
66 |
-
| 0.
|
67 |
-
| 0.
|
68 |
-
| 0.
|
69 |
-
| 0.
|
70 |
-
| 0.
|
71 |
-
| 0.
|
72 |
-
| 0.
|
73 |
-
| 0.
|
74 |
-
| 0.
|
75 |
-
| 0.
|
76 |
-
| 0.
|
77 |
-
| 0.
|
78 |
-
| 0.
|
79 |
-
| 0.
|
80 |
-
| 0.
|
81 |
-
| 0.
|
82 |
-
| 0.
|
83 |
-
| 0.
|
84 |
|
85 |
|
86 |
### Framework versions
|
87 |
|
88 |
-
- Transformers 4.37.
|
89 |
-
- Pytorch 2.1.
|
90 |
-
- Datasets 2.
|
91 |
-
- Tokenizers 0.15.
|
|
|
10 |
model-index:
|
11 |
- name: speecht5_tts-finetuned-nst-da
|
12 |
results: []
|
|
|
|
|
|
|
13 |
---
|
14 |
|
15 |
+
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
16 |
+
should probably proofread and complete it, then remove this comment. -->
|
17 |
+
|
18 |
# speecht5_tts-finetuned-nst-da
|
19 |
|
20 |
This model is a fine-tuned version of [microsoft/speecht5_tts](https://huggingface.co/microsoft/speecht5_tts) on the NST Danish ASR Database dataset.
|
21 |
It achieves the following results on the evaluation set:
|
22 |
+
- Loss: 0.3298
|
23 |
|
24 |
## Model description
|
25 |
|
26 |
+
More information needed
|
27 |
|
28 |
## Intended uses & limitations
|
29 |
|
30 |
+
More information needed
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
31 |
|
32 |
## Training and evaluation data
|
33 |
|
34 |
+
More information needed
|
35 |
|
36 |
## Training procedure
|
37 |
|
|
|
|
|
38 |
### Training hyperparameters
|
39 |
|
40 |
The following hyperparameters were used during training:
|
41 |
+
- learning_rate: 5e-05
|
42 |
+
- train_batch_size: 16
|
43 |
+
- eval_batch_size: 16
|
44 |
- seed: 42
|
45 |
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
46 |
- lr_scheduler_type: linear
|
47 |
- lr_scheduler_warmup_ratio: 0.1
|
48 |
- num_epochs: 20
|
|
|
49 |
|
50 |
### Training results
|
51 |
|
52 |
+
| Training Loss | Epoch | Step | Validation Loss |
|
53 |
+
|:-------------:|:-----:|:------:|:---------------:|
|
54 |
+
| 0.3762 | 1.0 | 9429 | 0.3670 |
|
55 |
+
| 0.3596 | 2.0 | 18858 | 0.3577 |
|
56 |
+
| 0.3498 | 3.0 | 28287 | 0.3535 |
|
57 |
+
| 0.3356 | 4.0 | 37716 | 0.3414 |
|
58 |
+
| 0.3405 | 5.0 | 47145 | 0.3378 |
|
59 |
+
| 0.3312 | 6.0 | 56574 | 0.3397 |
|
60 |
+
| 0.3326 | 7.0 | 66003 | 0.3377 |
|
61 |
+
| 0.3299 | 8.0 | 75432 | 0.3384 |
|
62 |
+
| 0.3279 | 9.0 | 84861 | 0.3363 |
|
63 |
+
| 0.3203 | 10.0 | 94290 | 0.3335 |
|
64 |
+
| 0.3235 | 11.0 | 103719 | 0.3367 |
|
65 |
+
| 0.3188 | 12.0 | 113148 | 0.3365 |
|
66 |
+
| 0.3141 | 13.0 | 122577 | 0.3324 |
|
67 |
+
| 0.3176 | 14.0 | 132006 | 0.3345 |
|
68 |
+
| 0.3221 | 15.0 | 141435 | 0.3331 |
|
69 |
+
| 0.3157 | 16.0 | 150864 | 0.3317 |
|
70 |
+
| 0.314 | 17.0 | 160293 | 0.3298 |
|
71 |
+
| 0.3164 | 18.0 | 169722 | 0.3316 |
|
72 |
+
| 0.3172 | 19.0 | 179151 | 0.3315 |
|
73 |
+
| 0.3179 | 20.0 | 188580 | 0.3318 |
|
74 |
|
75 |
|
76 |
### Framework versions
|
77 |
|
78 |
+
- Transformers 4.37.2
|
79 |
+
- Pytorch 2.1.1+cu121
|
80 |
+
- Datasets 2.17.0
|
81 |
+
- Tokenizers 0.15.2
|
config.json
CHANGED
@@ -64,7 +64,6 @@
|
|
64 |
"mask_time_length": 10,
|
65 |
"mask_time_min_masks": 2,
|
66 |
"mask_time_prob": 0.05,
|
67 |
-
"max_length": 1876,
|
68 |
"max_speech_positions": 1876,
|
69 |
"max_text_positions": 600,
|
70 |
"model_type": "speecht5",
|
@@ -85,8 +84,8 @@
|
|
85 |
"speech_decoder_prenet_layers": 2,
|
86 |
"speech_decoder_prenet_units": 256,
|
87 |
"torch_dtype": "float32",
|
88 |
-
"transformers_version": "4.37.
|
89 |
-
"use_cache":
|
90 |
"use_guided_attention_loss": true,
|
91 |
"vocab_size": 81
|
92 |
}
|
|
|
64 |
"mask_time_length": 10,
|
65 |
"mask_time_min_masks": 2,
|
66 |
"mask_time_prob": 0.05,
|
|
|
67 |
"max_speech_positions": 1876,
|
68 |
"max_text_positions": 600,
|
69 |
"model_type": "speecht5",
|
|
|
84 |
"speech_decoder_prenet_layers": 2,
|
85 |
"speech_decoder_prenet_units": 256,
|
86 |
"torch_dtype": "float32",
|
87 |
+
"transformers_version": "4.37.2",
|
88 |
+
"use_cache": true,
|
89 |
"use_guided_attention_loss": true,
|
90 |
"vocab_size": 81
|
91 |
}
|
generation_config.json
CHANGED
@@ -5,5 +5,5 @@
|
|
5 |
"eos_token_id": 2,
|
6 |
"max_length": 1876,
|
7 |
"pad_token_id": 1,
|
8 |
-
"transformers_version": "4.37.
|
9 |
}
|
|
|
5 |
"eos_token_id": 2,
|
6 |
"max_length": 1876,
|
7 |
"pad_token_id": 1,
|
8 |
+
"transformers_version": "4.37.2"
|
9 |
}
|
model.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 577789320
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:dff4359a3c72168158595b8f34cbf6aa51f98f47264b71477a6115de94707c2f
|
3 |
size 577789320
|
runs/Feb11_23-38-55_CDT-DESKTOP-LINUX/events.out.tfevents.1707691135.CDT-DESKTOP-LINUX.8611.0
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:6a7e3f7e969bf2372a4f78976f5ae3f4ad2aee2adc0d9ded3a68ec5239264a31
|
3 |
+
size 1216780
|
runs/Feb11_23-38-55_CDT-DESKTOP-LINUX/events.out.tfevents.1707729176.CDT-DESKTOP-LINUX.8611.1
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:f52f16751e8076e74d5e9235c61b66c3dc1c1735dcbe5d0c18fbfd617fe9b69c
|
3 |
+
size 364
|
runs/Feb15_01-39-31_CDT-DESKTOP-LINUX/events.out.tfevents.1707957572.CDT-DESKTOP-LINUX.16926.0
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:2d28a273a43910b25316ff0f395da24002c14da259b4735e0418ff9b09bcc234
|
3 |
+
size 1216806
|
runs/Feb15_01-39-31_CDT-DESKTOP-LINUX/events.out.tfevents.1707996353.CDT-DESKTOP-LINUX.16926.1
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:716a966597a0b5daf1778ec9ac0fb24885d945fb74b8286facfe19cd079a00b1
|
3 |
+
size 364
|
training_args.bin
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 4920
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:7826ab175be8e113f8950ea58524622eadd9e37fa5d3f1e7b3d7377d06b01a48
|
3 |
size 4920
|