teticio commited on
Commit
d122744
1 Parent(s): 96e542f

update readme

Browse files
Files changed (2) hide show
  1. README.md +27 -2
  2. scripts/train_vae.py +4 -3
README.md CHANGED
@@ -15,7 +15,10 @@ license: gpl-3.0
15
 
16
  ---
17
 
18
- **UPDATES**:
 
 
 
19
 
20
  4/10/2022
21
  It is now possible to mask parts of the input audio during generation which means you can stitch several samples together (think "out-painting").
@@ -49,6 +52,7 @@ You can play around with some pretrained models on [Google Colab](https://colab.
49
  ```bash
50
  pip install .
51
  ```
 
52
  #### Training can be run with Mel spectrograms of resolution 64x64 on a single commercial grade GPU (e.g. RTX 2080 Ti). The `hop_length` should be set to 1024 for better results.
53
 
54
  ```bash
@@ -58,8 +62,8 @@ python scripts/audio_to_images.py \
58
  --input_dir path-to-audio-files \
59
  --output_dir path-to-output-data
60
  ```
61
- #### Generate dataset of 256x256 Mel spectrograms and push to hub (you will need to be authenticated with `huggingface-cli login`).
62
 
 
63
  ```bash
64
  python scripts/audio_to_images.py \
65
  --resolution 256 \
@@ -67,6 +71,7 @@ python scripts/audio_to_images.py \
67
  --output_dir data/audio-diffusion-256 \
68
  --push_to_hub teticio/audio-diffusion-256
69
  ```
 
70
  ## Train model
71
  #### Run training on local machine.
72
  ```bash
@@ -83,6 +88,7 @@ accelerate launch --config_file config/accelerate_local.yaml \
83
  --lr_warmup_steps 500 \
84
  --mixed_precision no
85
  ```
 
86
  #### Run training on local machine with `batch_size` of 2 and `gradient_accumulation_steps` 8 to compensate, so that 256x256 resolution model fits on commercial grade GPU and push to hub.
87
  ```bash
88
  accelerate launch --config_file config/accelerate_local.yaml \
@@ -101,6 +107,7 @@ accelerate launch --config_file config/accelerate_local.yaml \
101
  --hub_model_id audio-diffusion-256 \
102
  --hub_token $(cat $HOME/.huggingface/token)
103
  ```
 
104
  #### Run training on SageMaker.
105
  ```bash
106
  accelerate launch --config_file config/accelerate_sagemaker.yaml \
@@ -115,3 +122,21 @@ accelerate launch --config_file config/accelerate_sagemaker.yaml \
115
  --lr_warmup_steps 500 \
116
  --mixed_precision no
117
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
  ---
17
 
18
+ **UPDATES**:
19
+
20
+ 15/10/2022
21
+ Added latent audio diffusion (see below).
22
 
23
  4/10/2022
24
  It is now possible to mask parts of the input audio during generation which means you can stitch several samples together (think "out-painting").
 
52
  ```bash
53
  pip install .
54
  ```
55
+
56
  #### Training can be run with Mel spectrograms of resolution 64x64 on a single commercial grade GPU (e.g. RTX 2080 Ti). The `hop_length` should be set to 1024 for better results.
57
 
58
  ```bash
 
62
  --input_dir path-to-audio-files \
63
  --output_dir path-to-output-data
64
  ```
 
65
 
66
+ #### Generate dataset of 256x256 Mel spectrograms and push to hub (you will need to be authenticated with `huggingface-cli login`).
67
  ```bash
68
  python scripts/audio_to_images.py \
69
  --resolution 256 \
 
71
  --output_dir data/audio-diffusion-256 \
72
  --push_to_hub teticio/audio-diffusion-256
73
  ```
74
+
75
  ## Train model
76
  #### Run training on local machine.
77
  ```bash
 
88
  --lr_warmup_steps 500 \
89
  --mixed_precision no
90
  ```
91
+
92
  #### Run training on local machine with `batch_size` of 2 and `gradient_accumulation_steps` 8 to compensate, so that 256x256 resolution model fits on commercial grade GPU and push to hub.
93
  ```bash
94
  accelerate launch --config_file config/accelerate_local.yaml \
 
107
  --hub_model_id audio-diffusion-256 \
108
  --hub_token $(cat $HOME/.huggingface/token)
109
  ```
110
+
111
  #### Run training on SageMaker.
112
  ```bash
113
  accelerate launch --config_file config/accelerate_sagemaker.yaml \
 
122
  --lr_warmup_steps 500 \
123
  --mixed_precision no
124
  ```
125
+ ## Latent Audio Diffusion
126
+ Rather than denoising images directly, it is interesting to work in the "latent space" after first encoding images using an autoencoder. This has a number of advantages. Firstly, the information in the images is compressed into a latent space of a much lower dimension, so it is much faster to train denoising diffusion models and run inference with them. Secondly, as the latent space is really a array (tensor) of guassian variables with a particular mean, decoded images are invariant to guassian noise. And thirdly, similar images tend to be clustered together and interpolating between two images in latent space can produce meaningful combinations.
127
+
128
+ At the time of writing, the Hugging Face `diffusers` library is geared towards inference and lacking in training functionality, rather like its cousin `transformers` in the early days of development. In order to train a VAE (Variational Autoencoder), I use the [stable-diffusion](https://github.com/CompVis/stable-diffusion) repo from CompVis and convert the checkpoints to `diffusers` format.
129
+
130
+ #### Train an autoencoder.
131
+ ```bash
132
+ python scripts/train_vae.py \
133
+ --dataset_name teticio/audio-diffusion-256 \
134
+ --batch_size 2 \
135
+ --gradient_accumulation_steps 12
136
+ ```
137
+
138
+ #### Train latent diffusion model.
139
+ ```bash
140
+ accelerate launch ...
141
+ --vae models/autoencoder-kl
142
+ ```
scripts/train_vae.py CHANGED
@@ -4,7 +4,6 @@
4
 
5
  # TODO
6
  # grayscale
7
- # update README
8
 
9
  import os
10
  import argparse
@@ -107,7 +106,7 @@ class ImageLogger(Callback):
107
 
108
  class HFModelCheckpoint(ModelCheckpoint):
109
 
110
- def __init__(self, ldm_config, hf_checkpoint='models/autoencoder-kl', *args, **kwargs):
111
  super().__init__(*args, **kwargs)
112
  self.ldm_config = ldm_config
113
  self.hf_checkpoint = hf_checkpoint
@@ -131,7 +130,9 @@ if __name__ == "__main__":
131
  parser.add_argument("--ldm_checkpoint_dir",
132
  type=str,
133
  default="models/ldm-autoencoder-kl")
134
- parser.add_argument("--hf_checkpoint_dir", type=str, default="vae_model")
 
 
135
  parser.add_argument("-r",
136
  "--resume_from_checkpoint",
137
  type=str,
 
4
 
5
  # TODO
6
  # grayscale
 
7
 
8
  import os
9
  import argparse
 
106
 
107
  class HFModelCheckpoint(ModelCheckpoint):
108
 
109
+ def __init__(self, ldm_config, hf_checkpoint, *args, **kwargs):
110
  super().__init__(*args, **kwargs)
111
  self.ldm_config = ldm_config
112
  self.hf_checkpoint = hf_checkpoint
 
130
  parser.add_argument("--ldm_checkpoint_dir",
131
  type=str,
132
  default="models/ldm-autoencoder-kl")
133
+ parser.add_argument("--hf_checkpoint_dir",
134
+ type=str,
135
+ default="models/autoencoder-kl")
136
  parser.add_argument("-r",
137
  "--resume_from_checkpoint",
138
  type=str,