sethforsgren
commited on
Commit
•
7999343
1
Parent(s):
eea56d5
Update README.md
Browse files
README.md
CHANGED
@@ -35,7 +35,7 @@ This repository contains the model files, including:
|
|
35 |
* a traced unet for improved inference speed
|
36 |
* a seed image library for use with riffusion-app
|
37 |
|
38 |
-
|
39 |
|
40 |
Riffusion is a latent text-to-image diffusion model capable of generating spectrogram images given any text input. These spectrograms can be converted into audio clips.
|
41 |
|
@@ -45,14 +45,14 @@ You can use the Riffusion model directly, or try the [Riffusion web app](https:/
|
|
45 |
|
46 |
The Riffusion model was created by fine-tuning the **Stable-Diffusion-v1-5** checkpoint. Read about Stable Diffusion here [🤗's Stable Diffusion blog](https://huggingface.co/blog/stable_diffusion).
|
47 |
|
48 |
-
|
49 |
- **Developed by:** Seth Forsgren, Hayk Martiros
|
50 |
- **Model type:** Diffusion-based text-to-image generation model
|
51 |
- **Language(s):** English
|
52 |
- **License:** [The CreativeML OpenRAIL M license](https://huggingface.co/spaces/CompVis/stable-diffusion-license) is an [Open RAIL M license](https://www.licenses.ai/blog/2022/8/18/naming-convention-of-responsible-ai-licenses), adapted from the work that [BigScience](https://bigscience.huggingface.co/) and [the RAIL Initiative](https://www.licenses.ai/) are jointly carrying in the area of responsible AI licensing. See also [the article about the BLOOM Open RAIL license](https://bigscience.huggingface.co/blog/the-bigscience-rail-license) on which our license is based.
|
53 |
- **Model Description:** This is a model that can be used to generate and modify images based on text prompts. It is a [Latent Diffusion Model](https://arxiv.org/abs/2112.10752) that uses a fixed, pretrained text encoder ([CLIP ViT-L/14](https://arxiv.org/abs/2103.00020)) as suggested in the [Imagen paper](https://arxiv.org/abs/2205.11487).
|
54 |
|
55 |
-
|
56 |
The model is intended for research purposes only. Possible research areas and
|
57 |
tasks include
|
58 |
|
@@ -60,6 +60,13 @@ tasks include
|
|
60 |
- Applications in educational or creative tools.
|
61 |
- Research on generative models.
|
62 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
63 |
## Citation
|
64 |
|
65 |
If you build on this work, please cite it as follows:
|
|
|
35 |
* a traced unet for improved inference speed
|
36 |
* a seed image library for use with riffusion-app
|
37 |
|
38 |
+
## Riffusion v1 Model
|
39 |
|
40 |
Riffusion is a latent text-to-image diffusion model capable of generating spectrogram images given any text input. These spectrograms can be converted into audio clips.
|
41 |
|
|
|
45 |
|
46 |
The Riffusion model was created by fine-tuning the **Stable-Diffusion-v1-5** checkpoint. Read about Stable Diffusion here [🤗's Stable Diffusion blog](https://huggingface.co/blog/stable_diffusion).
|
47 |
|
48 |
+
### Model Details
|
49 |
- **Developed by:** Seth Forsgren, Hayk Martiros
|
50 |
- **Model type:** Diffusion-based text-to-image generation model
|
51 |
- **Language(s):** English
|
52 |
- **License:** [The CreativeML OpenRAIL M license](https://huggingface.co/spaces/CompVis/stable-diffusion-license) is an [Open RAIL M license](https://www.licenses.ai/blog/2022/8/18/naming-convention-of-responsible-ai-licenses), adapted from the work that [BigScience](https://bigscience.huggingface.co/) and [the RAIL Initiative](https://www.licenses.ai/) are jointly carrying in the area of responsible AI licensing. See also [the article about the BLOOM Open RAIL license](https://bigscience.huggingface.co/blog/the-bigscience-rail-license) on which our license is based.
|
53 |
- **Model Description:** This is a model that can be used to generate and modify images based on text prompts. It is a [Latent Diffusion Model](https://arxiv.org/abs/2112.10752) that uses a fixed, pretrained text encoder ([CLIP ViT-L/14](https://arxiv.org/abs/2103.00020)) as suggested in the [Imagen paper](https://arxiv.org/abs/2205.11487).
|
54 |
|
55 |
+
### Direct Use
|
56 |
The model is intended for research purposes only. Possible research areas and
|
57 |
tasks include
|
58 |
|
|
|
60 |
- Applications in educational or creative tools.
|
61 |
- Research on generative models.
|
62 |
|
63 |
+
### Datasets
|
64 |
+
The original Stable Diffusion v1.5 was trained on the [LAION-5B](https://arxiv.org/abs/2210.08402) dataset using the [CLIP text encoder](https://openai.com/blog/clip/), which provided an amazing starting point with an in-depth understanding of language, including musical concepts. The team at LAION also compiled a fantastic audio dataset from many general, speech, and music sources that we recommend at [LAION-AI/audio-dataset](https://github.com/LAION-AI/audio-dataset/blob/main/data_collection/README.md).
|
65 |
+
|
66 |
+
### Fine Tuning
|
67 |
+
|
68 |
+
Check out the [diffusers training examples](https://huggingface.co/docs/diffusers/training/overview) from Hugging Face. Fine tuning requires a dataset of spectrogram images of short audio clips, with associated text describing them. Note that the CLIP encoder is able to understand and connect many words even if they never appear in the dataset. It is also possible to use a [dreambooth](https://huggingface.co/blog/dreambooth) method to get custom styles.
|
69 |
+
|
70 |
## Citation
|
71 |
|
72 |
If you build on this work, please cite it as follows:
|