File size: 7,920 Bytes
513ae40 dbc4146 513ae40 82c0da6 513ae40 643ba00 49b1a09 a5890fd 643ba00 eac2862 155f49c a5890fd 513ae40 3a8b815 513ae40 3a8b815 513ae40 49b1a09 8d1ccd2 49b1a09 513ae40 49b1a09 513ae40 49b1a09 513ae40 49b1a09 513ae40 49b1a09 513ae40 cd4e0cd 513ae40 4d53dbb 8113fde 513ae40 5a0a796 513ae40 49b1a09 513ae40 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 |
---
pipeline_tag: text-to-image
inference: true
license: openrail++
language:
- en
tags:
- Deci AI
- DeciDiffusion
---
# DeciDiffusion 2.0
DeciDiffusion 2.0 is a 732 million parameter text-to-image latent diffusion model, generated with the help of AutoNAC, Deci's proprietary Neural Architecture Search technology. Advanced training techniques were used to speed up training, improve training performance, and achieve better inference quality.
## Model Details
- **Developed by:** Deci
- **Model type:** Diffusion-based text-to-image generation model
- **Language(s) (NLP):** English
- **License:** The model is released under the [CreativeML Open RAIL++-M](https://huggingface.co/Deci/DeciDiffusion-v1-0/blob/main/LICENSE-WEIGHTS.md) license.
### Model Resources
- **Blog:** [A technical overview](https://deci.ai/blog/decidiffusion-2-0-text-to-image-generation-optimized-for-cost-effective-hardware/)
- **Demo:** [Experience DeciDiffusion in action](https://huggingface.co/spaces/Deci/DeciDiffusion-v2-0)
- **Notebook:** [Google Colab Notebook](https://colab.research.google.com/drive/11Ui_KRtK2DkLHLrW0aa11MiDciW4dTuB?usp=sharing)
- **Tutorial:** [Run on Qualcomm Cloud AI 100](https://github.com/quic/cloud-ai-sdk/tree/1.12/models/multimodal/text_to_image)
- Run DeciCoder on [AWS DL2q instances using the Qualcomm Cloud AI Platform SDK](https://bit.ly/Amazon-EC2-DL2q-Instance)
## Model Architecture
DeciDiffusion 2.0, a state-of-the-art diffusion-based text-to-image generation model, builds upon the core architecture of Stable Diffusion. It incorporates key elements like the Variational Autoencoder (VAE) and the pre-trained Text Encoder CLIP. A standout feature of DeciDiffusion is its U-Net component, which is optimized for performance on cost-effective hardware. DeciDiffusion’s AutoNAC-generated U-Net-NAS features 525 million parameters as opposed to the 860 in Stable Diffusion 1.5’s U-Net. This optimized design significantly enhances processing speed, making DeciDiffusion a highly efficient and effective solution in the realm of text-to-image generation.
## Training Details
### Training Procedure
The model was trained in 4 phases:
- **Phase 1:** Trained from scratch 1.28 million steps at resolution 256x256.
- **Phase 2:** Trained from 870k steps at resolution 512x512 on the same dataset to learn more fine-detailed information.
- **Phase 3:** Trained 65k steps with EMA, another learning rate scheduler, and more "qualitative" data.
- **Phase 4:** Fine-tuning on a 2M sample dataset.
### Training Techniques
DeciDiffusion 2.0 marks a significant advancement over previous latent diffusion models, particularly in terms of sample efficiency. This means it can produce high-quality images with fewer diffusion timesteps during the inference process. To attain such efficiency, Deci has refined the DPM++ scheduler, effectively cutting down the number of steps needed to generate a quality image from 16 to just 10.
Additionally, the following training techniques were used to improve the model's sample efficiency:
- **[V-prediction](https://arxiv.org/pdf/2202.00512.pdf)**
- **[Enforcing zero terminal SNR during training](https://arxiv.org/pdf/2305.08891.pdf)**
- **[Using a Min-SNR loss weighting strategy](https://arxiv.org/abs/2303.09556)**
- **[Employing Rescale Classifier-Free Guidance during inference](https://arxiv.org/pdf/2305.08891.pdf)**
- **[Sampling from the last timestep](https://arxiv.org/pdf/2305.08891.pdf)**
- **Training from 870k steps at resolution 512x512 on the same dataset to learn more fine-detailed information.**
- **[Utilizing LAMB optimizer with large batch](https://arxiv.org/abs/1904.00962)**
-
The following techniques were used to shorten training time:
- **Using precomputed VAE and CLIP latents**
- **Using EMA only in the last phase of training**
### Additional Details
#### Phase 1
- **Hardware:** 6 x 8 x H100 (80GB)
- **Optimizer:** LAMB
- **Batch:** 18432
- **Learning rate:** 5e-03
#### Phases 2-4
- **Hardware:** 8 x 8 x H100 (80GB)
- **Optimizer:** LAMB
- **Batch:** 7168
- **Learning rate:** 5e-03
## Runtime Benchmarks
The following tables provide an image latency comparison between DeciDiffusion 2.0 and Stable Diffusion v1.5.
DeciDiffusion 2.0 vs. Stable Diffusion v1.5 at FP16 precision
|Implementation + Iterations| DeciDiffusion 2.0 on AI 100 (seconds/image) | Stable Diffusion v1.5 on AI 100 (seconds/image) |
|:----------|:----------|:----------|
| Compiled 16 Iterations | 1.335 | 2.478 |
| Compiled 10 Iterations | 0.971 |1.684 |
## How to Use
**Note:** You must use diffusers v0.21.4 to run the model successfully.
```python
# pip install diffusers==0.21.4 transformers torch
from diffusers import StableDiffusionPipeline
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
checkpoint = "Deci/DeciDiffusion-v2-0"
pipeline = StableDiffusionPipeline.from_pretrained(checkpoint, custom_pipeline=checkpoint, torch_dtype=torch.float16)
pipeline.unet = pipeline.unet.from_pretrained(checkpoint, subfolder='flexible_unet', torch_dtype=torch.float16)
pipeline = pipeline.to(device)
img = pipeline(prompt=['A photo of an astronaut riding a horse on Mars']).images[0]
```
# Uses
### Misuse, Malicious Use, and Out-of-Scope Use
The model must not be employed to deliberately produce or spread images that foster hostile or unwelcoming settings for individuals. This encompasses generating visuals that might be predictably upsetting, distressing, or inappropriate, as well as content that perpetuates existing or historical biases.
#### Out-of-Scope Use
The model isn't designed to produce accurate or truthful depictions of people or events. Thus, using it for such purposes exceeds its intended capabilities.
#### Misuse and Malicious Use
Misusing the model to produce content that harms or maligns individuals is strictly discouraged. Such misuses include, but aren't limited to:
- Creating offensive, degrading, or damaging portrayals of individuals, their cultures, religions, or surroundings.
- Intentionally promoting or propagating discriminatory content or harmful stereotypes.Deliberately endorsing or disseminating prejudiced content or harmful stereotypes.
- Deliberately endorsing or disseminating prejudiced content or harmful stereotypes.
- Posing as someone else without their agreement.
- Generating explicit content without the knowledge or agreement of potential viewers.
- Distributing copyrighted or licensed content against its usage terms.
- Sharing modified versions of copyrighted or licensed content in breach of its usage guidelines.
## Limitations and Bias
### Limitations
The model has certain limitations and may not function optimally in the following scenarios:
- It doesn't produce completely photorealistic images.
- Rendering legible text is beyond its capability.
- Complex compositions, like visualizing “A green sphere to the left of a blue square”, are challenging for the model.
- Generation of faces and human figures may be imprecise.
- It is primarily optimized for English captions and might not be as effective with other languages.
- The autoencoding component of the model is lossy.
### Bias
The remarkable abilities of image-generation models can unintentionally amplify societal biases. DeciDiffusion was trained with a focus on English descriptions. Consequently, non-English communities and cultures might be underrepresented, leading to a bias towards white and western norms. Outputs from non-English prompts are notably less accurate. Given these biases, users should approach DeciDiffusion with discretion, regardless of input.
## How to Cite
Please cite this model using this format.
```bibtex
@misc{DeciFoundationModels,
title = {DeciDiffusion 2.0},
author = {DeciAI Research Team},
year = {2024}
url={[https://huggingface.co/deci/decidiffusion-v2-0](https://huggingface.co/deci/decidiffusion-v2-0)},
}
``` |