File size: 7,799 Bytes

---
pipeline_tag: text-to-image
license: other
license_name: faipl-1.0-sd
license_link: LICENSE
decoder:
- Disty0/sotediffusion-wuerstchen3-decoder
---


# SoteDiffusion Wuerstchen3

Anime finetune of Würstchen V3.  

# Release Notes

- This release is sponsored by <a href="https://fal.ai/grants?rel=sote-diffusion" target="_blank">fal.ai/grants</a>  
- Trained on 6M images for 3 epochs using 8x A100 80G GPUs.  

# API Usage

This model can be used via API with Fal.AI  
For more details: https://fal.ai/models/fal-ai/stable-cascade/sote-diffusion  

<style>
.image {
    float: left;
    margin-left: 10px;
}
</style>

<table>
<img class="image" src="https://cdn-uploads.huggingface.co/production/uploads/6456af6195082f722d178522/9NmbUy1iaenscVLqCt7dA.png" width="320">
<img class="image" src="https://cdn-uploads.huggingface.co/production/uploads/6456af6195082f722d178522/78vAZc1-Ed1LhBst7HAa5.png" width="320">
</table>

# UI Guide

## SD.Next
URL: https://github.com/vladmandic/automatic/

Go to Models -> Huggingface and type `Disty0/sotediffusion-wuerstchen3-decoder` into the model name and press download.  
Load `Disty0/sotediffusion-wuerstchen3-decoder` after the download process is complete.  

Prompt:  
```
newest, extremely aesthetic, best quality,
```

Negative Prompt:  
```
very displeasing, worst quality, monochrome, realistic, oldest, loli,
```

Parameters:  
Sampler: Default  

Steps: 30 or 40  
Refiner Steps: 10  

CFG: 7  
Secondary CFG: 2 or 1  

Resolution: 1024x1536, 2048x1152  
Anything works as long as it's a multiply of 128.


## ComfyUI

Please refer to CivitAI: https://civitai.com/models/353284  


# Code Example

```shell
pip install diffusers
```

```python
import torch
from diffusers import StableCascadeCombinedPipeline

device = "cuda"
dtype = torch.bfloat16 # or torch.float16
model = "Disty0/sotediffusion-wuerstchen3-decoder"

pipe = StableCascadeCombinedPipeline.from_pretrained(model, torch_dtype=dtype)

# send everything to the gpu:
pipe = pipe.to(device, dtype=dtype)
pipe.prior_pipe = pipe.prior_pipe.to(device, dtype=dtype)

# or enable model offload to save vram:
# pipe.enable_model_cpu_offload()



prompt = "newest, extremely aesthetic, best quality, 1girl, solo, cat ears, pink hair, orange eyes, long hair, bare shoulders, looking at viewer, smile, indoors, casual, living room, playing guitar,"
negative_prompt = "very displeasing, worst quality, monochrome, realistic, oldest, loli,"
output = pipe(
    width=1024,
    height=1536,
    prompt=prompt,
    negative_prompt=negative_prompt,
    decoder_guidance_scale=2.0,
    prior_guidance_scale=7.0,
    prior_num_inference_steps=30,
    output_type="pil",
    num_inference_steps=10
).images[0]

## do something with the output image
```

## Training:
**Software used**: Kohya SD-Scripts with Stable Cascade branch.  
https://github.com/kohya-ss/sd-scripts/tree/stable-cascade   

**GPU used**: 8x Nvidia A100 80GB  
**GPU Hours**: 220  

### Base
| parameter | value |
|---|---|
| **amp** | bf16 |
| **weights** | fp32 |
| **save weights** | fp16 |
| **resolution** | 1024x1024 |
| **effective batch size** | 128 |
| **unet learning rate** | 1e-5 |
| **te learning rate** | 4e-6 |
| **optimizer** | Adafactor |
| **images** | 6M |
| **epochs** | 3 |

### Final

| parameter | value |
|---|---|
| **amp** | bf16 |
| **weights** | fp32 |
| **save weights** | fp16 |
| **resolution** | 1024x1024 |
| **effective batch size** | 128 |
| **unet learning rate** |  4e-6 |
| **te learning rate** | none |
| **optimizer** | Adafactor |
| **images** | 120K |
| **epochs** | 16 |

## Dataset:

**GPU used for captioning**: 1x Intel ARC A770 16GB  
**GPU Hours**: 350  

**Model used for captioning**: SmilingWolf/wd-swinv2-tagger-v3  
**Command:**  
```
python /mnt/DataSSD/AI/Apps/kohya_ss/sd-scripts/finetune/tag_images_by_wd14_tagger.py --model_dir "/mnt/DataSSD/AI/models/wd14_tagger_model" --repo_id "SmilingWolf/wd-swinv2-tagger-v3" --recursive --remove_underscore --use_rating_tags --character_tags_first --character_tag_expand --append_tags --onnx --caption_separator ", " --general_threshold 0.35 --character_threshold 0.50 --batch_size 4 --caption_extension ".txt" ./
```


| dataset name | total images |
|---|---|
| **newest** | 1.848.331 |
| **recent** | 1.380.630 |
| **mid** | 993.227 |
| **early** | 566.152 |
| **oldest** | 160.397 |
| **pixiv** | 343.614 |
| **visual novel cg** | 231.358 |
| **anime wallpaper** | 104.790 |
| **Total** | 5.628.499 |


**Note**:  
 - Smallest size is 1280x600 | 768.000 pixels
 - Deduped based on image similarity using czkawka-cli
 - Around 120K very high quality images got intentionally duplicated 5 times, making the total image count 6.2M


## Tags:

Model is trained with random tag order but this is the order in the dataset if you are interested:  
```
aesthetic tags, quality tags, date tags, custom tags, rating tags, character, series, rest of the tags
```

### Date:

| tag | date |
|---|---|
| **newest** | 2022 to 2024 |
| **recent** | 2019 to 2021 |
| **mid** | 2015 to 2018 |
| **early** | 2011 to 2014 |
| **oldest** | 2005 to 2010 |

### Aesthetic Tags:
**Model used**: shadowlilac/aesthetic-shadow-v2

| score greater than | tag | count |
|---|---|---|
| **0.90** | extremely aesthetic | 125.451 |
| **0.80** | very aesthetic | 887.382 |
| **0.70** | aesthetic | 1.049.857 |
| **0.50** | slightly aesthetic | 1.643.091 |
| **0.40** | not displeasing | 569.543 |
| **0.30** | not aesthetic | 445.188 |
| **0.20** | slightly displeasing | 341.424 |
| **0.10** | displeasing | 237.660 |
| **rest of them** | very displeasing | 328.712 |

### Quality Tags:
**Model used**: https://huggingface.co/hakurei/waifu-diffusion-v1-4/blob/main/models/aes-B32-v0.pth

| score greater than | tag | count |
|---|---|---|
| **0.980** | best quality | 1.270.447 |
| **0.900** | high quality | 498.244 |
| **0.750** | great quality | 351.006 |
| **0.500** | medium quality | 366.448 |
| **0.250** | normal quality | 368.380 |
| **0.125** | bad quality | 279.050 |
| **0.025** | low quality | 538.958 |
| **rest of them** | worst quality | 1.955.966 |

## Rating Tags:

| tag | count |
|---|---|
| **general** | 1.416.451 |
| **sensitive** | 3.447.664 |
| **nsfw** | 427.459 |
| **explicit nsfw** | 336.925 |

## Custom Tags:

| dataset name | custom tag |
|---|---|
| **image boards** | date, |
| **text** | The text says "text", |
| **characters** | character, series
| **pixiv** | art by Display_Name, |
| **visual novel cg** | Full_VN_Name (short_3_letter_name), visual novel cg, |
| **anime wallpaper** | date, anime wallpaper, |


## Limitations and Bias

### Bias

- This model is intended for anime illustrations.  
  Realistic capabilites are not tested at all.  

### Limitations

- Can fall back to realistic.  
  Add "realistic" tag to the negatives when this happens.  
- Far shot eyes and hands can be bad.  


## License

SoteDiffusion models falls under [Fair AI Public License 1.0-SD](https://freedevproject.org/faipl-1.0-sd/) license, which is compatible with Stable Diffusion models’ license. Key points:

1. **Modification Sharing:** If you modify SoteDiffusion models, you must share both your changes and the original license.
2. **Source Code Accessibility:** If your modified version is network-accessible, provide a way (like a download link) for others to get the source code. This applies to derived models too.
3. **Distribution Terms:** Any distribution must be under this license or another with similar rules.
4. **Compliance:** Non-compliance must be fixed within 30 days to avoid license termination, emphasizing transparency and adherence to open-source values.

**Notes**: Anything not covered by Fair AI license is inherited from Stability AI Non-Commercial license which is named as LICENSE_INHERIT.