Update README.md

9698642 verified 17 days ago

8.09 kB

	---
	pipeline_tag: text-to-image
	license: other
	license_name: faipl-1.0-sd
	license_link: LICENSE
	base_model: stabilityai/stable-cascade
	tags:
	- text-to-image
	- anime
	library_name: diffusers
	language: en
	inference: false
	decoder: Disty0/sotediffusion-wuerstchen3-decoder
	new_version: Disty0/sotediffusion-v2
	---


	# New verison is available: https://huggingface.co/Disty0/sotediffusion-v2


	# SoteDiffusion Wuerstchen3

	Anime finetune of Würstchen V3.

	# Release Notes

	- This release is sponsored by <a href="https://fal.ai/grants?rel=sote-diffusion" target="_blank">fal.ai/grants</a>
	- Trained on 6M images for 3 epochs using 8x A100 80G GPUs.

	# API Usage

	This model can be used via API with Fal.AI
	For more details: https://fal.ai/models/fal-ai/stable-cascade/sote-diffusion

	<style>
	.image {
	float: left;
	margin-left: 10px;
	}
	</style>

	<table>
	<img class="image" src="https://cdn-uploads.huggingface.co/production/uploads/6456af6195082f722d178522/9NmbUy1iaenscVLqCt7dA.png" width="320">
	<img class="image" src="https://cdn-uploads.huggingface.co/production/uploads/6456af6195082f722d178522/78vAZc1-Ed1LhBst7HAa5.png" width="320">
	</table>

	# UI Guide

	## SD.Next
	URL: https://github.com/vladmandic/automatic/

	Go to Models -> Huggingface and type `Disty0/sotediffusion-wuerstchen3-decoder` into the model name and press download.
	Load `Disty0/sotediffusion-wuerstchen3-decoder` after the download process is complete.

	Prompt:
	```
	newest, extremely aesthetic, best quality,
	```

	Negative Prompt:
	```
	very displeasing, worst quality, monochrome, realistic, oldest, loli,
	```

	Parameters:
	Sampler: Default

	Steps: 30 or 40
	Refiner Steps: 10

	CFG: 7
	Secondary CFG: 2 or 1

	Resolution: 1024x1536, 2048x1152
	Anything works as long as it's a multiply of 128.


	## ComfyUI

	Please refer to CivitAI: https://civitai.com/models/353284


	# Code Example

	```shell
	pip install diffusers
	```

	```python
	import torch
	from diffusers import StableCascadeCombinedPipeline

	device = "cuda"
	dtype = torch.bfloat16 # or torch.float16
	model = "Disty0/sotediffusion-wuerstchen3-decoder"

	pipe = StableCascadeCombinedPipeline.from_pretrained(model, torch_dtype=dtype)

	# send everything to the gpu:
	pipe = pipe.to(device, dtype=dtype)
	pipe.prior_pipe = pipe.prior_pipe.to(device, dtype=dtype)

	# or enable model offload to save vram:
	# pipe.enable_model_cpu_offload()



	prompt = "newest, extremely aesthetic, best quality, 1girl, solo, cat ears, pink hair, orange eyes, long hair, bare shoulders, looking at viewer, smile, indoors, casual, living room, playing guitar,"
	negative_prompt = "very displeasing, worst quality, monochrome, realistic, oldest, loli,"
	output = pipe(
	width=1024,
	height=1536,
	prompt=prompt,
	negative_prompt=negative_prompt,
	decoder_guidance_scale=2.0,
	prior_guidance_scale=7.0,
	prior_num_inference_steps=30,
	output_type="pil",
	num_inference_steps=10
	).images[0]

	## do something with the output image
	```

	## Training:
	Software used: Kohya SD-Scripts with Stable Cascade branch.
	https://github.com/kohya-ss/sd-scripts/tree/stable-cascade

	GPU used: 8x Nvidia A100 80GB
	GPU Hours: 220

	### Base
	\| parameter \| value \|
	\|---\|---\|
	\| amp \| bf16 \|
	\| weights \| fp32 \|
	\| save weights \| fp16 \|
	\| resolution \| 1024x1024 \|
	\| effective batch size \| 128 \|
	\| unet learning rate \| 1e-5 \|
	\| te learning rate \| 4e-6 \|
	\| optimizer \| Adafactor \|
	\| images \| 6M \|
	\| epochs \| 3 \|

	### Final

	\| parameter \| value \|
	\|---\|---\|
	\| amp \| bf16 \|
	\| weights \| fp32 \|
	\| save weights \| fp16 \|
	\| resolution \| 1024x1024 \|
	\| effective batch size \| 128 \|
	\| unet learning rate \| 4e-6 \|
	\| te learning rate \| none \|
	\| optimizer \| Adafactor \|
	\| images \| 120K \|
	\| epochs \| 16 \|

	## Dataset:

	GPU used for captioning: 1x Intel ARC A770 16GB
	GPU Hours: 350

	Model used for captioning: SmilingWolf/wd-swinv2-tagger-v3
	Model used for text: llava-hf/llava-1.5-7b-hf

	Command:
	```
	python /mnt/DataSSD/AI/Apps/kohya_ss/sd-scripts/finetune/tag_images_by_wd14_tagger.py --model_dir "/mnt/DataSSD/AI/models/wd14_tagger_model" --repo_id "SmilingWolf/wd-swinv2-tagger-v3" --recursive --remove_underscore --use_rating_tags --character_tags_first --character_tag_expand --append_tags --onnx --caption_separator ", " --general_threshold 0.35 --character_threshold 0.50 --batch_size 4 --caption_extension ".txt" ./
	```


	\| dataset name \| total images \|
	\|---\|---\|
	\| newest \| 1.848.331 \|
	\| recent \| 1.380.630 \|
	\| mid \| 993.227 \|
	\| early \| 566.152 \|
	\| oldest \| 160.397 \|
	\| pixiv \| 343.614 \|
	\| visual novel cg \| 231.358 \|
	\| anime wallpaper \| 104.790 \|
	\| Total \| 5.628.499 \|


	Note:
	- Smallest size is 1280x600 \| 768.000 pixels
	- Deduped based on image similarity using czkawka-cli
	- Around 120K very high quality images got intentionally duplicated 5 times, making the total image count 6.2M


	## Tags:

	Model is trained with random tag order but this is the order in the dataset if you are interested:
	```
	aesthetic tags, quality tags, date tags, custom tags, rating tags, character, series, rest of the tags
	```

	### Date:

	\| tag \| date \|
	\|---\|---\|
	\| newest \| 2022 to 2024 \|
	\| recent \| 2019 to 2021 \|
	\| mid \| 2015 to 2018 \|
	\| early \| 2011 to 2014 \|
	\| oldest \| 2005 to 2010 \|

	### Aesthetic Tags:
	Model used: shadowlilac/aesthetic-shadow-v2

	\| score greater than \| tag \| count \|
	\|---\|---\|---\|
	\| 0.90 \| extremely aesthetic \| 125.451 \|
	\| 0.80 \| very aesthetic \| 887.382 \|
	\| 0.70 \| aesthetic \| 1.049.857 \|
	\| 0.50 \| slightly aesthetic \| 1.643.091 \|
	\| 0.40 \| not displeasing \| 569.543 \|
	\| 0.30 \| not aesthetic \| 445.188 \|
	\| 0.20 \| slightly displeasing \| 341.424 \|
	\| 0.10 \| displeasing \| 237.660 \|
	\| rest of them \| very displeasing \| 328.712 \|

	### Quality Tags:
	Model used: https://huggingface.co/hakurei/waifu-diffusion-v1-4/blob/main/models/aes-B32-v0.pth

	\| score greater than \| tag \| count \|
	\|---\|---\|---\|
	\| 0.980 \| best quality \| 1.270.447 \|
	\| 0.900 \| high quality \| 498.244 \|
	\| 0.750 \| great quality \| 351.006 \|
	\| 0.500 \| medium quality \| 366.448 \|
	\| 0.250 \| normal quality \| 368.380 \|
	\| 0.125 \| bad quality \| 279.050 \|
	\| 0.025 \| low quality \| 538.958 \|
	\| rest of them \| worst quality \| 1.955.966 \|

	## Rating Tags:

	\| tag \| count \|
	\|---\|---\|
	\| general \| 1.416.451 \|
	\| sensitive \| 3.447.664 \|
	\| nsfw \| 427.459 \|
	\| explicit nsfw \| 336.925 \|

	## Custom Tags:

	\| dataset name \| custom tag \|
	\|---\|---\|
	\| image boards \| date, \|
	\| text \| The text says "text", \|
	\| characters \| character, series
	\| pixiv \| art by Display_Name, \|
	\| visual novel cg \| Full_VN_Name (short_3_letter_name), visual novel cg, \|
	\| anime wallpaper \| date, anime wallpaper, \|


	## Limitations and Bias

	### Bias

	- This model is intended for anime illustrations.
	Realistic capabilites are not tested at all.

	### Limitations

	- Can fall back to realistic.
	Add "realistic" tag to the negatives when this happens.
	- Far shot eyes and hands can be bad.


	## License

	SoteDiffusion models falls under [Fair AI Public License 1.0-SD](https://freedevproject.org/faipl-1.0-sd/) license, which is compatible with Stable Diffusion models’ license. Key points:

	1. Modification Sharing: If you modify SoteDiffusion models, you must share both your changes and the original license.
	2. Source Code Accessibility: If your modified version is network-accessible, provide a way (like a download link) for others to get the source code. This applies to derived models too.
	3. Distribution Terms: Any distribution must be under this license or another with similar rules.
	4. Compliance: Non-compliance must be fixed within 30 days to avoid license termination, emphasizing transparency and adherence to open-source values.

	Notes: Anything not covered by Fair AI license is inherited from Stability AI Non-Commercial license which is named as LICENSE_INHERIT.