codehappy
/

puzzlebox-xl

Model card Files Files and versions Community

codehappy commited on 28 days ago

Commit

d8b9071

verified ·

1 Parent(s): 5f23ac4

README -- more training info

Browse files

additional info about Puzzle Box training and prompting

Files changed (1) hide show

README.md +33 -4

README.md CHANGED Viewed

@@ -11,11 +11,38 @@ A latent diffusion model (LDM) geared toward illustration, style composability,
 * Architecture: SD XL (base model is v1.0)
 * Training procedure: U-Net fully unfrozen, all-parameter continued pretraining at LR between 3e-8 and 3e-7 for 14,290,000 steps (at epoch 14, batch size 4).
-Trained on the Puzzle Box dataset, a large collection of permissively licensed images from the public Internet (or generated by previous Puzzle Box models). Each image has from 3 to 15 different captions which are used interchangably during training. There are 8.2 million images and 54 million captions in the dataset.
-The model is substantially better than the base SDXL model at producing images that look like film photographs, any kind of cartoon art, or old artist styles. It's also heavily tuned toward personal aesthetic preference.
-Prompt adherence is unusually good; aesthetics are improved by human evaluation for generations between 1/4 and 1/2 megapixel in size. CFG scales between 2 and 7 can work well with Puzzle Box, experimenting with resolution or scale for your prompts is encouraged.
 Model checkpoints currently available:
@@ -23,4 +50,6 @@ Model checkpoints currently available:
 - from epoch 13, **11930k** training steps, 15 August 2024
 - from epoch 12, **10570k** training steps, 21 June 2024
-This model has been trained carefully on top of the SDXL base, with a widely diverse training set at low learning rate. Accordingly, it should *merge* well with most other LDMs built off SDXL base. (Merging LDMs built off the same base is a form of transfer learning; you can add Puzzle Box concepts to other SDXL models this way. Spherical interpolation is best.) The captions used in training are also varied: you can prompt Puzzle Box XL using English sentences, or booru-style with lists of tags. (If you prompt booru-style, don't use underscores in your tags, replace those with spaces. Tags may be separated by any combination of whitespace or by commas.)

 * Architecture: SD XL (base model is v1.0)
 * Training procedure: U-Net fully unfrozen, all-parameter continued pretraining at LR between 3e-8 and 3e-7 for 14,290,000 steps (at epoch 14, batch size 4).
+Trained on the Puzzle Box dataset, a large collection of permissively licensed images from the public Internet (or generated by previous Puzzle Box models). Each image
+has from 3 to 15 different captions which are used interchangably during training. There are 8.2 million images and 54 million captions in the dataset.
+The model is substantially better than the base SDXL model at producing images that look like film photographs, any kind of cartoon art, or old artist styles. It's also
+heavily tuned toward personal aesthetic preference.
+**Prompting:** The captions used in training are varied: you can prompt Puzzle Box XL using English sentences, or booru-style with lists of tags. (If you prompt
+booru-style, don't use underscores in your tags, replace those with spaces. Tags may be separated by any combination of whitespace or by commas.)
+Vitamin phrases: *top quartile*, *top decile* (there are also anti-vitamins, *bottom quartile* and *bottom decile*). These are the primary aesthetic labels (see below.)
+Prompt adherence is unusually good; aesthetics are improved by human evaluation for generations between 1/4 and 1/2 megapixel in size for epochs 12-14, 1/4 to 2
+megapixels for epoch 15. CFG scales between 2 and 7 can work well with Puzzle Box; experimenting with resolution or scale for your prompts is encouraged.
+**Captioning:** About 1.4 million of the captions in the dataset are human-written. The remainder come from a variety of ML models, either vision transformers or
+classifers. Models used in captioning the Puzzle Box dataset include: Qwen 2 VL 72b, BLIP 2 OPT-6.5B COCO, Llava 1.5, MiniCPM 2.6, bakllava, Moondream, DeepSeek Janus 7b,
+Mistral Pixtral 12b, CapPa, and wd-eva02-large-tagger-v3. Only open-weights models were used.
+In addition to human/machine-generated main caption, there are a large number of additional human-provided tags referring to style ("pointillism", "caricature", "Winsor McKay"),
+genre ("pop art", "advertising", "pixel art"), source ("wikiart", "library of congress"), or image content ("fluid expression", "pin-up", "squash and stretch").
+**Aesthetic labelling:** All images in the Puzzle Box dataset have been scored by multiple IQA models. There are also over 700,000 human paired image preferences. This data is combined to label especially high- or low-aesthetic images. Aesthetic breakpoints are chosen
+on a per-style/genre tag basis (the threshold for "pixel art" is different than "classical oil painting".)
+Training is broken into three phases: in the first phase, all images (regardless of aesthetic score) are used in training. In the second phase, bottom quartile-labelled
+images are removed from training. In the final phase, *only* images tagged as top quartile aesthetics are trained.
+**Other nifty tricks used:** Some less common techniques used in training Puzzle Box XL include:
+- *Attention masks*: constructed for images to exclude background or portions of the image not mentioned in captions/important to image content; we only update blocks that are not masked off.
+- *Lores-to-hires*: I save compute by training at lower resolution (512px) until the model learns new concepts satisfactorily, then training at higher resolution (768px).
+This allows later checkpoints to generate 1+ megapixel images without tiling or stuttering, while greatly speeding up earlier stages of training.
 Model checkpoints currently available:
 - from epoch 13, **11930k** training steps, 15 August 2024
 - from epoch 12, **10570k** training steps, 21 June 2024
+This model has been trained carefully on top of the SDXL base, with a widely diverse training set at low learning rate. Accordingly, it should *merge* well with most other
+LDMs built off SDXL base. (Merging LDMs built off the same base is a form of transfer learning; you can add Puzzle Box concepts to other SDXL models this way. Spherical
+interpolation is best.)