codehappy commited on
Commit
d3e7b63
·
verified ·
1 Parent(s): b8a3bc4

update readme for epoch 17

Browse files
Files changed (1) hide show
  1. README.md +19 -3
README.md CHANGED
@@ -6,10 +6,10 @@ base_model:
6
 
7
  **Puzzle Box XL**
8
 
9
- A latent diffusion model (LDM) geared toward illustration, style composability, and sample variety. Addresses a few deficiencies with the SDXL base model.
10
 
11
  * Architecture: SD XL (base model is v1.0)
12
- * Training procedure: U-Net fully unfrozen, all-parameter continued pretraining at LR between 3e-8 and 3e-7 for 16,950,000 steps (at epoch 16, batch size 4).
13
 
14
  Trained on the Puzzle Box dataset, a large collection of permissively licensed images from the public Internet (or generated by previous Puzzle Box models). Each image
15
  has from 3 to 17 different captions which are used interchangably during training. There are 9.3 million images and 62 million captions in the dataset.
@@ -27,7 +27,8 @@ megapixels for epoch 15+. CFG scales between 2 and 7 can work well with Puzzle B
27
 
28
  **Captioning:** About 1.4 million of the captions in the dataset are human-written. The remainder come from a variety of ML models, either vision transformers or
29
  classifers. Models used in captioning the Puzzle Box dataset include: Qwen 2 VL 72b, BLIP 2 OPT-6.5B COCO, Llava 1.5, MiniCPM 2.6, bakllava, Moondream, DeepSeek Janus 7b,
30
- Mistral Pixtral 12b, CapPa, Gemma 3 27b, JoyCaption, and wd-eva02-large-tagger-v3. Only open-weights models were used.
 
31
 
32
  In addition to human/machine-generated main caption, there are a large number of additional human-provided tags referring to style ("pointillism", "caricature", "Winsor McKay"),
33
  genre ("pop art", "advertising", "pixel art"), source ("wikiart", "library of congress"), or image content ("fluid expression", "pin-up", "squash and stretch").
@@ -46,12 +47,27 @@ This allows later checkpoints to generate 1+ megapixel images without tiling or
46
 
47
  Model checkpoints currently available:
48
 
 
49
  - from epoch 16, **16950k** training steps, 05 May 2025
50
  - from epoch 15, **15800k** training steps, 08 March 2025
51
  - from epoch 14, **14290k** training steps, 02 December 2024
52
  - from epoch 13, **11930k** training steps, 15 August 2024
53
  - from epoch 12, **10570k** training steps, 21 June 2024
54
 
 
 
55
  This model has been trained carefully on top of the SDXL base, with a widely diverse training set at low learning rate. Accordingly, it should *merge* well with most other
56
  LDMs built off SDXL base. (Merging LDMs built off the same base is a form of transfer learning; you can add Puzzle Box concepts to other SDXL models this way. Spherical
57
  interpolation is best.)
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
7
  **Puzzle Box XL**
8
 
9
+ A latent diffusion model (LDM) geared toward illustration, style composability, and sample variety. Addresses a few deficiencies with the SDXL base model; feels more like an SD 1.x with better resolution and much better prompt adherence.
10
 
11
  * Architecture: SD XL (base model is v1.0)
12
+ * Training procedure: U-Net fully unfrozen, all-parameter continued pretraining at LR between 3e-8 and 3e-7 for 18,000,000 steps (at epoch 17, batch size 4).
13
 
14
  Trained on the Puzzle Box dataset, a large collection of permissively licensed images from the public Internet (or generated by previous Puzzle Box models). Each image
15
  has from 3 to 17 different captions which are used interchangably during training. There are 9.3 million images and 62 million captions in the dataset.
 
27
 
28
  **Captioning:** About 1.4 million of the captions in the dataset are human-written. The remainder come from a variety of ML models, either vision transformers or
29
  classifers. Models used in captioning the Puzzle Box dataset include: Qwen 2 VL 72b, BLIP 2 OPT-6.5B COCO, Llava 1.5, MiniCPM 2.6, bakllava, Moondream, DeepSeek Janus 7b,
30
+ Mistral Pixtral 12b, CapPa, Gemma 3 27b, JoyCaption, CLIP Interrogator 0.6.0, and wd-eva02-large-tagger-v3. Deepseek v3 is used to create detailed consensus captions from
31
+ the others. Only open-weights models were used.
32
 
33
  In addition to human/machine-generated main caption, there are a large number of additional human-provided tags referring to style ("pointillism", "caricature", "Winsor McKay"),
34
  genre ("pop art", "advertising", "pixel art"), source ("wikiart", "library of congress"), or image content ("fluid expression", "pin-up", "squash and stretch").
 
47
 
48
  Model checkpoints currently available:
49
 
50
+ - from epoch 17, **18000k** training steps, 06 July 2025
51
  - from epoch 16, **16950k** training steps, 05 May 2025
52
  - from epoch 15, **15800k** training steps, 08 March 2025
53
  - from epoch 14, **14290k** training steps, 02 December 2024
54
  - from epoch 13, **11930k** training steps, 15 August 2024
55
  - from epoch 12, **10570k** training steps, 21 June 2024
56
 
57
+ *Which checkpoint's best?* Later checkpoints have better aesthetics and better prompt adherence at higher resolution and lower CFG scale, but are also more 'opinionated'; longer conditioning may be necessary to get the generation as you like it. In particular, the latest checkpoints are trained on the most consensus captions, which are highly accurate but also quite long. Earlier checkpoints may give larger sample variety on short conditioning, which (at lower resolution) may make them useful drafting models: searching for good noise seeds, etc. Earlier checkpoints may also be better for merging with other LDMs based on SD XL.
58
+
59
  This model has been trained carefully on top of the SDXL base, with a widely diverse training set at low learning rate. Accordingly, it should *merge* well with most other
60
  LDMs built off SDXL base. (Merging LDMs built off the same base is a form of transfer learning; you can add Puzzle Box concepts to other SDXL models this way. Spherical
61
  interpolation is best.)
62
+
63
+ The U-Net self-attention layers are the layers most modified by the continued pretrain; comparing those layers to SD XL 1.0, the correlation is:
64
+ | Epoch | Date | R-squared |
65
+ | ----- | ---------- | --------- |
66
+ | 17 | 2025-07-06 | 97.705% |
67
+ | 16 | 2025-05-05 | 97.917% |
68
+ | 15 | 2025-03-08 | 98.312% |
69
+ | 14 | 2024-12-02 | 98.573% |
70
+ | 13 | 2024-08-05 | 98.876% |
71
+ | 12 | 2024-06-21 | 99.167% |
72
+
73
+ (For reference, Pony-family models, which are also based on SD XL 1.0 but are trained at much higher LR, paving the model, are around 40%, and Playground-derived models, which are trained on SD XL architecture from static, are below 25%.)