rwightman HF staff commited on
Commit
96d934f
·
1 Parent(s): ef4a41a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -11
README.md CHANGED
@@ -22,31 +22,42 @@ license: mit
22
 
23
  A series of CLIP ConvNeXt-XXLarge (a custom `timm` ConvNeXt size) models trained on LAION-2B (english), a subset of [LAION-5B](https://arxiv.org/abs/2210.08402), using [OpenCLIP](https://github.com/mlfoundations/open_clip).
24
 
 
 
 
 
 
 
 
 
 
25
  Goals:
26
  * Push the size of largest convolutional CLIP image tower into the performance range of ViT-g to ViT-G w/ improved image size scaling for downstream use.
27
 
28
  Firsts:
29
- * Largest released ConvNeXt model pretrained (847M params,)
30
  * A non-ViT image tower CLIP model (with no previous image tower pretrain) achieving > 79% ImageNet top-1 zero-shot
31
 
32
  The models utilize:
33
  * the [timm](https://github.com/rwightman/pytorch-image-models) ConvNeXt-XXLarge model (`convnext_xxlarge`) as the image tower
34
  * a standard projection at end of image tower
35
  * a text tower with same size (with 1024, heads 16, depth 24) as ViT-H-14 and ViT-g-14 models
36
-
37
- The models are trained at 256x256 image resolution.
38
 
39
- At 256x256, the ConvNext-XXLarge sits just above a ViT-H-14 CLIP configuration in FLOPS and params while being lower in activation counts. It is well under both g-14 and G-14 while being between them in capabilities.
40
 
41
- | Model | Dataset | Resolution | AugReg | Top-1 ImageNet Zero-Shot (%) |
42
- | ----- | ------- | ---------- | ------------ | --------- |
43
- | [convnext_xxlarge.laion2b_s34b_b82k-augreg](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg) | LAION-2B | 256x256 | RRC (0.33, 1.0), RE (0.35), SD (0.1), D(0.1) | 79.1 |
44
- | [convnext_xxlarge.laion2b_s34b_b82k-augreg-rewind](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-rewind) | LAION-2B | 256x256 | RRC (0.3, 1.0), RE (0.4), SD (0.1) | 79.3 |
45
- | [convnext_xxlarge.laion2b_s34b_b82k-augreg-soup](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup) | LAION-2B | 256x256 | N/A | 79.4 |
46
 
47
- RRC = Random Resize Crop (crop pcts), RE = Random Erasing (prob), SD = Stochastic Depth (prob) -- image tower only
 
 
 
 
 
 
 
 
 
 
48
 
49
- The core training run was performed in pieces over a period of ~ 2 months. The global batch size for the core run was 81920. The last ~10% of training was re-done at a 95744 global batch size w/ higher LR and aug than original finish. The two were averaged together in a 'soup'. See more details in [Training Details](#training-details).
50
 
51
  Model training done by Ross Wightman across both the [stability.ai](https://stability.ai/) cluster and the [JUWELS Booster](https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html) supercomputer. See acknowledgements below.
52
 
 
22
 
23
  A series of CLIP ConvNeXt-XXLarge (a custom `timm` ConvNeXt size) models trained on LAION-2B (english), a subset of [LAION-5B](https://arxiv.org/abs/2210.08402), using [OpenCLIP](https://github.com/mlfoundations/open_clip).
24
 
25
+ | Model | Dataset | Resolution | AugReg | Top-1 ImageNet Zero-Shot (%) |
26
+ | ----- | ------- | ---------- | ------------ | --------- |
27
+ | [convnext_xxlarge.laion2b_s34b_b82k-augreg](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg) | LAION-2B | 256x256 | RRC (0.33, 1.0), RE (0.35), SD (0.1), D(0.1) | 79.1 |
28
+ | [convnext_xxlarge.laion2b_s34b_b82k-augreg-rewind](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-rewind) | LAION-2B | 256x256 | RRC (0.3, 1.0), RE (0.4), SD (0.1) | 79.3 |
29
+ | [convnext_xxlarge.laion2b_s34b_b82k-augreg-soup](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup) | LAION-2B | 256x256 | N/A | 79.4 |
30
+ RRC = Random Resize Crop (crop pcts), RE = Random Erasing (prob), SD = Stochastic Depth (prob) -- image tower only
31
+
32
+ The core training run was performed in pieces over a period of ~ 2 months. The global batch size for the core run was 81920. The last ~10% of training was re-done at a 95744 global batch size w/ higher LR and aug than original finish. The two were averaged together in a 'soup'. See more details in [Training Details](#training-details).
33
+
34
  Goals:
35
  * Push the size of largest convolutional CLIP image tower into the performance range of ViT-g to ViT-G w/ improved image size scaling for downstream use.
36
 
37
  Firsts:
38
+ * Largest released ConvNeXt model pretrained (847M params w/ 198 GMAC and 125 MActs @ 256x256 for image)
39
  * A non-ViT image tower CLIP model (with no previous image tower pretrain) achieving > 79% ImageNet top-1 zero-shot
40
 
41
  The models utilize:
42
  * the [timm](https://github.com/rwightman/pytorch-image-models) ConvNeXt-XXLarge model (`convnext_xxlarge`) as the image tower
43
  * a standard projection at end of image tower
44
  * a text tower with same size (with 1024, heads 16, depth 24) as ViT-H-14 and ViT-g-14 models
 
 
45
 
 
46
 
47
+ The models are trained at 256x256 image resolution. The size of the combined image + text CLIP model is 1.2B params w/ 222 GMAC and 146 MActs. At 256x256, the ConvNext-XXLarge sits just above a ViT-H-14 CLIP configuration in FLOPS and params while being lower in activation counts. It is well under both g-14 and G-14 while being between them in capabilities.
 
 
 
 
48
 
49
+ |model |image_size|embed_dim|gmacs |macts |mparams|image_gmacs|image_macts|image_mparams|text_gmacs|text_macts|text_mparams|
50
+ |--------------------------|----------|---------|------|------|-------|-----------|-----------|-------------|----------|----------|------------|
51
+ |ViT-H-16 |224 |1024 |150.96|122.01|986.26 |127.4 |100.81 |632.23 |23.57 |21.2 |354.03 |
52
+ |ViT-H-14 |224 |1024 |190.97|160.61|986.11 |167.4 |139.41 |632.08 |23.57 |21.2 |354.03 |
53
+ |ViT-L-14-336 |336 |768 |197.76|278.19|427.94 |191.1 |270.24 |304.29 |6.66 |7.95 |123.65 |
54
+ |convnext_xxlarge |256 |1024 |221.66|145.66|1200.58|198.09 |124.45 |846.54 |23.57 |21.2 |354.03 |
55
+ |RN50x64 |448 |1024 |276.8 |249.73|623.26 |265.02 |239.13 |420.38 |11.78 |10.6 |202.88 |
56
+ |ViT-g-14 |224 |1024 |290.74|213.84|1366.68|267.18 |192.64 |1012.65 |23.57 |21.2 |354.03 |
57
+ |convnext_xxlarge_320 |320 |1024 |333.08|215.66|1200.58|309.52 |194.46 |846.54 |23.57 |21.2 |354.03 |
58
+ |ViT-H-14-336 |336 |1024 |414.53|428.74|986.52 |390.97 |407.54 |632.49 |23.57 |21.2 |354.03 |
59
+ |ViT-bigG-14 |224 |1280 |532.92|310.71|2539.57|483.96 |275.37 |1844.91 |48.96 |35.34 |694.66 |
60
 
 
61
 
62
  Model training done by Ross Wightman across both the [stability.ai](https://stability.ai/) cluster and the [JUWELS Booster](https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html) supercomputer. See acknowledgements below.
63