Update README.md
Browse files
README.md
CHANGED
@@ -22,31 +22,42 @@ license: mit
|
|
22 |
|
23 |
A series of CLIP ConvNeXt-XXLarge (a custom `timm` ConvNeXt size) models trained on LAION-2B (english), a subset of [LAION-5B](https://arxiv.org/abs/2210.08402), using [OpenCLIP](https://github.com/mlfoundations/open_clip).
|
24 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
25 |
Goals:
|
26 |
* Push the size of largest convolutional CLIP image tower into the performance range of ViT-g to ViT-G w/ improved image size scaling for downstream use.
|
27 |
|
28 |
Firsts:
|
29 |
-
* Largest released ConvNeXt model pretrained (847M params
|
30 |
* A non-ViT image tower CLIP model (with no previous image tower pretrain) achieving > 79% ImageNet top-1 zero-shot
|
31 |
|
32 |
The models utilize:
|
33 |
* the [timm](https://github.com/rwightman/pytorch-image-models) ConvNeXt-XXLarge model (`convnext_xxlarge`) as the image tower
|
34 |
* a standard projection at end of image tower
|
35 |
* a text tower with same size (with 1024, heads 16, depth 24) as ViT-H-14 and ViT-g-14 models
|
36 |
-
|
37 |
-
The models are trained at 256x256 image resolution.
|
38 |
|
39 |
-
At 256x256, the ConvNext-XXLarge sits just above a ViT-H-14 CLIP configuration in FLOPS and params while being lower in activation counts. It is well under both g-14 and G-14 while being between them in capabilities.
|
40 |
|
41 |
-
|
42 |
-
| ----- | ------- | ---------- | ------------ | --------- |
|
43 |
-
| [convnext_xxlarge.laion2b_s34b_b82k-augreg](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg) | LAION-2B | 256x256 | RRC (0.33, 1.0), RE (0.35), SD (0.1), D(0.1) | 79.1 |
|
44 |
-
| [convnext_xxlarge.laion2b_s34b_b82k-augreg-rewind](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-rewind) | LAION-2B | 256x256 | RRC (0.3, 1.0), RE (0.4), SD (0.1) | 79.3 |
|
45 |
-
| [convnext_xxlarge.laion2b_s34b_b82k-augreg-soup](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup) | LAION-2B | 256x256 | N/A | 79.4 |
|
46 |
|
47 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
48 |
|
49 |
-
The core training run was performed in pieces over a period of ~ 2 months. The global batch size for the core run was 81920. The last ~10% of training was re-done at a 95744 global batch size w/ higher LR and aug than original finish. The two were averaged together in a 'soup'. See more details in [Training Details](#training-details).
|
50 |
|
51 |
Model training done by Ross Wightman across both the [stability.ai](https://stability.ai/) cluster and the [JUWELS Booster](https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html) supercomputer. See acknowledgements below.
|
52 |
|
|
|
22 |
|
23 |
A series of CLIP ConvNeXt-XXLarge (a custom `timm` ConvNeXt size) models trained on LAION-2B (english), a subset of [LAION-5B](https://arxiv.org/abs/2210.08402), using [OpenCLIP](https://github.com/mlfoundations/open_clip).
|
24 |
|
25 |
+
| Model | Dataset | Resolution | AugReg | Top-1 ImageNet Zero-Shot (%) |
|
26 |
+
| ----- | ------- | ---------- | ------------ | --------- |
|
27 |
+
| [convnext_xxlarge.laion2b_s34b_b82k-augreg](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg) | LAION-2B | 256x256 | RRC (0.33, 1.0), RE (0.35), SD (0.1), D(0.1) | 79.1 |
|
28 |
+
| [convnext_xxlarge.laion2b_s34b_b82k-augreg-rewind](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-rewind) | LAION-2B | 256x256 | RRC (0.3, 1.0), RE (0.4), SD (0.1) | 79.3 |
|
29 |
+
| [convnext_xxlarge.laion2b_s34b_b82k-augreg-soup](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup) | LAION-2B | 256x256 | N/A | 79.4 |
|
30 |
+
RRC = Random Resize Crop (crop pcts), RE = Random Erasing (prob), SD = Stochastic Depth (prob) -- image tower only
|
31 |
+
|
32 |
+
The core training run was performed in pieces over a period of ~ 2 months. The global batch size for the core run was 81920. The last ~10% of training was re-done at a 95744 global batch size w/ higher LR and aug than original finish. The two were averaged together in a 'soup'. See more details in [Training Details](#training-details).
|
33 |
+
|
34 |
Goals:
|
35 |
* Push the size of largest convolutional CLIP image tower into the performance range of ViT-g to ViT-G w/ improved image size scaling for downstream use.
|
36 |
|
37 |
Firsts:
|
38 |
+
* Largest released ConvNeXt model pretrained (847M params w/ 198 GMAC and 125 MActs @ 256x256 for image)
|
39 |
* A non-ViT image tower CLIP model (with no previous image tower pretrain) achieving > 79% ImageNet top-1 zero-shot
|
40 |
|
41 |
The models utilize:
|
42 |
* the [timm](https://github.com/rwightman/pytorch-image-models) ConvNeXt-XXLarge model (`convnext_xxlarge`) as the image tower
|
43 |
* a standard projection at end of image tower
|
44 |
* a text tower with same size (with 1024, heads 16, depth 24) as ViT-H-14 and ViT-g-14 models
|
|
|
|
|
45 |
|
|
|
46 |
|
47 |
+
The models are trained at 256x256 image resolution. The size of the combined image + text CLIP model is 1.2B params w/ 222 GMAC and 146 MActs. At 256x256, the ConvNext-XXLarge sits just above a ViT-H-14 CLIP configuration in FLOPS and params while being lower in activation counts. It is well under both g-14 and G-14 while being between them in capabilities.
|
|
|
|
|
|
|
|
|
48 |
|
49 |
+
|model |image_size|embed_dim|gmacs |macts |mparams|image_gmacs|image_macts|image_mparams|text_gmacs|text_macts|text_mparams|
|
50 |
+
|--------------------------|----------|---------|------|------|-------|-----------|-----------|-------------|----------|----------|------------|
|
51 |
+
|ViT-H-16 |224 |1024 |150.96|122.01|986.26 |127.4 |100.81 |632.23 |23.57 |21.2 |354.03 |
|
52 |
+
|ViT-H-14 |224 |1024 |190.97|160.61|986.11 |167.4 |139.41 |632.08 |23.57 |21.2 |354.03 |
|
53 |
+
|ViT-L-14-336 |336 |768 |197.76|278.19|427.94 |191.1 |270.24 |304.29 |6.66 |7.95 |123.65 |
|
54 |
+
|convnext_xxlarge |256 |1024 |221.66|145.66|1200.58|198.09 |124.45 |846.54 |23.57 |21.2 |354.03 |
|
55 |
+
|RN50x64 |448 |1024 |276.8 |249.73|623.26 |265.02 |239.13 |420.38 |11.78 |10.6 |202.88 |
|
56 |
+
|ViT-g-14 |224 |1024 |290.74|213.84|1366.68|267.18 |192.64 |1012.65 |23.57 |21.2 |354.03 |
|
57 |
+
|convnext_xxlarge_320 |320 |1024 |333.08|215.66|1200.58|309.52 |194.46 |846.54 |23.57 |21.2 |354.03 |
|
58 |
+
|ViT-H-14-336 |336 |1024 |414.53|428.74|986.52 |390.97 |407.54 |632.49 |23.57 |21.2 |354.03 |
|
59 |
+
|ViT-bigG-14 |224 |1280 |532.92|310.71|2539.57|483.96 |275.37 |1844.91 |48.96 |35.34 |694.66 |
|
60 |
|
|
|
61 |
|
62 |
Model training done by Ross Wightman across both the [stability.ai](https://stability.ai/) cluster and the [JUWELS Booster](https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html) supercomputer. See acknowledgements below.
|
63 |
|