laion
/

CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg

Zero-Shot Image Classification

OpenCLIP

clip

Model card Files Files and versions Community

rwightman HF staff commited on Feb 26, 2023

Commit

96d934f

1 Parent(s): ef4a41a

Update README.md

Browse files

Files changed (1) hide show

README.md +22 -11

README.md CHANGED Viewed

@@ -22,31 +22,42 @@ license: mit
 A series of CLIP ConvNeXt-XXLarge (a custom `timm` ConvNeXt size) models trained on LAION-2B (english), a subset of [LAION-5B](https://arxiv.org/abs/2210.08402), using [OpenCLIP](https://github.com/mlfoundations/open_clip).
 Goals:
   * Push the size of largest convolutional CLIP image tower into the performance range of ViT-g to ViT-G w/ improved image size scaling for downstream use.
 Firsts:
-  * Largest released ConvNeXt model pretrained (847M params,)
   * A non-ViT image tower CLIP model (with no previous image tower pretrain) achieving > 79% ImageNet top-1 zero-shot
 The models utilize:
   * the [timm](https://github.com/rwightman/pytorch-image-models) ConvNeXt-XXLarge model (`convnext_xxlarge`) as the image tower
   * a standard projection at end of image tower
   * a text tower with same size (with 1024, heads 16, depth 24) as ViT-H-14 and ViT-g-14 models
-The models are trained at 256x256 image resolution.
-At 256x256, the ConvNext-XXLarge sits just above a ViT-H-14 CLIP configuration in FLOPS and params while being lower in activation counts. It is well under both g-14 and G-14 while being between them in capabilities.
-| Model | Dataset | Resolution | AugReg | Top-1 ImageNet Zero-Shot (%) |
-| ----- | ------- | ---------- | ------------ | --------- |
-| [convnext_xxlarge.laion2b_s34b_b82k-augreg](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg) | LAION-2B | 256x256 |  RRC (0.33, 1.0), RE (0.35), SD (0.1), D(0.1) | 79.1 |
-| [convnext_xxlarge.laion2b_s34b_b82k-augreg-rewind](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-rewind) | LAION-2B | 256x256 |  RRC (0.3, 1.0), RE (0.4), SD (0.1) | 79.3 |
-| [convnext_xxlarge.laion2b_s34b_b82k-augreg-soup](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup) | LAION-2B | 256x256 |  N/A | 79.4 |
-RRC = Random Resize Crop (crop pcts), RE = Random Erasing (prob), SD = Stochastic Depth (prob) -- image tower only
-The core training run was performed in pieces over a period of ~ 2 months. The global batch size for the core run was 81920. The last ~10% of training was re-done at a 95744 global batch size w/ higher LR and aug than original finish. The two were averaged together in a 'soup'. See more details in [Training Details](#training-details).
 Model training done by Ross Wightman across both the [stability.ai](https://stability.ai/) cluster and the [JUWELS Booster](https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html) supercomputer. See acknowledgements below.

 A series of CLIP ConvNeXt-XXLarge (a custom `timm` ConvNeXt size) models trained on LAION-2B (english), a subset of [LAION-5B](https://arxiv.org/abs/2210.08402), using [OpenCLIP](https://github.com/mlfoundations/open_clip).
+| Model | Dataset | Resolution | AugReg | Top-1 ImageNet Zero-Shot (%) |
+| ----- | ------- | ---------- | ------------ | --------- |
+| [convnext_xxlarge.laion2b_s34b_b82k-augreg](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg) | LAION-2B | 256x256 |  RRC (0.33, 1.0), RE (0.35), SD (0.1), D(0.1) | 79.1 |
+| [convnext_xxlarge.laion2b_s34b_b82k-augreg-rewind](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-rewind) | LAION-2B | 256x256 |  RRC (0.3, 1.0), RE (0.4), SD (0.1) | 79.3 |
+| [convnext_xxlarge.laion2b_s34b_b82k-augreg-soup](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup) | LAION-2B | 256x256 |  N/A | 79.4 |
+RRC = Random Resize Crop (crop pcts), RE = Random Erasing (prob), SD = Stochastic Depth (prob) -- image tower only
+The core training run was performed in pieces over a period of ~ 2 months. The global batch size for the core run was 81920. The last ~10% of training was re-done at a 95744 global batch size w/ higher LR and aug than original finish. The two were averaged together in a 'soup'. See more details in [Training Details](#training-details).
 Goals:
   * Push the size of largest convolutional CLIP image tower into the performance range of ViT-g to ViT-G w/ improved image size scaling for downstream use.
 Firsts:
+  * Largest released ConvNeXt model pretrained (847M params w/ 198 GMAC and 125 MActs @ 256x256 for image)
   * A non-ViT image tower CLIP model (with no previous image tower pretrain) achieving > 79% ImageNet top-1 zero-shot
 The models utilize:
   * the [timm](https://github.com/rwightman/pytorch-image-models) ConvNeXt-XXLarge model (`convnext_xxlarge`) as the image tower
   * a standard projection at end of image tower
   * a text tower with same size (with 1024, heads 16, depth 24) as ViT-H-14 and ViT-g-14 models
+The models are trained at 256x256 image resolution. The size of the combined image + text CLIP model is 1.2B params w/ 222 GMAC and 146 MActs. At 256x256, the ConvNext-XXLarge sits just above a ViT-H-14 CLIP configuration in FLOPS and params while being lower in activation counts. It is well under both g-14 and G-14 while being between them in capabilities.
+|model                     |image_size|embed_dim|gmacs |macts |mparams|image_gmacs|image_macts|image_mparams|text_gmacs|text_macts|text_mparams|
+|--------------------------|----------|---------|------|------|-------|-----------|-----------|-------------|----------|----------|------------|
+|ViT-H-16                  |224       |1024     |150.96|122.01|986.26 |127.4      |100.81     |632.23       |23.57     |21.2      |354.03      |
+|ViT-H-14                  |224       |1024     |190.97|160.61|986.11 |167.4      |139.41     |632.08       |23.57     |21.2      |354.03      |
+|ViT-L-14-336              |336       |768      |197.76|278.19|427.94 |191.1      |270.24     |304.29       |6.66      |7.95      |123.65      |
+|convnext_xxlarge          |256       |1024     |221.66|145.66|1200.58|198.09     |124.45     |846.54       |23.57     |21.2      |354.03      |
+|RN50x64                   |448       |1024     |276.8 |249.73|623.26 |265.02     |239.13     |420.38       |11.78     |10.6      |202.88      |
+|ViT-g-14                  |224       |1024     |290.74|213.84|1366.68|267.18     |192.64     |1012.65      |23.57     |21.2      |354.03      |
+|convnext_xxlarge_320      |320       |1024     |333.08|215.66|1200.58|309.52     |194.46     |846.54       |23.57     |21.2      |354.03      |
+|ViT-H-14-336              |336       |1024     |414.53|428.74|986.52 |390.97     |407.54     |632.49       |23.57     |21.2      |354.03      |
+|ViT-bigG-14               |224       |1280     |532.92|310.71|2539.57|483.96     |275.37     |1844.91      |48.96     |35.34     |694.66      |
 Model training done by Ross Wightman across both the [stability.ai](https://stability.ai/) cluster and the [JUWELS Booster](https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html) supercomputer. See acknowledgements below.