rwightman HF staff commited on
Commit
bd0c5a9
·
1 Parent(s): 96d934f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -1
README.md CHANGED
@@ -24,7 +24,7 @@ A series of CLIP ConvNeXt-XXLarge (a custom `timm` ConvNeXt size) models trained
24
 
25
  | Model | Dataset | Resolution | AugReg | Top-1 ImageNet Zero-Shot (%) |
26
  | ----- | ------- | ---------- | ------------ | --------- |
27
- | [convnext_xxlarge.laion2b_s34b_b82k-augreg](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg) | LAION-2B | 256x256 | RRC (0.33, 1.0), RE (0.35), SD (0.1), D(0.1) | 79.1 |
28
  | [convnext_xxlarge.laion2b_s34b_b82k-augreg-rewind](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-rewind) | LAION-2B | 256x256 | RRC (0.3, 1.0), RE (0.4), SD (0.1) | 79.3 |
29
  | [convnext_xxlarge.laion2b_s34b_b82k-augreg-soup](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup) | LAION-2B | 256x256 | N/A | 79.4 |
30
  RRC = Random Resize Crop (crop pcts), RE = Random Erasing (prob), SD = Stochastic Depth (prob) -- image tower only
@@ -99,6 +99,20 @@ This model was trained with LAION-2B -- A 2 billion sample English subset of LAI
99
 
100
  The main training run was done at global batch size of 81920 for 256 checkpoint intervals of 135.6M samples for a total of ~34B samples seen over training.
101
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
102
  For 256x256 models, a slurm script w/ srun below for a 128 8-GPU (40GB A100) configuration:
103
 
104
  ```
@@ -131,6 +145,11 @@ srun --cpu_bind=v --accel-bind=gn python -m training.main \
131
  ```
132
 
133
  For the rewind of last 10%, a higher global batch size of 95744 was used w/ a higher LR and slightly increased augmentation strength. The slurm srun cmd for 136 8-GPU (40GB A100) nodes:
 
 
 
 
 
134
  ```
135
  srun --cpu_bind=v --accel-bind=gn python -m training.main \
136
  --save-frequency 1 \
 
24
 
25
  | Model | Dataset | Resolution | AugReg | Top-1 ImageNet Zero-Shot (%) |
26
  | ----- | ------- | ---------- | ------------ | --------- |
27
+ | [convnext_xxlarge.laion2b_s34b_b82k-augreg](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg) | LAION-2B | 256x256 | RRC (0.33, 1.0), RE (0.35), SD (0.1) | 79.1 |
28
  | [convnext_xxlarge.laion2b_s34b_b82k-augreg-rewind](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-rewind) | LAION-2B | 256x256 | RRC (0.3, 1.0), RE (0.4), SD (0.1) | 79.3 |
29
  | [convnext_xxlarge.laion2b_s34b_b82k-augreg-soup](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup) | LAION-2B | 256x256 | N/A | 79.4 |
30
  RRC = Random Resize Crop (crop pcts), RE = Random Erasing (prob), SD = Stochastic Depth (prob) -- image tower only
 
99
 
100
  The main training run was done at global batch size of 81920 for 256 checkpoint intervals of 135.6M samples for a total of ~34B samples seen over training.
101
 
102
+ Many difficulties w/ both model numerical stability and cluster stability and performance were encountered while training this model. Initial attempts to train with float16 AMP and default adam beta2 resulted in loss spikes and eventually NaN blow ups. `beta2` was reduced to 0.97 which helped, but the loss / zs curves were not tracking as expected. After switching to PyTorch nightlies, it was possible to use bfloat16 + AMP for training (as with rececnt H/14, g/14, and G/14 models), beta2 was returned to 0.98 and metrics improved.
103
+
104
+ |Checkpoint Interval |Cluster |# GPUs|# Nodes|GPU |local BS|sample/s|sample/s/gpu|precision |adam beta2 |
105
+ |--------------------|----------|------|-------|----------|--------|--------|------------|----------|-----------|
106
+ |1 - 2 |Stability |1024 |128 |A100 40GB | 80 |37-40k | 36-39 |amp + fp16|0.97 |
107
+ |3 - 32 |Stability |512 |64 |A100 80GB | 160 |27-32k | 52-62 |amp + fp16|0.97 |
108
+ |33 - 75 |Booster |1024 |256 |A100 40GB | 80 |48k | 47 |amp + fp16|0.97 |
109
+ |76 - 165 |Booster |1024 |256 |A100 40GB | 80 |51k | 50 |amp + bf16|0.98 |
110
+ |166 - 232 |Stability |320 |40 |A100 80GB | 256 |18-19k | 56-59 |amp + bf16|0.98 |
111
+ |233 - 249 |Booster |1024 |256 |A100 40GB | 80 |51k | 50 |amp + bf16|0.98 |
112
+ |250 - 256 |Stability |1024 |128 |A100 40GB | 80 |27-31k | 26-30 |amp + bf16|0.98 |
113
+
114
+ JUWELS Booster has 4x A100 GPU per node w/ 4x HDR-200 IB adapters per node (200Gbit/sec per GPU). Stability setup used was 8x A100 GPU per node w/ 400Gbit/sec EFA connectivity per node (~50 GBit/sec per GPU). Significant variation in training efficiency (throughput per GPU) as observed across the various configurations. The 1024 GPU configurations across both clusters were particularly prone to crashing (or very difficult to get running w/ a 'good' set of GPUs).
115
+
116
  For 256x256 models, a slurm script w/ srun below for a 128 8-GPU (40GB A100) configuration:
117
 
118
  ```
 
145
  ```
146
 
147
  For the rewind of last 10%, a higher global batch size of 95744 was used w/ a higher LR and slightly increased augmentation strength. The slurm srun cmd for 136 8-GPU (40GB A100) nodes:
148
+
149
+ |Checkpoint Interval |Cluster |# GPUs|# Nodes|GPU |local BS|sample/s|sample/s/gpu|precision |adam beta2 |
150
+ |--------------------|---------|------|-------|----------|--------|--------|------------|----------|-----------|
151
+ |231 - 256 |stability|1088 |136 |A100 40GB | 88 |32-35k | 29-32 |amp + bf16|0.98 |
152
+
153
  ```
154
  srun --cpu_bind=v --accel-bind=gn python -m training.main \
155
  --save-frequency 1 \