Ross Wightman
commited on
Commit
·
8075f1d
1
Parent(s):
b6f7d8d
Update README
Browse files
README.md
CHANGED
@@ -6,11 +6,12 @@ license: mit
|
|
6 |
# Table of Contents
|
7 |
|
8 |
1. [Model Details](#model-details)
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
|
|
14 |
|
15 |
|
16 |
# Model Details
|
@@ -19,6 +20,8 @@ license: mit
|
|
19 |
|
20 |
A CLIP ViT L/14 model trained with the LAION-2B English subset of LAION-5B (https://laion.ai/blog/laion-5b/) using OpenCLIP (https://github.com/mlfoundations/open_clip).
|
21 |
|
|
|
|
|
22 |
# Uses
|
23 |
|
24 |
As per the original OpenAI CLIP models, this model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such model.
|
@@ -55,7 +58,74 @@ This model was trained with the 2 Billion sample English subset of LAION-5B (htt
|
|
55 |
|
56 |
## Training Procedure
|
57 |
|
58 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
59 |
|
60 |
# Evaluation
|
61 |
|
@@ -71,7 +141,15 @@ The testing is performed with VTAB+ (A combination of VTAB (https://arxiv.org/ab
|
|
71 |
|
72 |
## Results
|
73 |
|
74 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
75 |
|
76 |
# Citation
|
77 |
|
|
|
6 |
# Table of Contents
|
7 |
|
8 |
1. [Model Details](#model-details)
|
9 |
+
2. [Uses](#uses)
|
10 |
+
3. [Training Details](#training-details)
|
11 |
+
4. [Evaluation](#evaluation)
|
12 |
+
5. [Acknolwedgements](#acknowledgements)
|
13 |
+
6. [Citation](#citation)
|
14 |
+
7. [How To Get Started With the Model](#how-to-get-started-with-the-model)
|
15 |
|
16 |
|
17 |
# Model Details
|
|
|
20 |
|
21 |
A CLIP ViT L/14 model trained with the LAION-2B English subset of LAION-5B (https://laion.ai/blog/laion-5b/) using OpenCLIP (https://github.com/mlfoundations/open_clip).
|
22 |
|
23 |
+
Model training ('babysitting') done by Ross Wightman on the [JUWELS Booster](https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html) supercomputer. See acknowledgements below.
|
24 |
+
|
25 |
# Uses
|
26 |
|
27 |
As per the original OpenAI CLIP models, this model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such model.
|
|
|
58 |
|
59 |
## Training Procedure
|
60 |
|
61 |
+
The model was trained on 384 A100 GPUs using 200M sample 'virtual' epochs where dataset shards were sampled with replacement. The model was trained with 160 virtual epochs for a total of 32B samples seen.
|
62 |
+
|
63 |
+
The first 68 epochs were trained with float16 AMP, global batch size 79K (208 per GPU). Initially running to epoch 75, where the loss spiked and training failed with NaN.
|
64 |
+
|
65 |
+
Romain Beaumont was training H/14 and g/14 models at the same time on Stability cluster and hit similar instabilities. Collectively we tried restarts with,
|
66 |
+
* different dataset shuffle seed
|
67 |
+
* different LR
|
68 |
+
* gradient clipping
|
69 |
+
* modifications to the architecture
|
70 |
+
* Norm modifications (stable norm for final, post embed norm for text transformer) as per https://github.com/mlfoundations/open_clip/pull/153 thanks to Phil Wang
|
71 |
+
* Extra attention block norms ala Normformer (https://arxiv.org/abs/2110.09456)
|
72 |
+
* Scaled cosine attention ala Swin-V2 (https://arxiv.org/abs/2111.09883)
|
73 |
+
|
74 |
+
None of the above ended up working. Most blew up within the same epoch as original, with the exception of architecture mods.
|
75 |
+
* Normformer mods signifcantly altered the network such that resuming did not quickly converge to previous performance, this was abandoned but might be worth trying from start.
|
76 |
+
* Scaled cosine attn initially looked promising and lasted until epoch 90 before loss suddenly increased and appeared to remain 'stuck'.
|
77 |
+
|
78 |
+
In the end, restarting at epoch 69 with `float32` precision solved all instabilities and training continued from there with global batch size 86k (224 per GPU). On A100 GPUs, `float32` had a minimal impact on the throughput once `tf32` matmuls were enabled in PyTorch. Approximately 10% slower than `float16 AMP`. Romain similary changed the precision but ended up using `bfloat16 AMP` to resolve issues.
|
79 |
+
|
80 |
+
### Slum Script
|
81 |
+
|
82 |
+
```
|
83 |
+
#SBATCH --nodes=96
|
84 |
+
#SBATCH --gres=gpu:4
|
85 |
+
#SBATCH --ntasks-per-node=4
|
86 |
+
#SBATCH --cpus-per-task=6
|
87 |
+
#SBATCH --wait-all-nodes=1
|
88 |
+
#SBATCH --job-name=open_clip_laion2b
|
89 |
+
|
90 |
+
# load low-level libraries
|
91 |
+
ml purge
|
92 |
+
source /conda/bin/activate pytorch-112
|
93 |
+
|
94 |
+
export NCCL_ASYNC_ERROR_HANDLING=1
|
95 |
+
export CUDA_VISIBLE_DEVICES=0,1,2,3
|
96 |
+
export MASTER_PORT=12802
|
97 |
+
|
98 |
+
### get the first node name as master address - customized for vgg slurm
|
99 |
+
### e.g. master(gnodee[2-5],gnoded1) == gnodee2
|
100 |
+
echo "NODELIST="${SLURM_NODELIST}
|
101 |
+
master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
|
102 |
+
export MASTER_ADDR=$master_addr"i"
|
103 |
+
echo "MASTER_ADDR="$MASTER_ADDR
|
104 |
+
|
105 |
+
cd /home/me/open_clip
|
106 |
+
export PYTHONPATH="$PYTHONPATH:$PWD/src"
|
107 |
+
|
108 |
+
srun --cpu_bind=none,v --accel-bind=gn python -u src/training/main.py \
|
109 |
+
--save-frequency 1 \
|
110 |
+
--zeroshot-frequency 1 \
|
111 |
+
--train-data="/data/laion2B-en/{00000..23295}.tar" \
|
112 |
+
--train-num-samples=200000000 \
|
113 |
+
--warmup 10000 \
|
114 |
+
--lr "1e-3" \
|
115 |
+
--batch-size=224 \
|
116 |
+
--epochs=160 \
|
117 |
+
--workers=6 \
|
118 |
+
--model ViT-L-14 \
|
119 |
+
--name "L14-laion2B" \
|
120 |
+
--report-to "tensorboard" \
|
121 |
+
--seed 0 \
|
122 |
+
--precision 'fp32' \
|
123 |
+
--ddp-static-graph \
|
124 |
+
--local-loss \
|
125 |
+
--dataset-resampled \
|
126 |
+
--gather-with-grad \
|
127 |
+
--grad-checkpointing
|
128 |
+
```
|
129 |
|
130 |
# Evaluation
|
131 |
|
|
|
141 |
|
142 |
## Results
|
143 |
|
144 |
+
The model achieves a 75.3 zero-shot top-1 accuracy on ImageNet-1k.
|
145 |
+
|
146 |
+
An initial round of benchmarks have been performed on a wider range of datasets, currently viewable at https://github.com/LAION-AI/CLIP_benchmark/blob/main/benchmark/results.ipynb
|
147 |
+
|
148 |
+
**TODO** - create table for just this model's metrics.
|
149 |
+
|
150 |
+
# Acknowledgements
|
151 |
+
|
152 |
+
Acknowledging the Gauss Centre for Supercomputing e.V. (http://gauss-centre.eu) for funding this part of work by providing computing time through the John von Neumann Institute for Computing (NIC) on the GCS Supercomputer JUWELS Booster at Jülich Supercomputing Centre (JSC).
|
153 |
|
154 |
# Citation
|
155 |
|