2024-09-06,22:40:58 | INFO | No latest resume checkpoint found in /home/breaking_0.1_trained/10% most difficult/checkpoints. 2024-09-06,22:41:00 | INFO | Running in distributed mode with multiple processes. Device: cuda:0.Process (global: 0, local 0), total 2. 2024-09-06,22:41:00 | INFO | Loaded ViT-B-32 model config. 2024-09-06,22:41:01 | INFO | Model: 2024-09-06,22:41:01 | INFO | CLIP( (visual): VisionTransformer( (patchnorm_pre_ln): Identity() (conv1): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), bias=False) (patch_dropout): Identity() (ln_pre): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (transformer): Transformer( (resblocks): ModuleList( (0): ResidualAttentionBlock( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=768, out_features=3072, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=3072, out_features=768, bias=True) ) (ls_2): Identity() ) (1): ResidualAttentionBlock( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=768, out_features=3072, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=3072, out_features=768, bias=True) ) (ls_2): Identity() ) (2): ResidualAttentionBlock( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=768, out_features=3072, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=3072, out_features=768, bias=True) ) (ls_2): Identity() ) (3): ResidualAttentionBlock( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=768, out_features=3072, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=3072, out_features=768, bias=True) ) (ls_2): Identity() ) (4): ResidualAttentionBlock( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=768, out_features=3072, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=3072, out_features=768, bias=True) ) (ls_2): Identity() ) (5): ResidualAttentionBlock( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=768, out_features=3072, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=3072, out_features=768, bias=True) ) (ls_2): Identity() ) (6): ResidualAttentionBlock( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=768, out_features=3072, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=3072, out_features=768, bias=True) ) (ls_2): Identity() ) (7): ResidualAttentionBlock( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=768, out_features=3072, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=3072, out_features=768, bias=True) ) (ls_2): Identity() ) (8): ResidualAttentionBlock( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=768, out_features=3072, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=3072, out_features=768, bias=True) ) (ls_2): Identity() ) (9): ResidualAttentionBlock( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=768, out_features=3072, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=3072, out_features=768, bias=True) ) (ls_2): Identity() ) (10): ResidualAttentionBlock( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=768, out_features=3072, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=3072, out_features=768, bias=True) ) (ls_2): Identity() ) (11): ResidualAttentionBlock( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=768, out_features=3072, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=3072, out_features=768, bias=True) ) (ls_2): Identity() ) ) ) (ln_post): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) (transformer): Transformer( (resblocks): ModuleList( (0): ResidualAttentionBlock( (ln_1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=512, out_features=2048, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=2048, out_features=512, bias=True) ) (ls_2): Identity() ) (1): ResidualAttentionBlock( (ln_1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=512, out_features=2048, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=2048, out_features=512, bias=True) ) (ls_2): Identity() ) (2): ResidualAttentionBlock( (ln_1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=512, out_features=2048, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=2048, out_features=512, bias=True) ) (ls_2): Identity() ) (3): ResidualAttentionBlock( (ln_1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=512, out_features=2048, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=2048, out_features=512, bias=True) ) (ls_2): Identity() ) (4): ResidualAttentionBlock( (ln_1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=512, out_features=2048, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=2048, out_features=512, bias=True) ) (ls_2): Identity() ) (5): ResidualAttentionBlock( (ln_1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=512, out_features=2048, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=2048, out_features=512, bias=True) ) (ls_2): Identity() ) (6): ResidualAttentionBlock( (ln_1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=512, out_features=2048, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=2048, out_features=512, bias=True) ) (ls_2): Identity() ) (7): ResidualAttentionBlock( (ln_1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=512, out_features=2048, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=2048, out_features=512, bias=True) ) (ls_2): Identity() ) (8): ResidualAttentionBlock( (ln_1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=512, out_features=2048, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=2048, out_features=512, bias=True) ) (ls_2): Identity() ) (9): ResidualAttentionBlock( (ln_1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=512, out_features=2048, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=2048, out_features=512, bias=True) ) (ls_2): Identity() ) (10): ResidualAttentionBlock( (ln_1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=512, out_features=2048, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=2048, out_features=512, bias=True) ) (ls_2): Identity() ) (11): ResidualAttentionBlock( (ln_1): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True) ) (ls_1): Identity() (ln_2): LayerNorm((512,), eps=1e-05, elementwise_affine=True) (mlp): Sequential( (c_fc): Linear(in_features=512, out_features=2048, bias=True) (gelu): GELU(approximate='none') (c_proj): Linear(in_features=2048, out_features=512, bias=True) ) (ls_2): Identity() ) ) ) (token_embedding): Embedding(49408, 512) (ln_final): LayerNorm((512,), eps=1e-05, elementwise_affine=True) ) 2024-09-06,22:41:01 | INFO | Params: 2024-09-06,22:41:01 | INFO | accum_freq: 1 2024-09-06,22:41:01 | INFO | aug_cfg: {} 2024-09-06,22:41:01 | INFO | batch_size: 2048 2024-09-06,22:41:01 | INFO | beta1: 0.9 2024-09-06,22:41:01 | INFO | beta2: 0.98 2024-09-06,22:41:01 | INFO | checkpoint_path: /home/breaking_0.1_trained/10% most difficult/checkpoints 2024-09-06,22:41:01 | INFO | coca_caption_loss_weight: 2.0 2024-09-06,22:41:01 | INFO | coca_contrastive_loss_weight: 1.0 2024-09-06,22:41:01 | INFO | copy_codebase: False 2024-09-06,22:41:01 | INFO | csv_caption_key: title 2024-09-06,22:41:01 | INFO | csv_img_key: filepath 2024-09-06,22:41:01 | INFO | csv_separator: 2024-09-06,22:41:01 | INFO | dataset_resampled: True 2024-09-06,22:41:01 | INFO | dataset_type: webdataset 2024-09-06,22:41:01 | INFO | ddp_static_graph: True 2024-09-06,22:41:01 | INFO | debug: False 2024-09-06,22:41:01 | INFO | delete_previous_checkpoint: False 2024-09-06,22:41:01 | INFO | device: cuda:0 2024-09-06,22:41:01 | INFO | dist_backend: nccl 2024-09-06,22:41:01 | INFO | dist_url: env:// 2024-09-06,22:41:01 | INFO | distill: False 2024-09-06,22:41:01 | INFO | distill_model: None 2024-09-06,22:41:01 | INFO | distill_pretrained: None 2024-09-06,22:41:01 | INFO | distributed: True 2024-09-06,22:41:01 | INFO | epochs: 5 2024-09-06,22:41:01 | INFO | epochs_cooldown: None 2024-09-06,22:41:01 | INFO | eps: 1e-06 2024-09-06,22:41:01 | INFO | force_custom_text: False 2024-09-06,22:41:01 | INFO | force_image_size: None 2024-09-06,22:41:01 | INFO | force_patch_dropout: None 2024-09-06,22:41:01 | INFO | force_quick_gelu: False 2024-09-06,22:41:01 | INFO | gather_with_grad: True 2024-09-06,22:41:01 | INFO | grad_checkpointing: True 2024-09-06,22:41:01 | INFO | grad_clip_norm: None 2024-09-06,22:41:01 | INFO | horovod: False 2024-09-06,22:41:01 | INFO | image_mean: None 2024-09-06,22:41:01 | INFO | image_std: None 2024-09-06,22:41:01 | INFO | imagenet_v2: None 2024-09-06,22:41:01 | INFO | imagenet_val: None 2024-09-06,22:41:01 | INFO | local_loss: True 2024-09-06,22:41:01 | INFO | local_rank: 0 2024-09-06,22:41:01 | INFO | lock_image: False 2024-09-06,22:41:01 | INFO | lock_image_freeze_bn_stats: False 2024-09-06,22:41:01 | INFO | lock_image_unlocked_groups: 0 2024-09-06,22:41:01 | INFO | lock_text: False 2024-09-06,22:41:01 | INFO | lock_text_freeze_layer_norm: False 2024-09-06,22:41:01 | INFO | lock_text_unlocked_layers: 0 2024-09-06,22:41:01 | INFO | log_every_n_steps: 100 2024-09-06,22:41:01 | INFO | log_level: 20 2024-09-06,22:41:01 | INFO | log_local: False 2024-09-06,22:41:01 | INFO | log_path: /home/breaking_0.1_trained/10% most difficult/out.log 2024-09-06,22:41:01 | INFO | logs: /home/breaking_0.1_trained 2024-09-06,22:41:01 | INFO | lr: 0.0005 2024-09-06,22:41:01 | INFO | lr_cooldown_end: 0.0 2024-09-06,22:41:01 | INFO | lr_cooldown_power: 1.0 2024-09-06,22:41:01 | INFO | lr_scheduler: cosine 2024-09-06,22:41:01 | INFO | model: ViT-B-32 2024-09-06,22:41:01 | INFO | name: 10% most difficult 2024-09-06,22:41:01 | INFO | no_set_device_rank: False 2024-09-06,22:41:01 | INFO | precision: amp 2024-09-06,22:41:01 | INFO | pretrained: 2024-09-06,22:41:01 | INFO | pretrained_image: False 2024-09-06,22:41:01 | INFO | rank: 0 2024-09-06,22:41:01 | INFO | remote_sync: None 2024-09-06,22:41:01 | INFO | remote_sync_frequency: 300 2024-09-06,22:41:01 | INFO | remote_sync_protocol: s3 2024-09-06,22:41:01 | INFO | report_to: wandb 2024-09-06,22:41:01 | INFO | resume: None 2024-09-06,22:41:01 | INFO | save_frequency: 0 2024-09-06,22:41:01 | INFO | save_most_recent: True 2024-09-06,22:41:01 | INFO | seed: 0 2024-09-06,22:41:01 | INFO | skip_scheduler: False 2024-09-06,22:41:01 | INFO | tensorboard: False 2024-09-06,22:41:01 | INFO | tensorboard_path: 2024-09-06,22:41:01 | INFO | torchscript: False 2024-09-06,22:41:01 | INFO | trace: False 2024-09-06,22:41:01 | INFO | train_data: /home/breaking_0.1/{00000000..00000127}.tar 2024-09-06,22:41:01 | INFO | train_data_upsampling_factors: None 2024-09-06,22:41:01 | INFO | train_num_samples: 2560000 2024-09-06,22:41:01 | INFO | use_bn_sync: False 2024-09-06,22:41:01 | INFO | val_data: None 2024-09-06,22:41:01 | INFO | val_frequency: 1 2024-09-06,22:41:01 | INFO | val_num_samples: None 2024-09-06,22:41:01 | INFO | wandb: True 2024-09-06,22:41:01 | INFO | wandb_notes: 2024-09-06,22:41:01 | INFO | wandb_project_name: clip_text_hq_clusters 2024-09-06,22:41:01 | INFO | warmup: 500 2024-09-06,22:41:01 | INFO | wd: 0.2 2024-09-06,22:41:01 | INFO | workers: 4 2024-09-06,22:41:01 | INFO | world_size: 2 2024-09-06,22:41:01 | INFO | zeroshot_frequency: 2