2023-06-23 08:06:25,990 INFO [train.py:1076] (1/4) Training started 2023-06-23 08:06:25,991 INFO [train.py:1086] (1/4) Device: cuda:1 2023-06-23 08:06:26,034 INFO [train.py:1095] (1/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Debug', 'k2-with-cuda': True, 'k2-git-sha1': '38211604d6a24b15f320578a1a38f6c12d7a711c', 'k2-git-date': 'Mon Jun 12 10:59:44 2023', 'lhotse-version': '1.15.0.dev+git.f1fd23d.clean', 'torch-version': '2.0.0+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.8', 'icefall-git-branch': 'jsalt_ted', 'icefall-git-sha1': '5e817e8-dirty', 'icefall-git-date': 'Thu Jun 22 03:25:04 2023', 'icefall-path': '/exp/draj/jsalt2023/icefall', 'k2-path': '/exp/draj/jsalt2023/k2/k2/python/k2/__init__.py', 'lhotse-path': '/exp/draj/jsalt2023/lhotse/lhotse/__init__.py', 'hostname': 'r3n06', 'IP address': '10.1.3.6'}, 'world_size': 4, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 50, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp/v7'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'base_lr': 0.04, 'lr_batches': 7500, 'lr_epochs': 5.0, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'rnnt_type': 'modified', 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 1, 'average_period': 200, 'use_fp16': True, 'delay_penalty': 0.0, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/manifests'), 'max_duration': 1000, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'blank_id': 0, 'vocab_size': 500} 2023-06-23 08:06:26,035 INFO [train.py:1097] (1/4) About to create model 2023-06-23 08:06:26,793 INFO [train.py:1101] (1/4) Number of model parameters: 65549011 2023-06-23 08:06:35,085 INFO [train.py:1116] (1/4) Using DDP 2023-06-23 08:06:35,517 INFO [asr_datamodule.py:406] (1/4) About to get train cuts 2023-06-23 08:06:35,574 INFO [asr_datamodule.py:232] (1/4) Enable SpecAugment 2023-06-23 08:06:35,574 INFO [asr_datamodule.py:233] (1/4) Time warp factor: 80 2023-06-23 08:06:35,575 INFO [asr_datamodule.py:249] (1/4) About to get Musan cuts 2023-06-23 08:06:35,575 INFO [asr_datamodule.py:252] (1/4) Enable MUSAN 2023-06-23 08:06:37,163 INFO [asr_datamodule.py:274] (1/4) About to create train dataset 2023-06-23 08:06:37,163 INFO [asr_datamodule.py:300] (1/4) Using DynamicBucketingSampler. 2023-06-23 08:06:39,221 INFO [asr_datamodule.py:321] (1/4) About to create train dataloader 2023-06-23 08:06:39,222 INFO [asr_datamodule.py:411] (1/4) About to get dev cuts 2023-06-23 08:06:39,223 INFO [asr_datamodule.py:342] (1/4) About to create dev dataset 2023-06-23 08:06:39,243 INFO [asr_datamodule.py:361] (1/4) About to create dev dataloader 2023-06-23 08:06:39,243 INFO [train.py:1269] (1/4) Sanity check -- see if any of the batches in epoch 1 would cause OOM. 2023-06-23 08:07:25,979 INFO [scaling.py:962] (1/4) Whitening: name=None, num_groups=4, num_channels=128, metric=13.66 vs. limit=3.0 2023-06-23 08:07:26,291 INFO [scaling.py:962] (1/4) Whitening: name=None, num_groups=1, num_channels=256, metric=45.72 vs. limit=5.0 2023-06-23 08:07:26,628 INFO [train.py:1297] (1/4) Maximum memory allocated so far is 8701MB 2023-06-23 08:07:29,313 INFO [train.py:1297] (1/4) Maximum memory allocated so far is 8824MB 2023-06-23 08:07:40,819 INFO [train.py:1297] (1/4) Maximum memory allocated so far is 11517MB 2023-06-23 08:07:47,183 INFO [train.py:1297] (1/4) Maximum memory allocated so far is 11850MB 2023-06-23 08:08:05,857 INFO [train.py:1297] (1/4) Maximum memory allocated so far is 11850MB 2023-06-23 08:08:15,177 INFO [train.py:1297] (1/4) Maximum memory allocated so far is 11982MB 2023-06-23 08:08:35,413 INFO [train.py:1008] (1/4) Epoch 1, batch 0, loss[loss=6.214, simple_loss=5.661, pruned_loss=5.523, over 19778.00 frames. ], tot_loss[loss=6.214, simple_loss=5.661, pruned_loss=5.523, over 19778.00 frames. ], batch size: 115, lr: 2.00e-02, grad_scale: 1.0 2023-06-23 08:08:35,413 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 08:08:41,592 INFO [train.py:1040] (1/4) Epoch 1, validation: loss=6.238, simple_loss=5.687, pruned_loss=5.495, over 143649.00 frames. 2023-06-23 08:08:41,593 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 11982MB 2023-06-23 08:08:48,230 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.14 vs. limit=7.5 2023-06-23 08:08:49,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=0.0, ans=0.2 2023-06-23 08:08:52,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=0.0, ans=0.3 2023-06-23 08:09:00,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=0.0, ans=0.5 2023-06-23 08:09:32,105 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.25 vs. limit=5.016666666666667 2023-06-23 08:09:34,070 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.32 vs. limit=3.01 2023-06-23 08:09:34,544 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=190.15 vs. limit=7.525 2023-06-23 08:09:37,171 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=499.37 vs. limit=5.033333333333333 2023-06-23 08:09:51,776 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=379.48 vs. limit=7.6 2023-06-23 08:10:05,957 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=208.83 vs. limit=7.575 2023-06-23 08:10:14,151 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=233.20 vs. limit=7.575 2023-06-23 08:10:15,773 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=245.11 vs. limit=7.65 2023-06-23 08:10:36,520 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.92 vs. limit=7.7 2023-06-23 08:10:40,002 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=36.73 vs. limit=7.7 2023-06-23 08:10:40,042 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=43.73 vs. limit=7.6 2023-06-23 08:10:44,679 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=55.70 vs. limit=7.6 2023-06-23 08:10:47,202 INFO [train.py:1008] (1/4) Epoch 1, batch 50, loss[loss=1.432, simple_loss=1.286, pruned_loss=1.324, over 19062.00 frames. ], tot_loss[loss=2.837, simple_loss=2.61, pruned_loss=2.216, over 863355.97 frames. ], batch size: 89, lr: 2.20e-02, grad_scale: 0.5 2023-06-23 08:10:48,651 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=51.93 vs. limit=7.625 2023-06-23 08:10:58,763 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=35.75 vs. limit=7.625 2023-06-23 08:11:13,493 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=194.48 vs. limit=7.65 2023-06-23 08:11:14,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=400.0, ans=0.0975 2023-06-23 08:11:32,799 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=75.57 vs. limit=7.675 2023-06-23 08:11:33,225 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=26.82 vs. limit=7.675 2023-06-23 08:11:34,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=466.6666666666667, ans=0.0895 2023-06-23 08:11:38,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=466.6666666666667, ans=0.1825 2023-06-23 08:11:39,139 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=393.85 vs. limit=5.233333333333333 2023-06-23 08:11:58,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=533.3333333333334, ans=0.475 2023-06-23 08:12:11,916 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=8.59 vs. limit=4.24 2023-06-23 08:12:20,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=600.0, ans=0.471875 2023-06-23 08:12:20,970 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=93.99 vs. limit=7.725 2023-06-23 08:12:31,920 INFO [train.py:1008] (1/4) Epoch 1, batch 100, loss[loss=1.336, simple_loss=1.165, pruned_loss=1.378, over 18317.00 frames. ], tot_loss[loss=1.973, simple_loss=1.79, pruned_loss=1.682, over 1514018.25 frames. ], batch size: 72, lr: 2.40e-02, grad_scale: 1.0 2023-06-23 08:12:35,825 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.225e+02 1.675e+02 4.751e+02 2.973e+03 1.162e+04, threshold=9.501e+02, percent-clipped=0.0 2023-06-23 08:13:00,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=733.3333333333334, ans=0.04770833333333334 2023-06-23 08:13:21,303 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.98 vs. limit=5.2 2023-06-23 08:13:46,187 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=52.28 vs. limit=7.825 2023-06-23 08:13:50,511 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=144.43 vs. limit=7.825 2023-06-23 08:13:52,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=933.3333333333334, ans=0.29066666666666663 2023-06-23 08:13:53,604 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=165.04 vs. limit=7.85 2023-06-23 08:14:03,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=933.3333333333334, ans=0.09416666666666668 2023-06-23 08:14:07,574 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.51 vs. limit=5.466666666666667 2023-06-23 08:14:14,448 INFO [train.py:1008] (1/4) Epoch 1, batch 150, loss[loss=1.14, simple_loss=0.9849, pruned_loss=1.145, over 16711.00 frames. ], tot_loss[loss=1.616, simple_loss=1.449, pruned_loss=1.453, over 2002865.29 frames. ], batch size: 59, lr: 2.60e-02, grad_scale: 1.0 2023-06-23 08:14:16,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1000.0, ans=0.453125 2023-06-23 08:14:24,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1000.0, ans=0.453125 2023-06-23 08:14:43,363 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=232.16 vs. limit=7.9 2023-06-23 08:14:45,819 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.94 vs. limit=5.266666666666667 2023-06-23 08:14:48,854 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.59 vs. limit=5.266666666666667 2023-06-23 08:14:50,588 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=22.42 vs. limit=7.9 2023-06-23 08:15:12,251 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=236.71 vs. limit=7.925 2023-06-23 08:15:12,263 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=204.56 vs. limit=7.925 2023-06-23 08:15:14,122 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=83.26 vs. limit=7.95 2023-06-23 08:15:15,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1200.0, ans=0.218 2023-06-23 08:15:42,776 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=132.96 vs. limit=7.975 2023-06-23 08:15:53,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1266.6666666666667, ans=0.219 2023-06-23 08:15:58,202 INFO [train.py:1008] (1/4) Epoch 1, batch 200, loss[loss=0.9346, simple_loss=0.8062, pruned_loss=0.8871, over 20090.00 frames. ], tot_loss[loss=1.403, simple_loss=1.246, pruned_loss=1.287, over 2393446.56 frames. ], batch size: 133, lr: 2.80e-02, grad_scale: 2.0 2023-06-23 08:16:02,125 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 9.170e+01 1.130e+02 1.229e+02 1.420e+02 8.859e+02, threshold=2.457e+02, percent-clipped=0.0 2023-06-23 08:16:03,429 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=242.71 vs. limit=5.666666666666667 2023-06-23 08:16:25,486 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=12.98 vs. limit=8.025 2023-06-23 08:16:29,663 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.whiten.whitening_limit, batch_count=1400.0, ans=4.5600000000000005 2023-06-23 08:16:39,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1466.6666666666667, ans=0.43125 2023-06-23 08:16:43,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=1466.6666666666667, ans=0.2853333333333333 2023-06-23 08:16:45,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1466.6666666666667, ans=0.43125 2023-06-23 08:16:49,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1466.6666666666667, ans=0.43125 2023-06-23 08:17:38,399 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.39 vs. limit=8.75 2023-06-23 08:17:39,815 INFO [train.py:1008] (1/4) Epoch 1, batch 250, loss[loss=1.036, simple_loss=0.888, pruned_loss=0.9546, over 18323.00 frames. ], tot_loss[loss=1.264, simple_loss=1.114, pruned_loss=1.165, over 2707396.18 frames. ], batch size: 72, lr: 3.00e-02, grad_scale: 2.0 2023-06-23 08:17:41,413 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=179.75 vs. limit=8.125 2023-06-23 08:17:44,919 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=16.36 vs. limit=8.125 2023-06-23 08:17:44,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=1666.6666666666667, ans=8.125 2023-06-23 08:17:58,377 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=74.42 vs. limit=8.15 2023-06-23 08:17:58,704 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=225.62 vs. limit=5.866666666666666 2023-06-23 08:18:05,012 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=19.95 vs. limit=8.15 2023-06-23 08:18:09,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1733.3333333333333, ans=0.061000000000000006 2023-06-23 08:18:14,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1733.3333333333333, ans=0.061000000000000006 2023-06-23 08:18:16,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1733.3333333333333, ans=0.061000000000000006 2023-06-23 08:18:28,829 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=16.93 vs. limit=8.175 2023-06-23 08:18:40,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1866.6666666666667, ans=0.4125 2023-06-23 08:18:42,370 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=35.90 vs. limit=8.2 2023-06-23 08:18:46,178 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=28.65 vs. limit=8.2 2023-06-23 08:18:53,031 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=22.34 vs. limit=8.2 2023-06-23 08:19:10,418 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=23.78 vs. limit=8.225 2023-06-23 08:19:21,087 INFO [train.py:1008] (1/4) Epoch 1, batch 300, loss[loss=0.8672, simple_loss=0.7413, pruned_loss=0.7718, over 20544.00 frames. ], tot_loss[loss=1.172, simple_loss=1.026, pruned_loss=1.076, over 2941102.18 frames. ], batch size: 189, lr: 3.20e-02, grad_scale: 4.0 2023-06-23 08:19:21,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2000.0, ans=0.043750000000000004 2023-06-23 08:19:24,878 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 8.030e+01 1.048e+02 1.320e+02 1.714e+02 2.522e+02, threshold=2.641e+02, percent-clipped=2.0 2023-06-23 08:19:31,890 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=54.79 vs. limit=8.25 2023-06-23 08:19:34,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2000.0, ans=0.23 2023-06-23 08:19:41,049 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2066.6666666666665, ans=0.403125 2023-06-23 08:19:53,331 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=29.59 vs. limit=8.275 2023-06-23 08:20:02,934 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=17.05 vs. limit=8.3 2023-06-23 08:20:10,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2133.3333333333335, ans=0.4 2023-06-23 08:20:16,950 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.26 vs. limit=9.1 2023-06-23 08:20:22,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2200.0, ans=0.27799999999999997 2023-06-23 08:20:28,895 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=20.89 vs. limit=8.325 2023-06-23 08:20:36,536 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.69 vs. limit=8.325 2023-06-23 08:20:59,823 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=26.49 vs. limit=9.2 2023-06-23 08:21:03,121 INFO [train.py:1008] (1/4) Epoch 1, batch 350, loss[loss=0.872, simple_loss=0.7395, pruned_loss=0.7652, over 20554.00 frames. ], tot_loss[loss=1.104, simple_loss=0.9603, pruned_loss=1.005, over 3126909.69 frames. ], batch size: 160, lr: 3.40e-02, grad_scale: 4.0 2023-06-23 08:21:24,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2400.0, ans=0.8160000000000001 2023-06-23 08:21:34,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2400.0, ans=0.5 2023-06-23 08:21:41,886 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=24.36 vs. limit=8.425 2023-06-23 08:21:56,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2466.6666666666665, ans=0.384375 2023-06-23 08:22:09,018 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.90 vs. limit=5.013333333333334 2023-06-23 08:22:11,266 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.29 vs. limit=8.45 2023-06-23 08:22:16,816 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.04 vs. limit=5.633333333333334 2023-06-23 08:22:22,422 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.98 vs. limit=8.475 2023-06-23 08:22:27,672 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=19.11 vs. limit=8.475 2023-06-23 08:22:29,307 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=13.39 vs. limit=9.45 2023-06-23 08:22:31,677 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.62 vs. limit=6.3 2023-06-23 08:22:31,820 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=19.23 vs. limit=8.475 2023-06-23 08:22:33,830 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=11.28 vs. limit=8.475 2023-06-23 08:22:36,930 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=18.87 vs. limit=9.45 2023-06-23 08:22:40,642 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.08 vs. limit=9.45 2023-06-23 08:22:43,344 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.64 vs. limit=6.333333333333333 2023-06-23 08:22:43,996 INFO [train.py:1008] (1/4) Epoch 1, batch 400, loss[loss=0.8609, simple_loss=0.7312, pruned_loss=0.7237, over 19444.00 frames. ], tot_loss[loss=1.054, simple_loss=0.9127, pruned_loss=0.947, over 3268977.90 frames. ], batch size: 105, lr: 3.60e-02, grad_scale: 8.0 2023-06-23 08:22:44,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2666.6666666666665, ans=0.16666666666666669 2023-06-23 08:22:46,677 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.08 vs. limit=8.5 2023-06-23 08:22:47,923 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 8.625e+01 1.356e+02 1.811e+02 2.301e+02 5.147e+02, threshold=3.622e+02, percent-clipped=12.0 2023-06-23 08:22:48,707 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.92 vs. limit=5.666666666666667 2023-06-23 08:22:54,773 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.54 vs. limit=8.5 2023-06-23 08:22:56,437 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.10 vs. limit=8.5 2023-06-23 08:22:58,586 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.25 vs. limit=8.5 2023-06-23 08:23:16,439 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=185.03 vs. limit=8.525 2023-06-23 08:23:27,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2800.0, ans=0.36875 2023-06-23 08:23:37,260 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.87 vs. limit=8.55 2023-06-23 08:23:37,661 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=18.24 vs. limit=8.55 2023-06-23 08:23:38,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2800.0, ans=0.27199999999999996 2023-06-23 08:23:41,848 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=9.6 2023-06-23 08:23:45,812 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=16.98 vs. limit=8.575 2023-06-23 08:23:47,238 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=8.87 vs. limit=8.575 2023-06-23 08:23:49,699 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=21.45 vs. limit=8.575 2023-06-23 08:23:52,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2866.6666666666665, ans=0.7996666666666667 2023-06-23 08:23:54,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2866.6666666666665, ans=0.14166666666666666 2023-06-23 08:23:54,908 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.67 vs. limit=8.575 2023-06-23 08:24:15,369 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.13 vs. limit=8.6 2023-06-23 08:24:16,967 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=24.88 vs. limit=8.6 2023-06-23 08:24:24,830 INFO [train.py:1008] (1/4) Epoch 1, batch 450, loss[loss=0.8447, simple_loss=0.7197, pruned_loss=0.6804, over 19665.00 frames. ], tot_loss[loss=1.01, simple_loss=0.8711, pruned_loss=0.8907, over 3365722.28 frames. ], batch size: 110, lr: 3.80e-02, grad_scale: 8.0 2023-06-23 08:24:31,289 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.56 vs. limit=8.625 2023-06-23 08:24:31,323 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=8.625 2023-06-23 08:24:38,900 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=8.625 2023-06-23 08:24:52,906 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=10.98 vs. limit=9.8 2023-06-23 08:24:57,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=3066.6666666666665, ans=5.766666666666667 2023-06-23 08:25:10,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=3133.3333333333335, ans=0.353125 2023-06-23 08:25:20,461 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=6.32 vs. limit=5.253333333333334 2023-06-23 08:25:22,451 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.29 vs. limit=8.7 2023-06-23 08:25:34,020 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.55 vs. limit=8.7 2023-06-23 08:25:36,022 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.93 vs. limit=5.8 2023-06-23 08:25:50,311 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.02 vs. limit=8.725 2023-06-23 08:25:54,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=3266.6666666666665, ans=0.02650000000000001 2023-06-23 08:25:59,213 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=24.34 vs. limit=8.75 2023-06-23 08:25:59,977 INFO [train.py:1008] (1/4) Epoch 1, batch 500, loss[loss=0.7411, simple_loss=0.6371, pruned_loss=0.5646, over 20288.00 frames. ], tot_loss[loss=0.9653, simple_loss=0.8314, pruned_loss=0.8308, over 3459353.06 frames. ], batch size: 149, lr: 4.00e-02, grad_scale: 8.0 2023-06-23 08:26:03,487 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.223e+02 1.809e+02 2.343e+02 3.380e+02 8.520e+02, threshold=4.686e+02, percent-clipped=21.0 2023-06-23 08:26:09,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=3333.3333333333335, ans=0.34375 2023-06-23 08:26:11,848 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=10.66 vs. limit=10.0 2023-06-23 08:26:13,612 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.68 vs. limit=8.75 2023-06-23 08:26:16,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=3400.0, ans=0.340625 2023-06-23 08:26:25,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=3400.0, ans=0.340625 2023-06-23 08:26:36,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=3466.6666666666665, ans=0.252 2023-06-23 08:26:46,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=3466.6666666666665, ans=0.3375 2023-06-23 08:26:56,131 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.08 vs. limit=10.15 2023-06-23 08:27:22,045 INFO [train.py:1008] (1/4) Epoch 2, batch 0, loss[loss=0.7396, simple_loss=0.6419, pruned_loss=0.539, over 19187.00 frames. ], tot_loss[loss=0.7396, simple_loss=0.6419, pruned_loss=0.539, over 19187.00 frames. ], batch size: 92, lr: 3.96e-02, grad_scale: 16.0 2023-06-23 08:27:22,045 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 08:27:27,602 INFO [train.py:1040] (1/4) Epoch 2, validation: loss=0.696, simple_loss=0.6181, pruned_loss=0.4715, over 143649.00 frames. 2023-06-23 08:27:27,603 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 08:27:36,299 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=8.8325 2023-06-23 08:27:48,777 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.94 vs. limit=6.8100000000000005 2023-06-23 08:28:05,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=3686.6666666666665, ans=0.06174999999999997 2023-06-23 08:28:06,206 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.77 vs. limit=6.843333333333334 2023-06-23 08:28:07,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=3686.6666666666665, ans=0.32718749999999996 2023-06-23 08:28:24,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=3753.3333333333335, ans=0.05924999999999997 2023-06-23 08:28:30,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=3753.3333333333335, ans=0.07654166666666667 2023-06-23 08:29:02,414 INFO [train.py:1008] (1/4) Epoch 2, batch 50, loss[loss=0.6615, simple_loss=0.5766, pruned_loss=0.4661, over 20648.00 frames. ], tot_loss[loss=0.7207, simple_loss=0.6271, pruned_loss=0.5156, over 856118.09 frames. ], batch size: 211, lr: 3.95e-02, grad_scale: 16.0 2023-06-23 08:29:06,849 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.17 vs. limit=5.971666666666667 2023-06-23 08:29:09,320 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.24 vs. limit=5.971666666666667 2023-06-23 08:29:31,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=3953.3333333333335, ans=0.0058333333333333015 2023-06-23 08:29:38,275 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 2.585e+02 4.168e+02 5.706e+02 1.238e+03, threshold=8.337e+02, percent-clipped=41.0 2023-06-23 08:29:46,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=4020.0, ans=0.04991666666666667 2023-06-23 08:30:23,207 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.92 vs. limit=9.0575 2023-06-23 08:30:26,688 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.84 vs. limit=9.0575 2023-06-23 08:30:31,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=4153.333333333333, ans=0.3053125 2023-06-23 08:30:33,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=4153.333333333333, ans=0.7546333333333334 2023-06-23 08:30:37,517 INFO [train.py:1008] (1/4) Epoch 2, batch 100, loss[loss=0.6509, simple_loss=0.5783, pruned_loss=0.4263, over 18627.00 frames. ], tot_loss[loss=0.6909, simple_loss=0.6039, pruned_loss=0.4828, over 1522814.28 frames. ], batch size: 80, lr: 3.95e-02, grad_scale: 16.0 2023-06-23 08:30:42,117 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.96 vs. limit=6.055 2023-06-23 08:30:45,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=4220.0, ans=0.2633 2023-06-23 08:30:45,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=4220.0, ans=0.2578 2023-06-23 08:30:49,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=4220.0, ans=0.3021875 2023-06-23 08:31:19,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=4353.333333333333, ans=0.7476333333333334 2023-06-23 08:31:32,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=4420.0, ans=0.2663 2023-06-23 08:31:38,398 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.21 vs. limit=9.1575 2023-06-23 08:32:02,089 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.70 vs. limit=9.182500000000001 2023-06-23 08:32:03,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=4486.666666666667, ans=0.04797222222222222 2023-06-23 08:32:05,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=4486.666666666667, ans=0.009894202898550725 2023-06-23 08:32:10,569 INFO [train.py:1008] (1/4) Epoch 2, batch 150, loss[loss=0.5913, simple_loss=0.5279, pruned_loss=0.3767, over 19822.00 frames. ], tot_loss[loss=0.6672, simple_loss=0.5859, pruned_loss=0.4557, over 2034110.17 frames. ], batch size: 115, lr: 3.95e-02, grad_scale: 16.0 2023-06-23 08:32:10,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=4553.333333333333, ans=0.2865625 2023-06-23 08:32:15,824 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=11.12 vs. limit=10.915 2023-06-23 08:32:16,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=4553.333333333333, ans=0.04769444444444445 2023-06-23 08:32:42,120 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=4620.0, ans=0.2693 2023-06-23 08:32:47,166 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.685e+02 3.263e+02 5.019e+02 7.884e+02 1.752e+03, threshold=1.004e+03, percent-clipped=21.0 2023-06-23 08:32:55,628 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.93 vs. limit=9.2575 2023-06-23 08:33:01,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=4686.666666666667, ans=0.2531333333333333 2023-06-23 08:33:03,788 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.92 vs. limit=9.2575 2023-06-23 08:33:42,516 INFO [train.py:1008] (1/4) Epoch 2, batch 200, loss[loss=0.5782, simple_loss=0.518, pruned_loss=0.3608, over 19490.00 frames. ], tot_loss[loss=0.6481, simple_loss=0.5715, pruned_loss=0.4337, over 2422128.29 frames. ], batch size: 105, lr: 3.95e-02, grad_scale: 8.0 2023-06-23 08:33:55,415 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=4886.666666666667, ans=0.04630555555555556 2023-06-23 08:34:03,121 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.05 vs. limit=9.3575 2023-06-23 08:34:29,928 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.94 vs. limit=6.008 2023-06-23 08:34:54,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=5153.333333333333, ans=0.2584375 2023-06-23 08:35:14,003 INFO [train.py:1008] (1/4) Epoch 2, batch 250, loss[loss=0.579, simple_loss=0.5216, pruned_loss=0.3525, over 19313.00 frames. ], tot_loss[loss=0.6307, simple_loss=0.5588, pruned_loss=0.4131, over 2720076.79 frames. ], batch size: 98, lr: 3.95e-02, grad_scale: 8.0 2023-06-23 08:35:31,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=5286.666666666667, ans=0.2521875 2023-06-23 08:35:34,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=5286.666666666667, ans=0.2521875 2023-06-23 08:35:45,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=5286.666666666667, ans=0.2521875 2023-06-23 08:35:50,030 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.705e+02 3.360e+02 5.707e+02 8.306e+02 1.976e+03, threshold=1.141e+03, percent-clipped=19.0 2023-06-23 08:35:52,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=5353.333333333333, ans=0.044361111111111115 2023-06-23 08:36:10,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=5420.0, ans=0.24593749999999998 2023-06-23 08:36:27,816 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.80 vs. limit=9.557500000000001 2023-06-23 08:36:28,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=5486.666666666667, ans=0.2428125 2023-06-23 08:36:32,713 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=12.02 vs. limit=11.615 2023-06-23 08:36:41,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=5486.666666666667, ans=0.7079666666666666 2023-06-23 08:36:44,288 INFO [train.py:1008] (1/4) Epoch 2, batch 300, loss[loss=0.5616, simple_loss=0.5145, pruned_loss=0.3247, over 19091.00 frames. ], tot_loss[loss=0.6129, simple_loss=0.5455, pruned_loss=0.3934, over 2956150.09 frames. ], batch size: 94, lr: 3.95e-02, grad_scale: 8.0 2023-06-23 08:36:46,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=5553.333333333333, ans=0.7056333333333333 2023-06-23 08:37:18,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=5686.666666666667, ans=0.23343750000000002 2023-06-23 08:37:34,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=5686.666666666667, ans=0.03222916666666667 2023-06-23 08:37:50,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=5753.333333333333, ans=0.19246666666666667 2023-06-23 08:37:51,256 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.51 vs. limit=9.6575 2023-06-23 08:38:12,492 INFO [train.py:1008] (1/4) Epoch 2, batch 350, loss[loss=0.5338, simple_loss=0.4874, pruned_loss=0.3095, over 19317.00 frames. ], tot_loss[loss=0.5961, simple_loss=0.5334, pruned_loss=0.3746, over 3137379.47 frames. ], batch size: 98, lr: 3.95e-02, grad_scale: 8.0 2023-06-23 08:38:18,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=5886.666666666667, ans=0.2411333333333333 2023-06-23 08:38:40,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=5953.333333333333, ans=0.2209375 2023-06-23 08:38:48,692 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.796e+02 3.515e+02 5.337e+02 9.069e+02 1.620e+03, threshold=1.067e+03, percent-clipped=11.0 2023-06-23 08:38:58,589 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.68 vs. limit=3.903 2023-06-23 08:39:30,307 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.81 vs. limit=9.807500000000001 2023-06-23 08:39:43,040 INFO [train.py:1008] (1/4) Epoch 2, batch 400, loss[loss=0.5575, simple_loss=0.4907, pruned_loss=0.3486, over 20114.00 frames. ], tot_loss[loss=0.5815, simple_loss=0.5228, pruned_loss=0.3586, over 3290732.08 frames. ], batch size: 239, lr: 3.95e-02, grad_scale: 16.0 2023-06-23 08:40:09,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=6286.666666666667, ans=0.04047222222222222 2023-06-23 08:40:31,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=6353.333333333333, ans=0.20218750000000002 2023-06-23 08:40:48,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=6420.0, ans=0.19906249999999998 2023-06-23 08:41:05,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_na.min_abs, batch_count=6486.666666666667, ans=0.02 2023-06-23 08:41:11,438 INFO [train.py:1008] (1/4) Epoch 2, batch 450, loss[loss=0.5539, simple_loss=0.5169, pruned_loss=0.3024, over 18316.00 frames. ], tot_loss[loss=0.5691, simple_loss=0.514, pruned_loss=0.3449, over 3399397.22 frames. ], batch size: 72, lr: 3.94e-02, grad_scale: 4.0 2023-06-23 08:41:18,204 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=9.9575 2023-06-23 08:41:20,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=6553.333333333333, ans=0.03936111111111111 2023-06-23 08:41:36,737 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.96 vs. limit=6.655 2023-06-23 08:41:50,626 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 2.476e+02 3.499e+02 6.035e+02 3.649e+03, threshold=6.998e+02, percent-clipped=11.0 2023-06-23 08:41:51,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=6686.666666666667, ans=0.23313333333333333 2023-06-23 08:41:57,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=6686.666666666667, ans=0.23313333333333333 2023-06-23 08:42:09,709 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=6.54 vs. limit=6.701333333333333 2023-06-23 08:42:23,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=6820.0, ans=0.038250000000000006 2023-06-23 08:42:26,724 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.88 vs. limit=6.705 2023-06-23 08:42:37,674 INFO [train.py:1008] (1/4) Epoch 2, batch 500, loss[loss=0.4949, simple_loss=0.4615, pruned_loss=0.27, over 19470.00 frames. ], tot_loss[loss=0.5542, simple_loss=0.5034, pruned_loss=0.3295, over 3489734.37 frames. ], batch size: 105, lr: 3.94e-02, grad_scale: 8.0 2023-06-23 08:42:44,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=6886.666666666667, ans=0.03797222222222223 2023-06-23 08:43:19,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=7020.0, ans=0.6543 2023-06-23 08:43:45,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=7100.0, ans=0.1671875 2023-06-23 08:43:52,970 INFO [train.py:1008] (1/4) Epoch 3, batch 0, loss[loss=0.53, simple_loss=0.4946, pruned_loss=0.2885, over 16710.00 frames. ], tot_loss[loss=0.53, simple_loss=0.4946, pruned_loss=0.2885, over 16710.00 frames. ], batch size: 59, lr: 3.84e-02, grad_scale: 16.0 2023-06-23 08:43:52,970 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 08:43:58,465 INFO [train.py:1040] (1/4) Epoch 3, validation: loss=0.4015, simple_loss=0.4171, pruned_loss=0.1648, over 143649.00 frames. 2023-06-23 08:43:58,466 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 08:44:03,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=7100.0, ans=0.1671875 2023-06-23 08:44:24,469 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=5.86 vs. limit=8.583333333333334 2023-06-23 08:44:27,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=7166.666666666667, ans=0.22833333333333333 2023-06-23 08:44:27,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=7166.666666666667, ans=0.03680555555555556 2023-06-23 08:44:39,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=7233.333333333333, ans=0.1609375 2023-06-23 08:44:47,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=7233.333333333333, ans=0.22766666666666668 2023-06-23 08:44:56,865 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=7300.0, ans=0.22699999999999998 2023-06-23 08:45:09,081 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.685e+02 2.749e+02 5.394e+02 7.566e+02 2.513e+03, threshold=1.079e+03, percent-clipped=34.0 2023-06-23 08:45:23,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=7433.333333333333, ans=0.09899494936611666 2023-06-23 08:45:24,901 INFO [train.py:1008] (1/4) Epoch 3, batch 50, loss[loss=0.4744, simple_loss=0.4456, pruned_loss=0.2543, over 18600.00 frames. ], tot_loss[loss=0.4967, simple_loss=0.4631, pruned_loss=0.2706, over 839287.11 frames. ], batch size: 80, lr: 3.83e-02, grad_scale: 8.0 2023-06-23 08:45:37,470 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=14.24 vs. limit=13.075 2023-06-23 08:45:43,900 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.15 vs. limit=10.3125 2023-06-23 08:45:50,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=7500.0, ans=0.1484375 2023-06-23 08:45:58,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=7566.666666666667, ans=0.1453125 2023-06-23 08:46:35,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=7700.0, ans=0.0 2023-06-23 08:46:50,091 INFO [train.py:1008] (1/4) Epoch 3, batch 100, loss[loss=0.4934, simple_loss=0.4609, pruned_loss=0.2671, over 20294.00 frames. ], tot_loss[loss=0.4909, simple_loss=0.4586, pruned_loss=0.2661, over 1495939.35 frames. ], batch size: 149, lr: 3.83e-02, grad_scale: 8.0 2023-06-23 08:47:14,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=7833.333333333333, ans=0.03402777777777778 2023-06-23 08:47:39,141 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.36 vs. limit=6.991666666666667 2023-06-23 08:47:57,782 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.817e+02 3.094e+02 5.025e+02 7.016e+02 1.761e+03, threshold=1.005e+03, percent-clipped=8.0 2023-06-23 08:48:14,412 INFO [train.py:1008] (1/4) Epoch 3, batch 150, loss[loss=0.4688, simple_loss=0.4471, pruned_loss=0.2436, over 19831.00 frames. ], tot_loss[loss=0.4844, simple_loss=0.4546, pruned_loss=0.2601, over 1985001.14 frames. ], batch size: 120, lr: 3.83e-02, grad_scale: 8.0 2023-06-23 08:48:31,563 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.66 vs. limit=9.083333333333334 2023-06-23 08:48:34,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=8166.666666666667, ans=0.21833333333333332 2023-06-23 08:48:48,122 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.86 vs. limit=13.675 2023-06-23 08:48:52,797 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=8233.333333333334, ans=0.6118333333333335 2023-06-23 08:49:22,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=8366.666666666666, ans=0.125 2023-06-23 08:49:25,701 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=8366.666666666666, ans=0.6071666666666667 2023-06-23 08:49:37,826 INFO [train.py:1008] (1/4) Epoch 3, batch 200, loss[loss=0.4392, simple_loss=0.4293, pruned_loss=0.2177, over 18311.00 frames. ], tot_loss[loss=0.4804, simple_loss=0.4524, pruned_loss=0.2562, over 2392627.70 frames. ], batch size: 74, lr: 3.83e-02, grad_scale: 8.0 2023-06-23 08:49:38,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=8433.333333333334, ans=0.125 2023-06-23 08:50:16,426 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 08:50:29,344 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.76 vs. limit=13.975000000000001 2023-06-23 08:50:45,448 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.582e+02 3.128e+02 5.211e+02 8.719e+02 2.248e+03, threshold=1.042e+03, percent-clipped=14.0 2023-06-23 08:50:50,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=8700.0, ans=0.125 2023-06-23 08:51:00,815 INFO [train.py:1008] (1/4) Epoch 3, batch 250, loss[loss=0.4652, simple_loss=0.45, pruned_loss=0.2357, over 18310.00 frames. ], tot_loss[loss=0.4759, simple_loss=0.45, pruned_loss=0.2517, over 2707902.76 frames. ], batch size: 74, lr: 3.83e-02, grad_scale: 8.0 2023-06-23 08:51:07,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=8766.666666666666, ans=0.125 2023-06-23 08:51:47,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=8900.0, ans=0.125 2023-06-23 08:51:48,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=8900.0, ans=0.05 2023-06-23 08:52:16,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=9033.333333333334, ans=0.5838333333333334 2023-06-23 08:52:25,329 INFO [train.py:1008] (1/4) Epoch 3, batch 300, loss[loss=0.4406, simple_loss=0.4106, pruned_loss=0.2383, over 19889.00 frames. ], tot_loss[loss=0.4693, simple_loss=0.4455, pruned_loss=0.2465, over 2950851.79 frames. ], batch size: 294, lr: 3.82e-02, grad_scale: 8.0 2023-06-23 08:52:30,307 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=9100.0, ans=0.20900000000000002 2023-06-23 08:52:54,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=9166.666666666666, ans=0.125 2023-06-23 08:53:08,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=9233.333333333334, ans=0.20766666666666667 2023-06-23 08:53:19,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=9300.0, ans=0.20700000000000002 2023-06-23 08:53:21,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=9300.0, ans=0.125 2023-06-23 08:53:33,148 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 3.201e+02 4.697e+02 7.108e+02 1.461e+03, threshold=9.395e+02, percent-clipped=6.0 2023-06-23 08:53:36,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=9366.666666666666, ans=0.20633333333333334 2023-06-23 08:53:43,027 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=9366.666666666666, ans=0.125 2023-06-23 08:53:43,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=9366.666666666666, ans=0.125 2023-06-23 08:53:47,514 INFO [train.py:1008] (1/4) Epoch 3, batch 350, loss[loss=0.4467, simple_loss=0.4368, pruned_loss=0.2232, over 19509.00 frames. ], tot_loss[loss=0.4641, simple_loss=0.4424, pruned_loss=0.242, over 3137010.22 frames. ], batch size: 105, lr: 3.82e-02, grad_scale: 8.0 2023-06-23 08:53:58,918 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.30 vs. limit=11.0375 2023-06-23 08:54:17,487 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.33 vs. limit=4.425 2023-06-23 08:54:20,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=9566.666666666666, ans=0.20433333333333334 2023-06-23 08:54:20,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=9566.666666666666, ans=0.008789855072463769 2023-06-23 08:54:28,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=9566.666666666666, ans=0.125 2023-06-23 08:54:39,137 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.30 vs. limit=14.725 2023-06-23 08:54:58,564 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=11.1375 2023-06-23 08:55:12,714 INFO [train.py:1008] (1/4) Epoch 3, batch 400, loss[loss=0.4341, simple_loss=0.4215, pruned_loss=0.2198, over 19834.00 frames. ], tot_loss[loss=0.4595, simple_loss=0.4397, pruned_loss=0.2381, over 3268909.84 frames. ], batch size: 115, lr: 3.82e-02, grad_scale: 16.0 2023-06-23 08:55:23,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=9766.666666666666, ans=0.5581666666666667 2023-06-23 08:55:33,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=9833.333333333334, ans=0.09899494936611666 2023-06-23 08:56:10,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=9966.666666666666, ans=0.5511666666666667 2023-06-23 08:56:13,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=9966.666666666666, ans=0.05 2023-06-23 08:56:22,513 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.727e+02 2.338e+02 3.683e+02 5.824e+02 1.434e+03, threshold=7.366e+02, percent-clipped=14.0 2023-06-23 08:56:26,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=10033.333333333334, ans=0.0 2023-06-23 08:56:37,671 INFO [train.py:1008] (1/4) Epoch 3, batch 450, loss[loss=0.4202, simple_loss=0.4236, pruned_loss=0.2009, over 18604.00 frames. ], tot_loss[loss=0.4539, simple_loss=0.4366, pruned_loss=0.2335, over 3386415.35 frames. ], batch size: 80, lr: 3.82e-02, grad_scale: 16.0 2023-06-23 08:57:10,988 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=15.96 vs. limit=15.175 2023-06-23 08:57:12,653 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.50 vs. limit=4.535 2023-06-23 08:57:24,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=10233.333333333334, ans=0.125 2023-06-23 08:57:27,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=10300.0, ans=0.023750000000000004 2023-06-23 08:57:54,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=10366.666666666666, ans=0.0 2023-06-23 08:58:00,686 INFO [train.py:1008] (1/4) Epoch 3, batch 500, loss[loss=0.414, simple_loss=0.4152, pruned_loss=0.2006, over 19876.00 frames. ], tot_loss[loss=0.4484, simple_loss=0.4333, pruned_loss=0.2292, over 3475614.48 frames. ], batch size: 120, lr: 3.81e-02, grad_scale: 16.0 2023-06-23 08:59:14,631 INFO [train.py:1008] (1/4) Epoch 4, batch 0, loss[loss=0.4063, simple_loss=0.4106, pruned_loss=0.195, over 19225.00 frames. ], tot_loss[loss=0.4063, simple_loss=0.4106, pruned_loss=0.195, over 19225.00 frames. ], batch size: 92, lr: 3.66e-02, grad_scale: 32.0 2023-06-23 08:59:14,631 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 08:59:20,229 INFO [train.py:1040] (1/4) Epoch 4, validation: loss=0.3166, simple_loss=0.3753, pruned_loss=0.1114, over 143649.00 frames. 2023-06-23 08:59:20,230 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 08:59:32,965 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=11.4925 2023-06-23 08:59:36,445 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.697e+02 2.731e+02 4.204e+02 6.792e+02 1.679e+03, threshold=8.407e+02, percent-clipped=23.0 2023-06-23 09:00:05,796 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=10780.0, ans=0.125 2023-06-23 09:00:10,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=10846.666666666666, ans=0.125 2023-06-23 09:00:15,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=10846.666666666666, ans=0.125 2023-06-23 09:00:32,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=10913.333333333334, ans=0.125 2023-06-23 09:00:42,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=10980.0, ans=0.125 2023-06-23 09:00:43,457 INFO [train.py:1008] (1/4) Epoch 4, batch 50, loss[loss=0.4634, simple_loss=0.4642, pruned_loss=0.2263, over 17638.00 frames. ], tot_loss[loss=0.4205, simple_loss=0.4191, pruned_loss=0.2065, over 848960.90 frames. ], batch size: 67, lr: 3.66e-02, grad_scale: 16.0 2023-06-23 09:00:59,174 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.88 vs. limit=7.761666666666667 2023-06-23 09:01:32,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=11180.0, ans=0.008439130434782609 2023-06-23 09:01:52,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=11246.666666666666, ans=0.125 2023-06-23 09:01:57,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=11246.666666666666, ans=0.125 2023-06-23 09:02:07,116 INFO [train.py:1008] (1/4) Epoch 4, batch 100, loss[loss=0.4265, simple_loss=0.429, pruned_loss=0.2079, over 18289.00 frames. ], tot_loss[loss=0.4196, simple_loss=0.4182, pruned_loss=0.2064, over 1499404.09 frames. ], batch size: 74, lr: 3.66e-02, grad_scale: 16.0 2023-06-23 09:02:07,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=11313.333333333334, ans=0.008410144927536231 2023-06-23 09:02:14,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=11313.333333333334, ans=0.18686666666666668 2023-06-23 09:02:19,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=11313.333333333334, ans=0.008410144927536231 2023-06-23 09:02:23,935 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.696e+02 2.593e+02 4.190e+02 6.338e+02 1.351e+03, threshold=8.380e+02, percent-clipped=12.0 2023-06-23 09:02:30,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=11380.0, ans=0.125 2023-06-23 09:02:37,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=11380.0, ans=0.125 2023-06-23 09:02:45,767 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=11446.666666666666, ans=0.018972222222222227 2023-06-23 09:03:29,037 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=11646.666666666666, ans=0.49236666666666673 2023-06-23 09:03:30,201 INFO [train.py:1008] (1/4) Epoch 4, batch 150, loss[loss=0.425, simple_loss=0.4327, pruned_loss=0.2046, over 16322.00 frames. ], tot_loss[loss=0.4157, simple_loss=0.4157, pruned_loss=0.204, over 2014017.73 frames. ], batch size: 52, lr: 3.66e-02, grad_scale: 16.0 2023-06-23 09:03:59,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=11713.333333333334, ans=0.125 2023-06-23 09:04:10,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=11780.0, ans=0.017583333333333333 2023-06-23 09:04:22,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=11846.666666666666, ans=0.125 2023-06-23 09:04:51,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=11913.333333333334, ans=0.125 2023-06-23 09:04:54,286 INFO [train.py:1008] (1/4) Epoch 4, batch 200, loss[loss=0.3768, simple_loss=0.3897, pruned_loss=0.1785, over 19218.00 frames. ], tot_loss[loss=0.4122, simple_loss=0.4137, pruned_loss=0.2019, over 2408853.40 frames. ], batch size: 92, lr: 3.65e-02, grad_scale: 16.0 2023-06-23 09:05:06,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=11980.0, ans=0.4807 2023-06-23 09:05:09,270 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.69 vs. limit=4.797 2023-06-23 09:05:11,079 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.595e+02 3.697e+02 5.580e+02 1.285e+03, threshold=7.394e+02, percent-clipped=7.0 2023-06-23 09:05:14,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=12046.666666666666, ans=0.17953333333333332 2023-06-23 09:05:27,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=12113.333333333334, ans=0.4760333333333333 2023-06-23 09:05:39,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=12113.333333333334, ans=0.125 2023-06-23 09:05:41,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=12113.333333333334, ans=0.04949747468305833 2023-06-23 09:06:06,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=12246.666666666666, ans=0.015638888888888897 2023-06-23 09:06:12,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=12246.666666666666, ans=0.17753333333333332 2023-06-23 09:06:19,932 INFO [train.py:1008] (1/4) Epoch 4, batch 250, loss[loss=0.3965, simple_loss=0.3984, pruned_loss=0.1956, over 20598.00 frames. ], tot_loss[loss=0.4097, simple_loss=0.4114, pruned_loss=0.2009, over 2710934.62 frames. ], batch size: 189, lr: 3.65e-02, grad_scale: 16.0 2023-06-23 09:06:59,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=12446.666666666666, ans=0.014805555555555558 2023-06-23 09:07:08,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=12513.333333333334, ans=0.00814927536231884 2023-06-23 09:07:44,453 INFO [train.py:1008] (1/4) Epoch 4, batch 300, loss[loss=0.3973, simple_loss=0.38, pruned_loss=0.2071, over 19809.00 frames. ], tot_loss[loss=0.4061, simple_loss=0.4101, pruned_loss=0.1983, over 2947871.29 frames. ], batch size: 293, lr: 3.65e-02, grad_scale: 16.0 2023-06-23 09:07:45,522 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.84 vs. limit=12.2425 2023-06-23 09:08:00,651 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.829e+02 2.988e+02 4.215e+02 5.995e+02 1.282e+03, threshold=8.430e+02, percent-clipped=15.0 2023-06-23 09:08:02,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=12713.333333333334, ans=0.04949747468305833 2023-06-23 09:09:09,412 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.36 vs. limit=8.245000000000001 2023-06-23 09:09:09,841 INFO [train.py:1008] (1/4) Epoch 4, batch 350, loss[loss=0.3635, simple_loss=0.3883, pruned_loss=0.1682, over 19667.00 frames. ], tot_loss[loss=0.4023, simple_loss=0.4087, pruned_loss=0.1956, over 3135488.92 frames. ], batch size: 110, lr: 3.64e-02, grad_scale: 16.0 2023-06-23 09:09:24,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=13046.666666666666, ans=0.125 2023-06-23 09:09:57,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=13113.333333333334, ans=0.16886666666666666 2023-06-23 09:10:02,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=13180.0, ans=0.0 2023-06-23 09:10:34,561 INFO [train.py:1008] (1/4) Epoch 4, batch 400, loss[loss=0.3826, simple_loss=0.3945, pruned_loss=0.1853, over 19957.00 frames. ], tot_loss[loss=0.3985, simple_loss=0.4065, pruned_loss=0.1934, over 3291666.81 frames. ], batch size: 126, lr: 3.64e-02, grad_scale: 32.0 2023-06-23 09:10:51,312 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.848e+02 2.547e+02 3.557e+02 5.280e+02 1.006e+03, threshold=7.113e+02, percent-clipped=4.0 2023-06-23 09:11:29,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=13513.333333333334, ans=0.010361111111111106 2023-06-23 09:11:57,232 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.18 vs. limit=12.592500000000001 2023-06-23 09:11:59,703 INFO [train.py:1008] (1/4) Epoch 4, batch 450, loss[loss=0.3436, simple_loss=0.3738, pruned_loss=0.1567, over 19468.00 frames. ], tot_loss[loss=0.395, simple_loss=0.4047, pruned_loss=0.1912, over 3411245.01 frames. ], batch size: 105, lr: 3.64e-02, grad_scale: 32.0 2023-06-23 09:12:49,712 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.02 vs. limit=5.0 2023-06-23 09:13:14,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=13913.333333333334, ans=0.125 2023-06-23 09:13:18,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=13913.333333333334, ans=0.125 2023-06-23 09:13:18,615 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:13:21,467 INFO [train.py:1008] (1/4) Epoch 4, batch 500, loss[loss=0.3824, simple_loss=0.4032, pruned_loss=0.1808, over 19084.00 frames. ], tot_loss[loss=0.3914, simple_loss=0.4037, pruned_loss=0.1885, over 3494577.65 frames. ], batch size: 89, lr: 3.63e-02, grad_scale: 32.0 2023-06-23 09:13:24,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=13980.0, ans=0.125 2023-06-23 09:13:26,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=13980.0, ans=0.125 2023-06-23 09:13:33,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=13980.0, ans=0.125 2023-06-23 09:13:35,038 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.58 vs. limit=12.7425 2023-06-23 09:13:37,511 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.023e+02 3.041e+02 4.268e+02 6.398e+02 1.193e+03, threshold=8.536e+02, percent-clipped=18.0 2023-06-23 09:13:38,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=14046.666666666666, ans=0.125 2023-06-23 09:13:44,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=14046.666666666666, ans=0.125 2023-06-23 09:14:38,024 INFO [train.py:1008] (1/4) Epoch 5, batch 0, loss[loss=0.3759, simple_loss=0.3977, pruned_loss=0.177, over 20118.00 frames. ], tot_loss[loss=0.3759, simple_loss=0.3977, pruned_loss=0.177, over 20118.00 frames. ], batch size: 133, lr: 3.47e-02, grad_scale: 32.0 2023-06-23 09:14:38,024 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 09:14:43,509 INFO [train.py:1040] (1/4) Epoch 5, validation: loss=0.2696, simple_loss=0.3565, pruned_loss=0.09131, over 143649.00 frames. 2023-06-23 09:14:43,509 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 09:15:11,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=14260.0, ans=0.0072499999999999995 2023-06-23 09:15:32,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=14393.333333333334, ans=0.025 2023-06-23 09:15:34,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=14393.333333333334, ans=0.125 2023-06-23 09:15:42,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=14393.333333333334, ans=0.125 2023-06-23 09:16:07,690 INFO [train.py:1008] (1/4) Epoch 5, batch 50, loss[loss=0.3775, simple_loss=0.3924, pruned_loss=0.1813, over 20560.00 frames. ], tot_loss[loss=0.375, simple_loss=0.3934, pruned_loss=0.1783, over 868083.59 frames. ], batch size: 189, lr: 3.46e-02, grad_scale: 32.0 2023-06-23 09:16:13,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=14526.666666666666, ans=0.007711594202898551 2023-06-23 09:16:27,105 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.76 vs. limit=18.445 2023-06-23 09:16:53,634 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.58 vs. limit=12.997499999999999 2023-06-23 09:16:55,798 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.912e+02 2.663e+02 3.585e+02 4.767e+02 1.366e+03, threshold=7.170e+02, percent-clipped=3.0 2023-06-23 09:17:05,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=14726.666666666666, ans=0.05 2023-06-23 09:17:12,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=14726.666666666666, ans=0.007668115942028986 2023-06-23 09:17:14,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=14726.666666666666, ans=0.07 2023-06-23 09:17:17,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=14793.333333333334, ans=0.005027777777777777 2023-06-23 09:17:33,192 INFO [train.py:1008] (1/4) Epoch 5, batch 100, loss[loss=0.3854, simple_loss=0.3878, pruned_loss=0.1915, over 20270.00 frames. ], tot_loss[loss=0.3706, simple_loss=0.3918, pruned_loss=0.1747, over 1524242.29 frames. ], batch size: 239, lr: 3.46e-02, grad_scale: 32.0 2023-06-23 09:18:04,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=14926.666666666666, ans=0.3775666666666667 2023-06-23 09:18:09,020 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.49 vs. limit=13.122499999999999 2023-06-23 09:18:17,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=14993.333333333334, ans=0.125 2023-06-23 09:18:21,653 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.14 vs. limit=18.744999999999997 2023-06-23 09:18:29,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=15060.0, ans=0.0039166666666666725 2023-06-23 09:18:37,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=15060.0, ans=0.0039166666666666725 2023-06-23 09:18:41,995 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.96 vs. limit=13.1725 2023-06-23 09:18:59,153 INFO [train.py:1008] (1/4) Epoch 5, batch 150, loss[loss=0.3418, simple_loss=0.3752, pruned_loss=0.1542, over 19109.00 frames. ], tot_loss[loss=0.3711, simple_loss=0.3929, pruned_loss=0.1746, over 2017128.69 frames. ], batch size: 94, lr: 3.46e-02, grad_scale: 32.0 2023-06-23 09:19:18,870 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.46 vs. limit=5.289 2023-06-23 09:19:28,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=15260.0, ans=0.125 2023-06-23 09:19:46,746 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.943e+02 2.580e+02 3.352e+02 4.730e+02 1.164e+03, threshold=6.705e+02, percent-clipped=8.0 2023-06-23 09:19:53,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=15393.333333333334, ans=0.125 2023-06-23 09:19:59,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=15393.333333333334, ans=0.125 2023-06-23 09:20:24,365 INFO [train.py:1008] (1/4) Epoch 5, batch 200, loss[loss=0.3733, simple_loss=0.4003, pruned_loss=0.1731, over 19318.00 frames. ], tot_loss[loss=0.3698, simple_loss=0.3921, pruned_loss=0.1737, over 2403929.70 frames. ], batch size: 98, lr: 3.45e-02, grad_scale: 32.0 2023-06-23 09:20:29,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=15526.666666666666, ans=0.035 2023-06-23 09:20:45,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=15593.333333333334, ans=0.0016944444444444429 2023-06-23 09:20:51,989 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:20:53,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=15593.333333333334, ans=0.3542333333333333 2023-06-23 09:21:01,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=15660.0, ans=0.3519 2023-06-23 09:21:24,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=15726.666666666666, ans=0.125 2023-06-23 09:21:28,486 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.69 vs. limit=8.931666666666667 2023-06-23 09:21:50,789 INFO [train.py:1008] (1/4) Epoch 5, batch 250, loss[loss=0.3875, simple_loss=0.397, pruned_loss=0.189, over 20515.00 frames. ], tot_loss[loss=0.3682, simple_loss=0.3906, pruned_loss=0.1728, over 2722641.58 frames. ], batch size: 189, lr: 3.45e-02, grad_scale: 32.0 2023-06-23 09:21:51,960 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.02 vs. limit=13.4475 2023-06-23 09:22:39,221 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.813e+02 2.807e+02 3.851e+02 5.722e+02 1.154e+03, threshold=7.701e+02, percent-clipped=13.0 2023-06-23 09:22:49,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=16060.0, ans=0.125 2023-06-23 09:23:05,920 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.92 vs. limit=13.5475 2023-06-23 09:23:15,796 INFO [train.py:1008] (1/4) Epoch 5, batch 300, loss[loss=0.3516, simple_loss=0.3858, pruned_loss=0.1587, over 19874.00 frames. ], tot_loss[loss=0.3663, simple_loss=0.3899, pruned_loss=0.1714, over 2955957.19 frames. ], batch size: 120, lr: 3.45e-02, grad_scale: 32.0 2023-06-23 09:23:26,863 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:24:41,020 INFO [train.py:1008] (1/4) Epoch 5, batch 350, loss[loss=0.3692, simple_loss=0.3901, pruned_loss=0.1742, over 20339.00 frames. ], tot_loss[loss=0.3659, simple_loss=0.3897, pruned_loss=0.171, over 3130739.73 frames. ], batch size: 149, lr: 3.44e-02, grad_scale: 32.0 2023-06-23 09:24:41,958 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.63 vs. limit=13.697500000000002 2023-06-23 09:24:52,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=16526.666666666668, ans=0.0847333333333333 2023-06-23 09:25:01,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=16593.333333333332, ans=0.31923333333333337 2023-06-23 09:25:10,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=16593.333333333332, ans=0.125 2023-06-23 09:25:10,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=16593.333333333332, ans=0.04949747468305833 2023-06-23 09:25:19,326 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.47 vs. limit=13.747499999999999 2023-06-23 09:25:19,332 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.97 vs. limit=13.747499999999999 2023-06-23 09:25:22,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=16660.0, ans=0.007247826086956522 2023-06-23 09:25:22,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=16660.0, ans=0.125 2023-06-23 09:25:28,107 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 2.487e+02 3.376e+02 4.312e+02 1.022e+03, threshold=6.752e+02, percent-clipped=3.0 2023-06-23 09:25:34,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=16726.666666666668, ans=0.125 2023-06-23 09:25:48,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=16793.333333333332, ans=0.125 2023-06-23 09:25:48,637 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.41 vs. limit=13.7975 2023-06-23 09:25:56,536 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=13.7975 2023-06-23 09:26:06,038 INFO [train.py:1008] (1/4) Epoch 5, batch 400, loss[loss=0.3719, simple_loss=0.3929, pruned_loss=0.1755, over 19773.00 frames. ], tot_loss[loss=0.3635, simple_loss=0.3875, pruned_loss=0.1698, over 3284798.95 frames. ], batch size: 115, lr: 3.44e-02, grad_scale: 32.0 2023-06-23 09:26:34,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=16926.666666666668, ans=0.125 2023-06-23 09:26:55,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=17060.0, ans=0.0 2023-06-23 09:27:32,064 INFO [train.py:1008] (1/4) Epoch 5, batch 450, loss[loss=0.3338, simple_loss=0.3726, pruned_loss=0.1475, over 18799.00 frames. ], tot_loss[loss=0.3615, simple_loss=0.3869, pruned_loss=0.168, over 3397780.69 frames. ], batch size: 83, lr: 3.44e-02, grad_scale: 32.0 2023-06-23 09:27:35,363 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=17193.333333333332, ans=0.125 2023-06-23 09:27:43,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=17193.333333333332, ans=0.125 2023-06-23 09:28:04,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=17326.666666666668, ans=0.125 2023-06-23 09:28:17,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=17326.666666666668, ans=0.125 2023-06-23 09:28:19,060 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.791e+02 2.771e+02 3.568e+02 4.658e+02 8.532e+02, threshold=7.136e+02, percent-clipped=9.0 2023-06-23 09:28:32,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=17393.333333333332, ans=0.125 2023-06-23 09:28:38,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=17460.0, ans=0.125 2023-06-23 09:28:48,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=17460.0, ans=0.0 2023-06-23 09:28:53,951 INFO [train.py:1008] (1/4) Epoch 5, batch 500, loss[loss=0.3368, simple_loss=0.3708, pruned_loss=0.1514, over 19330.00 frames. ], tot_loss[loss=0.3597, simple_loss=0.3859, pruned_loss=0.1667, over 3494690.19 frames. ], batch size: 98, lr: 3.43e-02, grad_scale: 32.0 2023-06-23 09:29:05,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=17526.666666666668, ans=0.28656666666666675 2023-06-23 09:29:13,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=17593.333333333332, ans=0.07 2023-06-23 09:29:21,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=17593.333333333332, ans=0.125 2023-06-23 09:29:22,496 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=8.24 vs. limit=9.398333333333333 2023-06-23 09:29:33,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=17660.0, ans=0.0 2023-06-23 09:29:33,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=17660.0, ans=0.0 2023-06-23 09:30:09,879 INFO [train.py:1008] (1/4) Epoch 6, batch 0, loss[loss=0.3446, simple_loss=0.3768, pruned_loss=0.1562, over 20304.00 frames. ], tot_loss[loss=0.3446, simple_loss=0.3768, pruned_loss=0.1562, over 20304.00 frames. ], batch size: 141, lr: 3.27e-02, grad_scale: 32.0 2023-06-23 09:30:09,880 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 09:30:15,555 INFO [train.py:1040] (1/4) Epoch 6, validation: loss=0.257, simple_loss=0.3485, pruned_loss=0.08271, over 143649.00 frames. 2023-06-23 09:30:15,556 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 09:30:42,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=17813.333333333332, ans=0.125 2023-06-23 09:30:59,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=17880.0, ans=0.2742 2023-06-23 09:31:06,781 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:31:08,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=17946.666666666668, ans=0.2718666666666667 2023-06-23 09:31:31,229 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.785e+02 2.890e+02 3.718e+02 5.042e+02 9.661e+02, threshold=7.435e+02, percent-clipped=9.0 2023-06-23 09:31:40,608 INFO [train.py:1008] (1/4) Epoch 6, batch 50, loss[loss=0.342, simple_loss=0.3708, pruned_loss=0.1565, over 20279.00 frames. ], tot_loss[loss=0.3495, simple_loss=0.3799, pruned_loss=0.1596, over 864048.44 frames. ], batch size: 141, lr: 3.26e-02, grad_scale: 32.0 2023-06-23 09:32:29,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=18280.0, ans=0.125 2023-06-23 09:32:37,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=18280.0, ans=0.125 2023-06-23 09:33:03,853 INFO [train.py:1008] (1/4) Epoch 6, batch 100, loss[loss=0.3444, simple_loss=0.3852, pruned_loss=0.1518, over 19218.00 frames. ], tot_loss[loss=0.3505, simple_loss=0.3816, pruned_loss=0.1597, over 1500917.49 frames. ], batch size: 92, lr: 3.26e-02, grad_scale: 32.0 2023-06-23 09:33:06,115 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.42 vs. limit=21.310000000000002 2023-06-23 09:33:09,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=18413.333333333332, ans=0.2555333333333334 2023-06-23 09:33:35,261 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.34 vs. limit=21.36 2023-06-23 09:33:36,963 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.14 vs. limit=21.41 2023-06-23 09:33:59,014 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.05 vs. limit=21.46 2023-06-23 09:34:10,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=18680.0, ans=0.006808695652173913 2023-06-23 09:34:12,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=18680.0, ans=0.125 2023-06-23 09:34:20,199 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.894e+02 2.573e+02 3.085e+02 4.128e+02 9.767e+02, threshold=6.169e+02, percent-clipped=1.0 2023-06-23 09:34:28,055 INFO [train.py:1008] (1/4) Epoch 6, batch 150, loss[loss=0.3473, simple_loss=0.3865, pruned_loss=0.1541, over 18257.00 frames. ], tot_loss[loss=0.3476, simple_loss=0.3794, pruned_loss=0.1579, over 2000802.33 frames. ], batch size: 74, lr: 3.25e-02, grad_scale: 32.0 2023-06-23 09:34:42,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=18746.666666666668, ans=0.125 2023-06-23 09:35:00,143 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:35:09,979 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.64 vs. limit=14.58 2023-06-23 09:35:19,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=18946.666666666668, ans=0.0 2023-06-23 09:35:29,045 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=5.85 vs. limit=11.578666666666667 2023-06-23 09:35:52,857 INFO [train.py:1008] (1/4) Epoch 6, batch 200, loss[loss=0.3366, simple_loss=0.3636, pruned_loss=0.1548, over 20567.00 frames. ], tot_loss[loss=0.3466, simple_loss=0.3785, pruned_loss=0.1573, over 2395228.65 frames. ], batch size: 173, lr: 3.25e-02, grad_scale: 32.0 2023-06-23 09:35:56,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=19080.0, ans=0.125 2023-06-23 09:36:17,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=19146.666666666668, ans=0.125 2023-06-23 09:36:22,830 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.85 vs. limit=14.68 2023-06-23 09:36:26,049 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.44 vs. limit=21.91 2023-06-23 09:36:27,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=19213.333333333332, ans=0.125 2023-06-23 09:37:08,622 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.678e+02 3.588e+02 4.915e+02 9.806e+02, threshold=7.176e+02, percent-clipped=9.0 2023-06-23 09:37:11,852 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.09 vs. limit=22.009999999999998 2023-06-23 09:37:17,043 INFO [train.py:1008] (1/4) Epoch 6, batch 250, loss[loss=0.3374, simple_loss=0.3728, pruned_loss=0.151, over 19656.00 frames. ], tot_loss[loss=0.3463, simple_loss=0.3781, pruned_loss=0.1573, over 2698093.90 frames. ], batch size: 110, lr: 3.25e-02, grad_scale: 32.0 2023-06-23 09:37:20,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=19413.333333333332, ans=10.0 2023-06-23 09:37:34,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=19480.0, ans=0.10520000000000002 2023-06-23 09:37:44,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=19480.0, ans=0.125 2023-06-23 09:38:12,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=19613.333333333332, ans=0.0 2023-06-23 09:38:19,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=19613.333333333332, ans=0.21353333333333346 2023-06-23 09:38:21,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=19613.333333333332, ans=0.10386666666666669 2023-06-23 09:38:23,829 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.86 vs. limit=9.92 2023-06-23 09:38:31,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=19680.0, ans=0.07 2023-06-23 09:38:41,159 INFO [train.py:1008] (1/4) Epoch 6, batch 300, loss[loss=0.3133, simple_loss=0.3544, pruned_loss=0.1361, over 19125.00 frames. ], tot_loss[loss=0.3454, simple_loss=0.3775, pruned_loss=0.1566, over 2920432.72 frames. ], batch size: 94, lr: 3.24e-02, grad_scale: 32.0 2023-06-23 09:38:46,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=19746.666666666668, ans=0.0 2023-06-23 09:38:56,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=19813.333333333332, ans=0.125 2023-06-23 09:39:10,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=19813.333333333332, ans=0.0 2023-06-23 09:39:18,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=19880.0, ans=0.125 2023-06-23 09:39:30,494 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.09 vs. limit=14.98 2023-06-23 09:39:34,601 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:39:46,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=20013.333333333332, ans=0.1 2023-06-23 09:39:51,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=20013.333333333332, ans=0.0 2023-06-23 09:39:56,400 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.052e+02 2.604e+02 3.378e+02 4.786e+02 8.678e+02, threshold=6.756e+02, percent-clipped=6.0 2023-06-23 09:40:05,064 INFO [train.py:1008] (1/4) Epoch 6, batch 350, loss[loss=0.3462, simple_loss=0.3813, pruned_loss=0.1556, over 19058.00 frames. ], tot_loss[loss=0.3439, simple_loss=0.3764, pruned_loss=0.1557, over 3120321.46 frames. ], batch size: 89, lr: 3.24e-02, grad_scale: 32.0 2023-06-23 09:40:06,062 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.97 vs. limit=15.0 2023-06-23 09:40:23,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=20146.666666666668, ans=0.125 2023-06-23 09:40:35,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=20146.666666666668, ans=0.125 2023-06-23 09:40:45,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=20213.333333333332, ans=0.125 2023-06-23 09:40:51,141 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.24 vs. limit=15.0 2023-06-23 09:41:06,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=20280.0, ans=0.0 2023-06-23 09:41:17,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=20346.666666666668, ans=0.5 2023-06-23 09:41:24,721 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.21 vs. limit=22.5 2023-06-23 09:41:28,293 INFO [train.py:1008] (1/4) Epoch 6, batch 400, loss[loss=0.3182, simple_loss=0.3627, pruned_loss=0.1369, over 19829.00 frames. ], tot_loss[loss=0.3427, simple_loss=0.3763, pruned_loss=0.1546, over 3275019.40 frames. ], batch size: 115, lr: 3.24e-02, grad_scale: 32.0 2023-06-23 09:41:30,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=20413.333333333332, ans=0.2 2023-06-23 09:42:25,953 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=33.39 vs. limit=22.5 2023-06-23 09:42:30,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=20613.333333333332, ans=0.125 2023-06-23 09:42:32,649 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.98 vs. limit=22.5 2023-06-23 09:42:37,102 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.72 vs. limit=15.0 2023-06-23 09:42:42,559 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.754e+02 2.860e+02 3.612e+02 4.527e+02 7.434e+02, threshold=7.224e+02, percent-clipped=4.0 2023-06-23 09:42:52,415 INFO [train.py:1008] (1/4) Epoch 6, batch 450, loss[loss=0.3277, simple_loss=0.3637, pruned_loss=0.1459, over 18804.00 frames. ], tot_loss[loss=0.3417, simple_loss=0.3754, pruned_loss=0.154, over 3370802.32 frames. ], batch size: 83, lr: 3.23e-02, grad_scale: 32.0 2023-06-23 09:42:59,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=20746.666666666668, ans=0.006359420289855073 2023-06-23 09:43:02,750 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=5.54 vs. limit=12.0 2023-06-23 09:43:38,852 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.22 vs. limit=15.0 2023-06-23 09:43:46,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=20946.666666666668, ans=0.95 2023-06-23 09:44:09,189 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.12 vs. limit=15.0 2023-06-23 09:44:12,765 INFO [train.py:1008] (1/4) Epoch 6, batch 500, loss[loss=0.3733, simple_loss=0.4129, pruned_loss=0.1668, over 16097.00 frames. ], tot_loss[loss=0.3418, simple_loss=0.3754, pruned_loss=0.1541, over 3464126.09 frames. ], batch size: 51, lr: 3.23e-02, grad_scale: 32.0 2023-06-23 09:44:16,594 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.64 vs. limit=6.0 2023-06-23 09:44:18,341 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.81 vs. limit=22.5 2023-06-23 09:44:20,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=21080.0, ans=0.125 2023-06-23 09:44:33,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=21146.666666666668, ans=0.006272463768115942 2023-06-23 09:44:36,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=21146.666666666668, ans=0.2 2023-06-23 09:44:43,917 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=15.0 2023-06-23 09:44:52,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=21213.333333333332, ans=0.1 2023-06-23 09:45:26,850 INFO [train.py:1008] (1/4) Epoch 7, batch 0, loss[loss=0.3478, simple_loss=0.3795, pruned_loss=0.158, over 19206.00 frames. ], tot_loss[loss=0.3478, simple_loss=0.3795, pruned_loss=0.158, over 19206.00 frames. ], batch size: 92, lr: 3.07e-02, grad_scale: 32.0 2023-06-23 09:45:26,850 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 09:45:32,635 INFO [train.py:1040] (1/4) Epoch 7, validation: loss=0.2465, simple_loss=0.3396, pruned_loss=0.07665, over 143649.00 frames. 2023-06-23 09:45:32,636 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 09:45:38,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=21300.0, ans=0.125 2023-06-23 09:45:51,173 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.42 vs. limit=10.0 2023-06-23 09:45:53,228 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.639e+02 3.413e+02 4.875e+02 7.305e+02, threshold=6.826e+02, percent-clipped=1.0 2023-06-23 09:46:20,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=21433.333333333332, ans=0.006210144927536233 2023-06-23 09:46:33,354 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-06-23 09:46:42,331 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.85 vs. limit=6.0 2023-06-23 09:46:56,425 INFO [train.py:1008] (1/4) Epoch 7, batch 50, loss[loss=0.3294, simple_loss=0.3669, pruned_loss=0.146, over 18448.00 frames. ], tot_loss[loss=0.3301, simple_loss=0.3698, pruned_loss=0.1452, over 850573.47 frames. ], batch size: 77, lr: 3.07e-02, grad_scale: 32.0 2023-06-23 09:47:01,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=21633.333333333332, ans=0.0 2023-06-23 09:48:00,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=21833.333333333332, ans=0.1 2023-06-23 09:48:06,929 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:48:07,608 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.89 vs. limit=6.0 2023-06-23 09:48:14,310 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.05 vs. limit=22.5 2023-06-23 09:48:19,723 INFO [train.py:1008] (1/4) Epoch 7, batch 100, loss[loss=0.3114, simple_loss=0.3551, pruned_loss=0.1338, over 18621.00 frames. ], tot_loss[loss=0.3305, simple_loss=0.3663, pruned_loss=0.1473, over 1493929.36 frames. ], batch size: 80, lr: 3.06e-02, grad_scale: 32.0 2023-06-23 09:48:39,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=22033.333333333332, ans=0.0060797101449275364 2023-06-23 09:48:40,163 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.977e+02 2.721e+02 3.392e+02 4.507e+02 8.334e+02, threshold=6.785e+02, percent-clipped=4.0 2023-06-23 09:48:41,472 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.17 vs. limit=22.5 2023-06-23 09:49:12,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=22166.666666666668, ans=0.125 2023-06-23 09:49:39,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=22233.333333333332, ans=0.125 2023-06-23 09:49:42,089 INFO [train.py:1008] (1/4) Epoch 7, batch 150, loss[loss=0.3019, simple_loss=0.3446, pruned_loss=0.1296, over 10878.00 frames. ], tot_loss[loss=0.3307, simple_loss=0.3668, pruned_loss=0.1473, over 1992199.61 frames. ], batch size: 30, lr: 3.06e-02, grad_scale: 32.0 2023-06-23 09:49:44,051 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=22300.0, ans=0.125 2023-06-23 09:49:59,095 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.51 vs. limit=15.0 2023-06-23 09:50:31,373 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=15.59 vs. limit=15.0 2023-06-23 09:51:00,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=22566.666666666668, ans=0.0 2023-06-23 09:51:05,079 INFO [train.py:1008] (1/4) Epoch 7, batch 200, loss[loss=0.3454, simple_loss=0.3934, pruned_loss=0.1487, over 16413.00 frames. ], tot_loss[loss=0.3307, simple_loss=0.3677, pruned_loss=0.1468, over 2387775.62 frames. ], batch size: 52, lr: 3.05e-02, grad_scale: 32.0 2023-06-23 09:51:05,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=22633.333333333332, ans=0.0 2023-06-23 09:51:15,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=22633.333333333332, ans=0.005949275362318841 2023-06-23 09:51:15,781 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.75 vs. limit=22.5 2023-06-23 09:51:22,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=22700.0, ans=0.0059347826086956525 2023-06-23 09:51:26,010 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.837e+02 2.559e+02 3.231e+02 3.874e+02 6.394e+02, threshold=6.463e+02, percent-clipped=0.0 2023-06-23 09:52:10,701 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=22900.0, ans=0.2 2023-06-23 09:52:14,596 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.98 vs. limit=22.5 2023-06-23 09:52:23,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=22900.0, ans=0.2 2023-06-23 09:52:27,984 INFO [train.py:1008] (1/4) Epoch 7, batch 250, loss[loss=0.3156, simple_loss=0.3581, pruned_loss=0.1366, over 19111.00 frames. ], tot_loss[loss=0.3314, simple_loss=0.3673, pruned_loss=0.1477, over 2708295.83 frames. ], batch size: 94, lr: 3.05e-02, grad_scale: 32.0 2023-06-23 09:52:50,253 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.69 vs. limit=15.0 2023-06-23 09:52:52,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=23033.333333333332, ans=0.0 2023-06-23 09:53:19,566 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.37 vs. limit=15.0 2023-06-23 09:53:36,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=23233.333333333332, ans=0.005818840579710146 2023-06-23 09:53:46,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=23233.333333333332, ans=0.2 2023-06-23 09:53:50,822 INFO [train.py:1008] (1/4) Epoch 7, batch 300, loss[loss=0.3437, simple_loss=0.3735, pruned_loss=0.1569, over 20298.00 frames. ], tot_loss[loss=0.3323, simple_loss=0.3682, pruned_loss=0.1482, over 2938626.13 frames. ], batch size: 141, lr: 3.05e-02, grad_scale: 32.0 2023-06-23 09:54:03,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=23300.0, ans=0.0 2023-06-23 09:54:09,416 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.661e+02 2.566e+02 3.013e+02 4.099e+02 6.159e+02, threshold=6.026e+02, percent-clipped=0.0 2023-06-23 09:54:17,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=23366.666666666668, ans=0.1 2023-06-23 09:54:21,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=23366.666666666668, ans=0.125 2023-06-23 09:54:24,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=23433.333333333332, ans=0.125 2023-06-23 09:55:13,437 INFO [train.py:1008] (1/4) Epoch 7, batch 350, loss[loss=0.3359, simple_loss=0.3561, pruned_loss=0.1579, over 20288.00 frames. ], tot_loss[loss=0.3305, simple_loss=0.3661, pruned_loss=0.1475, over 3122003.88 frames. ], batch size: 239, lr: 3.04e-02, grad_scale: 32.0 2023-06-23 09:55:17,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=23633.333333333332, ans=0.125 2023-06-23 09:55:23,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=23633.333333333332, ans=0.005731884057971015 2023-06-23 09:55:33,723 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.11 vs. limit=15.0 2023-06-23 09:55:39,370 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.37 vs. limit=15.0 2023-06-23 09:55:42,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=23700.0, ans=0.0 2023-06-23 09:56:36,502 INFO [train.py:1008] (1/4) Epoch 7, batch 400, loss[loss=0.2977, simple_loss=0.3509, pruned_loss=0.1222, over 18309.00 frames. ], tot_loss[loss=0.3299, simple_loss=0.3662, pruned_loss=0.1469, over 3272747.90 frames. ], batch size: 74, lr: 3.04e-02, grad_scale: 32.0 2023-06-23 09:56:55,718 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.878e+02 2.370e+02 2.886e+02 3.873e+02 7.478e+02, threshold=5.773e+02, percent-clipped=7.0 2023-06-23 09:56:57,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=24033.333333333332, ans=0.1 2023-06-23 09:57:08,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=24100.0, ans=0.125 2023-06-23 09:57:34,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=24166.666666666668, ans=0.125 2023-06-23 09:57:36,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=24166.666666666668, ans=0.125 2023-06-23 09:57:50,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=24233.333333333332, ans=0.1 2023-06-23 09:57:58,896 INFO [train.py:1008] (1/4) Epoch 7, batch 450, loss[loss=0.3245, simple_loss=0.3669, pruned_loss=0.1411, over 18910.00 frames. ], tot_loss[loss=0.3295, simple_loss=0.367, pruned_loss=0.1459, over 3375877.00 frames. ], batch size: 86, lr: 3.04e-02, grad_scale: 64.0 2023-06-23 09:58:03,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=24300.0, ans=0.125 2023-06-23 09:58:23,801 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.61 vs. limit=22.5 2023-06-23 09:58:34,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=24433.333333333332, ans=0.0 2023-06-23 09:58:47,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=24500.0, ans=0.125 2023-06-23 09:58:59,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=24500.0, ans=0.005543478260869566 2023-06-23 09:59:12,160 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.97 vs. limit=15.0 2023-06-23 09:59:17,628 INFO [train.py:1008] (1/4) Epoch 7, batch 500, loss[loss=0.3454, simple_loss=0.3706, pruned_loss=0.1601, over 20082.00 frames. ], tot_loss[loss=0.3276, simple_loss=0.3651, pruned_loss=0.1451, over 3470415.14 frames. ], batch size: 133, lr: 3.03e-02, grad_scale: 64.0 2023-06-23 09:59:36,075 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.772e+02 2.330e+02 2.795e+02 3.451e+02 4.911e+02, threshold=5.589e+02, percent-clipped=0.0 2023-06-23 09:59:41,986 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.12 vs. limit=15.0 2023-06-23 09:59:59,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=24766.666666666668, ans=22.5 2023-06-23 10:00:04,279 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=24.50 vs. limit=22.5 2023-06-23 10:00:30,076 INFO [train.py:1008] (1/4) Epoch 8, batch 0, loss[loss=0.3242, simple_loss=0.356, pruned_loss=0.1462, over 20475.00 frames. ], tot_loss[loss=0.3242, simple_loss=0.356, pruned_loss=0.1462, over 20475.00 frames. ], batch size: 189, lr: 2.89e-02, grad_scale: 64.0 2023-06-23 10:00:30,076 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 10:00:35,761 INFO [train.py:1040] (1/4) Epoch 8, validation: loss=0.2397, simple_loss=0.334, pruned_loss=0.07275, over 143649.00 frames. 2023-06-23 10:00:35,762 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 10:00:42,201 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.82 vs. limit=15.0 2023-06-23 10:00:43,498 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.09 vs. limit=10.0 2023-06-23 10:01:26,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=25046.666666666668, ans=0.0 2023-06-23 10:01:44,701 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=25113.333333333332, ans=0.2 2023-06-23 10:01:57,294 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.094e-01 2023-06-23 10:01:58,313 INFO [train.py:1008] (1/4) Epoch 8, batch 50, loss[loss=0.3249, simple_loss=0.362, pruned_loss=0.1439, over 19485.00 frames. ], tot_loss[loss=0.3182, simple_loss=0.3601, pruned_loss=0.1382, over 864848.42 frames. ], batch size: 105, lr: 2.88e-02, grad_scale: 64.0 2023-06-23 10:02:06,781 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.23 vs. limit=15.0 2023-06-23 10:02:10,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=25180.0, ans=0.2 2023-06-23 10:02:14,714 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-06-23 10:02:26,598 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.86 vs. limit=12.0 2023-06-23 10:02:49,149 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.860e+02 2.335e+02 2.730e+02 3.502e+02 6.500e+02, threshold=5.461e+02, percent-clipped=1.0 2023-06-23 10:02:56,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=25380.0, ans=0.0053521739130434785 2023-06-23 10:03:02,920 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=25380.0, ans=0.1 2023-06-23 10:03:22,484 INFO [train.py:1008] (1/4) Epoch 8, batch 100, loss[loss=0.3251, simple_loss=0.3783, pruned_loss=0.136, over 16524.00 frames. ], tot_loss[loss=0.3194, simple_loss=0.3592, pruned_loss=0.1398, over 1508596.49 frames. ], batch size: 52, lr: 2.88e-02, grad_scale: 64.0 2023-06-23 10:03:24,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=25513.333333333332, ans=0.005323188405797102 2023-06-23 10:03:38,737 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.55 vs. limit=22.5 2023-06-23 10:03:54,708 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.44 vs. limit=15.0 2023-06-23 10:04:03,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=25646.666666666668, ans=0.2 2023-06-23 10:04:33,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=25780.0, ans=0.125 2023-06-23 10:04:37,987 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=25780.0, ans=0.125 2023-06-23 10:04:45,901 INFO [train.py:1008] (1/4) Epoch 8, batch 150, loss[loss=0.3227, simple_loss=0.3537, pruned_loss=0.1458, over 20678.00 frames. ], tot_loss[loss=0.3192, simple_loss=0.3596, pruned_loss=0.1394, over 2029627.12 frames. ], batch size: 211, lr: 2.87e-02, grad_scale: 64.0 2023-06-23 10:05:29,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=25980.0, ans=0.125 2023-06-23 10:05:33,083 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 10:05:36,168 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.682e+02 2.398e+02 2.849e+02 3.834e+02 8.685e+02, threshold=5.698e+02, percent-clipped=3.0 2023-06-23 10:05:46,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=26046.666666666668, ans=0.0 2023-06-23 10:05:48,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=26046.666666666668, ans=0.0 2023-06-23 10:06:00,078 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.11 vs. limit=15.0 2023-06-23 10:06:04,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=26113.333333333332, ans=0.125 2023-06-23 10:06:09,737 INFO [train.py:1008] (1/4) Epoch 8, batch 200, loss[loss=0.326, simple_loss=0.3664, pruned_loss=0.1428, over 20119.00 frames. ], tot_loss[loss=0.3186, simple_loss=0.3592, pruned_loss=0.139, over 2426601.11 frames. ], batch size: 133, lr: 2.87e-02, grad_scale: 64.0 2023-06-23 10:06:31,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=26246.666666666668, ans=0.125 2023-06-23 10:06:39,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=26246.666666666668, ans=0.07 2023-06-23 10:06:45,968 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=26313.333333333332, ans=0.125 2023-06-23 10:07:31,944 INFO [train.py:1008] (1/4) Epoch 8, batch 250, loss[loss=0.3265, simple_loss=0.3791, pruned_loss=0.1369, over 18333.00 frames. ], tot_loss[loss=0.3185, simple_loss=0.359, pruned_loss=0.1389, over 2710457.55 frames. ], batch size: 72, lr: 2.87e-02, grad_scale: 64.0 2023-06-23 10:07:55,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=26580.0, ans=0.005091304347826087 2023-06-23 10:08:23,647 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.719e+02 2.405e+02 3.162e+02 4.078e+02 7.691e+02, threshold=6.324e+02, percent-clipped=7.0 2023-06-23 10:08:45,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=26780.0, ans=0.125 2023-06-23 10:08:47,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=26780.0, ans=0.05 2023-06-23 10:08:56,871 INFO [train.py:1008] (1/4) Epoch 8, batch 300, loss[loss=0.3292, simple_loss=0.3633, pruned_loss=0.1476, over 20004.00 frames. ], tot_loss[loss=0.317, simple_loss=0.3582, pruned_loss=0.1379, over 2958084.97 frames. ], batch size: 126, lr: 2.86e-02, grad_scale: 64.0 2023-06-23 10:09:08,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=26846.666666666668, ans=0.1 2023-06-23 10:09:11,216 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=13.21 vs. limit=15.0 2023-06-23 10:09:31,516 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=26980.0, ans=0.125 2023-06-23 10:09:34,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=26980.0, ans=0.05 2023-06-23 10:09:40,257 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 10:09:41,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=26980.0, ans=0.0 2023-06-23 10:09:51,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=27046.666666666668, ans=0.2 2023-06-23 10:09:55,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=27046.666666666668, ans=0.125 2023-06-23 10:10:17,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=27180.0, ans=0.125 2023-06-23 10:10:18,661 INFO [train.py:1008] (1/4) Epoch 8, batch 350, loss[loss=0.3336, simple_loss=0.3795, pruned_loss=0.1439, over 18332.00 frames. ], tot_loss[loss=0.3163, simple_loss=0.3576, pruned_loss=0.1375, over 3147150.59 frames. ], batch size: 72, lr: 2.86e-02, grad_scale: 64.0 2023-06-23 10:11:00,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=27313.333333333332, ans=10.0 2023-06-23 10:11:00,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=27313.333333333332, ans=0.125 2023-06-23 10:11:08,209 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.700e+02 2.345e+02 2.694e+02 3.348e+02 7.255e+02, threshold=5.388e+02, percent-clipped=3.0 2023-06-23 10:11:17,564 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.64 vs. limit=22.5 2023-06-23 10:11:23,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=27446.666666666668, ans=0.125 2023-06-23 10:11:41,432 INFO [train.py:1008] (1/4) Epoch 8, batch 400, loss[loss=0.3431, simple_loss=0.3548, pruned_loss=0.1657, over 19905.00 frames. ], tot_loss[loss=0.3161, simple_loss=0.3574, pruned_loss=0.1374, over 3297970.30 frames. ], batch size: 293, lr: 2.85e-02, grad_scale: 64.0 2023-06-23 10:11:46,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=27513.333333333332, ans=0.2 2023-06-23 10:11:46,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=27513.333333333332, ans=0.0 2023-06-23 10:12:10,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=27580.0, ans=0.0 2023-06-23 10:12:14,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=27646.666666666668, ans=0.125 2023-06-23 10:12:34,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=27713.333333333332, ans=0.09899494936611666 2023-06-23 10:12:36,341 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.75 vs. limit=22.5 2023-06-23 10:13:03,988 INFO [train.py:1008] (1/4) Epoch 8, batch 450, loss[loss=0.2993, simple_loss=0.3483, pruned_loss=0.1252, over 19868.00 frames. ], tot_loss[loss=0.3163, simple_loss=0.3576, pruned_loss=0.1375, over 3400765.26 frames. ], batch size: 120, lr: 2.85e-02, grad_scale: 64.0 2023-06-23 10:13:27,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=27913.333333333332, ans=0.035 2023-06-23 10:13:32,867 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=27913.333333333332, ans=0.125 2023-06-23 10:13:33,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=27913.333333333332, ans=0.125 2023-06-23 10:13:36,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=27980.0, ans=0.125 2023-06-23 10:13:47,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=27980.0, ans=0.04949747468305833 2023-06-23 10:13:52,894 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.385e+02 2.830e+02 3.601e+02 6.779e+02, threshold=5.659e+02, percent-clipped=6.0 2023-06-23 10:14:20,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=28113.333333333332, ans=0.125 2023-06-23 10:14:23,517 INFO [train.py:1008] (1/4) Epoch 8, batch 500, loss[loss=0.3245, simple_loss=0.3529, pruned_loss=0.148, over 20478.00 frames. ], tot_loss[loss=0.3157, simple_loss=0.3575, pruned_loss=0.1369, over 3475933.34 frames. ], batch size: 160, lr: 2.85e-02, grad_scale: 64.0 2023-06-23 10:14:23,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=28180.0, ans=0.2 2023-06-23 10:14:32,786 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.00 vs. limit=10.0 2023-06-23 10:14:33,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=28180.0, ans=0.125 2023-06-23 10:15:01,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=28313.333333333332, ans=0.07 2023-06-23 10:15:04,345 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=28313.333333333332, ans=0.1 2023-06-23 10:15:36,364 INFO [train.py:1008] (1/4) Epoch 9, batch 0, loss[loss=0.301, simple_loss=0.35, pruned_loss=0.126, over 19850.00 frames. ], tot_loss[loss=0.301, simple_loss=0.35, pruned_loss=0.126, over 19850.00 frames. ], batch size: 120, lr: 2.72e-02, grad_scale: 64.0 2023-06-23 10:15:36,364 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 10:15:42,531 INFO [train.py:1040] (1/4) Epoch 9, validation: loss=0.2321, simple_loss=0.3284, pruned_loss=0.06788, over 143649.00 frames. 2023-06-23 10:15:42,532 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 10:15:47,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=28393.333333333332, ans=0.125 2023-06-23 10:16:28,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=28526.666666666668, ans=0.0 2023-06-23 10:16:31,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=28593.333333333332, ans=0.05 2023-06-23 10:16:47,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=28660.0, ans=0.125 2023-06-23 10:17:01,831 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.740e+02 2.235e+02 2.795e+02 3.608e+02 5.356e+02, threshold=5.591e+02, percent-clipped=0.0 2023-06-23 10:17:02,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=28660.0, ans=0.125 2023-06-23 10:17:05,077 INFO [train.py:1008] (1/4) Epoch 9, batch 50, loss[loss=0.3319, simple_loss=0.3546, pruned_loss=0.1546, over 20366.00 frames. ], tot_loss[loss=0.313, simple_loss=0.3526, pruned_loss=0.1367, over 863863.24 frames. ], batch size: 239, lr: 2.71e-02, grad_scale: 64.0 2023-06-23 10:17:15,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=28726.666666666668, ans=0.125 2023-06-23 10:17:18,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=28726.666666666668, ans=0.025 2023-06-23 10:17:23,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=28793.333333333332, ans=0.125 2023-06-23 10:17:45,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=28860.0, ans=0.125 2023-06-23 10:17:51,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=28926.666666666668, ans=0.125 2023-06-23 10:17:55,213 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.56 vs. limit=22.5 2023-06-23 10:18:11,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=28993.333333333332, ans=0.125 2023-06-23 10:18:12,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=28993.333333333332, ans=0.1 2023-06-23 10:18:18,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=28993.333333333332, ans=0.2 2023-06-23 10:18:19,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=28993.333333333332, ans=0.0 2023-06-23 10:18:26,266 INFO [train.py:1008] (1/4) Epoch 9, batch 100, loss[loss=0.3125, simple_loss=0.3562, pruned_loss=0.1344, over 19067.00 frames. ], tot_loss[loss=0.3111, simple_loss=0.3535, pruned_loss=0.1344, over 1501363.04 frames. ], batch size: 89, lr: 2.71e-02, grad_scale: 64.0 2023-06-23 10:18:32,852 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=29060.0, ans=0.125 2023-06-23 10:18:58,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=29193.333333333332, ans=0.125 2023-06-23 10:19:09,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=29193.333333333332, ans=0.07 2023-06-23 10:19:12,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=29193.333333333332, ans=0.0 2023-06-23 10:19:19,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=29260.0, ans=0.1 2023-06-23 10:19:33,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=29326.666666666668, ans=0.125 2023-06-23 10:19:41,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=29326.666666666668, ans=0.1 2023-06-23 10:19:45,728 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.821e+02 2.343e+02 2.904e+02 3.644e+02 6.921e+02, threshold=5.807e+02, percent-clipped=3.0 2023-06-23 10:19:47,378 INFO [train.py:1008] (1/4) Epoch 9, batch 150, loss[loss=0.3372, simple_loss=0.3652, pruned_loss=0.1546, over 20694.00 frames. ], tot_loss[loss=0.3096, simple_loss=0.3535, pruned_loss=0.1328, over 1995033.70 frames. ], batch size: 211, lr: 2.70e-02, grad_scale: 32.0 2023-06-23 10:19:56,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=29393.333333333332, ans=0.125 2023-06-23 10:20:12,233 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.90 vs. limit=22.5 2023-06-23 10:20:24,563 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.29 vs. limit=22.5 2023-06-23 10:21:05,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=29660.0, ans=0.035 2023-06-23 10:21:09,333 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.33 vs. limit=22.5 2023-06-23 10:21:09,943 INFO [train.py:1008] (1/4) Epoch 9, batch 200, loss[loss=0.3327, simple_loss=0.351, pruned_loss=0.1572, over 20177.00 frames. ], tot_loss[loss=0.3078, simple_loss=0.3524, pruned_loss=0.1315, over 2391695.64 frames. ], batch size: 239, lr: 2.70e-02, grad_scale: 32.0 2023-06-23 10:21:16,852 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.06 vs. limit=15.0 2023-06-23 10:21:19,852 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=29726.666666666668, ans=0.04949747468305833 2023-06-23 10:21:27,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=29793.333333333332, ans=0.0 2023-06-23 10:21:33,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=29793.333333333332, ans=0.2 2023-06-23 10:22:03,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=29926.666666666668, ans=0.004363768115942028 2023-06-23 10:22:09,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=29926.666666666668, ans=0.1 2023-06-23 10:22:31,027 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.762e+02 2.339e+02 2.716e+02 3.327e+02 5.463e+02, threshold=5.433e+02, percent-clipped=0.0 2023-06-23 10:22:32,591 INFO [train.py:1008] (1/4) Epoch 9, batch 250, loss[loss=0.2815, simple_loss=0.3341, pruned_loss=0.1145, over 18913.00 frames. ], tot_loss[loss=0.307, simple_loss=0.3519, pruned_loss=0.131, over 2698498.68 frames. ], batch size: 86, lr: 2.70e-02, grad_scale: 32.0 2023-06-23 10:22:40,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=30060.0, ans=0.1 2023-06-23 10:22:42,193 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=15.0 2023-06-23 10:22:42,595 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.05 vs. limit=22.5 2023-06-23 10:23:09,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=30193.333333333332, ans=0.1 2023-06-23 10:23:13,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=30193.333333333332, ans=0.125 2023-06-23 10:23:53,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=30326.666666666668, ans=0.0 2023-06-23 10:23:55,983 INFO [train.py:1008] (1/4) Epoch 9, batch 300, loss[loss=0.3105, simple_loss=0.3655, pruned_loss=0.1277, over 15401.00 frames. ], tot_loss[loss=0.3074, simple_loss=0.3521, pruned_loss=0.1314, over 2926075.25 frames. ], batch size: 44, lr: 2.69e-02, grad_scale: 32.0 2023-06-23 10:24:30,476 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.09 vs. limit=15.0 2023-06-23 10:24:36,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=30526.666666666668, ans=10.0 2023-06-23 10:24:55,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=30593.333333333332, ans=0.125 2023-06-23 10:24:55,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=30593.333333333332, ans=0.125 2023-06-23 10:25:06,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=30660.0, ans=0.125 2023-06-23 10:25:08,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=30660.0, ans=0.0 2023-06-23 10:25:17,348 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=12.43 vs. limit=15.0 2023-06-23 10:25:17,840 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.462e+02 2.891e+02 3.800e+02 6.993e+02, threshold=5.782e+02, percent-clipped=6.0 2023-06-23 10:25:20,053 INFO [train.py:1008] (1/4) Epoch 9, batch 350, loss[loss=0.2759, simple_loss=0.3279, pruned_loss=0.112, over 19816.00 frames. ], tot_loss[loss=0.3052, simple_loss=0.3511, pruned_loss=0.1297, over 3114795.84 frames. ], batch size: 120, lr: 2.69e-02, grad_scale: 32.0 2023-06-23 10:25:29,573 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.85 vs. limit=15.0 2023-06-23 10:25:42,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=30793.333333333332, ans=0.125 2023-06-23 10:25:51,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=30860.0, ans=0.125 2023-06-23 10:26:02,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=30860.0, ans=0.125 2023-06-23 10:26:43,140 INFO [train.py:1008] (1/4) Epoch 9, batch 400, loss[loss=0.3056, simple_loss=0.3437, pruned_loss=0.1337, over 20559.00 frames. ], tot_loss[loss=0.3049, simple_loss=0.3505, pruned_loss=0.1297, over 3264364.89 frames. ], batch size: 173, lr: 2.68e-02, grad_scale: 32.0 2023-06-23 10:26:48,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=31060.0, ans=0.125 2023-06-23 10:26:50,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=31060.0, ans=0.125 2023-06-23 10:27:03,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=31126.666666666668, ans=0.1 2023-06-23 10:27:48,185 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.60 vs. limit=22.5 2023-06-23 10:27:57,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=31326.666666666668, ans=0.0 2023-06-23 10:28:03,167 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.693e+02 2.233e+02 2.638e+02 3.310e+02 5.527e+02, threshold=5.277e+02, percent-clipped=0.0 2023-06-23 10:28:04,774 INFO [train.py:1008] (1/4) Epoch 9, batch 450, loss[loss=0.3299, simple_loss=0.3779, pruned_loss=0.141, over 18929.00 frames. ], tot_loss[loss=0.3047, simple_loss=0.3507, pruned_loss=0.1294, over 3384944.43 frames. ], batch size: 86, lr: 2.68e-02, grad_scale: 32.0 2023-06-23 10:28:10,486 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 10:28:24,664 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.50 vs. limit=15.0 2023-06-23 10:28:27,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=31460.0, ans=0.0 2023-06-23 10:28:33,969 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=31460.0, ans=0.04949747468305833 2023-06-23 10:28:59,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=31593.333333333332, ans=0.125 2023-06-23 10:29:05,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=31593.333333333332, ans=0.125 2023-06-23 10:29:09,250 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.96 vs. limit=6.0 2023-06-23 10:29:25,463 INFO [train.py:1008] (1/4) Epoch 9, batch 500, loss[loss=0.3179, simple_loss=0.3541, pruned_loss=0.1409, over 20558.00 frames. ], tot_loss[loss=0.3044, simple_loss=0.35, pruned_loss=0.1294, over 3480217.53 frames. ], batch size: 189, lr: 2.68e-02, grad_scale: 32.0 2023-06-23 10:29:31,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=31726.666666666668, ans=0.125 2023-06-23 10:29:36,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=31726.666666666668, ans=0.125 2023-06-23 10:29:57,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=31860.0, ans=0.125 2023-06-23 10:30:03,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=31860.0, ans=0.1 2023-06-23 10:30:41,980 INFO [train.py:1008] (1/4) Epoch 10, batch 0, loss[loss=0.2844, simple_loss=0.3406, pruned_loss=0.1141, over 19078.00 frames. ], tot_loss[loss=0.2844, simple_loss=0.3406, pruned_loss=0.1141, over 19078.00 frames. ], batch size: 89, lr: 2.56e-02, grad_scale: 32.0 2023-06-23 10:30:41,981 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 10:30:47,654 INFO [train.py:1040] (1/4) Epoch 10, validation: loss=0.2252, simple_loss=0.3233, pruned_loss=0.06351, over 143649.00 frames. 2023-06-23 10:30:47,655 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 10:31:14,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=32006.666666666668, ans=0.0 2023-06-23 10:31:15,958 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.771e+02 2.285e+02 2.652e+02 3.209e+02 5.793e+02, threshold=5.303e+02, percent-clipped=2.0 2023-06-23 10:31:28,921 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.37 vs. limit=6.0 2023-06-23 10:31:34,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=32073.333333333332, ans=0.125 2023-06-23 10:32:10,759 INFO [train.py:1008] (1/4) Epoch 10, batch 50, loss[loss=0.2806, simple_loss=0.3316, pruned_loss=0.1148, over 20320.00 frames. ], tot_loss[loss=0.2989, simple_loss=0.3469, pruned_loss=0.1255, over 857995.51 frames. ], batch size: 149, lr: 2.56e-02, grad_scale: 32.0 2023-06-23 10:32:36,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=32340.0, ans=0.1 2023-06-23 10:32:42,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=32406.666666666668, ans=0.00382463768115942 2023-06-23 10:32:42,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=32406.666666666668, ans=0.2 2023-06-23 10:32:47,415 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=32406.666666666668, ans=0.0 2023-06-23 10:33:03,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=32473.333333333332, ans=0.07 2023-06-23 10:33:08,337 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 10:33:09,670 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=32473.333333333332, ans=0.125 2023-06-23 10:33:22,696 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.92 vs. limit=15.0 2023-06-23 10:33:26,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=32540.0, ans=0.0 2023-06-23 10:33:33,229 INFO [train.py:1008] (1/4) Epoch 10, batch 100, loss[loss=0.2982, simple_loss=0.3564, pruned_loss=0.12, over 16841.00 frames. ], tot_loss[loss=0.2991, simple_loss=0.3462, pruned_loss=0.126, over 1510651.47 frames. ], batch size: 59, lr: 2.55e-02, grad_scale: 32.0 2023-06-23 10:34:00,487 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 2.231e+02 2.647e+02 3.330e+02 6.227e+02, threshold=5.294e+02, percent-clipped=3.0 2023-06-23 10:34:07,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=32740.0, ans=0.0037521739130434778 2023-06-23 10:34:33,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_positive, batch_count=32806.666666666664, ans=0.05 2023-06-23 10:34:38,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=32873.333333333336, ans=0.0037231884057971013 2023-06-23 10:34:54,558 INFO [train.py:1008] (1/4) Epoch 10, batch 150, loss[loss=0.2946, simple_loss=0.3409, pruned_loss=0.1242, over 20304.00 frames. ], tot_loss[loss=0.2985, simple_loss=0.3466, pruned_loss=0.1252, over 2008838.81 frames. ], batch size: 141, lr: 2.55e-02, grad_scale: 32.0 2023-06-23 10:34:58,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=32940.0, ans=0.125 2023-06-23 10:35:07,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=32940.0, ans=0.125 2023-06-23 10:36:17,393 INFO [train.py:1008] (1/4) Epoch 10, batch 200, loss[loss=0.289, simple_loss=0.338, pruned_loss=0.12, over 20148.00 frames. ], tot_loss[loss=0.2974, simple_loss=0.3454, pruned_loss=0.1247, over 2416134.18 frames. ], batch size: 133, lr: 2.54e-02, grad_scale: 32.0 2023-06-23 10:36:21,604 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=33273.333333333336, ans=0.0 2023-06-23 10:36:34,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=33340.0, ans=0.125 2023-06-23 10:36:43,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=33340.0, ans=0.1 2023-06-23 10:36:43,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=33340.0, ans=0.0036217391304347825 2023-06-23 10:36:44,636 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.786e+02 2.278e+02 2.899e+02 3.563e+02 6.245e+02, threshold=5.798e+02, percent-clipped=3.0 2023-06-23 10:36:59,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=33406.666666666664, ans=0.0 2023-06-23 10:37:02,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=33406.666666666664, ans=0.125 2023-06-23 10:37:03,922 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.40 vs. limit=5.0 2023-06-23 10:37:15,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=33473.333333333336, ans=0.09899494936611666 2023-06-23 10:37:17,771 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2023-06-23 10:37:29,947 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=33540.0, ans=15.0 2023-06-23 10:37:35,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=33540.0, ans=0.0035782608695652165 2023-06-23 10:37:40,534 INFO [train.py:1008] (1/4) Epoch 10, batch 250, loss[loss=0.2904, simple_loss=0.3446, pruned_loss=0.1181, over 18611.00 frames. ], tot_loss[loss=0.2983, simple_loss=0.3461, pruned_loss=0.1253, over 2720587.36 frames. ], batch size: 80, lr: 2.54e-02, grad_scale: 32.0 2023-06-23 10:37:49,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=33606.666666666664, ans=0.0 2023-06-23 10:37:55,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=33673.333333333336, ans=0.09899494936611666 2023-06-23 10:37:57,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=33673.333333333336, ans=0.125 2023-06-23 10:38:12,148 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 10:38:12,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=33740.0, ans=0.125 2023-06-23 10:38:12,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=33740.0, ans=0.0 2023-06-23 10:38:17,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=33740.0, ans=0.0 2023-06-23 10:38:34,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=33806.666666666664, ans=0.125 2023-06-23 10:38:35,977 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.40 vs. limit=22.5 2023-06-23 10:38:40,366 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=33806.666666666664, ans=0.125 2023-06-23 10:39:02,818 INFO [train.py:1008] (1/4) Epoch 10, batch 300, loss[loss=0.2816, simple_loss=0.3365, pruned_loss=0.1133, over 18620.00 frames. ], tot_loss[loss=0.2983, simple_loss=0.3465, pruned_loss=0.125, over 2944101.67 frames. ], batch size: 80, lr: 2.54e-02, grad_scale: 32.0 2023-06-23 10:39:29,954 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.768e+02 2.326e+02 2.732e+02 3.307e+02 6.159e+02, threshold=5.464e+02, percent-clipped=1.0 2023-06-23 10:39:35,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=34073.333333333336, ans=0.00346231884057971 2023-06-23 10:39:45,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=34073.333333333336, ans=0.00346231884057971 2023-06-23 10:40:01,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=34140.0, ans=10.0 2023-06-23 10:40:23,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=34273.333333333336, ans=0.0034188405797101447 2023-06-23 10:40:24,790 INFO [train.py:1008] (1/4) Epoch 10, batch 350, loss[loss=0.2835, simple_loss=0.3367, pruned_loss=0.1152, over 18467.00 frames. ], tot_loss[loss=0.2972, simple_loss=0.3455, pruned_loss=0.1244, over 3132739.27 frames. ], batch size: 77, lr: 2.53e-02, grad_scale: 32.0 2023-06-23 10:40:31,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=34273.333333333336, ans=0.0 2023-06-23 10:40:34,252 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=34273.333333333336, ans=0.0034188405797101447 2023-06-23 10:40:46,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=34340.0, ans=0.125 2023-06-23 10:40:56,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=34406.666666666664, ans=0.125 2023-06-23 10:41:36,407 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=12.96 vs. limit=15.0 2023-06-23 10:41:43,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=34540.0, ans=0.125 2023-06-23 10:41:47,482 INFO [train.py:1008] (1/4) Epoch 10, batch 400, loss[loss=0.2672, simple_loss=0.3215, pruned_loss=0.1064, over 19327.00 frames. ], tot_loss[loss=0.2972, simple_loss=0.3449, pruned_loss=0.1247, over 3274971.56 frames. ], batch size: 98, lr: 2.53e-02, grad_scale: 32.0 2023-06-23 10:42:08,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=34673.333333333336, ans=0.125 2023-06-23 10:42:15,842 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 2.198e+02 2.659e+02 3.246e+02 4.895e+02, threshold=5.319e+02, percent-clipped=0.0 2023-06-23 10:42:16,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=34673.333333333336, ans=0.125 2023-06-23 10:42:22,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=34740.0, ans=0.0 2023-06-23 10:42:22,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=34740.0, ans=0.125 2023-06-23 10:42:34,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=34740.0, ans=0.0 2023-06-23 10:43:02,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=34873.333333333336, ans=0.0 2023-06-23 10:43:09,659 INFO [train.py:1008] (1/4) Epoch 10, batch 450, loss[loss=0.316, simple_loss=0.3505, pruned_loss=0.1407, over 20663.00 frames. ], tot_loss[loss=0.2968, simple_loss=0.3449, pruned_loss=0.1244, over 3372260.49 frames. ], batch size: 211, lr: 2.52e-02, grad_scale: 32.0 2023-06-23 10:43:20,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=34940.0, ans=0.125 2023-06-23 10:43:31,045 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.78 vs. limit=15.0 2023-06-23 10:43:58,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=35140.0, ans=0.125 2023-06-23 10:44:06,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=35140.0, ans=0.125 2023-06-23 10:44:29,450 INFO [train.py:1008] (1/4) Epoch 10, batch 500, loss[loss=0.3221, simple_loss=0.379, pruned_loss=0.1326, over 18316.00 frames. ], tot_loss[loss=0.2973, simple_loss=0.3454, pruned_loss=0.1246, over 3463374.28 frames. ], batch size: 72, lr: 2.52e-02, grad_scale: 32.0 2023-06-23 10:44:37,968 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=35273.333333333336, ans=0.125 2023-06-23 10:44:57,137 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.829e+02 2.221e+02 2.542e+02 2.887e+02 4.179e+02, threshold=5.084e+02, percent-clipped=0.0 2023-06-23 10:44:57,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=35340.0, ans=0.1 2023-06-23 10:45:46,045 INFO [train.py:1008] (1/4) Epoch 11, batch 0, loss[loss=0.3175, simple_loss=0.3412, pruned_loss=0.1469, over 19930.00 frames. ], tot_loss[loss=0.3175, simple_loss=0.3412, pruned_loss=0.1469, over 19930.00 frames. ], batch size: 293, lr: 2.42e-02, grad_scale: 32.0 2023-06-23 10:45:46,046 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 10:45:51,641 INFO [train.py:1040] (1/4) Epoch 11, validation: loss=0.2248, simple_loss=0.3217, pruned_loss=0.06391, over 143649.00 frames. 2023-06-23 10:45:51,641 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 10:45:59,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=35493.333333333336, ans=0.125 2023-06-23 10:46:25,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=35626.666666666664, ans=0.003124637681159421 2023-06-23 10:46:27,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=35626.666666666664, ans=0.2 2023-06-23 10:46:30,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=35626.666666666664, ans=0.1 2023-06-23 10:46:41,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=35693.333333333336, ans=0.125 2023-06-23 10:46:48,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=35693.333333333336, ans=0.125 2023-06-23 10:47:14,638 INFO [train.py:1008] (1/4) Epoch 11, batch 50, loss[loss=0.3222, simple_loss=0.3728, pruned_loss=0.1358, over 17600.00 frames. ], tot_loss[loss=0.2968, simple_loss=0.3456, pruned_loss=0.124, over 853845.91 frames. ], batch size: 67, lr: 2.41e-02, grad_scale: 32.0 2023-06-23 10:47:22,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=35826.666666666664, ans=0.2 2023-06-23 10:47:26,055 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=35826.666666666664, ans=0.125 2023-06-23 10:47:34,541 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.98 vs. limit=6.0 2023-06-23 10:47:43,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=35893.333333333336, ans=0.125 2023-06-23 10:48:10,006 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.688e+02 2.233e+02 2.520e+02 2.847e+02 4.196e+02, threshold=5.041e+02, percent-clipped=0.0 2023-06-23 10:48:21,808 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=36093.333333333336, ans=0.0 2023-06-23 10:48:21,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=36093.333333333336, ans=0.2 2023-06-23 10:48:36,762 INFO [train.py:1008] (1/4) Epoch 11, batch 100, loss[loss=0.3158, simple_loss=0.3445, pruned_loss=0.1436, over 20347.00 frames. ], tot_loss[loss=0.2936, simple_loss=0.3441, pruned_loss=0.1215, over 1498495.52 frames. ], batch size: 239, lr: 2.41e-02, grad_scale: 32.0 2023-06-23 10:48:56,940 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.76 vs. limit=22.5 2023-06-23 10:49:07,326 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 10:49:15,305 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.23 vs. limit=22.5 2023-06-23 10:49:32,357 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.57 vs. limit=15.0 2023-06-23 10:49:41,052 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 10:49:42,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=36426.666666666664, ans=0.0029507246376811597 2023-06-23 10:49:55,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=36426.666666666664, ans=0.07 2023-06-23 10:50:00,208 INFO [train.py:1008] (1/4) Epoch 11, batch 150, loss[loss=0.3121, simple_loss=0.3639, pruned_loss=0.1301, over 18283.00 frames. ], tot_loss[loss=0.2944, simple_loss=0.3433, pruned_loss=0.1228, over 2008329.36 frames. ], batch size: 74, lr: 2.40e-02, grad_scale: 32.0 2023-06-23 10:50:08,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=36493.333333333336, ans=0.125 2023-06-23 10:50:49,590 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 10:50:49,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=36693.333333333336, ans=0.1 2023-06-23 10:50:55,670 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.693e+02 2.157e+02 2.625e+02 2.983e+02 4.556e+02, threshold=5.251e+02, percent-clipped=0.0 2023-06-23 10:51:04,460 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.88 vs. limit=15.0 2023-06-23 10:51:08,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=36760.0, ans=0.0 2023-06-23 10:51:22,153 INFO [train.py:1008] (1/4) Epoch 11, batch 200, loss[loss=0.2812, simple_loss=0.325, pruned_loss=0.1188, over 20510.00 frames. ], tot_loss[loss=0.294, simple_loss=0.3427, pruned_loss=0.1226, over 2379955.37 frames. ], batch size: 173, lr: 2.40e-02, grad_scale: 32.0 2023-06-23 10:51:24,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=36826.666666666664, ans=0.125 2023-06-23 10:51:53,707 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.48 vs. limit=22.5 2023-06-23 10:52:04,349 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.43 vs. limit=15.0 2023-06-23 10:52:29,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=37093.333333333336, ans=0.0 2023-06-23 10:52:30,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=37093.333333333336, ans=0.1 2023-06-23 10:52:43,209 INFO [train.py:1008] (1/4) Epoch 11, batch 250, loss[loss=0.3454, simple_loss=0.4012, pruned_loss=0.1448, over 18327.00 frames. ], tot_loss[loss=0.2934, simple_loss=0.3428, pruned_loss=0.122, over 2676074.22 frames. ], batch size: 72, lr: 2.40e-02, grad_scale: 32.0 2023-06-23 10:52:58,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=37160.0, ans=0.1 2023-06-23 10:53:24,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=37293.333333333336, ans=0.125 2023-06-23 10:53:28,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=37293.333333333336, ans=0.05 2023-06-23 10:53:33,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=37360.0, ans=0.2 2023-06-23 10:53:40,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=37360.0, ans=0.125 2023-06-23 10:53:41,579 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.605e+02 2.088e+02 2.350e+02 3.005e+02 5.122e+02, threshold=4.700e+02, percent-clipped=0.0 2023-06-23 10:53:47,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=37360.0, ans=0.125 2023-06-23 10:54:07,915 INFO [train.py:1008] (1/4) Epoch 11, batch 300, loss[loss=0.2667, simple_loss=0.3262, pruned_loss=0.1036, over 19110.00 frames. ], tot_loss[loss=0.2923, simple_loss=0.3419, pruned_loss=0.1213, over 2925880.16 frames. ], batch size: 89, lr: 2.39e-02, grad_scale: 16.0 2023-06-23 10:55:32,512 INFO [train.py:1008] (1/4) Epoch 11, batch 350, loss[loss=0.3244, simple_loss=0.3599, pruned_loss=0.1445, over 20528.00 frames. ], tot_loss[loss=0.2924, simple_loss=0.3414, pruned_loss=0.1217, over 3128318.71 frames. ], batch size: 160, lr: 2.39e-02, grad_scale: 16.0 2023-06-23 10:56:08,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=37960.0, ans=0.1 2023-06-23 10:56:24,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=38026.666666666664, ans=0.125 2023-06-23 10:56:32,937 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.815e+02 2.281e+02 2.772e+02 3.917e+02 6.266e+02, threshold=5.543e+02, percent-clipped=13.0 2023-06-23 10:56:33,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=38026.666666666664, ans=0.125 2023-06-23 10:56:40,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=38093.333333333336, ans=0.0025884057971014493 2023-06-23 10:56:45,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=38093.333333333336, ans=0.125 2023-06-23 10:56:57,415 INFO [train.py:1008] (1/4) Epoch 11, batch 400, loss[loss=0.3009, simple_loss=0.3469, pruned_loss=0.1274, over 19101.00 frames. ], tot_loss[loss=0.2899, simple_loss=0.3397, pruned_loss=0.1201, over 3289244.43 frames. ], batch size: 89, lr: 2.38e-02, grad_scale: 32.0 2023-06-23 10:57:03,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=38160.0, ans=0.125 2023-06-23 10:57:19,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=38226.666666666664, ans=0.2 2023-06-23 10:57:19,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=38226.666666666664, ans=0.125 2023-06-23 10:57:27,010 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.63 vs. limit=10.0 2023-06-23 10:57:31,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=38293.333333333336, ans=0.125 2023-06-23 10:58:17,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=38426.666666666664, ans=0.2 2023-06-23 10:58:22,122 INFO [train.py:1008] (1/4) Epoch 11, batch 450, loss[loss=0.3183, simple_loss=0.3834, pruned_loss=0.1266, over 15490.00 frames. ], tot_loss[loss=0.2891, simple_loss=0.3395, pruned_loss=0.1193, over 3402671.29 frames. ], batch size: 44, lr: 2.38e-02, grad_scale: 32.0 2023-06-23 10:58:26,424 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.10 vs. limit=6.0 2023-06-23 10:58:53,309 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.16 vs. limit=6.0 2023-06-23 10:59:13,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=38693.333333333336, ans=0.002457971014492754 2023-06-23 10:59:20,873 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.759e+02 2.099e+02 2.478e+02 2.998e+02 4.300e+02, threshold=4.956e+02, percent-clipped=0.0 2023-06-23 10:59:44,927 INFO [train.py:1008] (1/4) Epoch 11, batch 500, loss[loss=0.3119, simple_loss=0.3637, pruned_loss=0.13, over 17089.00 frames. ], tot_loss[loss=0.2886, simple_loss=0.3393, pruned_loss=0.119, over 3485351.74 frames. ], batch size: 60, lr: 2.38e-02, grad_scale: 32.0 2023-06-23 10:59:48,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=38826.666666666664, ans=0.125 2023-06-23 11:00:10,022 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.03 vs. limit=15.0 2023-06-23 11:01:00,452 INFO [train.py:1008] (1/4) Epoch 12, batch 0, loss[loss=0.2742, simple_loss=0.3375, pruned_loss=0.1054, over 16943.00 frames. ], tot_loss[loss=0.2742, simple_loss=0.3375, pruned_loss=0.1054, over 16943.00 frames. ], batch size: 60, lr: 2.28e-02, grad_scale: 32.0 2023-06-23 11:01:00,452 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 11:01:06,087 INFO [train.py:1040] (1/4) Epoch 12, validation: loss=0.2211, simple_loss=0.3184, pruned_loss=0.06189, over 143649.00 frames. 2023-06-23 11:01:06,088 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 11:01:10,334 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.83 vs. limit=15.0 2023-06-23 11:01:28,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=39106.666666666664, ans=0.0 2023-06-23 11:01:48,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=39173.333333333336, ans=0.125 2023-06-23 11:01:55,171 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.20 vs. limit=10.0 2023-06-23 11:02:11,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=39306.666666666664, ans=0.125 2023-06-23 11:02:29,855 INFO [train.py:1008] (1/4) Epoch 12, batch 50, loss[loss=0.2656, simple_loss=0.3401, pruned_loss=0.09557, over 17598.00 frames. ], tot_loss[loss=0.2859, simple_loss=0.3355, pruned_loss=0.1182, over 850446.64 frames. ], batch size: 67, lr: 2.28e-02, grad_scale: 32.0 2023-06-23 11:02:34,458 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 2.173e+02 2.414e+02 2.819e+02 3.920e+02, threshold=4.827e+02, percent-clipped=0.0 2023-06-23 11:02:40,027 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.39 vs. limit=6.0 2023-06-23 11:02:47,937 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.89 vs. limit=15.0 2023-06-23 11:03:23,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=39573.333333333336, ans=0.002266666666666667 2023-06-23 11:03:48,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=39640.0, ans=0.04949747468305833 2023-06-23 11:03:52,870 INFO [train.py:1008] (1/4) Epoch 12, batch 100, loss[loss=0.2983, simple_loss=0.3546, pruned_loss=0.1209, over 17035.00 frames. ], tot_loss[loss=0.2842, simple_loss=0.3342, pruned_loss=0.1171, over 1513054.89 frames. ], batch size: 60, lr: 2.28e-02, grad_scale: 32.0 2023-06-23 11:04:00,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=39706.666666666664, ans=0.5 2023-06-23 11:04:09,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=39773.333333333336, ans=0.0 2023-06-23 11:04:18,251 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=15.0 2023-06-23 11:04:30,471 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.64 vs. limit=22.5 2023-06-23 11:05:02,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=39973.333333333336, ans=0.125 2023-06-23 11:05:04,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=39973.333333333336, ans=0.125 2023-06-23 11:05:15,883 INFO [train.py:1008] (1/4) Epoch 12, batch 150, loss[loss=0.2713, simple_loss=0.3226, pruned_loss=0.11, over 20566.00 frames. ], tot_loss[loss=0.2826, simple_loss=0.3338, pruned_loss=0.1157, over 2040883.27 frames. ], batch size: 189, lr: 2.27e-02, grad_scale: 32.0 2023-06-23 11:05:20,576 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.671e+02 2.197e+02 2.613e+02 3.167e+02 4.551e+02, threshold=5.226e+02, percent-clipped=0.0 2023-06-23 11:06:07,447 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=40240.0, ans=0.0 2023-06-23 11:06:10,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=40240.0, ans=0.125 2023-06-23 11:06:17,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=40240.0, ans=0.04949747468305833 2023-06-23 11:06:34,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=40306.666666666664, ans=0.1 2023-06-23 11:06:38,961 INFO [train.py:1008] (1/4) Epoch 12, batch 200, loss[loss=0.2688, simple_loss=0.3409, pruned_loss=0.09837, over 16967.00 frames. ], tot_loss[loss=0.2814, simple_loss=0.3333, pruned_loss=0.1148, over 2435969.12 frames. ], batch size: 60, lr: 2.27e-02, grad_scale: 32.0 2023-06-23 11:06:52,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=40373.333333333336, ans=0.125 2023-06-23 11:06:58,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=40440.0, ans=0.2 2023-06-23 11:07:07,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=40440.0, ans=0.125 2023-06-23 11:07:43,924 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=40640.0, ans=0.1 2023-06-23 11:08:03,154 INFO [train.py:1008] (1/4) Epoch 12, batch 250, loss[loss=0.3205, simple_loss=0.3743, pruned_loss=0.1333, over 16180.00 frames. ], tot_loss[loss=0.2821, simple_loss=0.3343, pruned_loss=0.115, over 2732730.19 frames. ], batch size: 52, lr: 2.27e-02, grad_scale: 32.0 2023-06-23 11:08:07,869 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.708e+02 2.162e+02 2.502e+02 3.086e+02 4.353e+02, threshold=5.005e+02, percent-clipped=0.0 2023-06-23 11:08:08,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=40706.666666666664, ans=0.002020289855072464 2023-06-23 11:08:54,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=40906.666666666664, ans=0.001976811594202899 2023-06-23 11:08:54,909 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.99 vs. limit=15.0 2023-06-23 11:09:06,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=40906.666666666664, ans=0.125 2023-06-23 11:09:07,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=40906.666666666664, ans=0.5 2023-06-23 11:09:26,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=41040.0, ans=0.2 2023-06-23 11:09:27,217 INFO [train.py:1008] (1/4) Epoch 12, batch 300, loss[loss=0.294, simple_loss=0.3502, pruned_loss=0.1189, over 18317.00 frames. ], tot_loss[loss=0.2816, simple_loss=0.3343, pruned_loss=0.1145, over 2970034.58 frames. ], batch size: 74, lr: 2.26e-02, grad_scale: 32.0 2023-06-23 11:09:38,696 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.66 vs. limit=15.0 2023-06-23 11:09:45,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=41106.666666666664, ans=0.125 2023-06-23 11:10:26,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=41240.0, ans=0.125 2023-06-23 11:10:36,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=41306.666666666664, ans=0.125 2023-06-23 11:10:38,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=41306.666666666664, ans=0.0018898550724637687 2023-06-23 11:10:41,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=41306.666666666664, ans=0.0018898550724637687 2023-06-23 11:10:47,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=41306.666666666664, ans=0.125 2023-06-23 11:10:50,364 INFO [train.py:1008] (1/4) Epoch 12, batch 350, loss[loss=0.3035, simple_loss=0.367, pruned_loss=0.12, over 18300.00 frames. ], tot_loss[loss=0.2818, simple_loss=0.3341, pruned_loss=0.1148, over 3172405.01 frames. ], batch size: 72, lr: 2.26e-02, grad_scale: 32.0 2023-06-23 11:10:54,993 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.720e+02 2.164e+02 2.476e+02 2.969e+02 4.432e+02, threshold=4.951e+02, percent-clipped=0.0 2023-06-23 11:11:21,896 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.10 vs. limit=15.0 2023-06-23 11:12:11,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=41706.666666666664, ans=0.125 2023-06-23 11:12:13,273 INFO [train.py:1008] (1/4) Epoch 12, batch 400, loss[loss=0.2754, simple_loss=0.3341, pruned_loss=0.1084, over 19330.00 frames. ], tot_loss[loss=0.2817, simple_loss=0.3341, pruned_loss=0.1147, over 3312137.01 frames. ], batch size: 98, lr: 2.25e-02, grad_scale: 32.0 2023-06-23 11:12:37,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=41773.333333333336, ans=0.1 2023-06-23 11:13:26,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=41973.333333333336, ans=0.125 2023-06-23 11:13:35,923 INFO [train.py:1008] (1/4) Epoch 12, batch 450, loss[loss=0.2926, simple_loss=0.3417, pruned_loss=0.1218, over 19662.00 frames. ], tot_loss[loss=0.2815, simple_loss=0.3336, pruned_loss=0.1147, over 3414948.01 frames. ], batch size: 110, lr: 2.25e-02, grad_scale: 32.0 2023-06-23 11:13:38,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=42040.0, ans=0.2 2023-06-23 11:13:41,745 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.635e+02 2.136e+02 2.493e+02 3.022e+02 5.993e+02, threshold=4.985e+02, percent-clipped=1.0 2023-06-23 11:14:15,949 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=12.0 2023-06-23 11:14:17,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=42173.333333333336, ans=0.125 2023-06-23 11:14:30,301 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=14.99 vs. limit=15.0 2023-06-23 11:14:48,564 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=15.77 vs. limit=15.0 2023-06-23 11:14:57,109 INFO [train.py:1008] (1/4) Epoch 12, batch 500, loss[loss=0.2654, simple_loss=0.3255, pruned_loss=0.1027, over 19353.00 frames. ], tot_loss[loss=0.2819, simple_loss=0.3333, pruned_loss=0.1153, over 3496487.71 frames. ], batch size: 98, lr: 2.25e-02, grad_scale: 32.0 2023-06-23 11:15:05,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=42373.333333333336, ans=0.1 2023-06-23 11:15:20,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=42440.0, ans=0.125 2023-06-23 11:15:22,394 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=42440.0, ans=0.001643478260869564 2023-06-23 11:15:26,276 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.08 vs. limit=15.0 2023-06-23 11:15:31,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=42506.666666666664, ans=0.1 2023-06-23 11:15:32,297 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.60 vs. limit=15.0 2023-06-23 11:16:12,192 INFO [train.py:1008] (1/4) Epoch 13, batch 0, loss[loss=0.2821, simple_loss=0.3328, pruned_loss=0.1157, over 20266.00 frames. ], tot_loss[loss=0.2821, simple_loss=0.3328, pruned_loss=0.1157, over 20266.00 frames. ], batch size: 141, lr: 2.16e-02, grad_scale: 32.0 2023-06-23 11:16:12,192 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 11:16:18,215 INFO [train.py:1040] (1/4) Epoch 13, validation: loss=0.2164, simple_loss=0.3138, pruned_loss=0.05947, over 143649.00 frames. 2023-06-23 11:16:18,216 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 11:16:27,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=42593.333333333336, ans=0.0 2023-06-23 11:16:48,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=42660.0, ans=0.125 2023-06-23 11:16:51,936 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.682e+02 2.193e+02 2.511e+02 3.488e+02 6.127e+02, threshold=5.022e+02, percent-clipped=8.0 2023-06-23 11:17:41,412 INFO [train.py:1008] (1/4) Epoch 13, batch 50, loss[loss=0.2639, simple_loss=0.3245, pruned_loss=0.1017, over 19721.00 frames. ], tot_loss[loss=0.2786, simple_loss=0.3322, pruned_loss=0.1125, over 855627.15 frames. ], batch size: 110, lr: 2.16e-02, grad_scale: 32.0 2023-06-23 11:18:13,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=43060.0, ans=0.125 2023-06-23 11:18:20,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=43060.0, ans=0.125 2023-06-23 11:18:36,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=43126.666666666664, ans=0.125 2023-06-23 11:19:04,636 INFO [train.py:1008] (1/4) Epoch 13, batch 100, loss[loss=0.2842, simple_loss=0.3385, pruned_loss=0.1149, over 18764.00 frames. ], tot_loss[loss=0.2773, simple_loss=0.3303, pruned_loss=0.1122, over 1506835.38 frames. ], batch size: 83, lr: 2.16e-02, grad_scale: 32.0 2023-06-23 11:19:16,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=43260.0, ans=0.2 2023-06-23 11:19:19,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=43326.666666666664, ans=0.04949747468305833 2023-06-23 11:19:24,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=43326.666666666664, ans=0.0 2023-06-23 11:19:35,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=43326.666666666664, ans=0.0 2023-06-23 11:19:37,885 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.683e+02 2.152e+02 2.569e+02 2.957e+02 4.748e+02, threshold=5.138e+02, percent-clipped=0.0 2023-06-23 11:19:41,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=43393.333333333336, ans=0.125 2023-06-23 11:19:48,632 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=12.29 vs. limit=15.0 2023-06-23 11:20:09,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=43526.666666666664, ans=0.0 2023-06-23 11:20:27,030 INFO [train.py:1008] (1/4) Epoch 13, batch 150, loss[loss=0.2775, simple_loss=0.3243, pruned_loss=0.1154, over 20555.00 frames. ], tot_loss[loss=0.2764, simple_loss=0.3302, pruned_loss=0.1112, over 2007301.50 frames. ], batch size: 189, lr: 2.15e-02, grad_scale: 32.0 2023-06-23 11:20:29,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=43593.333333333336, ans=0.125 2023-06-23 11:20:32,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=43593.333333333336, ans=0.2 2023-06-23 11:20:40,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=43593.333333333336, ans=0.2 2023-06-23 11:20:45,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=43660.0, ans=0.1 2023-06-23 11:20:48,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=43660.0, ans=0.125 2023-06-23 11:21:20,789 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-06-23 11:21:23,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=43793.333333333336, ans=0.125 2023-06-23 11:21:32,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=43860.0, ans=0.2 2023-06-23 11:21:50,249 INFO [train.py:1008] (1/4) Epoch 13, batch 200, loss[loss=0.2616, simple_loss=0.3177, pruned_loss=0.1028, over 19870.00 frames. ], tot_loss[loss=0.2779, simple_loss=0.3316, pruned_loss=0.1121, over 2403591.74 frames. ], batch size: 120, lr: 2.15e-02, grad_scale: 32.0 2023-06-23 11:22:05,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=43993.333333333336, ans=0.125 2023-06-23 11:22:23,033 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.640e+02 2.129e+02 2.385e+02 2.873e+02 5.055e+02, threshold=4.770e+02, percent-clipped=0.0 2023-06-23 11:22:33,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=44060.0, ans=0.125 2023-06-23 11:22:35,436 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=44060.0, ans=0.0 2023-06-23 11:22:37,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=44060.0, ans=0.5 2023-06-23 11:22:40,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=44126.666666666664, ans=0.1 2023-06-23 11:22:40,493 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 11:23:13,624 INFO [train.py:1008] (1/4) Epoch 13, batch 250, loss[loss=0.2476, simple_loss=0.3138, pruned_loss=0.09069, over 19697.00 frames. ], tot_loss[loss=0.2776, simple_loss=0.3316, pruned_loss=0.1118, over 2713851.36 frames. ], batch size: 110, lr: 2.15e-02, grad_scale: 32.0 2023-06-23 11:23:41,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=44326.666666666664, ans=0.125 2023-06-23 11:23:57,846 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.88 vs. limit=15.0 2023-06-23 11:24:04,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=44460.0, ans=0.1 2023-06-23 11:24:24,111 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=44526.666666666664, ans=0.0011898550724637694 2023-06-23 11:24:29,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=44526.666666666664, ans=0.0011898550724637694 2023-06-23 11:24:34,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=44526.666666666664, ans=0.125 2023-06-23 11:24:37,094 INFO [train.py:1008] (1/4) Epoch 13, batch 300, loss[loss=0.2561, simple_loss=0.3239, pruned_loss=0.09418, over 11146.00 frames. ], tot_loss[loss=0.2777, simple_loss=0.3312, pruned_loss=0.1121, over 2945412.69 frames. ], batch size: 31, lr: 2.14e-02, grad_scale: 32.0 2023-06-23 11:24:49,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=44593.333333333336, ans=0.125 2023-06-23 11:25:07,272 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 11:25:10,125 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 2.051e+02 2.313e+02 2.745e+02 4.176e+02, threshold=4.625e+02, percent-clipped=1.0 2023-06-23 11:26:00,769 INFO [train.py:1008] (1/4) Epoch 13, batch 350, loss[loss=0.2839, simple_loss=0.3365, pruned_loss=0.1156, over 19691.00 frames. ], tot_loss[loss=0.2772, simple_loss=0.3301, pruned_loss=0.1122, over 3138783.01 frames. ], batch size: 110, lr: 2.14e-02, grad_scale: 32.0 2023-06-23 11:26:42,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=45060.0, ans=0.2 2023-06-23 11:26:52,183 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 11:27:23,847 INFO [train.py:1008] (1/4) Epoch 13, batch 400, loss[loss=0.2852, simple_loss=0.3447, pruned_loss=0.1129, over 17648.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3302, pruned_loss=0.1114, over 3289914.15 frames. ], batch size: 67, lr: 2.14e-02, grad_scale: 32.0 2023-06-23 11:27:24,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=45260.0, ans=0.125 2023-06-23 11:27:37,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=45260.0, ans=0.0 2023-06-23 11:27:58,228 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 2.091e+02 2.460e+02 2.937e+02 5.771e+02, threshold=4.919e+02, percent-clipped=2.0 2023-06-23 11:28:48,322 INFO [train.py:1008] (1/4) Epoch 13, batch 450, loss[loss=0.2578, simple_loss=0.3316, pruned_loss=0.09199, over 17627.00 frames. ], tot_loss[loss=0.2762, simple_loss=0.3297, pruned_loss=0.1113, over 3406702.17 frames. ], batch size: 67, lr: 2.13e-02, grad_scale: 32.0 2023-06-23 11:28:58,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=45593.333333333336, ans=0.0 2023-06-23 11:29:28,557 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.79 vs. limit=15.0 2023-06-23 11:29:44,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=45793.333333333336, ans=0.125 2023-06-23 11:29:44,871 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.19 vs. limit=15.0 2023-06-23 11:29:56,761 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 11:30:08,820 INFO [train.py:1008] (1/4) Epoch 13, batch 500, loss[loss=0.2875, simple_loss=0.3356, pruned_loss=0.1197, over 20526.00 frames. ], tot_loss[loss=0.2771, simple_loss=0.3305, pruned_loss=0.1118, over 3489096.56 frames. ], batch size: 189, lr: 2.13e-02, grad_scale: 32.0 2023-06-23 11:30:09,052 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 11:30:40,871 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.702e+02 2.058e+02 2.521e+02 2.960e+02 4.342e+02, threshold=5.042e+02, percent-clipped=0.0 2023-06-23 11:30:53,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=46060.0, ans=0.125 2023-06-23 11:31:22,276 INFO [train.py:1008] (1/4) Epoch 14, batch 0, loss[loss=0.2875, simple_loss=0.3329, pruned_loss=0.121, over 20094.00 frames. ], tot_loss[loss=0.2875, simple_loss=0.3329, pruned_loss=0.121, over 20094.00 frames. ], batch size: 133, lr: 2.05e-02, grad_scale: 32.0 2023-06-23 11:31:22,276 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 11:31:28,287 INFO [train.py:1040] (1/4) Epoch 14, validation: loss=0.2154, simple_loss=0.3133, pruned_loss=0.05873, over 143649.00 frames. 2023-06-23 11:31:28,288 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 11:31:44,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=46206.666666666664, ans=0.0 2023-06-23 11:31:44,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=46206.666666666664, ans=0.2 2023-06-23 11:32:13,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=46273.333333333336, ans=0.0008101449275362323 2023-06-23 11:32:32,028 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.11 vs. limit=12.0 2023-06-23 11:32:49,794 INFO [train.py:1008] (1/4) Epoch 14, batch 50, loss[loss=0.2719, simple_loss=0.3213, pruned_loss=0.1113, over 20354.00 frames. ], tot_loss[loss=0.2731, simple_loss=0.3292, pruned_loss=0.1085, over 851135.73 frames. ], batch size: 149, lr: 2.05e-02, grad_scale: 32.0 2023-06-23 11:33:00,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=46473.333333333336, ans=0.125 2023-06-23 11:33:10,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=46540.0, ans=0.125 2023-06-23 11:33:21,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=46606.666666666664, ans=0.125 2023-06-23 11:33:37,728 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.80 vs. limit=15.0 2023-06-23 11:33:52,835 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.682e+02 2.120e+02 2.475e+02 2.928e+02 4.576e+02, threshold=4.949e+02, percent-clipped=0.0 2023-06-23 11:33:59,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=46740.0, ans=0.2 2023-06-23 11:34:05,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=46740.0, ans=0.0 2023-06-23 11:34:12,920 INFO [train.py:1008] (1/4) Epoch 14, batch 100, loss[loss=0.2409, simple_loss=0.3065, pruned_loss=0.08763, over 19878.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3281, pruned_loss=0.1097, over 1491556.44 frames. ], batch size: 120, lr: 2.05e-02, grad_scale: 32.0 2023-06-23 11:34:32,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=46873.333333333336, ans=0.125 2023-06-23 11:34:34,746 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.30 vs. limit=12.0 2023-06-23 11:34:44,480 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=46940.0, ans=0.05 2023-06-23 11:35:12,197 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2023-06-23 11:35:24,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=47073.333333333336, ans=0.0 2023-06-23 11:35:35,495 INFO [train.py:1008] (1/4) Epoch 14, batch 150, loss[loss=0.2741, simple_loss=0.3336, pruned_loss=0.1073, over 19222.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.3277, pruned_loss=0.1102, over 2024895.99 frames. ], batch size: 92, lr: 2.04e-02, grad_scale: 32.0 2023-06-23 11:35:41,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=47140.0, ans=0.125 2023-06-23 11:35:49,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=47140.0, ans=0.05 2023-06-23 11:36:13,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=47273.333333333336, ans=0.125 2023-06-23 11:36:37,862 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 2.008e+02 2.294e+02 2.617e+02 4.085e+02, threshold=4.588e+02, percent-clipped=0.0 2023-06-23 11:36:47,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=47406.666666666664, ans=0.0 2023-06-23 11:36:49,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=47406.666666666664, ans=0.125 2023-06-23 11:36:54,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=47406.666666666664, ans=0.125 2023-06-23 11:36:56,396 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.64 vs. limit=15.0 2023-06-23 11:36:57,104 INFO [train.py:1008] (1/4) Epoch 14, batch 200, loss[loss=0.2642, simple_loss=0.3212, pruned_loss=0.1036, over 20272.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3283, pruned_loss=0.1096, over 2412290.79 frames. ], batch size: 149, lr: 2.04e-02, grad_scale: 32.0 2023-06-23 11:36:57,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=47473.333333333336, ans=0.1 2023-06-23 11:37:20,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=47540.0, ans=0.1 2023-06-23 11:37:31,328 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.01 vs. limit=22.5 2023-06-23 11:38:15,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=47740.0, ans=0.1 2023-06-23 11:38:19,881 INFO [train.py:1008] (1/4) Epoch 14, batch 250, loss[loss=0.2701, simple_loss=0.3247, pruned_loss=0.1077, over 20510.00 frames. ], tot_loss[loss=0.2725, simple_loss=0.3275, pruned_loss=0.1087, over 2723145.18 frames. ], batch size: 160, lr: 2.04e-02, grad_scale: 32.0 2023-06-23 11:38:28,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=47806.666666666664, ans=0.0 2023-06-23 11:38:45,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=47873.333333333336, ans=10.0 2023-06-23 11:39:13,363 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=48006.666666666664, ans=0.0 2023-06-23 11:39:23,140 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.616e+02 2.088e+02 2.363e+02 2.839e+02 5.052e+02, threshold=4.726e+02, percent-clipped=1.0 2023-06-23 11:39:43,354 INFO [train.py:1008] (1/4) Epoch 14, batch 300, loss[loss=0.2727, simple_loss=0.3329, pruned_loss=0.1062, over 18627.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.3272, pruned_loss=0.1085, over 2961215.04 frames. ], batch size: 80, lr: 2.03e-02, grad_scale: 32.0 2023-06-23 11:39:57,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=48140.0, ans=0.0 2023-06-23 11:40:05,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=48206.666666666664, ans=0.04949747468305833 2023-06-23 11:40:08,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=48206.666666666664, ans=0.1 2023-06-23 11:40:41,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=48340.0, ans=0.125 2023-06-23 11:41:06,352 INFO [train.py:1008] (1/4) Epoch 14, batch 350, loss[loss=0.2504, simple_loss=0.3136, pruned_loss=0.09359, over 18622.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3263, pruned_loss=0.1079, over 3161815.85 frames. ], batch size: 80, lr: 2.03e-02, grad_scale: 32.0 2023-06-23 11:41:26,056 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.83 vs. limit=22.5 2023-06-23 11:41:39,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=48606.666666666664, ans=0.95 2023-06-23 11:41:43,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=48606.666666666664, ans=0.00030289855072463713 2023-06-23 11:42:08,173 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.701e+02 2.151e+02 2.394e+02 2.779e+02 4.644e+02, threshold=4.788e+02, percent-clipped=0.0 2023-06-23 11:42:08,742 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=48673.333333333336, ans=0.125 2023-06-23 11:42:21,078 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.65 vs. limit=15.0 2023-06-23 11:42:23,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=48740.0, ans=0.125 2023-06-23 11:42:28,598 INFO [train.py:1008] (1/4) Epoch 14, batch 400, loss[loss=0.2499, simple_loss=0.3199, pruned_loss=0.09001, over 19234.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3255, pruned_loss=0.1073, over 3317311.71 frames. ], batch size: 92, lr: 2.03e-02, grad_scale: 32.0 2023-06-23 11:42:28,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=48806.666666666664, ans=0.2 2023-06-23 11:43:10,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=48940.0, ans=0.125 2023-06-23 11:43:24,927 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 11:43:51,077 INFO [train.py:1008] (1/4) Epoch 14, batch 450, loss[loss=0.2346, simple_loss=0.3017, pruned_loss=0.0837, over 19199.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3257, pruned_loss=0.1075, over 3431078.44 frames. ], batch size: 92, lr: 2.02e-02, grad_scale: 32.0 2023-06-23 11:44:05,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=49206.666666666664, ans=0.125 2023-06-23 11:44:18,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=49206.666666666664, ans=0.0 2023-06-23 11:44:18,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=49206.666666666664, ans=0.125 2023-06-23 11:44:24,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff3.min_abs, batch_count=49273.333333333336, ans=0.2 2023-06-23 11:44:32,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=49273.333333333336, ans=0.125 2023-06-23 11:44:51,420 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.655e+02 2.043e+02 2.389e+02 2.856e+02 3.989e+02, threshold=4.778e+02, percent-clipped=0.0 2023-06-23 11:45:10,419 INFO [train.py:1008] (1/4) Epoch 14, batch 500, loss[loss=0.2441, simple_loss=0.3141, pruned_loss=0.08711, over 19335.00 frames. ], tot_loss[loss=0.27, simple_loss=0.3259, pruned_loss=0.107, over 3512355.30 frames. ], batch size: 98, lr: 2.02e-02, grad_scale: 32.0 2023-06-23 11:45:29,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=49540.0, ans=0.125 2023-06-23 11:45:34,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=49540.0, ans=0.2 2023-06-23 11:45:38,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=49540.0, ans=0.0 2023-06-23 11:45:46,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=49606.666666666664, ans=0.125 2023-06-23 11:46:24,383 INFO [train.py:1008] (1/4) Epoch 15, batch 0, loss[loss=0.2669, simple_loss=0.325, pruned_loss=0.1044, over 19075.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.325, pruned_loss=0.1044, over 19075.00 frames. ], batch size: 89, lr: 1.95e-02, grad_scale: 32.0 2023-06-23 11:46:24,384 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 11:46:31,368 INFO [train.py:1040] (1/4) Epoch 15, validation: loss=0.2123, simple_loss=0.3108, pruned_loss=0.05692, over 143649.00 frames. 2023-06-23 11:46:31,369 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 11:46:36,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=49693.333333333336, ans=0.125 2023-06-23 11:46:45,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=49693.333333333336, ans=0.125 2023-06-23 11:47:43,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=49960.0, ans=0.07 2023-06-23 11:47:52,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=50026.666666666664, ans=0.09899494936611666 2023-06-23 11:47:54,012 INFO [train.py:1008] (1/4) Epoch 15, batch 50, loss[loss=0.2735, simple_loss=0.315, pruned_loss=0.116, over 20115.00 frames. ], tot_loss[loss=0.2696, simple_loss=0.325, pruned_loss=0.1071, over 859621.87 frames. ], batch size: 239, lr: 1.95e-02, grad_scale: 32.0 2023-06-23 11:48:00,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=50026.666666666664, ans=0.125 2023-06-23 11:48:01,754 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.603e+02 2.176e+02 2.428e+02 2.954e+02 4.942e+02, threshold=4.857e+02, percent-clipped=1.0 2023-06-23 11:49:03,252 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=50293.333333333336, ans=0.0 2023-06-23 11:49:04,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=50293.333333333336, ans=0.0 2023-06-23 11:49:15,446 INFO [train.py:1008] (1/4) Epoch 15, batch 100, loss[loss=0.259, simple_loss=0.3109, pruned_loss=0.1036, over 20516.00 frames. ], tot_loss[loss=0.2675, simple_loss=0.3239, pruned_loss=0.1056, over 1509464.03 frames. ], batch size: 173, lr: 1.95e-02, grad_scale: 32.0 2023-06-23 11:49:22,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=50360.0, ans=0.125 2023-06-23 11:49:32,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=50426.666666666664, ans=0.125 2023-06-23 11:49:34,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=50426.666666666664, ans=0.125 2023-06-23 11:49:47,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=50493.333333333336, ans=0.09899494936611666 2023-06-23 11:50:11,474 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=50560.0, ans=0.0 2023-06-23 11:50:38,475 INFO [train.py:1008] (1/4) Epoch 15, batch 150, loss[loss=0.2836, simple_loss=0.3379, pruned_loss=0.1147, over 19870.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.3249, pruned_loss=0.106, over 2020084.19 frames. ], batch size: 120, lr: 1.94e-02, grad_scale: 32.0 2023-06-23 11:50:46,282 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.598e+02 1.892e+02 2.103e+02 2.343e+02 3.831e+02, threshold=4.207e+02, percent-clipped=0.0 2023-06-23 11:50:59,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=50760.0, ans=0.1 2023-06-23 11:51:00,500 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.75 vs. limit=22.5 2023-06-23 11:51:20,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=50826.666666666664, ans=0.2 2023-06-23 11:51:34,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=50893.333333333336, ans=0.1 2023-06-23 11:51:36,380 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.32 vs. limit=10.0 2023-06-23 11:51:44,261 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.43 vs. limit=15.0 2023-06-23 11:51:53,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=50960.0, ans=0.125 2023-06-23 11:52:01,033 INFO [train.py:1008] (1/4) Epoch 15, batch 200, loss[loss=0.2554, simple_loss=0.3143, pruned_loss=0.09823, over 19071.00 frames. ], tot_loss[loss=0.2687, simple_loss=0.3242, pruned_loss=0.1066, over 2406198.79 frames. ], batch size: 94, lr: 1.94e-02, grad_scale: 64.0 2023-06-23 11:52:13,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=51026.666666666664, ans=0.125 2023-06-23 11:52:35,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=51160.0, ans=0.2 2023-06-23 11:52:50,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=51226.666666666664, ans=0.125 2023-06-23 11:52:50,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=51226.666666666664, ans=0.1 2023-06-23 11:53:11,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=51293.333333333336, ans=0.0 2023-06-23 11:53:23,353 INFO [train.py:1008] (1/4) Epoch 15, batch 250, loss[loss=0.2733, simple_loss=0.3363, pruned_loss=0.1052, over 19122.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3229, pruned_loss=0.1053, over 2713234.50 frames. ], batch size: 94, lr: 1.94e-02, grad_scale: 64.0 2023-06-23 11:53:25,907 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.50 vs. limit=22.5 2023-06-23 11:53:31,300 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.602e+02 2.086e+02 2.413e+02 2.905e+02 4.106e+02, threshold=4.826e+02, percent-clipped=0.0 2023-06-23 11:54:08,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=51493.333333333336, ans=0.1 2023-06-23 11:54:44,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=51693.333333333336, ans=0.1 2023-06-23 11:54:45,808 INFO [train.py:1008] (1/4) Epoch 15, batch 300, loss[loss=0.2627, simple_loss=0.331, pruned_loss=0.09714, over 18316.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3227, pruned_loss=0.1057, over 2947014.64 frames. ], batch size: 74, lr: 1.93e-02, grad_scale: 64.0 2023-06-23 11:55:04,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff2.min_abs, batch_count=51760.0, ans=0.1 2023-06-23 11:55:07,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=51760.0, ans=0.0 2023-06-23 11:55:16,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=51826.666666666664, ans=0.1 2023-06-23 11:55:25,915 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.60 vs. limit=10.0 2023-06-23 11:55:30,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=51826.666666666664, ans=0.0 2023-06-23 11:56:07,184 INFO [train.py:1008] (1/4) Epoch 15, batch 350, loss[loss=0.2765, simple_loss=0.3529, pruned_loss=0.1001, over 17642.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3228, pruned_loss=0.1046, over 3132620.73 frames. ], batch size: 67, lr: 1.93e-02, grad_scale: 64.0 2023-06-23 11:56:13,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=52026.666666666664, ans=0.5 2023-06-23 11:56:16,020 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.692e+02 2.082e+02 2.457e+02 3.003e+02 4.853e+02, threshold=4.914e+02, percent-clipped=1.0 2023-06-23 11:56:27,751 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.18 vs. limit=22.5 2023-06-23 11:56:43,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=52160.0, ans=0.0 2023-06-23 11:57:01,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=52226.666666666664, ans=0.1 2023-06-23 11:57:18,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=52293.333333333336, ans=0.2 2023-06-23 11:57:20,362 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.81 vs. limit=6.0 2023-06-23 11:57:22,274 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.74 vs. limit=15.0 2023-06-23 11:57:28,147 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=52360.0, ans=0.09899494936611666 2023-06-23 11:57:28,786 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.80 vs. limit=15.0 2023-06-23 11:57:29,261 INFO [train.py:1008] (1/4) Epoch 15, batch 400, loss[loss=0.2452, simple_loss=0.3122, pruned_loss=0.0891, over 19518.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.323, pruned_loss=0.1046, over 3270584.23 frames. ], batch size: 102, lr: 1.93e-02, grad_scale: 64.0 2023-06-23 11:57:46,638 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.53 vs. limit=15.0 2023-06-23 11:57:52,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=52426.666666666664, ans=0.125 2023-06-23 11:58:01,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_na.min_abs, batch_count=52493.333333333336, ans=0.02 2023-06-23 11:58:04,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=52493.333333333336, ans=0.0 2023-06-23 11:58:08,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=52493.333333333336, ans=0.0 2023-06-23 11:58:46,746 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.90 vs. limit=22.5 2023-06-23 11:58:51,044 INFO [train.py:1008] (1/4) Epoch 15, batch 450, loss[loss=0.2767, simple_loss=0.3317, pruned_loss=0.1109, over 19068.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3233, pruned_loss=0.1043, over 3380445.98 frames. ], batch size: 94, lr: 1.92e-02, grad_scale: 64.0 2023-06-23 11:58:58,841 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.634e+02 1.980e+02 2.151e+02 2.530e+02 3.585e+02, threshold=4.302e+02, percent-clipped=0.0 2023-06-23 11:59:00,167 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.72 vs. limit=15.0 2023-06-23 11:59:03,506 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.65 vs. limit=22.5 2023-06-23 11:59:21,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=52760.0, ans=0.2 2023-06-23 11:59:22,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=52826.666666666664, ans=0.2 2023-06-23 11:59:34,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=52826.666666666664, ans=0.05 2023-06-23 11:59:43,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=52893.333333333336, ans=0.2 2023-06-23 12:00:11,080 INFO [train.py:1008] (1/4) Epoch 15, batch 500, loss[loss=0.2594, simple_loss=0.3223, pruned_loss=0.09828, over 19233.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3227, pruned_loss=0.1039, over 3488028.61 frames. ], batch size: 92, lr: 1.92e-02, grad_scale: 64.0 2023-06-23 12:01:23,404 INFO [train.py:1008] (1/4) Epoch 16, batch 0, loss[loss=0.3022, simple_loss=0.3089, pruned_loss=0.1477, over 16940.00 frames. ], tot_loss[loss=0.3022, simple_loss=0.3089, pruned_loss=0.1477, over 16940.00 frames. ], batch size: 391, lr: 1.86e-02, grad_scale: 64.0 2023-06-23 12:01:23,404 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 12:01:28,949 INFO [train.py:1040] (1/4) Epoch 16, validation: loss=0.2067, simple_loss=0.3069, pruned_loss=0.05322, over 143649.00 frames. 2023-06-23 12:01:28,950 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 12:01:34,124 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=4.41 vs. limit=15.0 2023-06-23 12:01:49,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=53306.666666666664, ans=0.125 2023-06-23 12:02:08,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=53373.333333333336, ans=0.0 2023-06-23 12:02:10,845 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 2.015e+02 2.320e+02 2.731e+02 3.993e+02, threshold=4.641e+02, percent-clipped=0.0 2023-06-23 12:02:34,741 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:02:53,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=53573.333333333336, ans=0.0 2023-06-23 12:02:53,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=53573.333333333336, ans=0.07 2023-06-23 12:02:54,607 INFO [train.py:1008] (1/4) Epoch 16, batch 50, loss[loss=0.2938, simple_loss=0.3425, pruned_loss=0.1226, over 20308.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3208, pruned_loss=0.1045, over 853084.44 frames. ], batch size: 149, lr: 1.86e-02, grad_scale: 32.0 2023-06-23 12:03:02,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=53573.333333333336, ans=0.0 2023-06-23 12:03:38,177 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:03:45,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=53773.333333333336, ans=0.2 2023-06-23 12:04:08,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=53840.0, ans=0.025 2023-06-23 12:04:16,527 INFO [train.py:1008] (1/4) Epoch 16, batch 100, loss[loss=0.2805, simple_loss=0.33, pruned_loss=0.1156, over 20102.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.3208, pruned_loss=0.1034, over 1518433.39 frames. ], batch size: 133, lr: 1.85e-02, grad_scale: 32.0 2023-06-23 12:04:28,527 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.97 vs. limit=8.0 2023-06-23 12:04:56,725 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.578e+02 2.011e+02 2.286e+02 2.547e+02 3.194e+02, threshold=4.572e+02, percent-clipped=0.0 2023-06-23 12:04:57,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=54040.0, ans=0.125 2023-06-23 12:05:04,868 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.78 vs. limit=15.0 2023-06-23 12:05:14,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=54106.666666666664, ans=0.125 2023-06-23 12:05:33,234 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.36 vs. limit=15.0 2023-06-23 12:05:40,382 INFO [train.py:1008] (1/4) Epoch 16, batch 150, loss[loss=0.294, simple_loss=0.3613, pruned_loss=0.1133, over 18322.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3212, pruned_loss=0.1035, over 2032927.23 frames. ], batch size: 72, lr: 1.85e-02, grad_scale: 32.0 2023-06-23 12:05:41,295 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.53 vs. limit=10.0 2023-06-23 12:06:12,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=54306.666666666664, ans=0.125 2023-06-23 12:06:42,580 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.30 vs. limit=15.0 2023-06-23 12:06:59,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=54506.666666666664, ans=0.1 2023-06-23 12:07:05,393 INFO [train.py:1008] (1/4) Epoch 16, batch 200, loss[loss=0.2856, simple_loss=0.3482, pruned_loss=0.1116, over 18288.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.3216, pruned_loss=0.1031, over 2420733.94 frames. ], batch size: 74, lr: 1.85e-02, grad_scale: 32.0 2023-06-23 12:07:26,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=54640.0, ans=0.1 2023-06-23 12:07:45,470 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.605e+02 1.985e+02 2.333e+02 2.775e+02 4.718e+02, threshold=4.665e+02, percent-clipped=2.0 2023-06-23 12:08:03,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=54773.333333333336, ans=0.1 2023-06-23 12:08:11,038 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.99 vs. limit=10.0 2023-06-23 12:08:29,509 INFO [train.py:1008] (1/4) Epoch 16, batch 250, loss[loss=0.2523, simple_loss=0.3139, pruned_loss=0.09536, over 19827.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3211, pruned_loss=0.1028, over 2715609.97 frames. ], batch size: 115, lr: 1.85e-02, grad_scale: 32.0 2023-06-23 12:08:29,969 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=54906.666666666664, ans=0.125 2023-06-23 12:08:31,948 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.75 vs. limit=22.5 2023-06-23 12:08:36,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=54906.666666666664, ans=0.0 2023-06-23 12:09:11,755 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:09:14,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=55040.0, ans=0.125 2023-06-23 12:09:29,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=55106.666666666664, ans=0.125 2023-06-23 12:09:38,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=55173.333333333336, ans=0.125 2023-06-23 12:09:49,042 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=24.10 vs. limit=22.5 2023-06-23 12:09:50,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=55173.333333333336, ans=0.2 2023-06-23 12:09:53,088 INFO [train.py:1008] (1/4) Epoch 16, batch 300, loss[loss=0.2635, simple_loss=0.3374, pruned_loss=0.09482, over 15339.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.321, pruned_loss=0.1025, over 2947872.87 frames. ], batch size: 44, lr: 1.84e-02, grad_scale: 32.0 2023-06-23 12:10:01,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=55240.0, ans=0.125 2023-06-23 12:10:07,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=55240.0, ans=0.0 2023-06-23 12:10:12,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=55306.666666666664, ans=0.1 2023-06-23 12:10:34,241 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.643e+02 2.012e+02 2.441e+02 2.908e+02 3.801e+02, threshold=4.883e+02, percent-clipped=0.0 2023-06-23 12:11:18,528 INFO [train.py:1008] (1/4) Epoch 16, batch 350, loss[loss=0.2461, simple_loss=0.3178, pruned_loss=0.08721, over 19813.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.32, pruned_loss=0.1014, over 3139011.46 frames. ], batch size: 115, lr: 1.84e-02, grad_scale: 32.0 2023-06-23 12:11:22,111 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=55573.333333333336, ans=0.2 2023-06-23 12:11:46,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=55640.0, ans=0.125 2023-06-23 12:12:43,448 INFO [train.py:1008] (1/4) Epoch 16, batch 400, loss[loss=0.2586, simple_loss=0.3253, pruned_loss=0.09597, over 18931.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3193, pruned_loss=0.1012, over 3285090.40 frames. ], batch size: 86, lr: 1.84e-02, grad_scale: 32.0 2023-06-23 12:13:24,802 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 2.066e+02 2.371e+02 2.737e+02 3.800e+02, threshold=4.742e+02, percent-clipped=0.0 2023-06-23 12:13:31,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=56040.0, ans=0.2 2023-06-23 12:13:41,006 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.11 vs. limit=22.5 2023-06-23 12:13:59,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=56173.333333333336, ans=0.0 2023-06-23 12:14:04,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=56173.333333333336, ans=0.0 2023-06-23 12:14:09,390 INFO [train.py:1008] (1/4) Epoch 16, batch 450, loss[loss=0.2349, simple_loss=0.306, pruned_loss=0.08189, over 19666.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3191, pruned_loss=0.1018, over 3394643.73 frames. ], batch size: 110, lr: 1.83e-02, grad_scale: 32.0 2023-06-23 12:14:21,996 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.69 vs. limit=15.0 2023-06-23 12:14:25,361 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.42 vs. limit=10.0 2023-06-23 12:14:26,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=56306.666666666664, ans=0.125 2023-06-23 12:14:34,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=56306.666666666664, ans=0.125 2023-06-23 12:14:35,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=56306.666666666664, ans=0.2 2023-06-23 12:14:41,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=56373.333333333336, ans=0.0 2023-06-23 12:15:22,542 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.00 vs. limit=15.0 2023-06-23 12:15:31,262 INFO [train.py:1008] (1/4) Epoch 16, batch 500, loss[loss=0.2525, simple_loss=0.3144, pruned_loss=0.09529, over 19558.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3183, pruned_loss=0.1012, over 3477263.55 frames. ], batch size: 102, lr: 1.83e-02, grad_scale: 32.0 2023-06-23 12:15:47,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=56640.0, ans=0.125 2023-06-23 12:15:53,235 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=7.62 vs. limit=15.0 2023-06-23 12:15:57,420 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:16:10,046 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.938e+02 2.207e+02 2.461e+02 4.032e+02, threshold=4.415e+02, percent-clipped=0.0 2023-06-23 12:16:14,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=56706.666666666664, ans=0.0 2023-06-23 12:16:44,472 INFO [train.py:1008] (1/4) Epoch 17, batch 0, loss[loss=0.2648, simple_loss=0.329, pruned_loss=0.1003, over 19553.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.329, pruned_loss=0.1003, over 19553.00 frames. ], batch size: 102, lr: 1.78e-02, grad_scale: 32.0 2023-06-23 12:16:44,472 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 12:16:50,189 INFO [train.py:1040] (1/4) Epoch 17, validation: loss=0.2079, simple_loss=0.3059, pruned_loss=0.0549, over 143649.00 frames. 2023-06-23 12:16:50,191 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 12:17:08,710 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.51 vs. limit=15.0 2023-06-23 12:17:24,039 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.36 vs. limit=15.0 2023-06-23 12:17:24,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=56920.0, ans=0.0 2023-06-23 12:17:24,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=56920.0, ans=0.1 2023-06-23 12:17:25,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=56920.0, ans=0.125 2023-06-23 12:17:41,691 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.89 vs. limit=22.5 2023-06-23 12:18:13,416 INFO [train.py:1008] (1/4) Epoch 17, batch 50, loss[loss=0.2489, simple_loss=0.3167, pruned_loss=0.09058, over 18776.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3189, pruned_loss=0.09931, over 863742.02 frames. ], batch size: 83, lr: 1.77e-02, grad_scale: 32.0 2023-06-23 12:18:17,019 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:18:20,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=57120.0, ans=0.2 2023-06-23 12:18:36,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=57186.666666666664, ans=0.2 2023-06-23 12:19:01,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=57320.0, ans=0.0 2023-06-23 12:19:23,004 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.944e+02 2.187e+02 2.439e+02 3.665e+02, threshold=4.374e+02, percent-clipped=0.0 2023-06-23 12:19:30,366 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=57386.666666666664, ans=0.125 2023-06-23 12:19:36,922 INFO [train.py:1008] (1/4) Epoch 17, batch 100, loss[loss=0.2546, simple_loss=0.3151, pruned_loss=0.09699, over 18461.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3188, pruned_loss=0.09876, over 1512554.14 frames. ], batch size: 77, lr: 1.77e-02, grad_scale: 32.0 2023-06-23 12:20:01,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=57520.0, ans=0.125 2023-06-23 12:21:00,447 INFO [train.py:1008] (1/4) Epoch 17, batch 150, loss[loss=0.2537, simple_loss=0.2993, pruned_loss=0.104, over 20013.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3172, pruned_loss=0.09819, over 2026861.81 frames. ], batch size: 294, lr: 1.77e-02, grad_scale: 32.0 2023-06-23 12:21:03,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=57786.666666666664, ans=0.1 2023-06-23 12:21:15,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=57853.333333333336, ans=0.125 2023-06-23 12:21:43,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=57920.0, ans=0.125 2023-06-23 12:21:51,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=57986.666666666664, ans=0.125 2023-06-23 12:22:00,384 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.52 vs. limit=15.0 2023-06-23 12:22:07,756 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=58053.333333333336, ans=0.125 2023-06-23 12:22:08,707 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.366e+02 2.008e+02 2.186e+02 2.450e+02 4.680e+02, threshold=4.373e+02, percent-clipped=1.0 2023-06-23 12:22:22,860 INFO [train.py:1008] (1/4) Epoch 17, batch 200, loss[loss=0.2365, simple_loss=0.3004, pruned_loss=0.08627, over 19849.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3178, pruned_loss=0.09849, over 2419940.30 frames. ], batch size: 120, lr: 1.76e-02, grad_scale: 32.0 2023-06-23 12:22:30,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=58120.0, ans=0.1 2023-06-23 12:22:44,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=58186.666666666664, ans=0.125 2023-06-23 12:22:44,593 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=15.0 2023-06-23 12:22:45,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=58186.666666666664, ans=0.0 2023-06-23 12:22:47,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=58186.666666666664, ans=0.125 2023-06-23 12:22:50,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=58186.666666666664, ans=0.025 2023-06-23 12:22:51,208 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=58186.666666666664, ans=22.5 2023-06-23 12:22:59,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=58253.333333333336, ans=0.1 2023-06-23 12:23:06,252 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=58253.333333333336, ans=0.1 2023-06-23 12:23:08,261 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=15.0 2023-06-23 12:23:27,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=58386.666666666664, ans=0.125 2023-06-23 12:23:45,482 INFO [train.py:1008] (1/4) Epoch 17, batch 250, loss[loss=0.2886, simple_loss=0.3028, pruned_loss=0.1371, over 16829.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3172, pruned_loss=0.09848, over 2725472.88 frames. ], batch size: 391, lr: 1.76e-02, grad_scale: 32.0 2023-06-23 12:23:47,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=58453.333333333336, ans=0.125 2023-06-23 12:23:57,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=58453.333333333336, ans=0.125 2023-06-23 12:23:58,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=58453.333333333336, ans=0.0 2023-06-23 12:24:54,506 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.622e+02 1.998e+02 2.239e+02 2.590e+02 4.394e+02, threshold=4.478e+02, percent-clipped=1.0 2023-06-23 12:24:55,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=58720.0, ans=0.0 2023-06-23 12:25:08,875 INFO [train.py:1008] (1/4) Epoch 17, batch 300, loss[loss=0.2549, simple_loss=0.3185, pruned_loss=0.09563, over 18278.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3178, pruned_loss=0.09892, over 2967104.66 frames. ], batch size: 74, lr: 1.76e-02, grad_scale: 32.0 2023-06-23 12:25:12,817 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.31 vs. limit=15.0 2023-06-23 12:25:23,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=58853.333333333336, ans=0.125 2023-06-23 12:25:23,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=58853.333333333336, ans=0.125 2023-06-23 12:25:44,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=58920.0, ans=0.0 2023-06-23 12:26:05,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=58986.666666666664, ans=0.0 2023-06-23 12:26:13,494 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.38 vs. limit=15.0 2023-06-23 12:26:21,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=59053.333333333336, ans=0.0 2023-06-23 12:26:30,797 INFO [train.py:1008] (1/4) Epoch 17, batch 350, loss[loss=0.3053, simple_loss=0.3563, pruned_loss=0.1272, over 17049.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3175, pruned_loss=0.09853, over 3137978.11 frames. ], batch size: 60, lr: 1.76e-02, grad_scale: 32.0 2023-06-23 12:26:50,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=59186.666666666664, ans=0.0 2023-06-23 12:26:53,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=59186.666666666664, ans=0.125 2023-06-23 12:27:04,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=59253.333333333336, ans=0.125 2023-06-23 12:27:41,366 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.927e+02 2.178e+02 2.510e+02 4.895e+02, threshold=4.356e+02, percent-clipped=2.0 2023-06-23 12:27:45,809 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.01 vs. limit=15.0 2023-06-23 12:27:54,116 INFO [train.py:1008] (1/4) Epoch 17, batch 400, loss[loss=0.2265, simple_loss=0.299, pruned_loss=0.07695, over 18779.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3169, pruned_loss=0.09796, over 3296369.98 frames. ], batch size: 83, lr: 1.75e-02, grad_scale: 32.0 2023-06-23 12:28:07,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=59453.333333333336, ans=0.0 2023-06-23 12:28:20,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=59520.0, ans=0.125 2023-06-23 12:28:23,709 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.61 vs. limit=15.0 2023-06-23 12:28:49,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=59653.333333333336, ans=0.125 2023-06-23 12:29:10,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=59720.0, ans=0.1 2023-06-23 12:29:16,914 INFO [train.py:1008] (1/4) Epoch 17, batch 450, loss[loss=0.2653, simple_loss=0.3349, pruned_loss=0.09786, over 18273.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3158, pruned_loss=0.09756, over 3388085.05 frames. ], batch size: 74, lr: 1.75e-02, grad_scale: 32.0 2023-06-23 12:29:17,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=59786.666666666664, ans=0.125 2023-06-23 12:29:35,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=59853.333333333336, ans=0.125 2023-06-23 12:29:39,634 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:29:42,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=59853.333333333336, ans=0.0 2023-06-23 12:29:44,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=59853.333333333336, ans=0.1 2023-06-23 12:30:09,178 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:30:24,547 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.618e+02 2.084e+02 2.367e+02 2.867e+02 4.445e+02, threshold=4.735e+02, percent-clipped=1.0 2023-06-23 12:30:37,432 INFO [train.py:1008] (1/4) Epoch 17, batch 500, loss[loss=0.282, simple_loss=0.3199, pruned_loss=0.1221, over 19838.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3156, pruned_loss=0.09752, over 3484426.60 frames. ], batch size: 293, lr: 1.75e-02, grad_scale: 32.0 2023-06-23 12:30:46,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=60120.0, ans=0.125 2023-06-23 12:30:50,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=60120.0, ans=0.125 2023-06-23 12:30:52,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=60186.666666666664, ans=0.125 2023-06-23 12:31:49,833 INFO [train.py:1008] (1/4) Epoch 18, batch 0, loss[loss=0.2511, simple_loss=0.3152, pruned_loss=0.09351, over 19695.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3152, pruned_loss=0.09351, over 19695.00 frames. ], batch size: 110, lr: 1.70e-02, grad_scale: 32.0 2023-06-23 12:31:49,834 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 12:31:55,526 INFO [train.py:1040] (1/4) Epoch 18, validation: loss=0.2057, simple_loss=0.3034, pruned_loss=0.05401, over 143649.00 frames. 2023-06-23 12:31:55,527 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 12:32:02,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=60333.333333333336, ans=0.125 2023-06-23 12:32:02,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=60333.333333333336, ans=0.1 2023-06-23 12:32:14,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=60400.0, ans=0.05 2023-06-23 12:32:37,995 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.54 vs. limit=15.0 2023-06-23 12:32:41,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=60466.666666666664, ans=0.125 2023-06-23 12:32:42,396 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=17.11 vs. limit=22.5 2023-06-23 12:32:52,001 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.27 vs. limit=15.0 2023-06-23 12:33:07,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=60600.0, ans=0.0 2023-06-23 12:33:16,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=60666.666666666664, ans=0.0 2023-06-23 12:33:16,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=60666.666666666664, ans=0.0 2023-06-23 12:33:17,515 INFO [train.py:1008] (1/4) Epoch 18, batch 50, loss[loss=0.2658, simple_loss=0.3274, pruned_loss=0.1021, over 19106.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3163, pruned_loss=0.09652, over 856169.31 frames. ], batch size: 94, lr: 1.69e-02, grad_scale: 32.0 2023-06-23 12:33:27,991 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.93 vs. limit=15.0 2023-06-23 12:33:33,572 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 1.949e+02 2.212e+02 2.514e+02 4.263e+02, threshold=4.424e+02, percent-clipped=0.0 2023-06-23 12:33:58,176 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.07 vs. limit=15.0 2023-06-23 12:34:39,502 INFO [train.py:1008] (1/4) Epoch 18, batch 100, loss[loss=0.2643, simple_loss=0.3227, pruned_loss=0.1029, over 19099.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3144, pruned_loss=0.09533, over 1525206.32 frames. ], batch size: 94, lr: 1.69e-02, grad_scale: 32.0 2023-06-23 12:34:42,028 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:35:05,507 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.78 vs. limit=12.0 2023-06-23 12:35:19,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=61133.333333333336, ans=0.2 2023-06-23 12:35:21,331 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.16 vs. limit=15.0 2023-06-23 12:35:24,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=61133.333333333336, ans=0.0 2023-06-23 12:35:42,028 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.81 vs. limit=15.0 2023-06-23 12:35:52,871 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.05 vs. limit=15.0 2023-06-23 12:35:54,244 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=15.0 2023-06-23 12:36:02,341 INFO [train.py:1008] (1/4) Epoch 18, batch 150, loss[loss=0.3158, simple_loss=0.37, pruned_loss=0.1307, over 17662.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3146, pruned_loss=0.09626, over 2039097.88 frames. ], batch size: 67, lr: 1.69e-02, grad_scale: 32.0 2023-06-23 12:36:09,430 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:36:18,956 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.917e+02 2.135e+02 2.463e+02 3.823e+02, threshold=4.269e+02, percent-clipped=0.0 2023-06-23 12:36:27,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=61400.0, ans=0.0 2023-06-23 12:36:35,295 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:36:53,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=61533.333333333336, ans=0.07 2023-06-23 12:36:53,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=61533.333333333336, ans=0.125 2023-06-23 12:37:24,477 INFO [train.py:1008] (1/4) Epoch 18, batch 200, loss[loss=0.259, simple_loss=0.3367, pruned_loss=0.09059, over 18342.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3149, pruned_loss=0.0958, over 2436367.63 frames. ], batch size: 72, lr: 1.69e-02, grad_scale: 32.0 2023-06-23 12:37:51,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=61733.333333333336, ans=0.1 2023-06-23 12:37:51,319 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=21.38 vs. limit=22.5 2023-06-23 12:38:25,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=61866.666666666664, ans=0.0 2023-06-23 12:38:33,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=61933.333333333336, ans=0.0 2023-06-23 12:38:45,969 INFO [train.py:1008] (1/4) Epoch 18, batch 250, loss[loss=0.2641, simple_loss=0.3135, pruned_loss=0.1074, over 20250.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3143, pruned_loss=0.0954, over 2731628.86 frames. ], batch size: 239, lr: 1.68e-02, grad_scale: 32.0 2023-06-23 12:39:02,740 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 1.868e+02 2.092e+02 2.337e+02 4.828e+02, threshold=4.183e+02, percent-clipped=1.0 2023-06-23 12:39:11,101 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.31 vs. limit=15.0 2023-06-23 12:39:11,969 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=62066.666666666664, ans=0.125 2023-06-23 12:39:13,726 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:39:16,865 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=62066.666666666664, ans=0.0 2023-06-23 12:39:20,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=62133.333333333336, ans=0.1 2023-06-23 12:39:23,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=62133.333333333336, ans=0.0 2023-06-23 12:40:01,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=62266.666666666664, ans=0.1 2023-06-23 12:40:09,902 INFO [train.py:1008] (1/4) Epoch 18, batch 300, loss[loss=0.2707, simple_loss=0.3407, pruned_loss=0.1004, over 17578.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3136, pruned_loss=0.09451, over 2982303.29 frames. ], batch size: 67, lr: 1.68e-02, grad_scale: 32.0 2023-06-23 12:40:18,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=62333.333333333336, ans=0.0 2023-06-23 12:40:18,484 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2023-06-23 12:40:29,970 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=12.0 2023-06-23 12:41:25,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=62600.0, ans=0.125 2023-06-23 12:41:27,511 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.87 vs. limit=10.0 2023-06-23 12:41:31,702 INFO [train.py:1008] (1/4) Epoch 18, batch 350, loss[loss=0.2502, simple_loss=0.3092, pruned_loss=0.09555, over 20304.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3128, pruned_loss=0.09421, over 3173045.29 frames. ], batch size: 141, lr: 1.68e-02, grad_scale: 32.0 2023-06-23 12:41:42,324 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.80 vs. limit=15.0 2023-06-23 12:41:45,533 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.23 vs. limit=15.0 2023-06-23 12:41:46,423 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=62733.333333333336, ans=0.125 2023-06-23 12:41:47,678 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.611e+02 1.885e+02 2.140e+02 2.544e+02 4.064e+02, threshold=4.281e+02, percent-clipped=0.0 2023-06-23 12:42:34,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=62866.666666666664, ans=0.1 2023-06-23 12:42:36,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=62933.333333333336, ans=0.1 2023-06-23 12:42:46,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=62933.333333333336, ans=0.2 2023-06-23 12:42:49,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=62933.333333333336, ans=0.0 2023-06-23 12:42:49,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=62933.333333333336, ans=0.0 2023-06-23 12:42:54,399 INFO [train.py:1008] (1/4) Epoch 18, batch 400, loss[loss=0.2789, simple_loss=0.2991, pruned_loss=0.1293, over 16761.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3124, pruned_loss=0.0946, over 3313252.26 frames. ], batch size: 392, lr: 1.68e-02, grad_scale: 32.0 2023-06-23 12:43:03,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=63000.0, ans=0.1 2023-06-23 12:44:17,927 INFO [train.py:1008] (1/4) Epoch 18, batch 450, loss[loss=0.253, simple_loss=0.32, pruned_loss=0.093, over 18617.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.312, pruned_loss=0.09423, over 3426350.64 frames. ], batch size: 80, lr: 1.67e-02, grad_scale: 32.0 2023-06-23 12:44:19,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=63333.333333333336, ans=0.2 2023-06-23 12:44:33,786 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.630e+02 1.966e+02 2.259e+02 2.581e+02 5.300e+02, threshold=4.517e+02, percent-clipped=2.0 2023-06-23 12:44:39,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=63400.0, ans=0.125 2023-06-23 12:44:39,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=63400.0, ans=0.125 2023-06-23 12:45:04,099 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.94 vs. limit=15.0 2023-06-23 12:45:09,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=63533.333333333336, ans=0.125 2023-06-23 12:45:28,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=63600.0, ans=0.0 2023-06-23 12:45:33,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=63600.0, ans=0.0 2023-06-23 12:45:38,026 INFO [train.py:1008] (1/4) Epoch 18, batch 500, loss[loss=0.2843, simple_loss=0.3158, pruned_loss=0.1264, over 19714.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.312, pruned_loss=0.09477, over 3525046.41 frames. ], batch size: 293, lr: 1.67e-02, grad_scale: 32.0 2023-06-23 12:45:54,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=63733.333333333336, ans=0.125 2023-06-23 12:46:06,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=63733.333333333336, ans=0.1 2023-06-23 12:46:18,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=63800.0, ans=0.0 2023-06-23 12:46:21,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=63800.0, ans=0.0 2023-06-23 12:46:50,136 INFO [train.py:1008] (1/4) Epoch 19, batch 0, loss[loss=0.2518, simple_loss=0.3074, pruned_loss=0.09807, over 20721.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3074, pruned_loss=0.09807, over 20721.00 frames. ], batch size: 211, lr: 1.62e-02, grad_scale: 32.0 2023-06-23 12:46:50,136 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 12:46:55,825 INFO [train.py:1040] (1/4) Epoch 19, validation: loss=0.2047, simple_loss=0.3033, pruned_loss=0.05308, over 143649.00 frames. 2023-06-23 12:46:55,826 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 12:46:59,120 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=63880.0, ans=0.1 2023-06-23 12:47:25,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=63946.666666666664, ans=15.0 2023-06-23 12:47:41,322 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.842e+02 2.018e+02 2.417e+02 3.791e+02, threshold=4.036e+02, percent-clipped=0.0 2023-06-23 12:47:53,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=64080.0, ans=0.125 2023-06-23 12:47:58,278 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=64080.0, ans=0.125 2023-06-23 12:48:15,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=64146.666666666664, ans=0.125 2023-06-23 12:48:17,893 INFO [train.py:1008] (1/4) Epoch 19, batch 50, loss[loss=0.233, simple_loss=0.3001, pruned_loss=0.083, over 19110.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3132, pruned_loss=0.09274, over 840424.45 frames. ], batch size: 94, lr: 1.62e-02, grad_scale: 32.0 2023-06-23 12:48:32,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=64280.0, ans=0.09899494936611666 2023-06-23 12:48:32,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=64280.0, ans=0.0 2023-06-23 12:48:41,478 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.68 vs. limit=6.0 2023-06-23 12:48:45,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=64280.0, ans=0.125 2023-06-23 12:49:05,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=64413.333333333336, ans=0.2 2023-06-23 12:49:23,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=64480.0, ans=0.1 2023-06-23 12:49:39,868 INFO [train.py:1008] (1/4) Epoch 19, batch 100, loss[loss=0.2613, simple_loss=0.3268, pruned_loss=0.09795, over 18449.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3118, pruned_loss=0.09481, over 1484955.25 frames. ], batch size: 77, lr: 1.62e-02, grad_scale: 32.0 2023-06-23 12:49:44,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=64546.666666666664, ans=0.05 2023-06-23 12:49:57,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=64613.333333333336, ans=0.2 2023-06-23 12:50:07,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=64613.333333333336, ans=0.125 2023-06-23 12:50:08,107 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=17.57 vs. limit=22.5 2023-06-23 12:50:25,074 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.991e+02 2.213e+02 2.609e+02 4.659e+02, threshold=4.425e+02, percent-clipped=3.0 2023-06-23 12:50:38,516 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=64746.666666666664, ans=0.2 2023-06-23 12:51:01,324 INFO [train.py:1008] (1/4) Epoch 19, batch 150, loss[loss=0.2331, simple_loss=0.2992, pruned_loss=0.08347, over 19094.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3128, pruned_loss=0.09355, over 1980454.31 frames. ], batch size: 89, lr: 1.62e-02, grad_scale: 32.0 2023-06-23 12:51:24,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=64946.666666666664, ans=0.125 2023-06-23 12:52:19,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=65146.666666666664, ans=0.125 2023-06-23 12:52:21,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=65146.666666666664, ans=0.125 2023-06-23 12:52:24,637 INFO [train.py:1008] (1/4) Epoch 19, batch 200, loss[loss=0.2736, simple_loss=0.3246, pruned_loss=0.1113, over 20133.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3124, pruned_loss=0.09383, over 2404588.39 frames. ], batch size: 133, lr: 1.61e-02, grad_scale: 32.0 2023-06-23 12:52:28,635 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.13 vs. limit=15.0 2023-06-23 12:52:31,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=65213.333333333336, ans=0.125 2023-06-23 12:52:33,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=65213.333333333336, ans=0.1 2023-06-23 12:52:44,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=65280.0, ans=0.125 2023-06-23 12:52:44,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=65280.0, ans=0.04949747468305833 2023-06-23 12:53:10,921 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.572e+02 1.917e+02 2.134e+02 2.466e+02 4.511e+02, threshold=4.268e+02, percent-clipped=1.0 2023-06-23 12:53:19,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=65413.333333333336, ans=0.07 2023-06-23 12:53:30,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=65480.0, ans=10.0 2023-06-23 12:53:32,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=65480.0, ans=0.07 2023-06-23 12:53:41,082 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:53:47,871 INFO [train.py:1008] (1/4) Epoch 19, batch 250, loss[loss=0.247, simple_loss=0.2994, pruned_loss=0.09736, over 20334.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3115, pruned_loss=0.0935, over 2707212.29 frames. ], batch size: 239, lr: 1.61e-02, grad_scale: 32.0 2023-06-23 12:53:51,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=65546.66666666667, ans=0.0 2023-06-23 12:53:51,899 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.20 vs. limit=22.5 2023-06-23 12:54:04,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=65613.33333333333, ans=0.0 2023-06-23 12:54:13,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=65613.33333333333, ans=0.125 2023-06-23 12:54:23,503 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.54 vs. limit=15.0 2023-06-23 12:54:25,579 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.79 vs. limit=12.0 2023-06-23 12:54:48,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=65746.66666666667, ans=0.2 2023-06-23 12:54:54,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=65813.33333333333, ans=0.125 2023-06-23 12:55:08,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=65813.33333333333, ans=0.0 2023-06-23 12:55:10,891 INFO [train.py:1008] (1/4) Epoch 19, batch 300, loss[loss=0.2344, simple_loss=0.2936, pruned_loss=0.08755, over 20623.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3115, pruned_loss=0.09298, over 2954937.30 frames. ], batch size: 173, lr: 1.61e-02, grad_scale: 32.0 2023-06-23 12:55:17,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=65880.0, ans=0.125 2023-06-23 12:55:24,698 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.90 vs. limit=22.5 2023-06-23 12:55:30,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=65946.66666666667, ans=0.025 2023-06-23 12:55:43,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=66013.33333333333, ans=0.0 2023-06-23 12:55:56,537 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.473e+02 1.851e+02 2.103e+02 2.314e+02 3.495e+02, threshold=4.207e+02, percent-clipped=0.0 2023-06-23 12:55:56,837 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:56:17,084 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.81 vs. limit=15.0 2023-06-23 12:56:28,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=66146.66666666667, ans=0.1 2023-06-23 12:56:28,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=66146.66666666667, ans=0.0 2023-06-23 12:56:29,861 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=66146.66666666667, ans=0.125 2023-06-23 12:56:32,558 INFO [train.py:1008] (1/4) Epoch 19, batch 350, loss[loss=0.2505, simple_loss=0.3037, pruned_loss=0.09859, over 20587.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3108, pruned_loss=0.09273, over 3144697.04 frames. ], batch size: 189, lr: 1.61e-02, grad_scale: 32.0 2023-06-23 12:56:33,909 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.41 vs. limit=15.0 2023-06-23 12:56:38,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=66213.33333333333, ans=0.1 2023-06-23 12:57:25,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=66413.33333333333, ans=0.125 2023-06-23 12:57:27,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=66413.33333333333, ans=0.2 2023-06-23 12:57:35,281 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:57:51,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=66480.0, ans=0.0 2023-06-23 12:57:54,138 INFO [train.py:1008] (1/4) Epoch 19, batch 400, loss[loss=0.2479, simple_loss=0.3195, pruned_loss=0.08812, over 18776.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3114, pruned_loss=0.09291, over 3261630.07 frames. ], batch size: 83, lr: 1.60e-02, grad_scale: 32.0 2023-06-23 12:58:08,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=66546.66666666667, ans=0.1 2023-06-23 12:58:16,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=66613.33333333333, ans=0.125 2023-06-23 12:58:37,498 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:58:40,263 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.858e+02 2.159e+02 2.467e+02 4.399e+02, threshold=4.318e+02, percent-clipped=1.0 2023-06-23 12:59:16,972 INFO [train.py:1008] (1/4) Epoch 19, batch 450, loss[loss=0.2464, simple_loss=0.3128, pruned_loss=0.09001, over 19111.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3114, pruned_loss=0.09277, over 3383871.16 frames. ], batch size: 94, lr: 1.60e-02, grad_scale: 64.0 2023-06-23 12:59:17,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=66880.0, ans=0.2 2023-06-23 12:59:36,978 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.34 vs. limit=12.0 2023-06-23 12:59:57,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=67013.33333333333, ans=0.0 2023-06-23 12:59:59,032 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=67013.33333333333, ans=0.2 2023-06-23 13:00:11,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=67080.0, ans=0.0 2023-06-23 13:00:13,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=67080.0, ans=0.125 2023-06-23 13:00:19,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=67080.0, ans=0.125 2023-06-23 13:00:21,843 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.11 vs. limit=10.0 2023-06-23 13:00:24,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=67146.66666666667, ans=0.0 2023-06-23 13:00:26,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=67146.66666666667, ans=0.0 2023-06-23 13:00:38,151 INFO [train.py:1008] (1/4) Epoch 19, batch 500, loss[loss=0.23, simple_loss=0.297, pruned_loss=0.08147, over 19867.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.311, pruned_loss=0.09229, over 3476516.37 frames. ], batch size: 120, lr: 1.60e-02, grad_scale: 64.0 2023-06-23 13:00:41,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=67213.33333333333, ans=0.0 2023-06-23 13:00:54,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=67280.0, ans=0.0 2023-06-23 13:01:01,207 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.04 vs. limit=15.0 2023-06-23 13:01:02,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=67280.0, ans=0.125 2023-06-23 13:01:22,699 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.865e+02 2.077e+02 2.519e+02 3.781e+02, threshold=4.153e+02, percent-clipped=0.0 2023-06-23 13:01:23,277 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.66 vs. limit=22.5 2023-06-23 13:01:52,654 INFO [train.py:1008] (1/4) Epoch 20, batch 0, loss[loss=0.262, simple_loss=0.3025, pruned_loss=0.1107, over 19989.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3025, pruned_loss=0.1107, over 19989.00 frames. ], batch size: 293, lr: 1.56e-02, grad_scale: 64.0 2023-06-23 13:01:52,655 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 13:01:58,302 INFO [train.py:1040] (1/4) Epoch 20, validation: loss=0.2036, simple_loss=0.3021, pruned_loss=0.05255, over 143649.00 frames. 2023-06-23 13:01:58,303 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 13:02:18,120 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=67500.0, ans=0.125 2023-06-23 13:02:28,040 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.62 vs. limit=15.0 2023-06-23 13:02:31,373 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.48 vs. limit=15.0 2023-06-23 13:02:54,380 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.90 vs. limit=15.0 2023-06-23 13:03:01,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=67633.33333333333, ans=0.0 2023-06-23 13:03:09,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=67700.0, ans=0.04949747468305833 2023-06-23 13:03:11,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=67700.0, ans=0.125 2023-06-23 13:03:21,123 INFO [train.py:1008] (1/4) Epoch 20, batch 50, loss[loss=0.2245, simple_loss=0.2995, pruned_loss=0.07476, over 19523.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3103, pruned_loss=0.09264, over 857305.69 frames. ], batch size: 102, lr: 1.55e-02, grad_scale: 64.0 2023-06-23 13:03:40,980 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=67833.33333333333, ans=0.1 2023-06-23 13:03:51,758 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:04:09,085 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=67966.66666666667, ans=0.95 2023-06-23 13:04:17,780 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.83 vs. limit=22.5 2023-06-23 13:04:34,900 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 1.894e+02 2.119e+02 2.506e+02 3.679e+02, threshold=4.238e+02, percent-clipped=0.0 2023-06-23 13:04:43,801 INFO [train.py:1008] (1/4) Epoch 20, batch 100, loss[loss=0.263, simple_loss=0.3169, pruned_loss=0.1046, over 19942.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3085, pruned_loss=0.09196, over 1517625.65 frames. ], batch size: 126, lr: 1.55e-02, grad_scale: 64.0 2023-06-23 13:05:06,604 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=68166.66666666667, ans=0.07 2023-06-23 13:05:26,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=68233.33333333333, ans=0.1 2023-06-23 13:05:36,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=68300.0, ans=0.0 2023-06-23 13:05:49,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=68366.66666666667, ans=10.0 2023-06-23 13:06:01,840 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.56 vs. limit=15.0 2023-06-23 13:06:05,411 INFO [train.py:1008] (1/4) Epoch 20, batch 150, loss[loss=0.2464, simple_loss=0.2936, pruned_loss=0.09956, over 19997.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3078, pruned_loss=0.09186, over 2033243.26 frames. ], batch size: 293, lr: 1.55e-02, grad_scale: 64.0 2023-06-23 13:06:07,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=68433.33333333333, ans=0.125 2023-06-23 13:06:16,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=68433.33333333333, ans=0.1 2023-06-23 13:06:17,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=68433.33333333333, ans=0.125 2023-06-23 13:07:17,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=68700.0, ans=0.0 2023-06-23 13:07:18,945 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.872e+02 2.112e+02 2.327e+02 3.463e+02, threshold=4.224e+02, percent-clipped=0.0 2023-06-23 13:07:20,024 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=6.87 vs. limit=15.0 2023-06-23 13:07:26,780 INFO [train.py:1008] (1/4) Epoch 20, batch 200, loss[loss=0.2961, simple_loss=0.3522, pruned_loss=0.1201, over 16360.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3097, pruned_loss=0.09212, over 2418026.91 frames. ], batch size: 52, lr: 1.55e-02, grad_scale: 64.0 2023-06-23 13:07:49,814 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.82 vs. limit=10.0 2023-06-23 13:08:17,283 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:08:23,950 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=68966.66666666667, ans=0.1 2023-06-23 13:08:30,085 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:08:33,734 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.82 vs. limit=15.0 2023-06-23 13:08:48,888 INFO [train.py:1008] (1/4) Epoch 20, batch 250, loss[loss=0.2349, simple_loss=0.2937, pruned_loss=0.08807, over 20570.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3093, pruned_loss=0.09173, over 2720876.42 frames. ], batch size: 189, lr: 1.54e-02, grad_scale: 32.0 2023-06-23 13:09:13,711 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.40 vs. limit=22.5 2023-06-23 13:09:31,415 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.93 vs. limit=15.0 2023-06-23 13:09:34,320 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.71 vs. limit=6.0 2023-06-23 13:09:34,473 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.53 vs. limit=22.5 2023-06-23 13:09:40,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=69300.0, ans=0.0 2023-06-23 13:09:44,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=69300.0, ans=0.125 2023-06-23 13:09:47,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=69300.0, ans=0.0 2023-06-23 13:10:05,271 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 2.000e+02 2.200e+02 2.591e+02 4.331e+02, threshold=4.400e+02, percent-clipped=1.0 2023-06-23 13:10:06,152 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.43 vs. limit=15.0 2023-06-23 13:10:12,167 INFO [train.py:1008] (1/4) Epoch 20, batch 300, loss[loss=0.2462, simple_loss=0.3035, pruned_loss=0.0944, over 20694.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3091, pruned_loss=0.09176, over 2958335.92 frames. ], batch size: 211, lr: 1.54e-02, grad_scale: 32.0 2023-06-23 13:10:28,202 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.71 vs. limit=22.5 2023-06-23 13:11:34,767 INFO [train.py:1008] (1/4) Epoch 20, batch 350, loss[loss=0.2052, simple_loss=0.2709, pruned_loss=0.06973, over 19345.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3085, pruned_loss=0.0911, over 3140299.19 frames. ], batch size: 98, lr: 1.54e-02, grad_scale: 32.0 2023-06-23 13:11:58,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=69833.33333333333, ans=0.0 2023-06-23 13:12:23,454 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.92 vs. limit=22.5 2023-06-23 13:12:31,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=69966.66666666667, ans=0.0 2023-06-23 13:12:42,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=70033.33333333333, ans=0.2 2023-06-23 13:12:43,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=70033.33333333333, ans=0.125 2023-06-23 13:12:50,071 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.960e+02 2.265e+02 2.644e+02 4.141e+02, threshold=4.531e+02, percent-clipped=0.0 2023-06-23 13:12:56,226 INFO [train.py:1008] (1/4) Epoch 20, batch 400, loss[loss=0.234, simple_loss=0.3094, pruned_loss=0.07933, over 18808.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3091, pruned_loss=0.09138, over 3280373.02 frames. ], batch size: 83, lr: 1.54e-02, grad_scale: 32.0 2023-06-23 13:12:56,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=70100.0, ans=0.5 2023-06-23 13:13:09,790 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:13:23,622 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.54 vs. limit=15.0 2023-06-23 13:13:33,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=70233.33333333333, ans=0.125 2023-06-23 13:13:36,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=70233.33333333333, ans=0.0 2023-06-23 13:13:41,682 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=70233.33333333333, ans=0.125 2023-06-23 13:13:57,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=70300.0, ans=0.125 2023-06-23 13:14:10,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=70366.66666666667, ans=0.125 2023-06-23 13:14:18,936 INFO [train.py:1008] (1/4) Epoch 20, batch 450, loss[loss=0.2504, simple_loss=0.3253, pruned_loss=0.08769, over 16337.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3087, pruned_loss=0.09109, over 3390792.58 frames. ], batch size: 52, lr: 1.54e-02, grad_scale: 32.0 2023-06-23 13:14:40,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=70500.0, ans=0.2 2023-06-23 13:14:45,415 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=70500.0, ans=0.1 2023-06-23 13:14:53,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=70566.66666666667, ans=0.2 2023-06-23 13:14:59,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=70566.66666666667, ans=0.125 2023-06-23 13:15:32,559 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.641e+02 1.927e+02 2.175e+02 2.531e+02 3.833e+02, threshold=4.351e+02, percent-clipped=0.0 2023-06-23 13:15:38,741 INFO [train.py:1008] (1/4) Epoch 20, batch 500, loss[loss=0.2271, simple_loss=0.2975, pruned_loss=0.07839, over 19102.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3083, pruned_loss=0.09164, over 3500031.64 frames. ], batch size: 94, lr: 1.53e-02, grad_scale: 32.0 2023-06-23 13:15:46,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=70766.66666666667, ans=0.0 2023-06-23 13:16:05,573 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.37 vs. limit=15.0 2023-06-23 13:16:49,186 INFO [train.py:1008] (1/4) Epoch 21, batch 0, loss[loss=0.2728, simple_loss=0.3291, pruned_loss=0.1082, over 18487.00 frames. ], tot_loss[loss=0.2728, simple_loss=0.3291, pruned_loss=0.1082, over 18487.00 frames. ], batch size: 77, lr: 1.49e-02, grad_scale: 32.0 2023-06-23 13:16:49,187 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 13:16:54,884 INFO [train.py:1040] (1/4) Epoch 21, validation: loss=0.2034, simple_loss=0.3003, pruned_loss=0.05328, over 143649.00 frames. 2023-06-23 13:16:54,884 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 13:16:55,844 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.08 vs. limit=15.0 2023-06-23 13:17:14,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=71046.66666666667, ans=0.2 2023-06-23 13:18:16,793 INFO [train.py:1008] (1/4) Epoch 21, batch 50, loss[loss=0.2616, simple_loss=0.2866, pruned_loss=0.1183, over 16885.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3042, pruned_loss=0.09122, over 852927.59 frames. ], batch size: 392, lr: 1.49e-02, grad_scale: 32.0 2023-06-23 13:18:17,524 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.64 vs. limit=22.5 2023-06-23 13:18:24,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=71313.33333333333, ans=0.2 2023-06-23 13:18:27,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=71313.33333333333, ans=0.0 2023-06-23 13:18:28,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=71313.33333333333, ans=0.125 2023-06-23 13:18:32,851 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.45 vs. limit=22.5 2023-06-23 13:18:40,342 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.956e+02 2.297e+02 2.608e+02 3.935e+02, threshold=4.595e+02, percent-clipped=0.0 2023-06-23 13:18:43,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=71380.0, ans=0.125 2023-06-23 13:19:38,751 INFO [train.py:1008] (1/4) Epoch 21, batch 100, loss[loss=0.221, simple_loss=0.2904, pruned_loss=0.07575, over 19806.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3069, pruned_loss=0.09062, over 1514779.54 frames. ], batch size: 115, lr: 1.49e-02, grad_scale: 32.0 2023-06-23 13:19:47,491 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.11 vs. limit=15.0 2023-06-23 13:19:59,083 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.69 vs. limit=15.0 2023-06-23 13:20:36,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=71846.66666666667, ans=0.1 2023-06-23 13:20:40,607 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.30 vs. limit=12.0 2023-06-23 13:20:45,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=71913.33333333333, ans=0.2 2023-06-23 13:20:48,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.whiten.whitening_limit, batch_count=71913.33333333333, ans=12.0 2023-06-23 13:21:01,255 INFO [train.py:1008] (1/4) Epoch 21, batch 150, loss[loss=0.2526, simple_loss=0.3218, pruned_loss=0.09169, over 15346.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.307, pruned_loss=0.09058, over 2023663.78 frames. ], batch size: 44, lr: 1.49e-02, grad_scale: 32.0 2023-06-23 13:21:01,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=71980.0, ans=0.125 2023-06-23 13:21:25,295 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.847e+02 2.064e+02 2.318e+02 4.387e+02, threshold=4.128e+02, percent-clipped=0.0 2023-06-23 13:21:42,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=72113.33333333333, ans=0.125 2023-06-23 13:21:45,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=72113.33333333333, ans=0.125 2023-06-23 13:22:08,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=72246.66666666667, ans=0.125 2023-06-23 13:22:12,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=72246.66666666667, ans=0.125 2023-06-23 13:22:19,925 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.33 vs. limit=22.5 2023-06-23 13:22:23,968 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=72313.33333333333, ans=0.125 2023-06-23 13:22:25,803 INFO [train.py:1008] (1/4) Epoch 21, batch 200, loss[loss=0.2535, simple_loss=0.3203, pruned_loss=0.09339, over 17645.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3064, pruned_loss=0.09001, over 2415547.48 frames. ], batch size: 67, lr: 1.49e-02, grad_scale: 32.0 2023-06-23 13:22:32,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=72313.33333333333, ans=0.0 2023-06-23 13:23:07,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=72446.66666666667, ans=0.125 2023-06-23 13:23:47,625 INFO [train.py:1008] (1/4) Epoch 21, batch 250, loss[loss=0.2434, simple_loss=0.2975, pruned_loss=0.09463, over 20208.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3061, pruned_loss=0.0901, over 2735027.17 frames. ], batch size: 239, lr: 1.48e-02, grad_scale: 32.0 2023-06-23 13:24:07,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=72713.33333333333, ans=0.125 2023-06-23 13:24:10,332 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.903e+02 2.120e+02 2.539e+02 3.451e+02, threshold=4.241e+02, percent-clipped=0.0 2023-06-23 13:24:15,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=72713.33333333333, ans=0.1 2023-06-23 13:24:27,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=72780.0, ans=0.125 2023-06-23 13:24:54,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=72913.33333333333, ans=0.125 2023-06-23 13:25:10,681 INFO [train.py:1008] (1/4) Epoch 21, batch 300, loss[loss=0.2262, simple_loss=0.2968, pruned_loss=0.07779, over 19122.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3065, pruned_loss=0.09005, over 2964102.24 frames. ], batch size: 94, lr: 1.48e-02, grad_scale: 32.0 2023-06-23 13:25:47,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=73113.33333333333, ans=0.125 2023-06-23 13:26:31,797 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=73313.33333333333, ans=10.0 2023-06-23 13:26:32,892 INFO [train.py:1008] (1/4) Epoch 21, batch 350, loss[loss=0.2266, simple_loss=0.2908, pruned_loss=0.08117, over 20560.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3056, pruned_loss=0.08937, over 3146489.07 frames. ], batch size: 189, lr: 1.48e-02, grad_scale: 32.0 2023-06-23 13:26:43,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=73313.33333333333, ans=0.2 2023-06-23 13:26:45,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=73313.33333333333, ans=0.125 2023-06-23 13:26:48,052 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:26:48,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=73380.0, ans=0.2 2023-06-23 13:26:55,876 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 1.792e+02 1.983e+02 2.184e+02 3.913e+02, threshold=3.966e+02, percent-clipped=0.0 2023-06-23 13:27:40,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=73580.0, ans=0.125 2023-06-23 13:27:48,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=73580.0, ans=0.2 2023-06-23 13:27:55,450 INFO [train.py:1008] (1/4) Epoch 21, batch 400, loss[loss=0.2435, simple_loss=0.3209, pruned_loss=0.08305, over 17626.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3055, pruned_loss=0.08857, over 3288041.15 frames. ], batch size: 67, lr: 1.48e-02, grad_scale: 32.0 2023-06-23 13:27:55,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=73646.66666666667, ans=0.125 2023-06-23 13:28:02,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=73646.66666666667, ans=0.2 2023-06-23 13:28:09,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=73646.66666666667, ans=0.125 2023-06-23 13:28:14,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=73713.33333333333, ans=0.125 2023-06-23 13:28:24,375 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=15.0 2023-06-23 13:29:18,726 INFO [train.py:1008] (1/4) Epoch 21, batch 450, loss[loss=0.262, simple_loss=0.3303, pruned_loss=0.09686, over 16276.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3053, pruned_loss=0.08817, over 3399333.91 frames. ], batch size: 52, lr: 1.47e-02, grad_scale: 32.0 2023-06-23 13:29:24,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=73980.0, ans=0.125 2023-06-23 13:29:42,745 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.982e+02 2.281e+02 2.665e+02 3.713e+02, threshold=4.563e+02, percent-clipped=0.0 2023-06-23 13:29:44,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=74046.66666666667, ans=0.125 2023-06-23 13:29:49,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=74046.66666666667, ans=0.05 2023-06-23 13:30:13,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=74180.0, ans=0.0 2023-06-23 13:30:22,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=74180.0, ans=0.125 2023-06-23 13:30:40,110 INFO [train.py:1008] (1/4) Epoch 21, batch 500, loss[loss=0.2606, simple_loss=0.3333, pruned_loss=0.09392, over 18284.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3046, pruned_loss=0.08774, over 3499608.15 frames. ], batch size: 74, lr: 1.47e-02, grad_scale: 32.0 2023-06-23 13:30:43,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=74313.33333333333, ans=0.125 2023-06-23 13:31:03,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=74380.0, ans=0.0 2023-06-23 13:31:04,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=74380.0, ans=0.125 2023-06-23 13:31:12,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=74446.66666666667, ans=0.0 2023-06-23 13:31:18,205 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:31:18,960 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.91 vs. limit=22.5 2023-06-23 13:31:51,773 INFO [train.py:1008] (1/4) Epoch 22, batch 0, loss[loss=0.216, simple_loss=0.2904, pruned_loss=0.07074, over 19643.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2904, pruned_loss=0.07074, over 19643.00 frames. ], batch size: 110, lr: 1.44e-02, grad_scale: 32.0 2023-06-23 13:31:51,774 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 13:31:57,327 INFO [train.py:1040] (1/4) Epoch 22, validation: loss=0.2, simple_loss=0.2979, pruned_loss=0.05103, over 143649.00 frames. 2023-06-23 13:31:57,328 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 13:31:57,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=74533.33333333333, ans=0.2 2023-06-23 13:32:02,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=74533.33333333333, ans=0.125 2023-06-23 13:32:03,187 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.44 vs. limit=15.0 2023-06-23 13:32:27,009 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.76 vs. limit=15.0 2023-06-23 13:32:34,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff2.min_abs, batch_count=74666.66666666667, ans=0.1 2023-06-23 13:32:48,450 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 1.893e+02 2.081e+02 2.479e+02 4.096e+02, threshold=4.162e+02, percent-clipped=0.0 2023-06-23 13:32:52,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=74733.33333333333, ans=0.125 2023-06-23 13:32:53,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=74733.33333333333, ans=0.0 2023-06-23 13:32:54,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=74733.33333333333, ans=0.1 2023-06-23 13:33:14,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=74800.0, ans=0.2 2023-06-23 13:33:20,232 INFO [train.py:1008] (1/4) Epoch 22, batch 50, loss[loss=0.2337, simple_loss=0.2948, pruned_loss=0.08634, over 20592.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3036, pruned_loss=0.08656, over 857235.75 frames. ], batch size: 173, lr: 1.43e-02, grad_scale: 32.0 2023-06-23 13:33:22,790 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.41 vs. limit=15.0 2023-06-23 13:33:32,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=74866.66666666667, ans=0.0 2023-06-23 13:34:19,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=75066.66666666667, ans=0.125 2023-06-23 13:34:42,065 INFO [train.py:1008] (1/4) Epoch 22, batch 100, loss[loss=0.2425, simple_loss=0.3163, pruned_loss=0.08436, over 17639.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3058, pruned_loss=0.08716, over 1508487.56 frames. ], batch size: 67, lr: 1.43e-02, grad_scale: 32.0 2023-06-23 13:34:50,619 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.52 vs. limit=15.0 2023-06-23 13:35:01,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=75266.66666666667, ans=0.125 2023-06-23 13:35:03,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=75266.66666666667, ans=0.07 2023-06-23 13:35:18,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=75333.33333333333, ans=0.125 2023-06-23 13:35:26,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=75333.33333333333, ans=0.1 2023-06-23 13:35:29,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=75400.0, ans=0.125 2023-06-23 13:35:32,052 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.896e+02 2.104e+02 2.417e+02 4.108e+02, threshold=4.207e+02, percent-clipped=0.0 2023-06-23 13:35:43,793 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.57 vs. limit=12.0 2023-06-23 13:35:50,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=75466.66666666667, ans=0.035 2023-06-23 13:36:04,952 INFO [train.py:1008] (1/4) Epoch 22, batch 150, loss[loss=0.2374, simple_loss=0.3062, pruned_loss=0.0843, over 18273.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3054, pruned_loss=0.08732, over 2015850.99 frames. ], batch size: 74, lr: 1.43e-02, grad_scale: 32.0 2023-06-23 13:36:48,754 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:36:53,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=75733.33333333333, ans=0.125 2023-06-23 13:36:57,095 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.70 vs. limit=22.5 2023-06-23 13:37:14,260 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.12 vs. limit=15.0 2023-06-23 13:37:26,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=75866.66666666667, ans=0.0 2023-06-23 13:37:27,978 INFO [train.py:1008] (1/4) Epoch 22, batch 200, loss[loss=0.2518, simple_loss=0.3275, pruned_loss=0.08806, over 17551.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.305, pruned_loss=0.08682, over 2415309.56 frames. ], batch size: 67, lr: 1.43e-02, grad_scale: 32.0 2023-06-23 13:38:07,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=76000.0, ans=0.2 2023-06-23 13:38:16,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=76000.0, ans=0.125 2023-06-23 13:38:21,022 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 1.823e+02 2.064e+02 2.218e+02 4.834e+02, threshold=4.128e+02, percent-clipped=1.0 2023-06-23 13:38:21,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=76066.66666666667, ans=0.125 2023-06-23 13:38:25,516 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.37 vs. limit=15.0 2023-06-23 13:38:34,388 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.55 vs. limit=15.0 2023-06-23 13:38:35,929 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.18 vs. limit=15.0 2023-06-23 13:38:40,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=76133.33333333333, ans=10.0 2023-06-23 13:38:52,336 INFO [train.py:1008] (1/4) Epoch 22, batch 250, loss[loss=0.2464, simple_loss=0.2993, pruned_loss=0.09675, over 20433.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3049, pruned_loss=0.08705, over 2728766.30 frames. ], batch size: 160, lr: 1.43e-02, grad_scale: 32.0 2023-06-23 13:39:10,879 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.60 vs. limit=6.0 2023-06-23 13:39:15,653 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.03 vs. limit=10.0 2023-06-23 13:39:16,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=76266.66666666667, ans=0.1 2023-06-23 13:39:18,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=76266.66666666667, ans=0.125 2023-06-23 13:39:25,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=76333.33333333333, ans=0.0 2023-06-23 13:40:16,054 INFO [train.py:1008] (1/4) Epoch 22, batch 300, loss[loss=0.2252, simple_loss=0.2886, pruned_loss=0.08088, over 20721.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3041, pruned_loss=0.08623, over 2955323.71 frames. ], batch size: 211, lr: 1.42e-02, grad_scale: 32.0 2023-06-23 13:40:39,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=76600.0, ans=0.125 2023-06-23 13:40:42,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=76600.0, ans=0.125 2023-06-23 13:41:07,041 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.581e+02 1.877e+02 2.058e+02 2.358e+02 3.896e+02, threshold=4.117e+02, percent-clipped=0.0 2023-06-23 13:41:26,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=76800.0, ans=0.125 2023-06-23 13:41:37,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=76866.66666666667, ans=0.125 2023-06-23 13:41:39,128 INFO [train.py:1008] (1/4) Epoch 22, batch 350, loss[loss=0.288, simple_loss=0.3537, pruned_loss=0.1111, over 18317.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3051, pruned_loss=0.08589, over 3136535.14 frames. ], batch size: 72, lr: 1.42e-02, grad_scale: 32.0 2023-06-23 13:41:52,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=76866.66666666667, ans=0.2 2023-06-23 13:42:16,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=77000.0, ans=0.1 2023-06-23 13:42:32,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=77066.66666666667, ans=0.2 2023-06-23 13:42:37,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=77066.66666666667, ans=0.125 2023-06-23 13:43:02,748 INFO [train.py:1008] (1/4) Epoch 22, batch 400, loss[loss=0.2266, simple_loss=0.2978, pruned_loss=0.07763, over 19524.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3046, pruned_loss=0.08613, over 3273980.74 frames. ], batch size: 102, lr: 1.42e-02, grad_scale: 32.0 2023-06-23 13:43:22,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=77266.66666666667, ans=0.125 2023-06-23 13:43:43,268 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=77333.33333333333, ans=0.0 2023-06-23 13:43:48,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=77333.33333333333, ans=0.0 2023-06-23 13:43:54,276 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 1.898e+02 2.176e+02 2.564e+02 3.831e+02, threshold=4.352e+02, percent-clipped=0.0 2023-06-23 13:44:20,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=77466.66666666667, ans=0.0 2023-06-23 13:44:26,724 INFO [train.py:1008] (1/4) Epoch 22, batch 450, loss[loss=0.2268, simple_loss=0.2949, pruned_loss=0.07933, over 19201.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3036, pruned_loss=0.08549, over 3395305.03 frames. ], batch size: 92, lr: 1.42e-02, grad_scale: 32.0 2023-06-23 13:44:52,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=77600.0, ans=0.2 2023-06-23 13:44:54,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=77600.0, ans=0.125 2023-06-23 13:45:14,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=77733.33333333333, ans=0.125 2023-06-23 13:45:42,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=77800.0, ans=0.125 2023-06-23 13:45:47,955 INFO [train.py:1008] (1/4) Epoch 22, batch 500, loss[loss=0.2727, simple_loss=0.3465, pruned_loss=0.09945, over 16398.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.303, pruned_loss=0.08534, over 3490255.78 frames. ], batch size: 52, lr: 1.42e-02, grad_scale: 32.0 2023-06-23 13:45:56,427 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=77866.66666666667, ans=0.125 2023-06-23 13:45:59,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=77866.66666666667, ans=0.125 2023-06-23 13:46:21,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=78000.0, ans=0.125 2023-06-23 13:46:23,451 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.63 vs. limit=10.0 2023-06-23 13:46:27,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=78000.0, ans=0.2 2023-06-23 13:46:32,415 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=78000.0, ans=0.125 2023-06-23 13:46:36,461 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.596e+02 1.861e+02 2.045e+02 2.383e+02 4.014e+02, threshold=4.089e+02, percent-clipped=0.0 2023-06-23 13:47:00,144 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.40 vs. limit=15.0 2023-06-23 13:47:00,591 INFO [train.py:1008] (1/4) Epoch 23, batch 0, loss[loss=0.2431, simple_loss=0.3026, pruned_loss=0.09181, over 20555.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3026, pruned_loss=0.09181, over 20555.00 frames. ], batch size: 189, lr: 1.38e-02, grad_scale: 32.0 2023-06-23 13:47:00,591 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 13:47:06,604 INFO [train.py:1040] (1/4) Epoch 23, validation: loss=0.1991, simple_loss=0.2976, pruned_loss=0.05029, over 143649.00 frames. 2023-06-23 13:47:06,605 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 13:47:10,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=78080.0, ans=0.0 2023-06-23 13:47:34,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=78146.66666666667, ans=0.125 2023-06-23 13:47:41,732 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.51 vs. limit=6.0 2023-06-23 13:48:00,518 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.04 vs. limit=15.0 2023-06-23 13:48:03,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=78280.0, ans=0.125 2023-06-23 13:48:13,719 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.09 vs. limit=6.0 2023-06-23 13:48:28,462 INFO [train.py:1008] (1/4) Epoch 23, batch 50, loss[loss=0.2598, simple_loss=0.3313, pruned_loss=0.09415, over 18295.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.302, pruned_loss=0.08531, over 862180.22 frames. ], batch size: 74, lr: 1.38e-02, grad_scale: 32.0 2023-06-23 13:48:46,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=78480.0, ans=0.1 2023-06-23 13:48:49,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=78480.0, ans=0.2 2023-06-23 13:48:57,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=78480.0, ans=0.0 2023-06-23 13:49:09,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=78546.66666666667, ans=0.125 2023-06-23 13:49:38,673 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.81 vs. limit=12.0 2023-06-23 13:49:42,297 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.09 vs. limit=5.0 2023-06-23 13:49:50,571 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.849e+02 2.027e+02 2.431e+02 4.295e+02, threshold=4.054e+02, percent-clipped=1.0 2023-06-23 13:49:52,207 INFO [train.py:1008] (1/4) Epoch 23, batch 100, loss[loss=0.242, simple_loss=0.3145, pruned_loss=0.08473, over 18493.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3021, pruned_loss=0.08464, over 1525546.52 frames. ], batch size: 77, lr: 1.38e-02, grad_scale: 32.0 2023-06-23 13:49:59,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=78746.66666666667, ans=0.125 2023-06-23 13:50:01,134 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:51:15,957 INFO [train.py:1008] (1/4) Epoch 23, batch 150, loss[loss=0.262, simple_loss=0.2864, pruned_loss=0.1188, over 17059.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.302, pruned_loss=0.08534, over 2014408.65 frames. ], batch size: 391, lr: 1.38e-02, grad_scale: 32.0 2023-06-23 13:51:35,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=79146.66666666667, ans=0.125 2023-06-23 13:52:38,811 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.604e+02 1.946e+02 2.168e+02 2.387e+02 3.422e+02, threshold=4.335e+02, percent-clipped=0.0 2023-06-23 13:52:40,485 INFO [train.py:1008] (1/4) Epoch 23, batch 200, loss[loss=0.2392, simple_loss=0.2948, pruned_loss=0.09186, over 20691.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3015, pruned_loss=0.08443, over 2414044.42 frames. ], batch size: 211, lr: 1.37e-02, grad_scale: 32.0 2023-06-23 13:52:59,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=79480.0, ans=0.1 2023-06-23 13:53:29,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=79613.33333333333, ans=0.125 2023-06-23 13:53:33,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=79613.33333333333, ans=0.1 2023-06-23 13:53:38,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=79613.33333333333, ans=0.1 2023-06-23 13:53:43,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=79613.33333333333, ans=0.0 2023-06-23 13:53:47,031 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=15.0 2023-06-23 13:53:56,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer2.prob, batch_count=79680.0, ans=0.125 2023-06-23 13:54:03,898 INFO [train.py:1008] (1/4) Epoch 23, batch 250, loss[loss=0.2419, simple_loss=0.3084, pruned_loss=0.08768, over 19883.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.302, pruned_loss=0.0846, over 2716418.02 frames. ], batch size: 120, lr: 1.37e-02, grad_scale: 16.0 2023-06-23 13:54:18,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=79813.33333333333, ans=0.125 2023-06-23 13:54:36,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=79880.0, ans=0.2 2023-06-23 13:54:57,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=79946.66666666667, ans=0.125 2023-06-23 13:55:00,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=79946.66666666667, ans=0.125 2023-06-23 13:55:04,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=79946.66666666667, ans=0.125 2023-06-23 13:55:21,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=80013.33333333333, ans=0.1 2023-06-23 13:55:25,484 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.89 vs. limit=15.0 2023-06-23 13:55:26,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=80013.33333333333, ans=0.0 2023-06-23 13:55:29,304 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.905e+02 2.118e+02 2.432e+02 3.559e+02, threshold=4.237e+02, percent-clipped=0.0 2023-06-23 13:55:29,356 INFO [train.py:1008] (1/4) Epoch 23, batch 300, loss[loss=0.2284, simple_loss=0.2994, pruned_loss=0.07863, over 18461.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3022, pruned_loss=0.08479, over 2957422.04 frames. ], batch size: 77, lr: 1.37e-02, grad_scale: 16.0 2023-06-23 13:55:51,763 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=15.0 2023-06-23 13:56:12,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=80213.33333333333, ans=0.125 2023-06-23 13:56:27,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=80280.0, ans=10.0 2023-06-23 13:56:41,292 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.81 vs. limit=15.0 2023-06-23 13:56:49,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=80346.66666666667, ans=0.125 2023-06-23 13:56:50,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=80346.66666666667, ans=0.125 2023-06-23 13:56:52,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.max_positive, batch_count=80413.33333333333, ans=0.95 2023-06-23 13:56:53,740 INFO [train.py:1008] (1/4) Epoch 23, batch 350, loss[loss=0.2471, simple_loss=0.313, pruned_loss=0.09066, over 18643.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3025, pruned_loss=0.0854, over 3141196.77 frames. ], batch size: 80, lr: 1.37e-02, grad_scale: 16.0 2023-06-23 13:57:02,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=80413.33333333333, ans=0.125 2023-06-23 13:57:03,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=80413.33333333333, ans=0.125 2023-06-23 13:57:34,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=80546.66666666667, ans=0.125 2023-06-23 13:58:00,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=80680.0, ans=0.1 2023-06-23 13:58:05,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=80680.0, ans=0.0 2023-06-23 13:58:12,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=80680.0, ans=0.035 2023-06-23 13:58:17,443 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.818e+02 2.008e+02 2.403e+02 5.012e+02, threshold=4.015e+02, percent-clipped=2.0 2023-06-23 13:58:17,488 INFO [train.py:1008] (1/4) Epoch 23, batch 400, loss[loss=0.2464, simple_loss=0.3175, pruned_loss=0.08764, over 19463.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3024, pruned_loss=0.08476, over 3301994.42 frames. ], batch size: 105, lr: 1.37e-02, grad_scale: 32.0 2023-06-23 13:58:35,147 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=80813.33333333333, ans=0.0 2023-06-23 13:58:38,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=80813.33333333333, ans=0.125 2023-06-23 13:58:59,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=80880.0, ans=0.125 2023-06-23 13:59:01,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=80880.0, ans=0.125 2023-06-23 13:59:01,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=80880.0, ans=0.0 2023-06-23 13:59:12,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=80946.66666666667, ans=0.125 2023-06-23 13:59:40,691 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.70 vs. limit=15.0 2023-06-23 13:59:41,204 INFO [train.py:1008] (1/4) Epoch 23, batch 450, loss[loss=0.2199, simple_loss=0.2943, pruned_loss=0.07279, over 19308.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3027, pruned_loss=0.08491, over 3417209.68 frames. ], batch size: 98, lr: 1.36e-02, grad_scale: 32.0 2023-06-23 14:00:46,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=81346.66666666667, ans=0.125 2023-06-23 14:00:46,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=81346.66666666667, ans=0.125 2023-06-23 14:01:02,841 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.592e+02 1.844e+02 2.145e+02 2.439e+02 3.625e+02, threshold=4.291e+02, percent-clipped=0.0 2023-06-23 14:01:02,888 INFO [train.py:1008] (1/4) Epoch 23, batch 500, loss[loss=0.2474, simple_loss=0.306, pruned_loss=0.09437, over 20091.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3009, pruned_loss=0.08461, over 3514521.17 frames. ], batch size: 133, lr: 1.36e-02, grad_scale: 32.0 2023-06-23 14:01:04,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=81413.33333333333, ans=0.1 2023-06-23 14:01:39,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=81546.66666666667, ans=0.125 2023-06-23 14:02:13,111 INFO [train.py:1008] (1/4) Epoch 24, batch 0, loss[loss=0.23, simple_loss=0.3088, pruned_loss=0.07558, over 18279.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3088, pruned_loss=0.07558, over 18279.00 frames. ], batch size: 74, lr: 1.33e-02, grad_scale: 32.0 2023-06-23 14:02:13,112 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 14:02:19,037 INFO [train.py:1040] (1/4) Epoch 24, validation: loss=0.2009, simple_loss=0.2983, pruned_loss=0.05178, over 143649.00 frames. 2023-06-23 14:02:19,038 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 14:02:47,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=81693.33333333333, ans=0.0 2023-06-23 14:02:49,562 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:02:51,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=81760.0, ans=0.0 2023-06-23 14:03:28,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=81893.33333333333, ans=0.0 2023-06-23 14:03:32,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=81893.33333333333, ans=0.125 2023-06-23 14:03:41,725 INFO [train.py:1008] (1/4) Epoch 24, batch 50, loss[loss=0.2205, simple_loss=0.2891, pruned_loss=0.07594, over 19704.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.2959, pruned_loss=0.08253, over 873270.30 frames. ], batch size: 110, lr: 1.33e-02, grad_scale: 32.0 2023-06-23 14:04:06,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=82026.66666666667, ans=0.125 2023-06-23 14:04:06,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=82026.66666666667, ans=0.0 2023-06-23 14:04:11,068 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.845e+02 2.024e+02 2.275e+02 3.151e+02, threshold=4.048e+02, percent-clipped=0.0 2023-06-23 14:04:36,751 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=18.22 vs. limit=15.0 2023-06-23 14:04:46,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=82226.66666666667, ans=0.2 2023-06-23 14:05:04,402 INFO [train.py:1008] (1/4) Epoch 24, batch 100, loss[loss=0.2248, simple_loss=0.2982, pruned_loss=0.0757, over 19544.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.2987, pruned_loss=0.08244, over 1528082.29 frames. ], batch size: 102, lr: 1.33e-02, grad_scale: 32.0 2023-06-23 14:05:14,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=82293.33333333333, ans=0.04949747468305833 2023-06-23 14:05:26,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=82360.0, ans=0.125 2023-06-23 14:05:32,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=82360.0, ans=0.2 2023-06-23 14:05:59,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=82493.33333333333, ans=0.125 2023-06-23 14:06:03,476 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=15.0 2023-06-23 14:06:27,409 INFO [train.py:1008] (1/4) Epoch 24, batch 150, loss[loss=0.2191, simple_loss=0.2878, pruned_loss=0.07522, over 19329.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.2996, pruned_loss=0.08245, over 2022419.63 frames. ], batch size: 98, lr: 1.33e-02, grad_scale: 32.0 2023-06-23 14:06:28,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=82626.66666666667, ans=0.0 2023-06-23 14:06:51,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=82693.33333333333, ans=0.2 2023-06-23 14:06:57,414 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.611e+02 1.876e+02 2.097e+02 2.518e+02 3.387e+02, threshold=4.194e+02, percent-clipped=0.0 2023-06-23 14:07:26,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=82826.66666666667, ans=0.0 2023-06-23 14:07:33,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=82893.33333333333, ans=0.1 2023-06-23 14:07:45,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=82893.33333333333, ans=0.125 2023-06-23 14:07:45,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=82893.33333333333, ans=0.0 2023-06-23 14:07:50,549 INFO [train.py:1008] (1/4) Epoch 24, batch 200, loss[loss=0.226, simple_loss=0.299, pruned_loss=0.07648, over 18609.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.2993, pruned_loss=0.08226, over 2418032.77 frames. ], batch size: 80, lr: 1.32e-02, grad_scale: 32.0 2023-06-23 14:07:57,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=82960.0, ans=0.125 2023-06-23 14:08:02,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=82960.0, ans=0.1 2023-06-23 14:08:04,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=82960.0, ans=0.125 2023-06-23 14:08:23,648 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.26 vs. limit=15.0 2023-06-23 14:09:04,409 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:09:14,269 INFO [train.py:1008] (1/4) Epoch 24, batch 250, loss[loss=0.2341, simple_loss=0.3025, pruned_loss=0.0828, over 18936.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.2993, pruned_loss=0.08211, over 2736925.85 frames. ], batch size: 86, lr: 1.32e-02, grad_scale: 32.0 2023-06-23 14:09:43,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=83360.0, ans=0.0 2023-06-23 14:09:46,005 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.854e+02 2.071e+02 2.345e+02 4.579e+02, threshold=4.143e+02, percent-clipped=1.0 2023-06-23 14:09:56,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=83426.66666666667, ans=0.1 2023-06-23 14:10:38,022 INFO [train.py:1008] (1/4) Epoch 24, batch 300, loss[loss=0.2201, simple_loss=0.2947, pruned_loss=0.07273, over 19078.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.2998, pruned_loss=0.08237, over 2981384.38 frames. ], batch size: 89, lr: 1.32e-02, grad_scale: 16.0 2023-06-23 14:10:52,174 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.89 vs. limit=15.0 2023-06-23 14:11:04,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=83693.33333333333, ans=0.2 2023-06-23 14:11:13,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=83760.0, ans=0.04949747468305833 2023-06-23 14:11:37,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=83826.66666666667, ans=0.0 2023-06-23 14:11:37,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=83826.66666666667, ans=0.025 2023-06-23 14:11:50,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=83893.33333333333, ans=0.1 2023-06-23 14:11:53,252 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=83893.33333333333, ans=0.1 2023-06-23 14:12:02,742 INFO [train.py:1008] (1/4) Epoch 24, batch 350, loss[loss=0.2074, simple_loss=0.2766, pruned_loss=0.06916, over 20125.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.2988, pruned_loss=0.08245, over 3150460.35 frames. ], batch size: 133, lr: 1.32e-02, grad_scale: 16.0 2023-06-23 14:12:07,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=83960.0, ans=0.2 2023-06-23 14:12:25,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=84026.66666666667, ans=0.1 2023-06-23 14:12:35,438 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.880e+02 2.120e+02 2.567e+02 4.330e+02, threshold=4.240e+02, percent-clipped=2.0 2023-06-23 14:12:35,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=84093.33333333333, ans=0.125 2023-06-23 14:12:38,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=84093.33333333333, ans=0.125 2023-06-23 14:13:10,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=84226.66666666667, ans=0.125 2023-06-23 14:13:26,825 INFO [train.py:1008] (1/4) Epoch 24, batch 400, loss[loss=0.2196, simple_loss=0.2909, pruned_loss=0.07412, over 18941.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.2988, pruned_loss=0.08295, over 3285655.90 frames. ], batch size: 86, lr: 1.32e-02, grad_scale: 32.0 2023-06-23 14:13:30,250 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:14:02,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=84426.66666666667, ans=0.1 2023-06-23 14:14:24,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=84493.33333333333, ans=0.125 2023-06-23 14:14:25,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=84493.33333333333, ans=0.125 2023-06-23 14:14:30,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=84493.33333333333, ans=0.0 2023-06-23 14:14:50,366 INFO [train.py:1008] (1/4) Epoch 24, batch 450, loss[loss=0.2541, simple_loss=0.3031, pruned_loss=0.1026, over 20026.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.2992, pruned_loss=0.08327, over 3398051.23 frames. ], batch size: 293, lr: 1.31e-02, grad_scale: 32.0 2023-06-23 14:15:14,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=84693.33333333333, ans=0.125 2023-06-23 14:15:18,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=84693.33333333333, ans=0.0 2023-06-23 14:15:22,866 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.867e+02 2.068e+02 2.284e+02 2.784e+02, threshold=4.137e+02, percent-clipped=0.0 2023-06-23 14:15:38,108 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.82 vs. limit=6.0 2023-06-23 14:15:45,394 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.14 vs. limit=10.0 2023-06-23 14:15:50,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=84826.66666666667, ans=0.0 2023-06-23 14:15:54,092 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.78 vs. limit=22.5 2023-06-23 14:15:55,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=84893.33333333333, ans=0.05 2023-06-23 14:16:12,064 INFO [train.py:1008] (1/4) Epoch 24, batch 500, loss[loss=0.2209, simple_loss=0.2871, pruned_loss=0.07739, over 19329.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.2981, pruned_loss=0.08276, over 3502603.31 frames. ], batch size: 98, lr: 1.31e-02, grad_scale: 32.0 2023-06-23 14:16:12,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=84960.0, ans=0.1 2023-06-23 14:16:17,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=84960.0, ans=0.0 2023-06-23 14:16:27,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=85026.66666666667, ans=0.125 2023-06-23 14:16:41,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=85026.66666666667, ans=0.1 2023-06-23 14:16:41,674 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:17:24,928 INFO [train.py:1008] (1/4) Epoch 25, batch 0, loss[loss=0.2363, simple_loss=0.304, pruned_loss=0.08428, over 19773.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.304, pruned_loss=0.08428, over 19773.00 frames. ], batch size: 115, lr: 1.29e-02, grad_scale: 32.0 2023-06-23 14:17:24,929 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 14:17:30,721 INFO [train.py:1040] (1/4) Epoch 25, validation: loss=0.2009, simple_loss=0.2979, pruned_loss=0.05193, over 143649.00 frames. 2023-06-23 14:17:30,722 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 14:17:40,583 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.36 vs. limit=15.0 2023-06-23 14:18:28,647 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=85373.33333333333, ans=0.0 2023-06-23 14:18:32,977 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.781e+02 2.037e+02 2.242e+02 3.347e+02, threshold=4.075e+02, percent-clipped=0.0 2023-06-23 14:18:41,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=85440.0, ans=0.07 2023-06-23 14:18:54,959 INFO [train.py:1008] (1/4) Epoch 25, batch 50, loss[loss=0.2146, simple_loss=0.2896, pruned_loss=0.06977, over 18479.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.2954, pruned_loss=0.08262, over 850667.53 frames. ], batch size: 77, lr: 1.28e-02, grad_scale: 32.0 2023-06-23 14:18:55,952 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.24 vs. limit=6.0 2023-06-23 14:19:01,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=85506.66666666667, ans=0.125 2023-06-23 14:19:13,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=85573.33333333333, ans=0.0 2023-06-23 14:20:16,944 INFO [train.py:1008] (1/4) Epoch 25, batch 100, loss[loss=0.2257, simple_loss=0.3018, pruned_loss=0.07483, over 18454.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.2968, pruned_loss=0.0811, over 1501853.45 frames. ], batch size: 77, lr: 1.28e-02, grad_scale: 32.0 2023-06-23 14:20:53,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=85973.33333333333, ans=0.0 2023-06-23 14:21:18,855 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.789e+02 1.946e+02 2.136e+02 3.859e+02, threshold=3.893e+02, percent-clipped=0.0 2023-06-23 14:21:21,757 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.01 vs. limit=15.0 2023-06-23 14:21:40,873 INFO [train.py:1008] (1/4) Epoch 25, batch 150, loss[loss=0.2306, simple_loss=0.2807, pruned_loss=0.09027, over 19942.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.2977, pruned_loss=0.08153, over 2002802.49 frames. ], batch size: 294, lr: 1.28e-02, grad_scale: 32.0 2023-06-23 14:22:14,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=86306.66666666667, ans=0.2 2023-06-23 14:22:21,447 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=86306.66666666667, ans=0.125 2023-06-23 14:22:31,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=86373.33333333333, ans=0.2 2023-06-23 14:22:45,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=86373.33333333333, ans=0.0 2023-06-23 14:22:53,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=86440.0, ans=0.0 2023-06-23 14:22:58,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=86440.0, ans=0.125 2023-06-23 14:23:04,979 INFO [train.py:1008] (1/4) Epoch 25, batch 200, loss[loss=0.2142, simple_loss=0.2846, pruned_loss=0.07185, over 18567.00 frames. ], tot_loss[loss=0.231, simple_loss=0.2972, pruned_loss=0.08235, over 2403209.60 frames. ], batch size: 80, lr: 1.28e-02, grad_scale: 32.0 2023-06-23 14:23:21,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=86573.33333333333, ans=10.0 2023-06-23 14:23:40,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.min_positive, batch_count=86640.0, ans=0.025 2023-06-23 14:23:57,720 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.02 vs. limit=12.0 2023-06-23 14:23:58,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=86706.66666666667, ans=0.125 2023-06-23 14:24:00,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=86706.66666666667, ans=0.125 2023-06-23 14:24:00,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=86706.66666666667, ans=0.125 2023-06-23 14:24:01,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=86706.66666666667, ans=0.125 2023-06-23 14:24:06,848 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.790e+02 2.060e+02 2.451e+02 3.773e+02, threshold=4.120e+02, percent-clipped=0.0 2023-06-23 14:24:08,722 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=4.65 vs. limit=15.0 2023-06-23 14:24:14,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=86773.33333333333, ans=0.125 2023-06-23 14:24:23,032 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=86773.33333333333, ans=0.125 2023-06-23 14:24:23,530 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.11 vs. limit=15.0 2023-06-23 14:24:29,892 INFO [train.py:1008] (1/4) Epoch 25, batch 250, loss[loss=0.2502, simple_loss=0.3221, pruned_loss=0.0891, over 18610.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.2976, pruned_loss=0.08186, over 2702849.70 frames. ], batch size: 80, lr: 1.28e-02, grad_scale: 32.0 2023-06-23 14:24:47,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=86906.66666666667, ans=0.125 2023-06-23 14:24:54,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=86906.66666666667, ans=0.125 2023-06-23 14:24:59,381 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=6.034e-02 2023-06-23 14:25:27,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=87040.0, ans=0.1 2023-06-23 14:25:42,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=87106.66666666667, ans=0.125 2023-06-23 14:25:51,645 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.04 vs. limit=12.0 2023-06-23 14:25:53,879 INFO [train.py:1008] (1/4) Epoch 25, batch 300, loss[loss=0.2368, simple_loss=0.3123, pruned_loss=0.08062, over 18929.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.2968, pruned_loss=0.08184, over 2947473.73 frames. ], batch size: 86, lr: 1.27e-02, grad_scale: 32.0 2023-06-23 14:26:47,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=87373.33333333333, ans=0.125 2023-06-23 14:26:55,939 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.827e+02 2.003e+02 2.228e+02 3.545e+02, threshold=4.005e+02, percent-clipped=0.0 2023-06-23 14:26:58,478 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.80 vs. limit=15.0 2023-06-23 14:27:08,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=87440.0, ans=0.2 2023-06-23 14:27:17,109 INFO [train.py:1008] (1/4) Epoch 25, batch 350, loss[loss=0.2609, simple_loss=0.3335, pruned_loss=0.09411, over 16387.00 frames. ], tot_loss[loss=0.23, simple_loss=0.2977, pruned_loss=0.08113, over 3141163.48 frames. ], batch size: 52, lr: 1.27e-02, grad_scale: 32.0 2023-06-23 14:27:27,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=87506.66666666667, ans=0.125 2023-06-23 14:27:50,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=87640.0, ans=0.1 2023-06-23 14:28:08,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=87706.66666666667, ans=0.05 2023-06-23 14:28:38,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=87773.33333333333, ans=0.125 2023-06-23 14:28:41,117 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.01 vs. limit=15.0 2023-06-23 14:28:41,730 INFO [train.py:1008] (1/4) Epoch 25, batch 400, loss[loss=0.2282, simple_loss=0.2966, pruned_loss=0.07987, over 18913.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.2975, pruned_loss=0.08109, over 3291961.47 frames. ], batch size: 86, lr: 1.27e-02, grad_scale: 32.0 2023-06-23 14:29:16,037 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=87973.33333333333, ans=0.0 2023-06-23 14:29:28,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=87973.33333333333, ans=0.1 2023-06-23 14:29:30,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=87973.33333333333, ans=15.0 2023-06-23 14:29:35,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=88040.0, ans=0.1 2023-06-23 14:29:36,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=88040.0, ans=0.1 2023-06-23 14:29:43,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=88040.0, ans=0.1 2023-06-23 14:29:44,286 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.862e+02 1.997e+02 2.222e+02 3.084e+02, threshold=3.994e+02, percent-clipped=0.0 2023-06-23 14:29:47,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=88106.66666666667, ans=0.125 2023-06-23 14:29:50,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=88106.66666666667, ans=0.125 2023-06-23 14:29:51,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=88106.66666666667, ans=0.125 2023-06-23 14:30:06,716 INFO [train.py:1008] (1/4) Epoch 25, batch 450, loss[loss=0.2376, simple_loss=0.3222, pruned_loss=0.07653, over 18324.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.2973, pruned_loss=0.08144, over 3397631.07 frames. ], batch size: 72, lr: 1.27e-02, grad_scale: 32.0 2023-06-23 14:30:15,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=88173.33333333333, ans=0.125 2023-06-23 14:30:25,586 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.91 vs. limit=22.5 2023-06-23 14:31:14,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.whiten.whitening_limit, batch_count=88440.0, ans=15.0 2023-06-23 14:31:27,987 INFO [train.py:1008] (1/4) Epoch 25, batch 500, loss[loss=0.2207, simple_loss=0.2997, pruned_loss=0.07088, over 19046.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.2973, pruned_loss=0.08117, over 3482907.03 frames. ], batch size: 89, lr: 1.27e-02, grad_scale: 32.0 2023-06-23 14:31:28,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=88506.66666666667, ans=0.125 2023-06-23 14:31:46,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=88573.33333333333, ans=0.1 2023-06-23 14:32:11,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=88640.0, ans=0.1 2023-06-23 14:32:38,437 INFO [train.py:1008] (1/4) Epoch 26, batch 0, loss[loss=0.2253, simple_loss=0.3023, pruned_loss=0.07413, over 18272.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3023, pruned_loss=0.07413, over 18272.00 frames. ], batch size: 74, lr: 1.24e-02, grad_scale: 32.0 2023-06-23 14:32:38,437 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 14:32:44,186 INFO [train.py:1040] (1/4) Epoch 26, validation: loss=0.1977, simple_loss=0.2958, pruned_loss=0.04978, over 143649.00 frames. 2023-06-23 14:32:44,186 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 14:32:52,805 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.812e+02 2.007e+02 2.182e+02 2.880e+02, threshold=4.014e+02, percent-clipped=0.0 2023-06-23 14:32:59,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=88786.66666666667, ans=0.1 2023-06-23 14:33:05,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=88786.66666666667, ans=0.1 2023-06-23 14:33:05,772 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.70 vs. limit=22.5 2023-06-23 14:34:07,104 INFO [train.py:1008] (1/4) Epoch 26, batch 50, loss[loss=0.2298, simple_loss=0.2887, pruned_loss=0.08544, over 20272.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.2973, pruned_loss=0.08071, over 846362.56 frames. ], batch size: 239, lr: 1.24e-02, grad_scale: 32.0 2023-06-23 14:34:09,252 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=89053.33333333333, ans=0.0 2023-06-23 14:34:11,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=89053.33333333333, ans=0.1 2023-06-23 14:34:19,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=89053.33333333333, ans=0.0 2023-06-23 14:34:23,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=89120.0, ans=0.1 2023-06-23 14:34:23,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=89120.0, ans=0.125 2023-06-23 14:35:26,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=89320.0, ans=0.0 2023-06-23 14:35:29,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=89386.66666666667, ans=0.125 2023-06-23 14:35:30,373 INFO [train.py:1008] (1/4) Epoch 26, batch 100, loss[loss=0.2335, simple_loss=0.3192, pruned_loss=0.07388, over 17620.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2962, pruned_loss=0.07806, over 1499521.65 frames. ], batch size: 67, lr: 1.24e-02, grad_scale: 32.0 2023-06-23 14:35:39,096 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 1.887e+02 2.093e+02 2.327e+02 3.480e+02, threshold=4.186e+02, percent-clipped=0.0 2023-06-23 14:35:57,709 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.55 vs. limit=6.0 2023-06-23 14:36:05,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=89520.0, ans=0.0 2023-06-23 14:36:13,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=89520.0, ans=0.125 2023-06-23 14:36:18,874 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.69 vs. limit=6.0 2023-06-23 14:36:51,926 INFO [train.py:1008] (1/4) Epoch 26, batch 150, loss[loss=0.2224, simple_loss=0.2871, pruned_loss=0.07881, over 20128.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.2975, pruned_loss=0.07983, over 2004627.68 frames. ], batch size: 133, lr: 1.24e-02, grad_scale: 32.0 2023-06-23 14:37:06,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.whiten.whitening_limit, batch_count=89720.0, ans=12.0 2023-06-23 14:37:22,866 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.89 vs. limit=22.5 2023-06-23 14:38:00,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=89986.66666666667, ans=0.0 2023-06-23 14:38:13,806 INFO [train.py:1008] (1/4) Epoch 26, batch 200, loss[loss=0.2287, simple_loss=0.2954, pruned_loss=0.081, over 20335.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2971, pruned_loss=0.08049, over 2416400.48 frames. ], batch size: 149, lr: 1.23e-02, grad_scale: 32.0 2023-06-23 14:38:22,160 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.871e+02 2.185e+02 2.799e+02 4.409e+02, threshold=4.369e+02, percent-clipped=2.0 2023-06-23 14:38:58,660 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.56 vs. limit=15.0 2023-06-23 14:39:06,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=90253.33333333333, ans=0.125 2023-06-23 14:39:13,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=90253.33333333333, ans=0.125 2023-06-23 14:39:13,665 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.52 vs. limit=15.0 2023-06-23 14:39:33,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=90320.0, ans=0.0 2023-06-23 14:39:35,923 INFO [train.py:1008] (1/4) Epoch 26, batch 250, loss[loss=0.2315, simple_loss=0.3055, pruned_loss=0.07871, over 16250.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.2977, pruned_loss=0.0806, over 2701997.79 frames. ], batch size: 52, lr: 1.23e-02, grad_scale: 32.0 2023-06-23 14:40:21,716 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.84 vs. limit=6.0 2023-06-23 14:40:27,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=90586.66666666667, ans=0.125 2023-06-23 14:40:30,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=90586.66666666667, ans=0.125 2023-06-23 14:40:57,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=90720.0, ans=0.125 2023-06-23 14:40:59,291 INFO [train.py:1008] (1/4) Epoch 26, batch 300, loss[loss=0.2208, simple_loss=0.2923, pruned_loss=0.07466, over 18923.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.2971, pruned_loss=0.07998, over 2944736.37 frames. ], batch size: 86, lr: 1.23e-02, grad_scale: 32.0 2023-06-23 14:41:07,098 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.424e+02 1.789e+02 2.016e+02 2.294e+02 4.058e+02, threshold=4.033e+02, percent-clipped=0.0 2023-06-23 14:41:14,652 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.07 vs. limit=15.0 2023-06-23 14:41:23,358 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.78 vs. limit=15.0 2023-06-23 14:41:49,092 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.13 vs. limit=15.0 2023-06-23 14:41:59,073 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.10 vs. limit=22.5 2023-06-23 14:42:00,007 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:42:03,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=90986.66666666667, ans=0.125 2023-06-23 14:42:10,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=90986.66666666667, ans=0.035 2023-06-23 14:42:19,787 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:42:21,045 INFO [train.py:1008] (1/4) Epoch 26, batch 350, loss[loss=0.2389, simple_loss=0.3168, pruned_loss=0.08046, over 17581.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.2961, pruned_loss=0.08032, over 3122026.94 frames. ], batch size: 67, lr: 1.23e-02, grad_scale: 8.0 2023-06-23 14:42:42,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=91120.0, ans=0.0 2023-06-23 14:42:51,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=91186.66666666667, ans=0.125 2023-06-23 14:42:58,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=91186.66666666667, ans=0.125 2023-06-23 14:43:03,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=91186.66666666667, ans=0.2 2023-06-23 14:43:21,530 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.56 vs. limit=22.5 2023-06-23 14:43:29,712 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.66 vs. limit=15.0 2023-06-23 14:43:42,024 INFO [train.py:1008] (1/4) Epoch 26, batch 400, loss[loss=0.2206, simple_loss=0.2885, pruned_loss=0.07633, over 19703.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.2958, pruned_loss=0.07964, over 3264742.92 frames. ], batch size: 110, lr: 1.23e-02, grad_scale: 16.0 2023-06-23 14:43:48,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=91386.66666666667, ans=0.0 2023-06-23 14:43:52,736 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.810e+02 2.065e+02 2.376e+02 3.330e+02, threshold=4.130e+02, percent-clipped=0.0 2023-06-23 14:44:01,054 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.489e-01 2023-06-23 14:44:12,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=91520.0, ans=0.07 2023-06-23 14:44:15,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=91520.0, ans=0.0 2023-06-23 14:44:27,969 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.83 vs. limit=15.0 2023-06-23 14:44:29,959 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=10.95 vs. limit=15.0 2023-06-23 14:44:50,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=91653.33333333333, ans=0.125 2023-06-23 14:44:52,914 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.48 vs. limit=6.0 2023-06-23 14:45:03,409 INFO [train.py:1008] (1/4) Epoch 26, batch 450, loss[loss=0.2298, simple_loss=0.2956, pruned_loss=0.082, over 20313.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2956, pruned_loss=0.07945, over 3391066.23 frames. ], batch size: 141, lr: 1.23e-02, grad_scale: 16.0 2023-06-23 14:45:03,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=91720.0, ans=0.125 2023-06-23 14:45:14,008 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.11 vs. limit=15.0 2023-06-23 14:46:05,247 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=22.5 2023-06-23 14:46:09,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=91986.66666666667, ans=0.0 2023-06-23 14:46:23,510 INFO [train.py:1008] (1/4) Epoch 26, batch 500, loss[loss=0.2574, simple_loss=0.3143, pruned_loss=0.1003, over 20088.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.2964, pruned_loss=0.07951, over 3465400.73 frames. ], batch size: 133, lr: 1.22e-02, grad_scale: 16.0 2023-06-23 14:46:34,085 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.741e+02 1.935e+02 2.246e+02 4.045e+02, threshold=3.869e+02, percent-clipped=0.0 2023-06-23 14:46:43,796 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=92120.0, ans=0.0 2023-06-23 14:46:51,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=92120.0, ans=0.125 2023-06-23 14:47:36,847 INFO [train.py:1008] (1/4) Epoch 27, batch 0, loss[loss=0.2251, simple_loss=0.3057, pruned_loss=0.07226, over 10733.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3057, pruned_loss=0.07226, over 10733.00 frames. ], batch size: 30, lr: 1.20e-02, grad_scale: 32.0 2023-06-23 14:47:36,848 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 14:47:42,453 INFO [train.py:1040] (1/4) Epoch 27, validation: loss=0.1977, simple_loss=0.2951, pruned_loss=0.05021, over 143649.00 frames. 2023-06-23 14:47:42,454 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 14:47:47,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=92273.33333333333, ans=0.1 2023-06-23 14:48:37,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=92473.33333333333, ans=0.1 2023-06-23 14:48:44,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=92473.33333333333, ans=0.125 2023-06-23 14:48:48,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=92473.33333333333, ans=0.125 2023-06-23 14:48:50,390 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.36 vs. limit=15.0 2023-06-23 14:49:08,338 INFO [train.py:1008] (1/4) Epoch 27, batch 50, loss[loss=0.2237, simple_loss=0.2943, pruned_loss=0.07658, over 19962.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.2936, pruned_loss=0.08033, over 855173.58 frames. ], batch size: 126, lr: 1.20e-02, grad_scale: 32.0 2023-06-23 14:49:20,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=92606.66666666667, ans=0.0 2023-06-23 14:49:21,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=92606.66666666667, ans=0.125 2023-06-23 14:49:30,543 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.27 vs. limit=15.0 2023-06-23 14:49:37,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=92673.33333333333, ans=0.125 2023-06-23 14:49:48,388 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.403e+02 1.807e+02 2.048e+02 2.353e+02 3.369e+02, threshold=4.097e+02, percent-clipped=0.0 2023-06-23 14:49:53,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=92740.0, ans=0.015 2023-06-23 14:50:12,959 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.78 vs. limit=22.5 2023-06-23 14:50:32,179 INFO [train.py:1008] (1/4) Epoch 27, batch 100, loss[loss=0.2187, simple_loss=0.2899, pruned_loss=0.07375, over 19133.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.2957, pruned_loss=0.08009, over 1502895.29 frames. ], batch size: 94, lr: 1.20e-02, grad_scale: 32.0 2023-06-23 14:50:41,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=92940.0, ans=0.125 2023-06-23 14:50:41,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=92940.0, ans=0.1 2023-06-23 14:50:42,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=92940.0, ans=0.0 2023-06-23 14:50:56,147 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=93006.66666666667, ans=0.125 2023-06-23 14:51:00,983 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=93006.66666666667, ans=0.0 2023-06-23 14:51:03,135 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.49 vs. limit=15.0 2023-06-23 14:51:55,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=93273.33333333333, ans=0.125 2023-06-23 14:51:56,220 INFO [train.py:1008] (1/4) Epoch 27, batch 150, loss[loss=0.2167, simple_loss=0.2877, pruned_loss=0.07288, over 20105.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2952, pruned_loss=0.07819, over 1996438.43 frames. ], batch size: 133, lr: 1.19e-02, grad_scale: 32.0 2023-06-23 14:51:59,304 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.83 vs. limit=10.0 2023-06-23 14:52:29,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=93406.66666666667, ans=0.125 2023-06-23 14:52:38,094 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.519e+02 1.808e+02 2.052e+02 2.276e+02 3.582e+02, threshold=4.103e+02, percent-clipped=0.0 2023-06-23 14:52:40,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=93406.66666666667, ans=0.125 2023-06-23 14:52:57,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=93473.33333333333, ans=0.125 2023-06-23 14:53:02,229 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.70 vs. limit=15.0 2023-06-23 14:53:07,818 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:53:08,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=93540.0, ans=0.125 2023-06-23 14:53:22,641 INFO [train.py:1008] (1/4) Epoch 27, batch 200, loss[loss=0.2041, simple_loss=0.2818, pruned_loss=0.06317, over 19093.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2954, pruned_loss=0.07901, over 2401337.49 frames. ], batch size: 94, lr: 1.19e-02, grad_scale: 32.0 2023-06-23 14:53:27,010 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-23 14:53:44,190 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.14 vs. limit=22.5 2023-06-23 14:53:45,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=93673.33333333333, ans=0.0 2023-06-23 14:53:53,238 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=93673.33333333333, ans=0.09899494936611666 2023-06-23 14:54:01,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=93740.0, ans=0.09899494936611666 2023-06-23 14:54:18,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=93806.66666666667, ans=0.05 2023-06-23 14:54:31,244 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.85 vs. limit=22.5 2023-06-23 14:54:45,938 INFO [train.py:1008] (1/4) Epoch 27, batch 250, loss[loss=0.2158, simple_loss=0.2943, pruned_loss=0.06866, over 18950.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2945, pruned_loss=0.07831, over 2711456.11 frames. ], batch size: 86, lr: 1.19e-02, grad_scale: 16.0 2023-06-23 14:55:00,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=93940.0, ans=0.0 2023-06-23 14:55:27,789 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.519e+02 1.850e+02 2.043e+02 2.344e+02 3.255e+02, threshold=4.087e+02, percent-clipped=0.0 2023-06-23 14:55:30,490 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.25 vs. limit=15.0 2023-06-23 14:55:36,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=94140.0, ans=0.125 2023-06-23 14:55:36,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=94140.0, ans=0.125 2023-06-23 14:55:50,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=94140.0, ans=0.09899494936611666 2023-06-23 14:55:54,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=94206.66666666667, ans=0.2 2023-06-23 14:56:05,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=94206.66666666667, ans=0.125 2023-06-23 14:56:09,897 INFO [train.py:1008] (1/4) Epoch 27, batch 300, loss[loss=0.2497, simple_loss=0.3212, pruned_loss=0.08911, over 15165.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2945, pruned_loss=0.07853, over 2936259.28 frames. ], batch size: 43, lr: 1.19e-02, grad_scale: 16.0 2023-06-23 14:56:12,321 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.07 vs. limit=15.0 2023-06-23 14:56:13,802 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.11 vs. limit=12.0 2023-06-23 14:56:28,974 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:56:44,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=94406.66666666667, ans=0.0 2023-06-23 14:56:55,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=94406.66666666667, ans=0.125 2023-06-23 14:57:12,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=94473.33333333333, ans=0.1 2023-06-23 14:57:16,140 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.96 vs. limit=22.5 2023-06-23 14:57:26,383 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=20.82 vs. limit=22.5 2023-06-23 14:57:33,341 INFO [train.py:1008] (1/4) Epoch 27, batch 350, loss[loss=0.2084, simple_loss=0.2847, pruned_loss=0.06604, over 19487.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2939, pruned_loss=0.0782, over 3119023.79 frames. ], batch size: 105, lr: 1.19e-02, grad_scale: 16.0 2023-06-23 14:57:49,803 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=94673.33333333333, ans=0.0 2023-06-23 14:58:05,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=94740.0, ans=0.125 2023-06-23 14:58:15,017 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.806e+02 2.000e+02 2.279e+02 3.758e+02, threshold=4.000e+02, percent-clipped=0.0 2023-06-23 14:58:33,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=94806.66666666667, ans=0.1 2023-06-23 14:58:56,653 INFO [train.py:1008] (1/4) Epoch 27, batch 400, loss[loss=0.2103, simple_loss=0.2853, pruned_loss=0.06764, over 19713.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2937, pruned_loss=0.0783, over 3274996.67 frames. ], batch size: 110, lr: 1.19e-02, grad_scale: 32.0 2023-06-23 14:58:59,392 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.99 vs. limit=22.5 2023-06-23 14:59:08,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=94940.0, ans=0.125 2023-06-23 14:59:08,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=94940.0, ans=0.0 2023-06-23 14:59:20,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=95006.66666666667, ans=0.125 2023-06-23 14:59:23,065 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.88 vs. limit=15.0 2023-06-23 14:59:37,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=95073.33333333333, ans=0.0 2023-06-23 14:59:37,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=95073.33333333333, ans=0.0 2023-06-23 14:59:41,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=95073.33333333333, ans=0.0 2023-06-23 14:59:52,661 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.91 vs. limit=15.0 2023-06-23 14:59:55,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=95140.0, ans=0.1 2023-06-23 15:00:20,246 INFO [train.py:1008] (1/4) Epoch 27, batch 450, loss[loss=0.2155, simple_loss=0.29, pruned_loss=0.07046, over 18936.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2938, pruned_loss=0.07802, over 3384368.70 frames. ], batch size: 86, lr: 1.18e-02, grad_scale: 32.0 2023-06-23 15:00:35,325 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.37 vs. limit=15.0 2023-06-23 15:00:50,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=95340.0, ans=0.0 2023-06-23 15:01:02,083 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.792e+02 1.984e+02 2.286e+02 3.799e+02, threshold=3.968e+02, percent-clipped=0.0 2023-06-23 15:01:12,002 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.91 vs. limit=22.5 2023-06-23 15:01:40,189 INFO [train.py:1008] (1/4) Epoch 27, batch 500, loss[loss=0.227, simple_loss=0.2992, pruned_loss=0.07742, over 19124.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2945, pruned_loss=0.07866, over 3485084.54 frames. ], batch size: 94, lr: 1.18e-02, grad_scale: 32.0 2023-06-23 15:02:07,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=95673.33333333333, ans=0.2 2023-06-23 15:02:12,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=95740.0, ans=0.125 2023-06-23 15:02:12,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=95740.0, ans=0.0 2023-06-23 15:02:13,995 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=95740.0, ans=0.1 2023-06-23 15:02:23,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=95740.0, ans=0.05 2023-06-23 15:02:26,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=95806.66666666667, ans=0.1 2023-06-23 15:02:29,164 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.66 vs. limit=15.0 2023-06-23 15:02:54,453 INFO [train.py:1008] (1/4) Epoch 28, batch 0, loss[loss=0.214, simple_loss=0.2836, pruned_loss=0.07223, over 19563.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2836, pruned_loss=0.07223, over 19563.00 frames. ], batch size: 102, lr: 1.16e-02, grad_scale: 32.0 2023-06-23 15:02:54,453 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 15:03:00,353 INFO [train.py:1040] (1/4) Epoch 28, validation: loss=0.1967, simple_loss=0.2958, pruned_loss=0.0488, over 143649.00 frames. 2023-06-23 15:03:00,354 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 15:03:40,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=95960.0, ans=0.0 2023-06-23 15:03:47,875 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.04 vs. limit=15.0 2023-06-23 15:04:08,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=96093.33333333333, ans=0.125 2023-06-23 15:04:09,226 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.841e+02 2.095e+02 2.528e+02 5.157e+02, threshold=4.189e+02, percent-clipped=2.0 2023-06-23 15:04:23,617 INFO [train.py:1008] (1/4) Epoch 28, batch 50, loss[loss=0.2153, simple_loss=0.28, pruned_loss=0.07529, over 20590.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2947, pruned_loss=0.07734, over 856227.74 frames. ], batch size: 189, lr: 1.16e-02, grad_scale: 32.0 2023-06-23 15:04:24,999 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.50 vs. limit=5.0 2023-06-23 15:04:34,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=96160.0, ans=0.0 2023-06-23 15:04:45,315 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.68 vs. limit=22.5 2023-06-23 15:04:51,389 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 15:05:01,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=96293.33333333333, ans=0.0 2023-06-23 15:05:44,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=96493.33333333333, ans=0.0 2023-06-23 15:05:46,203 INFO [train.py:1008] (1/4) Epoch 28, batch 100, loss[loss=0.2395, simple_loss=0.3219, pruned_loss=0.07851, over 17595.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2924, pruned_loss=0.07673, over 1519149.92 frames. ], batch size: 67, lr: 1.16e-02, grad_scale: 32.0 2023-06-23 15:05:47,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=96493.33333333333, ans=0.1 2023-06-23 15:06:01,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=96560.0, ans=0.125 2023-06-23 15:06:02,165 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.02 vs. limit=15.0 2023-06-23 15:06:03,265 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 15:06:26,759 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.06 vs. limit=22.5 2023-06-23 15:06:27,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=96626.66666666667, ans=0.125 2023-06-23 15:06:35,642 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.19 vs. limit=15.0 2023-06-23 15:06:41,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=96693.33333333333, ans=0.125 2023-06-23 15:06:42,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=96693.33333333333, ans=0.1 2023-06-23 15:06:50,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=96760.0, ans=0.125 2023-06-23 15:06:54,819 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.744e+02 1.940e+02 2.256e+02 4.424e+02, threshold=3.879e+02, percent-clipped=2.0 2023-06-23 15:07:08,092 INFO [train.py:1008] (1/4) Epoch 28, batch 150, loss[loss=0.2289, simple_loss=0.303, pruned_loss=0.07737, over 18253.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2926, pruned_loss=0.07587, over 2025008.67 frames. ], batch size: 74, lr: 1.16e-02, grad_scale: 32.0 2023-06-23 15:07:17,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=96826.66666666667, ans=0.0 2023-06-23 15:07:28,398 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 15:08:29,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=97160.0, ans=0.125 2023-06-23 15:08:31,642 INFO [train.py:1008] (1/4) Epoch 28, batch 200, loss[loss=0.2205, simple_loss=0.2898, pruned_loss=0.07564, over 19107.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2937, pruned_loss=0.07619, over 2412463.68 frames. ], batch size: 94, lr: 1.15e-02, grad_scale: 32.0 2023-06-23 15:08:50,643 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.30 vs. limit=10.0 2023-06-23 15:09:01,802 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.11 vs. limit=8.0 2023-06-23 15:09:04,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=97293.33333333333, ans=0.0 2023-06-23 15:09:26,736 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.59 vs. limit=15.0 2023-06-23 15:09:41,909 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.794e+02 1.961e+02 2.215e+02 3.143e+02, threshold=3.923e+02, percent-clipped=0.0 2023-06-23 15:09:54,736 INFO [train.py:1008] (1/4) Epoch 28, batch 250, loss[loss=0.2203, simple_loss=0.2927, pruned_loss=0.07398, over 18312.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2928, pruned_loss=0.07574, over 2712715.56 frames. ], batch size: 74, lr: 1.15e-02, grad_scale: 32.0 2023-06-23 15:10:09,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=97493.33333333333, ans=0.125 2023-06-23 15:10:21,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=97560.0, ans=0.1 2023-06-23 15:10:48,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=97693.33333333333, ans=0.0 2023-06-23 15:11:20,039 INFO [train.py:1008] (1/4) Epoch 28, batch 300, loss[loss=0.2284, simple_loss=0.3036, pruned_loss=0.07659, over 18304.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2926, pruned_loss=0.07624, over 2946667.53 frames. ], batch size: 74, lr: 1.15e-02, grad_scale: 32.0 2023-06-23 15:11:41,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=97893.33333333333, ans=0.0 2023-06-23 15:11:48,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=97893.33333333333, ans=0.0 2023-06-23 15:11:50,480 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=97893.33333333333, ans=0.0 2023-06-23 15:12:07,336 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-23 15:12:33,706 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.834e+02 2.088e+02 2.410e+02 3.645e+02, threshold=4.176e+02, percent-clipped=0.0 2023-06-23 15:12:44,780 INFO [train.py:1008] (1/4) Epoch 28, batch 350, loss[loss=0.2187, simple_loss=0.2918, pruned_loss=0.07278, over 19456.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2921, pruned_loss=0.07592, over 3137101.01 frames. ], batch size: 105, lr: 1.15e-02, grad_scale: 16.0 2023-06-23 15:13:15,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=98226.66666666667, ans=0.1 2023-06-23 15:13:30,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=98293.33333333333, ans=0.2 2023-06-23 15:13:52,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=98426.66666666667, ans=0.0 2023-06-23 15:13:59,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=98426.66666666667, ans=0.125 2023-06-23 15:14:08,969 INFO [train.py:1008] (1/4) Epoch 28, batch 400, loss[loss=0.2296, simple_loss=0.3111, pruned_loss=0.07403, over 17582.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2927, pruned_loss=0.07615, over 3266558.02 frames. ], batch size: 67, lr: 1.15e-02, grad_scale: 32.0 2023-06-23 15:14:10,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=98493.33333333333, ans=0.0 2023-06-23 15:14:22,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=98493.33333333333, ans=0.2 2023-06-23 15:14:32,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=98560.0, ans=0.09899494936611666 2023-06-23 15:14:35,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=98560.0, ans=0.125 2023-06-23 15:14:50,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=98626.66666666667, ans=0.2 2023-06-23 15:14:54,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=98626.66666666667, ans=0.04949747468305833 2023-06-23 15:15:10,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=98693.33333333333, ans=0.1 2023-06-23 15:15:21,706 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.819e+02 2.049e+02 2.416e+02 3.292e+02, threshold=4.098e+02, percent-clipped=0.0 2023-06-23 15:15:33,629 INFO [train.py:1008] (1/4) Epoch 28, batch 450, loss[loss=0.2137, simple_loss=0.2859, pruned_loss=0.07076, over 18787.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2925, pruned_loss=0.0763, over 3396054.94 frames. ], batch size: 83, lr: 1.15e-02, grad_scale: 32.0 2023-06-23 15:15:41,075 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.20 vs. limit=15.0 2023-06-23 15:15:48,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=98893.33333333333, ans=10.0 2023-06-23 15:15:48,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=98893.33333333333, ans=0.2 2023-06-23 15:16:12,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=98960.0, ans=0.0 2023-06-23 15:16:28,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=99026.66666666667, ans=0.0 2023-06-23 15:16:33,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=99026.66666666667, ans=0.1 2023-06-23 15:16:42,275 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.19 vs. limit=15.0 2023-06-23 15:16:55,964 INFO [train.py:1008] (1/4) Epoch 28, batch 500, loss[loss=0.2362, simple_loss=0.3127, pruned_loss=0.07986, over 15543.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2927, pruned_loss=0.07579, over 3482307.69 frames. ], batch size: 44, lr: 1.15e-02, grad_scale: 32.0 2023-06-23 15:16:57,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=99160.0, ans=0.0 2023-06-23 15:17:18,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=99226.66666666667, ans=0.1 2023-06-23 15:18:09,966 INFO [train.py:1008] (1/4) Epoch 29, batch 0, loss[loss=0.2304, simple_loss=0.303, pruned_loss=0.07892, over 18441.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.303, pruned_loss=0.07892, over 18441.00 frames. ], batch size: 77, lr: 1.12e-02, grad_scale: 32.0 2023-06-23 15:18:09,966 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 15:18:15,773 INFO [train.py:1040] (1/4) Epoch 29, validation: loss=0.1936, simple_loss=0.2924, pruned_loss=0.04736, over 143649.00 frames. 2023-06-23 15:18:15,774 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 15:18:18,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=99380.0, ans=0.125 2023-06-23 15:18:32,882 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.762e+02 1.895e+02 2.191e+02 3.587e+02, threshold=3.791e+02, percent-clipped=0.0 2023-06-23 15:19:16,944 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=12.01 vs. limit=12.0 2023-06-23 15:19:30,519 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.70 vs. limit=6.0 2023-06-23 15:19:39,090 INFO [train.py:1008] (1/4) Epoch 29, batch 50, loss[loss=0.2322, simple_loss=0.3031, pruned_loss=0.08064, over 20298.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2914, pruned_loss=0.07428, over 862938.56 frames. ], batch size: 149, lr: 1.12e-02, grad_scale: 32.0 2023-06-23 15:19:46,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=99713.33333333333, ans=0.125 2023-06-23 15:19:58,231 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=99780.0, ans=0.025 2023-06-23 15:20:07,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=99780.0, ans=0.0 2023-06-23 15:20:19,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=99846.66666666667, ans=0.0 2023-06-23 15:20:21,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=99846.66666666667, ans=0.125 2023-06-23 15:20:21,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=99846.66666666667, ans=0.125 2023-06-23 15:20:39,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=99913.33333333333, ans=0.125 2023-06-23 15:20:40,105 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.44 vs. limit=22.5 2023-06-23 15:20:48,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=99980.0, ans=0.125 2023-06-23 15:20:48,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=99980.0, ans=0.125 2023-06-23 15:21:02,862 INFO [train.py:1008] (1/4) Epoch 29, batch 100, loss[loss=0.2399, simple_loss=0.3023, pruned_loss=0.08868, over 19995.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2907, pruned_loss=0.07443, over 1539462.96 frames. ], batch size: 126, lr: 1.12e-02, grad_scale: 32.0 2023-06-23 15:21:04,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=100046.66666666667, ans=0.125 2023-06-23 15:21:12,193 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.54 vs. limit=22.5 2023-06-23 15:21:19,912 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 1.800e+02 2.046e+02 2.472e+02 3.527e+02, threshold=4.093e+02, percent-clipped=0.0 2023-06-23 15:21:35,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=100180.0, ans=0.0 2023-06-23 15:21:40,836 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.93 vs. limit=15.0 2023-06-23 15:21:50,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=100180.0, ans=0.0 2023-06-23 15:22:00,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=100246.66666666667, ans=0.0 2023-06-23 15:22:11,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=100313.33333333333, ans=0.125 2023-06-23 15:22:27,001 INFO [train.py:1008] (1/4) Epoch 29, batch 150, loss[loss=0.2135, simple_loss=0.29, pruned_loss=0.06848, over 18468.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2914, pruned_loss=0.07609, over 2025754.57 frames. ], batch size: 77, lr: 1.12e-02, grad_scale: 32.0 2023-06-23 15:22:59,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=100513.33333333333, ans=0.1 2023-06-23 15:23:19,881 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.03 vs. limit=22.5 2023-06-23 15:23:29,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=100580.0, ans=0.125 2023-06-23 15:23:36,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=100646.66666666667, ans=0.125 2023-06-23 15:23:36,538 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.91 vs. limit=15.0 2023-06-23 15:23:50,793 INFO [train.py:1008] (1/4) Epoch 29, batch 200, loss[loss=0.2585, simple_loss=0.3329, pruned_loss=0.092, over 16341.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2925, pruned_loss=0.07638, over 2422787.54 frames. ], batch size: 52, lr: 1.12e-02, grad_scale: 32.0 2023-06-23 15:24:04,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=100713.33333333333, ans=0.0 2023-06-23 15:24:07,780 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.845e+02 2.075e+02 2.500e+02 4.166e+02, threshold=4.151e+02, percent-clipped=1.0 2023-06-23 15:24:22,877 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.48 vs. limit=15.0 2023-06-23 15:24:27,219 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=100846.66666666667, ans=0.1 2023-06-23 15:24:36,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=100846.66666666667, ans=0.125 2023-06-23 15:24:52,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=100913.33333333333, ans=0.0 2023-06-23 15:24:55,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=100913.33333333333, ans=0.035 2023-06-23 15:24:59,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=100980.0, ans=0.0 2023-06-23 15:25:00,061 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.87 vs. limit=10.0 2023-06-23 15:25:15,548 INFO [train.py:1008] (1/4) Epoch 29, batch 250, loss[loss=0.2099, simple_loss=0.2759, pruned_loss=0.072, over 20502.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2918, pruned_loss=0.07577, over 2735803.05 frames. ], batch size: 160, lr: 1.12e-02, grad_scale: 32.0 2023-06-23 15:25:21,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=101046.66666666667, ans=0.2 2023-06-23 15:25:32,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=101113.33333333333, ans=0.1 2023-06-23 15:25:33,200 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.12 vs. limit=10.0 2023-06-23 15:25:56,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=101180.0, ans=0.0 2023-06-23 15:26:18,423 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=101246.66666666667, ans=0.125 2023-06-23 15:26:40,105 INFO [train.py:1008] (1/4) Epoch 29, batch 300, loss[loss=0.2107, simple_loss=0.2938, pruned_loss=0.06375, over 17015.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2908, pruned_loss=0.07533, over 2968364.71 frames. ], batch size: 60, lr: 1.12e-02, grad_scale: 32.0 2023-06-23 15:26:57,098 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.583e+02 1.868e+02 2.052e+02 2.316e+02 3.360e+02, threshold=4.104e+02, percent-clipped=0.0 2023-06-23 15:27:05,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=101446.66666666667, ans=0.05 2023-06-23 15:27:05,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=101446.66666666667, ans=0.125 2023-06-23 15:27:06,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=101446.66666666667, ans=0.95 2023-06-23 15:27:10,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=101446.66666666667, ans=0.125 2023-06-23 15:27:40,185 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.83 vs. limit=12.0 2023-06-23 15:28:03,678 INFO [train.py:1008] (1/4) Epoch 29, batch 350, loss[loss=0.2155, simple_loss=0.2909, pruned_loss=0.06999, over 20091.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2905, pruned_loss=0.07548, over 3141335.61 frames. ], batch size: 133, lr: 1.11e-02, grad_scale: 32.0 2023-06-23 15:28:08,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=101713.33333333333, ans=0.1 2023-06-23 15:28:10,044 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.40 vs. limit=10.0 2023-06-23 15:28:11,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=101713.33333333333, ans=0.1 2023-06-23 15:28:34,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=101780.0, ans=0.125 2023-06-23 15:28:37,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=101846.66666666667, ans=0.0 2023-06-23 15:29:10,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=101980.0, ans=0.2 2023-06-23 15:29:18,444 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=7.77 vs. limit=15.0 2023-06-23 15:29:27,176 INFO [train.py:1008] (1/4) Epoch 29, batch 400, loss[loss=0.2272, simple_loss=0.2827, pruned_loss=0.08581, over 20036.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2902, pruned_loss=0.0753, over 3288558.70 frames. ], batch size: 293, lr: 1.11e-02, grad_scale: 32.0 2023-06-23 15:29:45,539 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.374e+02 1.818e+02 2.034e+02 2.468e+02 3.455e+02, threshold=4.068e+02, percent-clipped=0.0 2023-06-23 15:29:45,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=102113.33333333333, ans=0.0 2023-06-23 15:29:49,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=102113.33333333333, ans=0.125 2023-06-23 15:29:56,069 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.97 vs. limit=22.5 2023-06-23 15:29:59,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=102113.33333333333, ans=0.07 2023-06-23 15:30:04,440 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.56 vs. limit=15.0 2023-06-23 15:30:09,337 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=15.0 2023-06-23 15:30:25,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=102246.66666666667, ans=0.0 2023-06-23 15:30:26,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=102246.66666666667, ans=0.1 2023-06-23 15:30:51,691 INFO [train.py:1008] (1/4) Epoch 29, batch 450, loss[loss=0.2074, simple_loss=0.2809, pruned_loss=0.06691, over 19852.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2907, pruned_loss=0.07574, over 3389901.54 frames. ], batch size: 115, lr: 1.11e-02, grad_scale: 32.0 2023-06-23 15:31:00,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=102380.0, ans=0.0 2023-06-23 15:31:06,798 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.03 vs. limit=12.0 2023-06-23 15:31:20,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=102446.66666666667, ans=0.2 2023-06-23 15:31:39,208 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=102580.0, ans=0.0 2023-06-23 15:31:45,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=102580.0, ans=0.0 2023-06-23 15:32:03,770 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.18 vs. limit=15.0 2023-06-23 15:32:12,052 INFO [train.py:1008] (1/4) Epoch 29, batch 500, loss[loss=0.2165, simple_loss=0.2823, pruned_loss=0.07537, over 20535.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2905, pruned_loss=0.07511, over 3484133.25 frames. ], batch size: 173, lr: 1.11e-02, grad_scale: 32.0 2023-06-23 15:32:28,495 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.876e+02 2.021e+02 2.378e+02 4.301e+02, threshold=4.041e+02, percent-clipped=2.0 2023-06-23 15:32:29,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=102780.0, ans=0.125 2023-06-23 15:32:33,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=102780.0, ans=0.125 2023-06-23 15:32:37,869 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=102780.0, ans=0.125 2023-06-23 15:32:41,407 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.07 vs. limit=15.0 2023-06-23 15:32:51,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=102846.66666666667, ans=0.125 2023-06-23 15:32:53,991 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.03 vs. limit=15.0 2023-06-23 15:33:24,635 INFO [train.py:1008] (1/4) Epoch 30, batch 0, loss[loss=0.2506, simple_loss=0.2987, pruned_loss=0.1012, over 20284.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.2987, pruned_loss=0.1012, over 20284.00 frames. ], batch size: 239, lr: 1.09e-02, grad_scale: 32.0 2023-06-23 15:33:24,635 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 15:33:30,304 INFO [train.py:1040] (1/4) Epoch 30, validation: loss=0.1959, simple_loss=0.2936, pruned_loss=0.04913, over 143649.00 frames. 2023-06-23 15:33:30,305 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 15:33:39,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=102926.66666666667, ans=0.0 2023-06-23 15:33:59,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=102993.33333333333, ans=0.125 2023-06-23 15:34:04,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=103060.0, ans=0.0 2023-06-23 15:34:53,417 INFO [train.py:1008] (1/4) Epoch 30, batch 50, loss[loss=0.2325, simple_loss=0.2974, pruned_loss=0.08376, over 20016.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2887, pruned_loss=0.07682, over 864621.66 frames. ], batch size: 126, lr: 1.09e-02, grad_scale: 32.0 2023-06-23 15:34:53,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=103260.0, ans=0.125 2023-06-23 15:34:58,659 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 15:35:09,847 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=103326.66666666667, ans=0.0 2023-06-23 15:35:20,597 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 15:35:35,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=103393.33333333333, ans=0.0 2023-06-23 15:35:35,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=103393.33333333333, ans=0.1 2023-06-23 15:35:39,398 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.397e+02 1.710e+02 1.932e+02 2.166e+02 3.409e+02, threshold=3.865e+02, percent-clipped=0.0 2023-06-23 15:35:41,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=103460.0, ans=0.1 2023-06-23 15:35:46,265 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 15:36:01,208 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=103526.66666666667, ans=0.125 2023-06-23 15:36:16,855 INFO [train.py:1008] (1/4) Epoch 30, batch 100, loss[loss=0.2195, simple_loss=0.2837, pruned_loss=0.07764, over 20614.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2895, pruned_loss=0.07584, over 1530340.43 frames. ], batch size: 189, lr: 1.09e-02, grad_scale: 32.0 2023-06-23 15:36:17,961 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=13.68 vs. limit=15.0 2023-06-23 15:36:36,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=103660.0, ans=0.125 2023-06-23 15:36:36,882 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.03 vs. limit=22.5 2023-06-23 15:36:40,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=103660.0, ans=0.0 2023-06-23 15:37:22,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=103860.0, ans=0.1 2023-06-23 15:37:39,581 INFO [train.py:1008] (1/4) Epoch 30, batch 150, loss[loss=0.2051, simple_loss=0.2839, pruned_loss=0.06312, over 18934.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2897, pruned_loss=0.07555, over 2016469.07 frames. ], batch size: 86, lr: 1.09e-02, grad_scale: 32.0 2023-06-23 15:37:54,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=103993.33333333333, ans=0.125 2023-06-23 15:37:55,524 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.30 vs. limit=15.0 2023-06-23 15:38:28,112 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.832e+02 2.048e+02 2.358e+02 3.520e+02, threshold=4.096e+02, percent-clipped=0.0 2023-06-23 15:38:31,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=104126.66666666667, ans=0.125 2023-06-23 15:38:39,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=104126.66666666667, ans=0.2 2023-06-23 15:38:43,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=104126.66666666667, ans=0.0 2023-06-23 15:38:48,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=104193.33333333333, ans=0.0 2023-06-23 15:38:51,730 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.57 vs. limit=22.5 2023-06-23 15:39:00,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=104193.33333333333, ans=0.0 2023-06-23 15:39:02,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=104193.33333333333, ans=0.125 2023-06-23 15:39:05,390 INFO [train.py:1008] (1/4) Epoch 30, batch 200, loss[loss=0.2217, simple_loss=0.2955, pruned_loss=0.07393, over 18632.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2889, pruned_loss=0.07418, over 2424702.43 frames. ], batch size: 80, lr: 1.08e-02, grad_scale: 32.0 2023-06-23 15:39:05,656 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=104260.0, ans=0.5 2023-06-23 15:39:29,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=104326.66666666667, ans=0.125 2023-06-23 15:40:25,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=104526.66666666667, ans=0.5 2023-06-23 15:40:27,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=104593.33333333333, ans=0.2 2023-06-23 15:40:28,822 INFO [train.py:1008] (1/4) Epoch 30, batch 250, loss[loss=0.2029, simple_loss=0.283, pruned_loss=0.06135, over 18924.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2891, pruned_loss=0.07404, over 2736175.93 frames. ], batch size: 86, lr: 1.08e-02, grad_scale: 32.0 2023-06-23 15:40:30,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=104593.33333333333, ans=0.125 2023-06-23 15:41:16,341 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.612e+02 1.895e+02 2.101e+02 2.429e+02 4.312e+02, threshold=4.202e+02, percent-clipped=1.0 2023-06-23 15:41:28,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=104793.33333333333, ans=0.125 2023-06-23 15:41:54,050 INFO [train.py:1008] (1/4) Epoch 30, batch 300, loss[loss=0.2349, simple_loss=0.3126, pruned_loss=0.0786, over 17590.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2895, pruned_loss=0.07441, over 2976974.25 frames. ], batch size: 67, lr: 1.08e-02, grad_scale: 16.0 2023-06-23 15:42:10,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=104993.33333333333, ans=0.2 2023-06-23 15:42:10,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=104993.33333333333, ans=0.125 2023-06-23 15:42:16,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=104993.33333333333, ans=0.0 2023-06-23 15:42:19,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=104993.33333333333, ans=0.5 2023-06-23 15:42:33,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=105060.0, ans=0.125 2023-06-23 15:42:38,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=105060.0, ans=0.2 2023-06-23 15:42:59,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=105193.33333333333, ans=0.0 2023-06-23 15:43:04,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=105193.33333333333, ans=0.125 2023-06-23 15:43:16,219 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=105260.0, ans=0.125 2023-06-23 15:43:17,703 INFO [train.py:1008] (1/4) Epoch 30, batch 350, loss[loss=0.2185, simple_loss=0.2862, pruned_loss=0.07538, over 20269.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2901, pruned_loss=0.07448, over 3157744.26 frames. ], batch size: 141, lr: 1.08e-02, grad_scale: 16.0 2023-06-23 15:44:07,286 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.775e+02 1.994e+02 2.405e+02 3.305e+02, threshold=3.989e+02, percent-clipped=0.0 2023-06-23 15:44:14,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=105460.0, ans=0.125 2023-06-23 15:44:15,600 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.29 vs. limit=15.0 2023-06-23 15:44:16,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=105460.0, ans=0.0 2023-06-23 15:44:27,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=105526.66666666667, ans=0.1 2023-06-23 15:44:42,291 INFO [train.py:1008] (1/4) Epoch 30, batch 400, loss[loss=0.2448, simple_loss=0.294, pruned_loss=0.09774, over 19922.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2903, pruned_loss=0.07455, over 3307414.19 frames. ], batch size: 294, lr: 1.08e-02, grad_scale: 32.0 2023-06-23 15:44:42,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=105593.33333333333, ans=0.0 2023-06-23 15:45:17,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=105726.66666666667, ans=0.0 2023-06-23 15:45:33,492 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=105793.33333333333, ans=0.125 2023-06-23 15:45:45,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=105793.33333333333, ans=0.1 2023-06-23 15:46:06,209 INFO [train.py:1008] (1/4) Epoch 30, batch 450, loss[loss=0.1884, simple_loss=0.2688, pruned_loss=0.05405, over 18943.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2892, pruned_loss=0.07385, over 3419794.81 frames. ], batch size: 86, lr: 1.08e-02, grad_scale: 32.0 2023-06-23 15:46:16,306 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.07 vs. limit=6.0 2023-06-23 15:46:31,382 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.47 vs. limit=15.0 2023-06-23 15:46:49,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=106060.0, ans=0.0 2023-06-23 15:46:55,356 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.841e+02 2.075e+02 2.342e+02 3.586e+02, threshold=4.150e+02, percent-clipped=0.0 2023-06-23 15:47:13,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=106193.33333333333, ans=0.125 2023-06-23 15:47:29,132 INFO [train.py:1008] (1/4) Epoch 30, batch 500, loss[loss=0.248, simple_loss=0.3342, pruned_loss=0.08089, over 18316.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2893, pruned_loss=0.07412, over 3500368.66 frames. ], batch size: 72, lr: 1.08e-02, grad_scale: 32.0 2023-06-23 15:47:59,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=106393.33333333333, ans=0.2 2023-06-23 15:47:59,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=106393.33333333333, ans=0.125 2023-06-23 15:48:05,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=106393.33333333333, ans=0.125 2023-06-23 15:48:08,259 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.58 vs. limit=6.0 2023-06-23 15:48:10,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=106393.33333333333, ans=0.125 2023-06-23 15:48:42,479 INFO [train.py:1008] (1/4) Epoch 31, batch 0, loss[loss=0.2263, simple_loss=0.3081, pruned_loss=0.07229, over 18302.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3081, pruned_loss=0.07229, over 18302.00 frames. ], batch size: 72, lr: 1.06e-02, grad_scale: 32.0 2023-06-23 15:48:42,479 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 15:48:49,575 INFO [train.py:1040] (1/4) Epoch 31, validation: loss=0.1972, simple_loss=0.2938, pruned_loss=0.05034, over 143649.00 frames. 2023-06-23 15:48:49,576 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 15:48:54,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=106480.0, ans=0.125 2023-06-23 15:49:45,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=106680.0, ans=0.0 2023-06-23 15:49:47,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff2.min_abs, batch_count=106680.0, ans=0.1 2023-06-23 15:49:54,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=106680.0, ans=0.0 2023-06-23 15:49:58,058 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.87 vs. limit=22.5 2023-06-23 15:50:07,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=106746.66666666667, ans=0.025 2023-06-23 15:50:08,417 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.479e+02 1.811e+02 2.098e+02 2.397e+02 4.313e+02, threshold=4.196e+02, percent-clipped=1.0 2023-06-23 15:50:12,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=106746.66666666667, ans=0.1 2023-06-23 15:50:14,595 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.00 vs. limit=15.0 2023-06-23 15:50:15,211 INFO [train.py:1008] (1/4) Epoch 31, batch 50, loss[loss=0.195, simple_loss=0.2705, pruned_loss=0.05978, over 19527.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.289, pruned_loss=0.07239, over 839401.64 frames. ], batch size: 102, lr: 1.06e-02, grad_scale: 32.0 2023-06-23 15:50:27,033 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.53 vs. limit=15.0 2023-06-23 15:51:02,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=106946.66666666667, ans=0.125 2023-06-23 15:51:05,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=107013.33333333333, ans=0.2 2023-06-23 15:51:06,008 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=2.83 vs. limit=15.0 2023-06-23 15:51:06,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=107013.33333333333, ans=0.125 2023-06-23 15:51:15,516 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.85 vs. limit=15.0 2023-06-23 15:51:30,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=107080.0, ans=0.0 2023-06-23 15:51:34,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=107080.0, ans=0.0 2023-06-23 15:51:38,180 INFO [train.py:1008] (1/4) Epoch 31, batch 100, loss[loss=0.2196, simple_loss=0.2886, pruned_loss=0.07527, over 20284.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2893, pruned_loss=0.07237, over 1482361.02 frames. ], batch size: 141, lr: 1.05e-02, grad_scale: 16.0 2023-06-23 15:51:45,548 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.36 vs. limit=15.0 2023-06-23 15:52:03,188 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 15:52:08,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=107213.33333333333, ans=0.1 2023-06-23 15:52:47,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=107413.33333333333, ans=0.0 2023-06-23 15:52:53,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=107413.33333333333, ans=0.0 2023-06-23 15:52:56,571 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.362e+02 1.706e+02 1.875e+02 2.242e+02 3.229e+02, threshold=3.751e+02, percent-clipped=0.0 2023-06-23 15:53:01,953 INFO [train.py:1008] (1/4) Epoch 31, batch 150, loss[loss=0.211, simple_loss=0.2871, pruned_loss=0.06748, over 19139.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2877, pruned_loss=0.07185, over 2011945.69 frames. ], batch size: 94, lr: 1.05e-02, grad_scale: 16.0 2023-06-23 15:53:26,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=107546.66666666667, ans=0.05 2023-06-23 15:53:56,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=107680.0, ans=0.0 2023-06-23 15:54:14,997 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.44 vs. limit=15.0 2023-06-23 15:54:20,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=107746.66666666667, ans=0.0 2023-06-23 15:54:23,770 INFO [train.py:1008] (1/4) Epoch 31, batch 200, loss[loss=0.2032, simple_loss=0.2833, pruned_loss=0.06156, over 19223.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2882, pruned_loss=0.07203, over 2398456.00 frames. ], batch size: 92, lr: 1.05e-02, grad_scale: 16.0 2023-06-23 15:54:25,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=107813.33333333333, ans=0.2 2023-06-23 15:54:32,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=107813.33333333333, ans=0.0 2023-06-23 15:55:05,370 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=4.67 vs. limit=15.0 2023-06-23 15:55:42,529 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.794e+02 2.048e+02 2.340e+02 3.476e+02, threshold=4.095e+02, percent-clipped=0.0 2023-06-23 15:55:47,877 INFO [train.py:1008] (1/4) Epoch 31, batch 250, loss[loss=0.2355, simple_loss=0.3235, pruned_loss=0.07372, over 18311.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.289, pruned_loss=0.07225, over 2698552.37 frames. ], batch size: 72, lr: 1.05e-02, grad_scale: 16.0 2023-06-23 15:55:51,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=108146.66666666667, ans=0.04949747468305833 2023-06-23 15:56:07,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=108213.33333333333, ans=0.125 2023-06-23 15:56:29,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=108280.0, ans=0.125 2023-06-23 15:57:09,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=108480.0, ans=0.0 2023-06-23 15:57:10,799 INFO [train.py:1008] (1/4) Epoch 31, batch 300, loss[loss=0.2108, simple_loss=0.2837, pruned_loss=0.06896, over 19463.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2883, pruned_loss=0.07275, over 2946190.76 frames. ], batch size: 105, lr: 1.05e-02, grad_scale: 16.0 2023-06-23 15:57:27,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=108546.66666666667, ans=0.125 2023-06-23 15:58:05,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=108680.0, ans=0.125 2023-06-23 15:58:10,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=108680.0, ans=0.0 2023-06-23 15:58:12,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=108680.0, ans=0.2 2023-06-23 15:58:27,904 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.554e+02 1.826e+02 1.984e+02 2.389e+02 5.943e+02, threshold=3.969e+02, percent-clipped=1.0 2023-06-23 15:58:33,180 INFO [train.py:1008] (1/4) Epoch 31, batch 350, loss[loss=0.216, simple_loss=0.2938, pruned_loss=0.06908, over 19726.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2882, pruned_loss=0.07234, over 3135414.42 frames. ], batch size: 110, lr: 1.05e-02, grad_scale: 16.0 2023-06-23 15:58:44,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=108813.33333333333, ans=0.0 2023-06-23 15:58:53,826 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.34 vs. limit=10.0 2023-06-23 15:58:54,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=108880.0, ans=0.0 2023-06-23 15:59:03,180 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.44 vs. limit=12.0 2023-06-23 15:59:10,228 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.45 vs. limit=12.0 2023-06-23 15:59:30,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=109013.33333333333, ans=0.125 2023-06-23 15:59:31,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=109013.33333333333, ans=0.0 2023-06-23 15:59:42,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=109080.0, ans=0.125 2023-06-23 15:59:55,488 INFO [train.py:1008] (1/4) Epoch 31, batch 400, loss[loss=0.2201, simple_loss=0.2808, pruned_loss=0.07971, over 20657.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2883, pruned_loss=0.07214, over 3268106.82 frames. ], batch size: 211, lr: 1.05e-02, grad_scale: 32.0 2023-06-23 15:59:59,780 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.75 vs. limit=22.5 2023-06-23 16:00:40,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=109280.0, ans=0.125 2023-06-23 16:00:56,857 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.37 vs. limit=10.0 2023-06-23 16:00:59,920 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 16:01:13,833 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.820e+02 2.015e+02 2.494e+02 3.790e+02, threshold=4.031e+02, percent-clipped=0.0 2023-06-23 16:01:19,319 INFO [train.py:1008] (1/4) Epoch 31, batch 450, loss[loss=0.2187, simple_loss=0.2929, pruned_loss=0.07227, over 18501.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2881, pruned_loss=0.07211, over 3386553.32 frames. ], batch size: 77, lr: 1.05e-02, grad_scale: 32.0 2023-06-23 16:01:24,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=109480.0, ans=0.125 2023-06-23 16:01:34,849 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=22.5 2023-06-23 16:01:56,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=109613.33333333333, ans=0.125 2023-06-23 16:02:25,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=109746.66666666667, ans=0.0 2023-06-23 16:02:26,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=109746.66666666667, ans=0.125 2023-06-23 16:02:31,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=109746.66666666667, ans=0.0 2023-06-23 16:02:40,580 INFO [train.py:1008] (1/4) Epoch 31, batch 500, loss[loss=0.2153, simple_loss=0.2939, pruned_loss=0.06834, over 16741.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2877, pruned_loss=0.07187, over 3477474.95 frames. ], batch size: 59, lr: 1.04e-02, grad_scale: 32.0 2023-06-23 16:02:54,785 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.39 vs. limit=15.0 2023-06-23 16:02:57,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=109880.0, ans=0.0 2023-06-23 16:03:13,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=109946.66666666667, ans=0.125 2023-06-23 16:03:51,637 INFO [train.py:1008] (1/4) Epoch 32, batch 0, loss[loss=0.2097, simple_loss=0.2841, pruned_loss=0.06768, over 19536.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2841, pruned_loss=0.06768, over 19536.00 frames. ], batch size: 102, lr: 1.03e-02, grad_scale: 32.0 2023-06-23 16:03:51,638 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 16:03:57,278 INFO [train.py:1040] (1/4) Epoch 32, validation: loss=0.1948, simple_loss=0.2928, pruned_loss=0.04842, over 143649.00 frames. 2023-06-23 16:03:57,278 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 16:03:59,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=110026.66666666667, ans=0.125 2023-06-23 16:04:21,937 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.764e+02 1.926e+02 2.188e+02 3.298e+02, threshold=3.851e+02, percent-clipped=0.0 2023-06-23 16:04:42,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=110160.0, ans=0.0 2023-06-23 16:04:42,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=110160.0, ans=0.125 2023-06-23 16:05:04,526 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.11 vs. limit=15.0 2023-06-23 16:05:13,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=110293.33333333333, ans=0.125 2023-06-23 16:05:15,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=110293.33333333333, ans=0.0 2023-06-23 16:05:16,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=110293.33333333333, ans=0.1 2023-06-23 16:05:20,968 INFO [train.py:1008] (1/4) Epoch 32, batch 50, loss[loss=0.2162, simple_loss=0.2863, pruned_loss=0.073, over 18777.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2863, pruned_loss=0.07044, over 848346.29 frames. ], batch size: 83, lr: 1.03e-02, grad_scale: 32.0 2023-06-23 16:05:29,882 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.61 vs. limit=15.0 2023-06-23 16:05:47,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=110426.66666666667, ans=0.0 2023-06-23 16:06:09,314 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 16:06:09,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=110560.0, ans=0.1 2023-06-23 16:06:17,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=110560.0, ans=0.125 2023-06-23 16:06:34,505 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-06-23 16:06:43,355 INFO [train.py:1008] (1/4) Epoch 32, batch 100, loss[loss=0.2095, simple_loss=0.286, pruned_loss=0.06644, over 19069.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2876, pruned_loss=0.07416, over 1509893.31 frames. ], batch size: 89, lr: 1.02e-02, grad_scale: 32.0 2023-06-23 16:07:07,343 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.733e+02 1.900e+02 2.160e+02 3.250e+02, threshold=3.800e+02, percent-clipped=0.0 2023-06-23 16:07:20,178 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.12 vs. limit=6.0 2023-06-23 16:07:24,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=110826.66666666667, ans=0.125 2023-06-23 16:07:26,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=110826.66666666667, ans=0.1 2023-06-23 16:07:41,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=110893.33333333333, ans=0.125 2023-06-23 16:07:50,328 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 16:08:00,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=110960.0, ans=0.125 2023-06-23 16:08:05,658 INFO [train.py:1008] (1/4) Epoch 32, batch 150, loss[loss=0.2086, simple_loss=0.2884, pruned_loss=0.06444, over 18776.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2883, pruned_loss=0.07328, over 2016811.11 frames. ], batch size: 83, lr: 1.02e-02, grad_scale: 32.0 2023-06-23 16:08:09,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=111026.66666666667, ans=0.125 2023-06-23 16:08:09,777 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.40 vs. limit=15.0 2023-06-23 16:08:57,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=111226.66666666667, ans=0.2 2023-06-23 16:09:26,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=111293.33333333333, ans=0.0 2023-06-23 16:09:29,364 INFO [train.py:1008] (1/4) Epoch 32, batch 200, loss[loss=0.2019, simple_loss=0.2758, pruned_loss=0.06404, over 19519.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.287, pruned_loss=0.07262, over 2420769.02 frames. ], batch size: 102, lr: 1.02e-02, grad_scale: 32.0 2023-06-23 16:09:30,289 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.76 vs. limit=15.0 2023-06-23 16:09:37,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=111360.0, ans=0.125 2023-06-23 16:09:53,758 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.794e+02 2.041e+02 2.395e+02 3.648e+02, threshold=4.083e+02, percent-clipped=0.0 2023-06-23 16:09:58,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=111426.66666666667, ans=0.2 2023-06-23 16:10:00,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=111493.33333333333, ans=0.125 2023-06-23 16:10:04,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=111493.33333333333, ans=0.09899494936611666 2023-06-23 16:10:05,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=111493.33333333333, ans=0.125 2023-06-23 16:10:34,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=111626.66666666667, ans=0.125 2023-06-23 16:10:35,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=111626.66666666667, ans=0.2 2023-06-23 16:10:35,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=111626.66666666667, ans=0.0 2023-06-23 16:10:40,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=111626.66666666667, ans=0.125 2023-06-23 16:10:52,267 INFO [train.py:1008] (1/4) Epoch 32, batch 250, loss[loss=0.1993, simple_loss=0.2902, pruned_loss=0.05422, over 15015.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2868, pruned_loss=0.07218, over 2706264.27 frames. ], batch size: 43, lr: 1.02e-02, grad_scale: 32.0 2023-06-23 16:10:59,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=111693.33333333333, ans=0.125 2023-06-23 16:11:56,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=111893.33333333333, ans=0.0 2023-06-23 16:12:09,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=111960.0, ans=0.0 2023-06-23 16:12:15,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=112026.66666666667, ans=0.0 2023-06-23 16:12:16,953 INFO [train.py:1008] (1/4) Epoch 32, batch 300, loss[loss=0.2027, simple_loss=0.281, pruned_loss=0.06219, over 19075.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2866, pruned_loss=0.0723, over 2937571.97 frames. ], batch size: 94, lr: 1.02e-02, grad_scale: 32.0 2023-06-23 16:12:17,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=112026.66666666667, ans=0.125 2023-06-23 16:12:26,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=112026.66666666667, ans=0.05 2023-06-23 16:12:35,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=112093.33333333333, ans=0.0 2023-06-23 16:12:42,332 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.747e+02 1.919e+02 2.144e+02 3.308e+02, threshold=3.838e+02, percent-clipped=0.0 2023-06-23 16:13:40,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=112360.0, ans=0.1 2023-06-23 16:13:41,602 INFO [train.py:1008] (1/4) Epoch 32, batch 350, loss[loss=0.2165, simple_loss=0.2855, pruned_loss=0.07377, over 20493.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2861, pruned_loss=0.07244, over 3123272.67 frames. ], batch size: 160, lr: 1.02e-02, grad_scale: 32.0 2023-06-23 16:15:01,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=112626.66666666667, ans=0.02 2023-06-23 16:15:03,587 INFO [train.py:1008] (1/4) Epoch 32, batch 400, loss[loss=0.2035, simple_loss=0.2826, pruned_loss=0.06217, over 19336.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2874, pruned_loss=0.07188, over 3249872.41 frames. ], batch size: 98, lr: 1.02e-02, grad_scale: 32.0 2023-06-23 16:15:05,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=112693.33333333333, ans=0.5 2023-06-23 16:15:16,980 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=112693.33333333333, ans=0.125 2023-06-23 16:15:21,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=112760.0, ans=0.0 2023-06-23 16:15:28,200 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.955e+02 2.266e+02 2.752e+02 4.281e+02, threshold=4.532e+02, percent-clipped=1.0 2023-06-23 16:16:22,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=112960.0, ans=0.125 2023-06-23 16:16:26,230 INFO [train.py:1008] (1/4) Epoch 32, batch 450, loss[loss=0.2067, simple_loss=0.2848, pruned_loss=0.06428, over 19536.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2864, pruned_loss=0.07125, over 3384771.79 frames. ], batch size: 102, lr: 1.02e-02, grad_scale: 32.0 2023-06-23 16:17:21,897 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 16:17:47,504 INFO [train.py:1008] (1/4) Epoch 32, batch 500, loss[loss=0.199, simple_loss=0.2729, pruned_loss=0.06253, over 19822.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2862, pruned_loss=0.07111, over 3486003.42 frames. ], batch size: 115, lr: 1.01e-02, grad_scale: 32.0 2023-06-23 16:17:57,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=113360.0, ans=0.0 2023-06-23 16:18:09,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=113426.66666666667, ans=0.07 2023-06-23 16:18:10,897 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.754e+02 1.901e+02 2.303e+02 3.486e+02, threshold=3.802e+02, percent-clipped=0.0 2023-06-23 16:18:20,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=113493.33333333333, ans=0.125 2023-06-23 16:18:53,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=113573.33333333333, ans=0.1 2023-06-23 16:18:59,134 INFO [train.py:1008] (1/4) Epoch 33, batch 0, loss[loss=0.2153, simple_loss=0.3, pruned_loss=0.0653, over 11384.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.3, pruned_loss=0.0653, over 11384.00 frames. ], batch size: 32, lr: 9.98e-03, grad_scale: 32.0 2023-06-23 16:18:59,135 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 16:19:05,318 INFO [train.py:1040] (1/4) Epoch 33, validation: loss=0.1977, simple_loss=0.2933, pruned_loss=0.05106, over 143649.00 frames. 2023-06-23 16:19:05,318 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 16:19:57,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=113773.33333333333, ans=0.5 2023-06-23 16:20:06,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=113773.33333333333, ans=0.0 2023-06-23 16:20:14,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=113840.0, ans=0.125 2023-06-23 16:20:27,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=113906.66666666667, ans=0.1 2023-06-23 16:20:28,509 INFO [train.py:1008] (1/4) Epoch 33, batch 50, loss[loss=0.1966, simple_loss=0.2769, pruned_loss=0.05812, over 18296.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2844, pruned_loss=0.07105, over 852949.24 frames. ], batch size: 74, lr: 9.96e-03, grad_scale: 32.0 2023-06-23 16:20:31,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=113906.66666666667, ans=0.0 2023-06-23 16:20:42,037 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=113906.66666666667, ans=0.0 2023-06-23 16:21:11,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=114040.0, ans=0.125 2023-06-23 16:21:23,158 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.406e+02 1.777e+02 1.999e+02 2.231e+02 3.108e+02, threshold=3.997e+02, percent-clipped=0.0 2023-06-23 16:21:51,099 INFO [train.py:1008] (1/4) Epoch 33, batch 100, loss[loss=0.2212, simple_loss=0.2868, pruned_loss=0.07778, over 20318.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2846, pruned_loss=0.07028, over 1511218.16 frames. ], batch size: 149, lr: 9.95e-03, grad_scale: 32.0 2023-06-23 16:22:05,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=114306.66666666667, ans=0.125 2023-06-23 16:22:07,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=114306.66666666667, ans=0.125 2023-06-23 16:22:07,709 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=114306.66666666667, ans=0.2 2023-06-23 16:22:14,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=114306.66666666667, ans=0.05 2023-06-23 16:22:30,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=114373.33333333333, ans=0.07 2023-06-23 16:22:48,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=114440.0, ans=0.0 2023-06-23 16:23:13,385 INFO [train.py:1008] (1/4) Epoch 33, batch 150, loss[loss=0.2002, simple_loss=0.2804, pruned_loss=0.05994, over 19222.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2859, pruned_loss=0.06922, over 2005640.81 frames. ], batch size: 92, lr: 9.94e-03, grad_scale: 32.0 2023-06-23 16:23:38,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=114640.0, ans=0.1 2023-06-23 16:23:38,750 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.00 vs. limit=22.5 2023-06-23 16:24:07,342 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 1.778e+02 2.103e+02 2.536e+02 3.974e+02, threshold=4.205e+02, percent-clipped=0.0 2023-06-23 16:24:16,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=114773.33333333333, ans=0.2 2023-06-23 16:24:19,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=114840.0, ans=0.0 2023-06-23 16:24:23,457 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.26 vs. limit=22.5 2023-06-23 16:24:27,434 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.41 vs. limit=15.0 2023-06-23 16:24:35,511 INFO [train.py:1008] (1/4) Epoch 33, batch 200, loss[loss=0.1994, simple_loss=0.2796, pruned_loss=0.05961, over 19204.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2851, pruned_loss=0.06911, over 2416893.74 frames. ], batch size: 92, lr: 9.93e-03, grad_scale: 32.0 2023-06-23 16:24:45,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=114906.66666666667, ans=0.0 2023-06-23 16:24:50,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=114973.33333333333, ans=0.1 2023-06-23 16:25:34,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=115106.66666666667, ans=0.125 2023-06-23 16:25:51,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=115173.33333333333, ans=0.125 2023-06-23 16:25:59,197 INFO [train.py:1008] (1/4) Epoch 33, batch 250, loss[loss=0.23, simple_loss=0.2892, pruned_loss=0.08542, over 20702.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2859, pruned_loss=0.06941, over 2716823.04 frames. ], batch size: 211, lr: 9.92e-03, grad_scale: 32.0 2023-06-23 16:25:59,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=115240.0, ans=0.125 2023-06-23 16:26:23,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=115306.66666666667, ans=0.125 2023-06-23 16:26:23,085 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=115306.66666666667, ans=0.2 2023-06-23 16:26:29,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=115306.66666666667, ans=0.125 2023-06-23 16:26:33,719 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=115373.33333333333, ans=0.125 2023-06-23 16:26:35,027 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=115373.33333333333, ans=0.035 2023-06-23 16:26:50,009 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=15.0 2023-06-23 16:26:54,125 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.756e+02 1.892e+02 2.103e+02 3.317e+02, threshold=3.784e+02, percent-clipped=0.0 2023-06-23 16:27:21,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=115573.33333333333, ans=0.0 2023-06-23 16:27:23,076 INFO [train.py:1008] (1/4) Epoch 33, batch 300, loss[loss=0.1986, simple_loss=0.2749, pruned_loss=0.06112, over 19057.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2857, pruned_loss=0.06966, over 2959741.18 frames. ], batch size: 89, lr: 9.90e-03, grad_scale: 32.0 2023-06-23 16:27:45,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=115640.0, ans=0.0 2023-06-23 16:27:52,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=115640.0, ans=0.0 2023-06-23 16:28:05,243 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=115706.66666666667, ans=0.125 2023-06-23 16:28:21,766 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.02 vs. limit=15.0 2023-06-23 16:28:22,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=115773.33333333333, ans=0.2 2023-06-23 16:28:46,489 INFO [train.py:1008] (1/4) Epoch 33, batch 350, loss[loss=0.2131, simple_loss=0.2778, pruned_loss=0.07418, over 20632.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2853, pruned_loss=0.06929, over 3142994.51 frames. ], batch size: 189, lr: 9.89e-03, grad_scale: 32.0 2023-06-23 16:29:04,253 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.95 vs. limit=15.0 2023-06-23 16:29:42,033 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.796e+02 1.968e+02 2.174e+02 3.064e+02, threshold=3.936e+02, percent-clipped=0.0 2023-06-23 16:30:10,459 INFO [train.py:1008] (1/4) Epoch 33, batch 400, loss[loss=0.2036, simple_loss=0.2805, pruned_loss=0.06336, over 19872.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2856, pruned_loss=0.06986, over 3293885.82 frames. ], batch size: 120, lr: 9.88e-03, grad_scale: 32.0 2023-06-23 16:30:17,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=116240.0, ans=0.0 2023-06-23 16:30:28,703 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=116306.66666666667, ans=0.125 2023-06-23 16:30:37,164 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.37 vs. limit=6.0 2023-06-23 16:30:40,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=116306.66666666667, ans=0.1 2023-06-23 16:31:32,677 INFO [train.py:1008] (1/4) Epoch 33, batch 450, loss[loss=0.2391, simple_loss=0.2984, pruned_loss=0.0899, over 20308.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2855, pruned_loss=0.0697, over 3410570.46 frames. ], batch size: 149, lr: 9.87e-03, grad_scale: 32.0 2023-06-23 16:31:33,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=116573.33333333333, ans=0.0 2023-06-23 16:31:46,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=116573.33333333333, ans=0.125 2023-06-23 16:31:50,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=116640.0, ans=0.0 2023-06-23 16:32:27,200 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.736e+02 1.935e+02 2.162e+02 3.591e+02, threshold=3.870e+02, percent-clipped=0.0 2023-06-23 16:32:34,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=116773.33333333333, ans=0.025 2023-06-23 16:32:40,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=116840.0, ans=0.125 2023-06-23 16:32:54,601 INFO [train.py:1008] (1/4) Epoch 33, batch 500, loss[loss=0.2217, simple_loss=0.2849, pruned_loss=0.07923, over 19946.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2859, pruned_loss=0.06973, over 3473095.79 frames. ], batch size: 126, lr: 9.86e-03, grad_scale: 32.0 2023-06-23 16:32:56,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=116906.66666666667, ans=0.2 2023-06-23 16:33:06,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=116906.66666666667, ans=22.5 2023-06-23 16:34:05,557 INFO [train.py:1008] (1/4) Epoch 34, batch 0, loss[loss=0.2136, simple_loss=0.2858, pruned_loss=0.07069, over 20559.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2858, pruned_loss=0.07069, over 20559.00 frames. ], batch size: 173, lr: 9.70e-03, grad_scale: 32.0 2023-06-23 16:34:05,558 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 16:34:13,123 INFO [train.py:1040] (1/4) Epoch 34, validation: loss=0.1987, simple_loss=0.2934, pruned_loss=0.05199, over 143649.00 frames. 2023-06-23 16:34:13,124 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 16:34:18,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=117126.66666666667, ans=0.2 2023-06-23 16:34:45,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=117260.0, ans=0.125 2023-06-23 16:35:17,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=117326.66666666667, ans=0.0 2023-06-23 16:35:37,122 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.307e+02 1.709e+02 1.865e+02 2.103e+02 3.251e+02, threshold=3.730e+02, percent-clipped=0.0 2023-06-23 16:35:37,180 INFO [train.py:1008] (1/4) Epoch 34, batch 50, loss[loss=0.1867, simple_loss=0.2616, pruned_loss=0.05588, over 19557.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2861, pruned_loss=0.07022, over 860371.45 frames. ], batch size: 102, lr: 9.69e-03, grad_scale: 32.0 2023-06-23 16:36:20,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=117593.33333333333, ans=0.04949747468305833 2023-06-23 16:36:24,468 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=117660.0, ans=0.0 2023-06-23 16:36:39,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=117660.0, ans=0.125 2023-06-23 16:36:58,674 INFO [train.py:1008] (1/4) Epoch 34, batch 100, loss[loss=0.1986, simple_loss=0.2794, pruned_loss=0.05886, over 19479.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2857, pruned_loss=0.0688, over 1516111.59 frames. ], batch size: 105, lr: 9.68e-03, grad_scale: 32.0 2023-06-23 16:37:17,626 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=117860.0, ans=0.125 2023-06-23 16:37:22,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=117860.0, ans=0.0 2023-06-23 16:37:35,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=117926.66666666667, ans=0.0 2023-06-23 16:37:55,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=117993.33333333333, ans=0.0 2023-06-23 16:38:13,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=118060.0, ans=0.07 2023-06-23 16:38:13,652 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.81 vs. limit=15.0 2023-06-23 16:38:21,413 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.747e+02 1.970e+02 2.348e+02 3.776e+02, threshold=3.940e+02, percent-clipped=1.0 2023-06-23 16:38:21,460 INFO [train.py:1008] (1/4) Epoch 34, batch 150, loss[loss=0.2049, simple_loss=0.2809, pruned_loss=0.06447, over 19822.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2865, pruned_loss=0.07024, over 2011597.31 frames. ], batch size: 115, lr: 9.67e-03, grad_scale: 32.0 2023-06-23 16:39:03,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer_ff3.min_abs, batch_count=118260.0, ans=0.2 2023-06-23 16:39:23,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=118326.66666666667, ans=0.1 2023-06-23 16:39:36,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=118393.33333333333, ans=0.125 2023-06-23 16:39:42,106 INFO [train.py:1008] (1/4) Epoch 34, batch 200, loss[loss=0.2297, simple_loss=0.3033, pruned_loss=0.07805, over 18612.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2865, pruned_loss=0.06981, over 2399426.47 frames. ], batch size: 80, lr: 9.65e-03, grad_scale: 32.0 2023-06-23 16:39:52,995 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=118460.0, ans=0.1 2023-06-23 16:39:54,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=118460.0, ans=0.125 2023-06-23 16:40:11,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=118526.66666666667, ans=0.0 2023-06-23 16:40:23,132 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.23 vs. limit=12.0 2023-06-23 16:40:55,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=118726.66666666667, ans=0.125 2023-06-23 16:40:55,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=118726.66666666667, ans=0.125 2023-06-23 16:41:05,107 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.757e+02 2.010e+02 2.304e+02 3.025e+02, threshold=4.020e+02, percent-clipped=0.0 2023-06-23 16:41:05,155 INFO [train.py:1008] (1/4) Epoch 34, batch 250, loss[loss=0.216, simple_loss=0.2877, pruned_loss=0.07217, over 19060.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2855, pruned_loss=0.06955, over 2720807.90 frames. ], batch size: 89, lr: 9.64e-03, grad_scale: 32.0 2023-06-23 16:41:07,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=118793.33333333333, ans=0.125 2023-06-23 16:41:22,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=118860.0, ans=0.125 2023-06-23 16:41:35,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=118860.0, ans=0.125 2023-06-23 16:41:58,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=118993.33333333333, ans=0.1 2023-06-23 16:41:59,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=118993.33333333333, ans=0.2 2023-06-23 16:42:06,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=118993.33333333333, ans=0.125 2023-06-23 16:42:11,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=119060.0, ans=0.2 2023-06-23 16:42:21,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=119060.0, ans=0.125 2023-06-23 16:42:27,335 INFO [train.py:1008] (1/4) Epoch 34, batch 300, loss[loss=0.2266, simple_loss=0.3042, pruned_loss=0.0745, over 16959.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2855, pruned_loss=0.06919, over 2944536.95 frames. ], batch size: 60, lr: 9.63e-03, grad_scale: 32.0 2023-06-23 16:43:27,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=119326.66666666667, ans=0.0 2023-06-23 16:43:49,850 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.773e+02 2.027e+02 2.280e+02 3.489e+02, threshold=4.054e+02, percent-clipped=0.0 2023-06-23 16:43:49,897 INFO [train.py:1008] (1/4) Epoch 34, batch 350, loss[loss=0.202, simple_loss=0.2786, pruned_loss=0.06268, over 19471.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2847, pruned_loss=0.06902, over 3136487.95 frames. ], batch size: 105, lr: 9.62e-03, grad_scale: 32.0 2023-06-23 16:43:56,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=119460.0, ans=0.2 2023-06-23 16:44:13,082 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-23 16:44:57,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=119726.66666666667, ans=0.125 2023-06-23 16:45:11,386 INFO [train.py:1008] (1/4) Epoch 34, batch 400, loss[loss=0.1901, simple_loss=0.273, pruned_loss=0.05359, over 19538.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2847, pruned_loss=0.06883, over 3271502.33 frames. ], batch size: 102, lr: 9.61e-03, grad_scale: 32.0 2023-06-23 16:45:19,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=119793.33333333333, ans=0.2 2023-06-23 16:45:24,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=119793.33333333333, ans=0.1 2023-06-23 16:45:25,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=119860.0, ans=0.1 2023-06-23 16:45:33,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=119860.0, ans=0.2 2023-06-23 16:45:48,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=119926.66666666667, ans=0.0 2023-06-23 16:45:48,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=119926.66666666667, ans=0.125 2023-06-23 16:45:52,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=119926.66666666667, ans=0.125 2023-06-23 16:46:12,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=119993.33333333333, ans=0.0 2023-06-23 16:46:18,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=120060.0, ans=0.0 2023-06-23 16:46:33,400 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.782e+02 1.974e+02 2.224e+02 3.060e+02, threshold=3.949e+02, percent-clipped=0.0 2023-06-23 16:46:33,446 INFO [train.py:1008] (1/4) Epoch 34, batch 450, loss[loss=0.2284, simple_loss=0.3035, pruned_loss=0.07664, over 15128.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2842, pruned_loss=0.06922, over 3384829.02 frames. ], batch size: 43, lr: 9.60e-03, grad_scale: 32.0 2023-06-23 16:46:37,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=120126.66666666667, ans=0.1 2023-06-23 16:46:39,345 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-23 16:46:40,016 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.44 vs. limit=15.0 2023-06-23 16:46:45,782 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.56 vs. limit=12.0 2023-06-23 16:47:15,634 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.27 vs. limit=15.0 2023-06-23 16:47:20,503 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-23 16:47:24,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=120326.66666666667, ans=0.125 2023-06-23 16:47:50,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=120393.33333333333, ans=0.125 2023-06-23 16:47:53,646 INFO [train.py:1008] (1/4) Epoch 34, batch 500, loss[loss=0.2038, simple_loss=0.2801, pruned_loss=0.06368, over 19540.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2843, pruned_loss=0.06938, over 3478620.60 frames. ], batch size: 102, lr: 9.59e-03, grad_scale: 64.0 2023-06-23 16:48:11,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=120526.66666666667, ans=0.0 2023-06-23 16:48:17,885 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.81 vs. limit=15.0 2023-06-23 16:48:29,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=120593.33333333333, ans=0.125 2023-06-23 16:48:33,385 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.37 vs. limit=15.0 2023-06-23 16:49:04,713 INFO [train.py:1008] (1/4) Epoch 35, batch 0, loss[loss=0.2073, simple_loss=0.2841, pruned_loss=0.06526, over 19865.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2841, pruned_loss=0.06526, over 19865.00 frames. ], batch size: 120, lr: 9.44e-03, grad_scale: 32.0 2023-06-23 16:49:04,713 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 16:49:10,322 INFO [train.py:1040] (1/4) Epoch 35, validation: loss=0.1946, simple_loss=0.2902, pruned_loss=0.04948, over 143649.00 frames. 2023-06-23 16:49:10,322 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 16:49:39,553 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.767e+02 1.966e+02 2.184e+02 2.859e+02, threshold=3.932e+02, percent-clipped=0.0 2023-06-23 16:49:59,495 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 16:50:07,338 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-06-23 16:50:32,616 INFO [train.py:1008] (1/4) Epoch 35, batch 50, loss[loss=0.217, simple_loss=0.2852, pruned_loss=0.07442, over 19958.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2852, pruned_loss=0.06755, over 861414.61 frames. ], batch size: 126, lr: 9.43e-03, grad_scale: 32.0 2023-06-23 16:50:53,867 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=121080.0, ans=0.125 2023-06-23 16:51:00,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=121080.0, ans=0.125 2023-06-23 16:51:21,968 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=121213.33333333333, ans=0.2 2023-06-23 16:51:54,134 INFO [train.py:1008] (1/4) Epoch 35, batch 100, loss[loss=0.2134, simple_loss=0.2967, pruned_loss=0.06502, over 18476.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2855, pruned_loss=0.06767, over 1503436.07 frames. ], batch size: 77, lr: 9.42e-03, grad_scale: 16.0 2023-06-23 16:51:58,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=121346.66666666667, ans=0.125 2023-06-23 16:52:10,980 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=121413.33333333333, ans=0.125 2023-06-23 16:52:25,022 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.884e+02 2.196e+02 2.729e+02 3.922e+02, threshold=4.391e+02, percent-clipped=0.0 2023-06-23 16:52:34,891 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.72 vs. limit=22.5 2023-06-23 16:52:39,975 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.02 vs. limit=15.0 2023-06-23 16:52:42,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=121546.66666666667, ans=0.125 2023-06-23 16:52:48,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=121546.66666666667, ans=0.1 2023-06-23 16:53:15,635 INFO [train.py:1008] (1/4) Epoch 35, batch 150, loss[loss=0.2329, simple_loss=0.3088, pruned_loss=0.0785, over 18471.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2852, pruned_loss=0.06821, over 2000095.12 frames. ], batch size: 77, lr: 9.41e-03, grad_scale: 16.0 2023-06-23 16:53:28,127 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.35 vs. limit=6.0 2023-06-23 16:53:42,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=121746.66666666667, ans=0.1 2023-06-23 16:53:42,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=121746.66666666667, ans=0.125 2023-06-23 16:53:43,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=121746.66666666667, ans=0.2 2023-06-23 16:53:46,294 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.30 vs. limit=6.0 2023-06-23 16:53:51,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=121813.33333333333, ans=0.035 2023-06-23 16:54:00,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=121813.33333333333, ans=0.125 2023-06-23 16:54:12,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=121880.0, ans=0.1 2023-06-23 16:54:37,537 INFO [train.py:1008] (1/4) Epoch 35, batch 200, loss[loss=0.2084, simple_loss=0.2837, pruned_loss=0.06657, over 19089.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2858, pruned_loss=0.0684, over 2388466.73 frames. ], batch size: 89, lr: 9.40e-03, grad_scale: 16.0 2023-06-23 16:54:43,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=122013.33333333333, ans=0.1 2023-06-23 16:54:43,974 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.27 vs. limit=15.0 2023-06-23 16:55:01,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=122080.0, ans=15.0 2023-06-23 16:55:05,862 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-23 16:55:08,351 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.752e+02 1.972e+02 2.179e+02 3.541e+02, threshold=3.944e+02, percent-clipped=0.0 2023-06-23 16:55:08,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=122146.66666666667, ans=0.125 2023-06-23 16:55:15,707 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=122146.66666666667, ans=0.125 2023-06-23 16:55:59,555 INFO [train.py:1008] (1/4) Epoch 35, batch 250, loss[loss=0.228, simple_loss=0.2631, pruned_loss=0.09648, over 17126.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.285, pruned_loss=0.06832, over 2703825.90 frames. ], batch size: 391, lr: 9.38e-03, grad_scale: 16.0 2023-06-23 16:56:00,251 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.37 vs. limit=15.0 2023-06-23 16:56:24,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=122413.33333333333, ans=0.125 2023-06-23 16:56:40,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=122480.0, ans=0.125 2023-06-23 16:56:42,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=122480.0, ans=0.1 2023-06-23 16:56:50,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=122546.66666666667, ans=10.0 2023-06-23 16:56:52,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=122546.66666666667, ans=0.125 2023-06-23 16:57:10,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=122613.33333333333, ans=0.125 2023-06-23 16:57:21,552 INFO [train.py:1008] (1/4) Epoch 35, batch 300, loss[loss=0.1915, simple_loss=0.2727, pruned_loss=0.05519, over 19802.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2849, pruned_loss=0.06846, over 2928838.04 frames. ], batch size: 115, lr: 9.37e-03, grad_scale: 16.0 2023-06-23 16:57:52,638 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.728e+02 1.902e+02 2.188e+02 3.019e+02, threshold=3.803e+02, percent-clipped=0.0 2023-06-23 16:58:16,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=122880.0, ans=0.1 2023-06-23 16:58:28,220 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=12.0 2023-06-23 16:58:35,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=122946.66666666667, ans=0.125 2023-06-23 16:58:43,122 INFO [train.py:1008] (1/4) Epoch 35, batch 350, loss[loss=0.2007, simple_loss=0.2805, pruned_loss=0.06039, over 18272.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2839, pruned_loss=0.06858, over 3125238.80 frames. ], batch size: 74, lr: 9.36e-03, grad_scale: 16.0 2023-06-23 16:58:47,546 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.01 vs. limit=10.0 2023-06-23 16:58:52,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=123013.33333333333, ans=0.04949747468305833 2023-06-23 16:59:00,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=123080.0, ans=0.125 2023-06-23 16:59:42,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=123213.33333333333, ans=0.125 2023-06-23 16:59:43,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=123213.33333333333, ans=0.0 2023-06-23 16:59:53,113 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:00:03,639 INFO [train.py:1008] (1/4) Epoch 35, batch 400, loss[loss=0.194, simple_loss=0.2753, pruned_loss=0.05638, over 19102.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2833, pruned_loss=0.06878, over 3284640.07 frames. ], batch size: 94, lr: 9.35e-03, grad_scale: 32.0 2023-06-23 17:00:14,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=123346.66666666667, ans=0.0 2023-06-23 17:00:19,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=123413.33333333333, ans=0.125 2023-06-23 17:00:35,369 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.350e+02 1.770e+02 1.936e+02 2.100e+02 3.299e+02, threshold=3.871e+02, percent-clipped=0.0 2023-06-23 17:00:38,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=123480.0, ans=0.0 2023-06-23 17:00:38,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=123480.0, ans=0.125 2023-06-23 17:00:43,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=123480.0, ans=0.025 2023-06-23 17:00:57,835 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.91 vs. limit=15.0 2023-06-23 17:01:15,201 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.53 vs. limit=12.0 2023-06-23 17:01:17,100 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=2.90 vs. limit=15.0 2023-06-23 17:01:25,581 INFO [train.py:1008] (1/4) Epoch 35, batch 450, loss[loss=0.1928, simple_loss=0.2797, pruned_loss=0.05293, over 18289.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2837, pruned_loss=0.06853, over 3375054.68 frames. ], batch size: 74, lr: 9.34e-03, grad_scale: 32.0 2023-06-23 17:01:36,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=123680.0, ans=0.04949747468305833 2023-06-23 17:01:50,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=123746.66666666667, ans=0.1 2023-06-23 17:02:18,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=123880.0, ans=0.5 2023-06-23 17:02:44,857 INFO [train.py:1008] (1/4) Epoch 35, batch 500, loss[loss=0.2305, simple_loss=0.3167, pruned_loss=0.07217, over 15411.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2839, pruned_loss=0.06874, over 3464079.18 frames. ], batch size: 44, lr: 9.33e-03, grad_scale: 32.0 2023-06-23 17:02:50,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=124013.33333333333, ans=0.09899494936611666 2023-06-23 17:02:50,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=124013.33333333333, ans=0.125 2023-06-23 17:03:02,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=124080.0, ans=0.1 2023-06-23 17:03:03,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=124080.0, ans=0.0 2023-06-23 17:03:15,049 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.683e+02 1.866e+02 2.092e+02 3.633e+02, threshold=3.733e+02, percent-clipped=0.0 2023-06-23 17:03:16,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=124146.66666666667, ans=0.125 2023-06-23 17:03:20,440 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.64 vs. limit=15.0 2023-06-23 17:03:26,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=124146.66666666667, ans=0.1 2023-06-23 17:03:55,161 INFO [train.py:1008] (1/4) Epoch 36, batch 0, loss[loss=0.2412, simple_loss=0.3043, pruned_loss=0.08905, over 19982.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3043, pruned_loss=0.08905, over 19982.00 frames. ], batch size: 126, lr: 9.19e-03, grad_scale: 32.0 2023-06-23 17:03:55,161 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 17:04:00,893 INFO [train.py:1040] (1/4) Epoch 36, validation: loss=0.1946, simple_loss=0.2906, pruned_loss=0.04927, over 143649.00 frames. 2023-06-23 17:04:00,894 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 17:04:18,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=124293.33333333333, ans=0.09899494936611666 2023-06-23 17:04:18,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=124293.33333333333, ans=0.2 2023-06-23 17:05:00,615 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.78 vs. limit=15.0 2023-06-23 17:05:02,492 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.68 vs. limit=12.0 2023-06-23 17:05:23,509 INFO [train.py:1008] (1/4) Epoch 36, batch 50, loss[loss=0.2018, simple_loss=0.277, pruned_loss=0.06331, over 19856.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2835, pruned_loss=0.06695, over 851108.36 frames. ], batch size: 120, lr: 9.18e-03, grad_scale: 32.0 2023-06-23 17:05:26,899 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:05:33,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=124560.0, ans=0.1 2023-06-23 17:05:44,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=124626.66666666667, ans=0.125 2023-06-23 17:05:57,995 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=124693.33333333333, ans=0.05 2023-06-23 17:06:01,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=124693.33333333333, ans=0.09899494936611666 2023-06-23 17:06:23,165 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.316e+02 1.772e+02 1.989e+02 2.325e+02 4.599e+02, threshold=3.978e+02, percent-clipped=6.0 2023-06-23 17:06:32,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=124826.66666666667, ans=0.125 2023-06-23 17:06:35,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=124826.66666666667, ans=0.0 2023-06-23 17:06:45,510 INFO [train.py:1008] (1/4) Epoch 36, batch 100, loss[loss=0.2114, simple_loss=0.278, pruned_loss=0.07238, over 20329.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2816, pruned_loss=0.06649, over 1503020.51 frames. ], batch size: 149, lr: 9.17e-03, grad_scale: 32.0 2023-06-23 17:07:00,941 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.78 vs. limit=15.0 2023-06-23 17:07:32,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=125026.66666666667, ans=0.0 2023-06-23 17:07:46,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=125093.33333333333, ans=0.0 2023-06-23 17:07:48,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=125093.33333333333, ans=0.0 2023-06-23 17:08:08,853 INFO [train.py:1008] (1/4) Epoch 36, batch 150, loss[loss=0.2212, simple_loss=0.31, pruned_loss=0.06618, over 15658.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2827, pruned_loss=0.06736, over 2000829.04 frames. ], batch size: 44, lr: 9.16e-03, grad_scale: 32.0 2023-06-23 17:08:09,813 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.03 vs. limit=22.5 2023-06-23 17:08:22,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=125226.66666666667, ans=0.2 2023-06-23 17:08:28,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=125293.33333333333, ans=0.1 2023-06-23 17:08:48,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=125360.0, ans=0.125 2023-06-23 17:09:08,242 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.28 vs. limit=22.5 2023-06-23 17:09:09,870 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.77 vs. limit=15.0 2023-06-23 17:09:10,269 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.880e+02 2.081e+02 2.354e+02 3.612e+02, threshold=4.162e+02, percent-clipped=0.0 2023-06-23 17:09:32,152 INFO [train.py:1008] (1/4) Epoch 36, batch 200, loss[loss=0.2148, simple_loss=0.2954, pruned_loss=0.06706, over 16448.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2827, pruned_loss=0.06769, over 2393959.36 frames. ], batch size: 52, lr: 9.15e-03, grad_scale: 32.0 2023-06-23 17:09:42,032 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=125560.0, ans=0.0 2023-06-23 17:10:36,433 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.04 vs. limit=6.0 2023-06-23 17:10:55,766 INFO [train.py:1008] (1/4) Epoch 36, batch 250, loss[loss=0.237, simple_loss=0.3048, pruned_loss=0.08465, over 19540.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2822, pruned_loss=0.06807, over 2710772.90 frames. ], batch size: 102, lr: 9.14e-03, grad_scale: 32.0 2023-06-23 17:11:57,046 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.714e+02 1.828e+02 2.101e+02 3.725e+02, threshold=3.656e+02, percent-clipped=0.0 2023-06-23 17:12:17,695 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.87 vs. limit=15.0 2023-06-23 17:12:20,006 INFO [train.py:1008] (1/4) Epoch 36, batch 300, loss[loss=0.2036, simple_loss=0.2855, pruned_loss=0.06082, over 18762.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2821, pruned_loss=0.06754, over 2944802.55 frames. ], batch size: 83, lr: 9.13e-03, grad_scale: 32.0 2023-06-23 17:13:08,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=126426.66666666667, ans=0.0 2023-06-23 17:13:18,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=126426.66666666667, ans=0.125 2023-06-23 17:13:43,221 INFO [train.py:1008] (1/4) Epoch 36, batch 350, loss[loss=0.1968, simple_loss=0.2619, pruned_loss=0.06586, over 20593.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2822, pruned_loss=0.06751, over 3125969.52 frames. ], batch size: 189, lr: 9.12e-03, grad_scale: 32.0 2023-06-23 17:14:28,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=126693.33333333333, ans=0.125 2023-06-23 17:14:30,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=126693.33333333333, ans=0.0 2023-06-23 17:14:44,142 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.718e+02 1.947e+02 2.162e+02 2.915e+02, threshold=3.893e+02, percent-clipped=1.0 2023-06-23 17:14:49,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=126826.66666666667, ans=0.1 2023-06-23 17:14:59,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=126826.66666666667, ans=0.0 2023-06-23 17:15:06,643 INFO [train.py:1008] (1/4) Epoch 36, batch 400, loss[loss=0.2063, simple_loss=0.2827, pruned_loss=0.06498, over 19528.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2818, pruned_loss=0.06697, over 3265059.98 frames. ], batch size: 102, lr: 9.11e-03, grad_scale: 32.0 2023-06-23 17:15:59,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=127093.33333333333, ans=0.125 2023-06-23 17:16:28,389 INFO [train.py:1008] (1/4) Epoch 36, batch 450, loss[loss=0.2062, simple_loss=0.2842, pruned_loss=0.06416, over 19805.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2821, pruned_loss=0.06711, over 3385722.31 frames. ], batch size: 115, lr: 9.10e-03, grad_scale: 32.0 2023-06-23 17:17:06,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=127360.0, ans=0.125 2023-06-23 17:17:21,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=127426.66666666667, ans=0.5 2023-06-23 17:17:24,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=127426.66666666667, ans=0.2 2023-06-23 17:17:28,954 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.593e+02 1.767e+02 1.977e+02 2.245e+02 3.178e+02, threshold=3.955e+02, percent-clipped=0.0 2023-06-23 17:17:39,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=127493.33333333333, ans=0.125 2023-06-23 17:17:42,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=127493.33333333333, ans=0.125 2023-06-23 17:17:49,642 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.87 vs. limit=15.0 2023-06-23 17:17:50,302 INFO [train.py:1008] (1/4) Epoch 36, batch 500, loss[loss=0.2551, simple_loss=0.3263, pruned_loss=0.09196, over 16660.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2824, pruned_loss=0.06754, over 3463058.84 frames. ], batch size: 59, lr: 9.09e-03, grad_scale: 32.0 2023-06-23 17:17:56,920 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=127560.0, ans=0.2 2023-06-23 17:19:03,430 INFO [train.py:1008] (1/4) Epoch 37, batch 0, loss[loss=0.2204, simple_loss=0.2746, pruned_loss=0.08309, over 19995.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2746, pruned_loss=0.08309, over 19995.00 frames. ], batch size: 294, lr: 8.96e-03, grad_scale: 32.0 2023-06-23 17:19:03,430 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 17:19:09,069 INFO [train.py:1040] (1/4) Epoch 37, validation: loss=0.1945, simple_loss=0.2907, pruned_loss=0.04917, over 143649.00 frames. 2023-06-23 17:19:09,070 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 17:19:12,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=127780.0, ans=0.2 2023-06-23 17:19:15,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=127780.0, ans=0.0 2023-06-23 17:19:20,608 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:19:33,450 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=127846.66666666667, ans=0.0 2023-06-23 17:19:37,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=127846.66666666667, ans=0.0 2023-06-23 17:19:40,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=127913.33333333333, ans=0.125 2023-06-23 17:19:58,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=127980.0, ans=0.125 2023-06-23 17:20:13,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=128046.66666666667, ans=0.2 2023-06-23 17:20:31,314 INFO [train.py:1008] (1/4) Epoch 37, batch 50, loss[loss=0.2089, simple_loss=0.2939, pruned_loss=0.062, over 17060.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2806, pruned_loss=0.06862, over 857185.41 frames. ], batch size: 60, lr: 8.95e-03, grad_scale: 16.0 2023-06-23 17:20:39,824 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.347e+02 1.710e+02 1.841e+02 2.087e+02 3.007e+02, threshold=3.682e+02, percent-clipped=0.0 2023-06-23 17:21:13,556 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.74 vs. limit=15.0 2023-06-23 17:21:17,795 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.53 vs. limit=15.0 2023-06-23 17:21:38,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=128380.0, ans=0.0 2023-06-23 17:21:53,144 INFO [train.py:1008] (1/4) Epoch 37, batch 100, loss[loss=0.192, simple_loss=0.2687, pruned_loss=0.05761, over 18942.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2811, pruned_loss=0.06673, over 1504707.45 frames. ], batch size: 86, lr: 8.94e-03, grad_scale: 16.0 2023-06-23 17:22:18,366 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:22:50,953 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.52 vs. limit=15.0 2023-06-23 17:22:58,260 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:23:16,513 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.29 vs. limit=15.0 2023-06-23 17:23:17,075 INFO [train.py:1008] (1/4) Epoch 37, batch 150, loss[loss=0.2008, simple_loss=0.2919, pruned_loss=0.05482, over 15454.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2828, pruned_loss=0.06686, over 2012567.54 frames. ], batch size: 44, lr: 8.93e-03, grad_scale: 16.0 2023-06-23 17:23:19,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=128780.0, ans=0.0 2023-06-23 17:23:23,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=128780.0, ans=0.0 2023-06-23 17:23:25,119 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.708e+02 1.911e+02 2.291e+02 3.817e+02, threshold=3.822e+02, percent-clipped=2.0 2023-06-23 17:24:21,427 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=11.90 vs. limit=15.0 2023-06-23 17:24:26,591 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-23 17:24:31,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=129046.66666666667, ans=0.0 2023-06-23 17:24:31,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=129046.66666666667, ans=0.5 2023-06-23 17:24:40,626 INFO [train.py:1008] (1/4) Epoch 37, batch 200, loss[loss=0.1931, simple_loss=0.2734, pruned_loss=0.05642, over 19870.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2824, pruned_loss=0.06675, over 2407040.64 frames. ], batch size: 120, lr: 8.92e-03, grad_scale: 16.0 2023-06-23 17:24:49,589 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=22.5 2023-06-23 17:24:50,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=129113.33333333333, ans=0.0 2023-06-23 17:24:55,709 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=129180.0, ans=0.1 2023-06-23 17:25:08,987 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=129180.0, ans=0.125 2023-06-23 17:26:04,569 INFO [train.py:1008] (1/4) Epoch 37, batch 250, loss[loss=0.1989, simple_loss=0.277, pruned_loss=0.06045, over 18476.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2818, pruned_loss=0.0666, over 2718211.17 frames. ], batch size: 77, lr: 8.91e-03, grad_scale: 16.0 2023-06-23 17:26:12,472 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.920e+02 2.271e+02 2.637e+02 3.680e+02, threshold=4.543e+02, percent-clipped=0.0 2023-06-23 17:26:19,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=129513.33333333333, ans=0.015 2023-06-23 17:26:29,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=129513.33333333333, ans=0.125 2023-06-23 17:26:30,740 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.27 vs. limit=15.0 2023-06-23 17:26:42,544 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.01 vs. limit=15.0 2023-06-23 17:26:46,969 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=9.08 vs. limit=15.0 2023-06-23 17:27:11,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=129713.33333333333, ans=0.0 2023-06-23 17:27:28,478 INFO [train.py:1008] (1/4) Epoch 37, batch 300, loss[loss=0.192, simple_loss=0.2722, pruned_loss=0.05588, over 19829.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2803, pruned_loss=0.06688, over 2972295.97 frames. ], batch size: 115, lr: 8.90e-03, grad_scale: 16.0 2023-06-23 17:27:38,371 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:28:13,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=129913.33333333333, ans=0.125 2023-06-23 17:28:53,094 INFO [train.py:1008] (1/4) Epoch 37, batch 350, loss[loss=0.2219, simple_loss=0.2833, pruned_loss=0.08025, over 20228.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2802, pruned_loss=0.06691, over 3163090.93 frames. ], batch size: 239, lr: 8.89e-03, grad_scale: 16.0 2023-06-23 17:29:01,118 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.765e+02 1.983e+02 2.245e+02 4.016e+02, threshold=3.966e+02, percent-clipped=0.0 2023-06-23 17:29:07,064 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=130113.33333333333, ans=0.0 2023-06-23 17:29:13,278 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.20 vs. limit=15.0 2023-06-23 17:29:20,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=130180.0, ans=0.125 2023-06-23 17:29:31,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=130246.66666666667, ans=0.125 2023-06-23 17:29:51,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=130313.33333333333, ans=0.125 2023-06-23 17:30:01,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=130380.0, ans=0.125 2023-06-23 17:30:13,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=130380.0, ans=15.0 2023-06-23 17:30:17,137 INFO [train.py:1008] (1/4) Epoch 37, batch 400, loss[loss=0.1812, simple_loss=0.261, pruned_loss=0.05073, over 18617.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2791, pruned_loss=0.06634, over 3310019.54 frames. ], batch size: 80, lr: 8.88e-03, grad_scale: 32.0 2023-06-23 17:30:37,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=130513.33333333333, ans=0.125 2023-06-23 17:30:43,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=130513.33333333333, ans=0.0 2023-06-23 17:31:19,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=130646.66666666667, ans=0.125 2023-06-23 17:31:40,935 INFO [train.py:1008] (1/4) Epoch 37, batch 450, loss[loss=0.2067, simple_loss=0.2814, pruned_loss=0.066, over 19339.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2798, pruned_loss=0.06641, over 3406308.77 frames. ], batch size: 98, lr: 8.87e-03, grad_scale: 32.0 2023-06-23 17:31:45,182 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.06 vs. limit=10.0 2023-06-23 17:31:49,230 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.360e+02 1.712e+02 1.928e+02 2.270e+02 2.975e+02, threshold=3.856e+02, percent-clipped=0.0 2023-06-23 17:31:59,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=130846.66666666667, ans=0.0 2023-06-23 17:32:15,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=130913.33333333333, ans=0.0 2023-06-23 17:32:51,689 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.86 vs. limit=12.0 2023-06-23 17:32:53,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=131046.66666666667, ans=0.125 2023-06-23 17:33:01,924 INFO [train.py:1008] (1/4) Epoch 37, batch 500, loss[loss=0.2057, simple_loss=0.273, pruned_loss=0.06918, over 20803.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2796, pruned_loss=0.0661, over 3495088.56 frames. ], batch size: 211, lr: 8.86e-03, grad_scale: 32.0 2023-06-23 17:33:16,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=131180.0, ans=0.0 2023-06-23 17:33:19,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=131180.0, ans=0.125 2023-06-23 17:33:38,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=131246.66666666666, ans=0.0 2023-06-23 17:34:13,569 INFO [train.py:1008] (1/4) Epoch 38, batch 0, loss[loss=0.2074, simple_loss=0.2844, pruned_loss=0.0652, over 19097.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2844, pruned_loss=0.0652, over 19097.00 frames. ], batch size: 89, lr: 8.73e-03, grad_scale: 32.0 2023-06-23 17:34:13,570 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 17:34:19,209 INFO [train.py:1040] (1/4) Epoch 38, validation: loss=0.1953, simple_loss=0.2909, pruned_loss=0.04986, over 143649.00 frames. 2023-06-23 17:34:19,210 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 17:34:26,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=131326.66666666666, ans=0.0 2023-06-23 17:34:32,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=131326.66666666666, ans=0.125 2023-06-23 17:34:34,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=131393.33333333334, ans=0.0 2023-06-23 17:34:43,622 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.53 vs. limit=6.0 2023-06-23 17:34:56,689 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.327e+02 1.790e+02 1.941e+02 2.313e+02 3.641e+02, threshold=3.881e+02, percent-clipped=0.0 2023-06-23 17:35:01,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=131460.0, ans=0.125 2023-06-23 17:35:02,990 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=4.14 vs. limit=15.0 2023-06-23 17:35:42,566 INFO [train.py:1008] (1/4) Epoch 38, batch 50, loss[loss=0.2133, simple_loss=0.2895, pruned_loss=0.06854, over 18778.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2803, pruned_loss=0.0643, over 858540.63 frames. ], batch size: 83, lr: 8.72e-03, grad_scale: 32.0 2023-06-23 17:35:56,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=131660.0, ans=0.125 2023-06-23 17:36:03,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=131726.66666666666, ans=0.0 2023-06-23 17:36:06,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=131726.66666666666, ans=0.0 2023-06-23 17:36:14,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=131793.33333333334, ans=0.07 2023-06-23 17:36:33,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=131860.0, ans=0.0 2023-06-23 17:36:55,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=131926.66666666666, ans=0.125 2023-06-23 17:37:04,498 INFO [train.py:1008] (1/4) Epoch 38, batch 100, loss[loss=0.2023, simple_loss=0.2767, pruned_loss=0.06391, over 20272.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2796, pruned_loss=0.06456, over 1527325.57 frames. ], batch size: 141, lr: 8.71e-03, grad_scale: 32.0 2023-06-23 17:37:12,045 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:37:19,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=132060.0, ans=0.0 2023-06-23 17:37:43,089 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.740e+02 1.898e+02 2.194e+02 3.363e+02, threshold=3.797e+02, percent-clipped=0.0 2023-06-23 17:37:59,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=132193.33333333334, ans=0.0 2023-06-23 17:38:00,108 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.59 vs. limit=12.0 2023-06-23 17:38:28,457 INFO [train.py:1008] (1/4) Epoch 38, batch 150, loss[loss=0.2041, simple_loss=0.2759, pruned_loss=0.06614, over 20079.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2796, pruned_loss=0.0643, over 2025398.38 frames. ], batch size: 133, lr: 8.70e-03, grad_scale: 32.0 2023-06-23 17:38:30,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=132326.66666666666, ans=0.05 2023-06-23 17:38:35,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=132326.66666666666, ans=0.125 2023-06-23 17:38:37,248 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:38:37,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=132326.66666666666, ans=0.125 2023-06-23 17:39:10,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=132460.0, ans=0.025 2023-06-23 17:39:19,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=132526.66666666666, ans=0.125 2023-06-23 17:39:51,925 INFO [train.py:1008] (1/4) Epoch 38, batch 200, loss[loss=0.2067, simple_loss=0.2855, pruned_loss=0.06398, over 19112.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.28, pruned_loss=0.06493, over 2435227.29 frames. ], batch size: 94, lr: 8.69e-03, grad_scale: 32.0 2023-06-23 17:40:04,600 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.80 vs. limit=22.5 2023-06-23 17:40:14,705 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.82 vs. limit=22.5 2023-06-23 17:40:24,820 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.87 vs. limit=10.0 2023-06-23 17:40:30,916 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.758e+02 1.987e+02 2.155e+02 3.076e+02, threshold=3.973e+02, percent-clipped=0.0 2023-06-23 17:40:38,497 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.62 vs. limit=6.0 2023-06-23 17:40:41,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=132860.0, ans=0.125 2023-06-23 17:40:41,534 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.92 vs. limit=22.5 2023-06-23 17:40:45,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=132860.0, ans=0.125 2023-06-23 17:40:48,049 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=132860.0, ans=0.0 2023-06-23 17:41:04,791 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-23 17:41:16,073 INFO [train.py:1008] (1/4) Epoch 38, batch 250, loss[loss=0.1787, simple_loss=0.2588, pruned_loss=0.0493, over 19846.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2804, pruned_loss=0.06474, over 2731898.26 frames. ], batch size: 115, lr: 8.68e-03, grad_scale: 32.0 2023-06-23 17:41:56,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=133126.66666666666, ans=0.125 2023-06-23 17:41:56,838 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-23 17:42:05,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=133126.66666666666, ans=0.2 2023-06-23 17:42:08,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=133193.33333333334, ans=0.0 2023-06-23 17:42:10,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=133193.33333333334, ans=0.02 2023-06-23 17:42:23,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=133260.0, ans=0.125 2023-06-23 17:42:41,929 INFO [train.py:1008] (1/4) Epoch 38, batch 300, loss[loss=0.207, simple_loss=0.2848, pruned_loss=0.06455, over 19888.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2795, pruned_loss=0.065, over 2987425.29 frames. ], batch size: 120, lr: 8.67e-03, grad_scale: 32.0 2023-06-23 17:43:17,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=133460.0, ans=0.0 2023-06-23 17:43:23,649 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.801e+02 1.994e+02 2.321e+02 3.312e+02, threshold=3.988e+02, percent-clipped=0.0 2023-06-23 17:44:08,563 INFO [train.py:1008] (1/4) Epoch 38, batch 350, loss[loss=0.1979, simple_loss=0.279, pruned_loss=0.0584, over 18275.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2801, pruned_loss=0.0652, over 3148146.81 frames. ], batch size: 74, lr: 8.66e-03, grad_scale: 32.0 2023-06-23 17:44:36,414 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.69 vs. limit=10.0 2023-06-23 17:44:42,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=133793.33333333334, ans=0.2 2023-06-23 17:45:09,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=133860.0, ans=0.0 2023-06-23 17:45:32,591 INFO [train.py:1008] (1/4) Epoch 38, batch 400, loss[loss=0.2024, simple_loss=0.284, pruned_loss=0.06043, over 20145.00 frames. ], tot_loss[loss=0.205, simple_loss=0.28, pruned_loss=0.06506, over 3276007.86 frames. ], batch size: 133, lr: 8.65e-03, grad_scale: 32.0 2023-06-23 17:45:42,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=133993.33333333334, ans=0.125 2023-06-23 17:45:47,220 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=15.0 2023-06-23 17:45:47,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=134060.0, ans=0.1 2023-06-23 17:45:58,111 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=134060.0, ans=0.125 2023-06-23 17:46:01,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=134060.0, ans=0.0 2023-06-23 17:46:11,383 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.397e+02 1.765e+02 1.961e+02 2.163e+02 2.989e+02, threshold=3.921e+02, percent-clipped=0.0 2023-06-23 17:46:32,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=134193.33333333334, ans=0.125 2023-06-23 17:46:34,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=134193.33333333334, ans=0.1 2023-06-23 17:46:40,482 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.55 vs. limit=10.0 2023-06-23 17:46:42,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=134260.0, ans=0.0 2023-06-23 17:46:56,496 INFO [train.py:1008] (1/4) Epoch 38, batch 450, loss[loss=0.1869, simple_loss=0.2682, pruned_loss=0.05282, over 19867.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2801, pruned_loss=0.06535, over 3403494.61 frames. ], batch size: 120, lr: 8.65e-03, grad_scale: 32.0 2023-06-23 17:47:23,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=134393.33333333334, ans=0.02 2023-06-23 17:47:55,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=134526.66666666666, ans=0.125 2023-06-23 17:48:02,131 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=134593.33333333334, ans=0.0 2023-06-23 17:48:05,706 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.71 vs. limit=15.0 2023-06-23 17:48:09,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=134593.33333333334, ans=0.1 2023-06-23 17:48:09,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=134593.33333333334, ans=0.125 2023-06-23 17:48:17,216 INFO [train.py:1008] (1/4) Epoch 38, batch 500, loss[loss=0.2211, simple_loss=0.3028, pruned_loss=0.0697, over 17615.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2802, pruned_loss=0.06494, over 3491496.81 frames. ], batch size: 67, lr: 8.64e-03, grad_scale: 32.0 2023-06-23 17:48:53,615 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.506e+02 1.690e+02 1.875e+02 2.333e+02 3.146e+02, threshold=3.750e+02, percent-clipped=0.0 2023-06-23 17:48:57,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=134793.33333333334, ans=0.2 2023-06-23 17:49:04,307 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=134860.0, ans=0.04949747468305833 2023-06-23 17:49:28,567 INFO [train.py:1008] (1/4) Epoch 39, batch 0, loss[loss=0.1727, simple_loss=0.2548, pruned_loss=0.0453, over 19705.00 frames. ], tot_loss[loss=0.1727, simple_loss=0.2548, pruned_loss=0.0453, over 19705.00 frames. ], batch size: 110, lr: 8.52e-03, grad_scale: 32.0 2023-06-23 17:49:28,568 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 17:49:34,265 INFO [train.py:1040] (1/4) Epoch 39, validation: loss=0.1961, simple_loss=0.2911, pruned_loss=0.05057, over 143649.00 frames. 2023-06-23 17:49:34,266 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 17:49:55,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=134940.0, ans=0.1 2023-06-23 17:50:20,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=135006.66666666666, ans=0.125 2023-06-23 17:50:32,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=135073.33333333334, ans=0.125 2023-06-23 17:50:35,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=135073.33333333334, ans=0.0 2023-06-23 17:50:57,631 INFO [train.py:1008] (1/4) Epoch 39, batch 50, loss[loss=0.1945, simple_loss=0.2777, pruned_loss=0.05563, over 18781.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2778, pruned_loss=0.06454, over 868590.24 frames. ], batch size: 83, lr: 8.51e-03, grad_scale: 32.0 2023-06-23 17:51:41,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=135340.0, ans=0.125 2023-06-23 17:52:05,157 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.783e+02 2.011e+02 2.337e+02 3.361e+02, threshold=4.022e+02, percent-clipped=0.0 2023-06-23 17:52:12,813 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:52:19,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=135540.0, ans=0.0 2023-06-23 17:52:21,076 INFO [train.py:1008] (1/4) Epoch 39, batch 100, loss[loss=0.1877, simple_loss=0.2667, pruned_loss=0.05436, over 19461.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2793, pruned_loss=0.06513, over 1518668.78 frames. ], batch size: 105, lr: 8.50e-03, grad_scale: 32.0 2023-06-23 17:52:34,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=135540.0, ans=0.5 2023-06-23 17:53:06,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=135673.33333333334, ans=0.125 2023-06-23 17:53:31,355 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.56 vs. limit=15.0 2023-06-23 17:53:38,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=135806.66666666666, ans=0.2 2023-06-23 17:53:43,314 INFO [train.py:1008] (1/4) Epoch 39, batch 150, loss[loss=0.2357, simple_loss=0.3224, pruned_loss=0.07451, over 10907.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2807, pruned_loss=0.06483, over 2012252.62 frames. ], batch size: 31, lr: 8.49e-03, grad_scale: 32.0 2023-06-23 17:53:52,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=135873.33333333334, ans=0.1 2023-06-23 17:54:21,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=136006.66666666666, ans=0.0 2023-06-23 17:54:40,764 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.21 vs. limit=15.0 2023-06-23 17:54:52,779 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.284e+02 1.711e+02 1.880e+02 2.080e+02 3.230e+02, threshold=3.760e+02, percent-clipped=0.0 2023-06-23 17:54:53,766 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.45 vs. limit=6.0 2023-06-23 17:55:07,137 INFO [train.py:1008] (1/4) Epoch 39, batch 200, loss[loss=0.1978, simple_loss=0.2477, pruned_loss=0.07398, over 16924.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2799, pruned_loss=0.06423, over 2404404.28 frames. ], batch size: 391, lr: 8.48e-03, grad_scale: 32.0 2023-06-23 17:55:57,341 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.74 vs. limit=12.0 2023-06-23 17:56:07,516 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=136406.66666666666, ans=0.125 2023-06-23 17:56:31,274 INFO [train.py:1008] (1/4) Epoch 39, batch 250, loss[loss=0.2, simple_loss=0.2783, pruned_loss=0.06088, over 18762.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.28, pruned_loss=0.06434, over 2713993.94 frames. ], batch size: 83, lr: 8.47e-03, grad_scale: 32.0 2023-06-23 17:56:35,775 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.18 vs. limit=6.0 2023-06-23 17:56:47,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=136606.66666666666, ans=0.125 2023-06-23 17:57:00,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=136606.66666666666, ans=0.1 2023-06-23 17:57:04,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=136673.33333333334, ans=0.0 2023-06-23 17:57:40,027 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.286e+02 1.691e+02 1.862e+02 2.348e+02 3.821e+02, threshold=3.724e+02, percent-clipped=1.0 2023-06-23 17:57:42,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=136806.66666666666, ans=0.125 2023-06-23 17:57:53,865 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.66 vs. limit=15.0 2023-06-23 17:57:54,449 INFO [train.py:1008] (1/4) Epoch 39, batch 300, loss[loss=0.1824, simple_loss=0.2666, pruned_loss=0.04909, over 19682.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2796, pruned_loss=0.06387, over 2945369.53 frames. ], batch size: 110, lr: 8.46e-03, grad_scale: 32.0 2023-06-23 17:58:01,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=136873.33333333334, ans=0.0 2023-06-23 17:58:06,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=136873.33333333334, ans=0.125 2023-06-23 17:58:29,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=137006.66666666666, ans=0.1 2023-06-23 17:58:54,675 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-23 17:58:55,797 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=137073.33333333334, ans=0.1 2023-06-23 17:58:57,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=137073.33333333334, ans=0.125 2023-06-23 17:59:05,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=137140.0, ans=0.125 2023-06-23 17:59:16,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=137206.66666666666, ans=0.125 2023-06-23 17:59:18,316 INFO [train.py:1008] (1/4) Epoch 39, batch 350, loss[loss=0.194, simple_loss=0.2735, pruned_loss=0.05729, over 19772.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2792, pruned_loss=0.06399, over 3120472.46 frames. ], batch size: 115, lr: 8.45e-03, grad_scale: 32.0 2023-06-23 17:59:23,895 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.36 vs. limit=6.0 2023-06-23 17:59:25,085 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=137206.66666666666, ans=0.07 2023-06-23 17:59:32,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=137206.66666666666, ans=0.125 2023-06-23 17:59:39,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=137273.33333333334, ans=0.1 2023-06-23 17:59:40,796 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=137273.33333333334, ans=0.0 2023-06-23 17:59:53,916 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.48 vs. limit=6.0 2023-06-23 17:59:59,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=137340.0, ans=0.125 2023-06-23 18:00:11,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=137406.66666666666, ans=0.2 2023-06-23 18:00:27,896 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.736e+02 1.973e+02 2.243e+02 3.453e+02, threshold=3.946e+02, percent-clipped=0.0 2023-06-23 18:00:30,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=137473.33333333334, ans=0.1 2023-06-23 18:00:40,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=137473.33333333334, ans=0.1 2023-06-23 18:00:43,644 INFO [train.py:1008] (1/4) Epoch 39, batch 400, loss[loss=0.2183, simple_loss=0.2801, pruned_loss=0.07822, over 20207.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2789, pruned_loss=0.06399, over 3264418.07 frames. ], batch size: 239, lr: 8.44e-03, grad_scale: 32.0 2023-06-23 18:00:57,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=137540.0, ans=0.2 2023-06-23 18:01:00,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=137606.66666666666, ans=0.125 2023-06-23 18:01:16,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=137673.33333333334, ans=0.0 2023-06-23 18:01:22,380 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.73 vs. limit=22.5 2023-06-23 18:01:28,671 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.07 vs. limit=15.0 2023-06-23 18:01:41,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=137740.0, ans=0.0 2023-06-23 18:01:57,699 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.32 vs. limit=15.0 2023-06-23 18:02:08,801 INFO [train.py:1008] (1/4) Epoch 39, batch 450, loss[loss=0.1803, simple_loss=0.2662, pruned_loss=0.04723, over 19686.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.279, pruned_loss=0.06379, over 3385579.52 frames. ], batch size: 110, lr: 8.44e-03, grad_scale: 32.0 2023-06-23 18:02:09,473 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.04 vs. limit=15.0 2023-06-23 18:02:12,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=137873.33333333334, ans=0.125 2023-06-23 18:02:12,752 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.96 vs. limit=12.0 2023-06-23 18:02:30,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=137940.0, ans=0.2 2023-06-23 18:02:50,978 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.21 vs. limit=10.0 2023-06-23 18:02:59,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=138073.33333333334, ans=0.125 2023-06-23 18:03:02,847 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=138073.33333333334, ans=0.0 2023-06-23 18:03:07,438 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:03:14,783 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.813e+02 2.153e+02 2.467e+02 3.224e+02, threshold=4.306e+02, percent-clipped=0.0 2023-06-23 18:03:17,907 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.58 vs. limit=6.0 2023-06-23 18:03:29,428 INFO [train.py:1008] (1/4) Epoch 39, batch 500, loss[loss=0.1903, simple_loss=0.2712, pruned_loss=0.05473, over 18797.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2794, pruned_loss=0.06396, over 3472450.44 frames. ], batch size: 83, lr: 8.43e-03, grad_scale: 32.0 2023-06-23 18:03:58,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=138273.33333333334, ans=0.125 2023-06-23 18:04:10,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=138340.0, ans=0.125 2023-06-23 18:04:11,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=138340.0, ans=0.1 2023-06-23 18:04:43,031 INFO [train.py:1008] (1/4) Epoch 40, batch 0, loss[loss=0.2131, simple_loss=0.2876, pruned_loss=0.06929, over 20278.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2876, pruned_loss=0.06929, over 20278.00 frames. ], batch size: 141, lr: 8.31e-03, grad_scale: 32.0 2023-06-23 18:04:43,031 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 18:04:48,660 INFO [train.py:1040] (1/4) Epoch 40, validation: loss=0.198, simple_loss=0.2915, pruned_loss=0.05224, over 143649.00 frames. 2023-06-23 18:04:48,661 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 18:04:53,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=138420.0, ans=0.0 2023-06-23 18:05:33,480 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=138553.33333333334, ans=0.125 2023-06-23 18:05:40,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=138620.0, ans=15.0 2023-06-23 18:05:57,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=138686.66666666666, ans=0.125 2023-06-23 18:06:12,702 INFO [train.py:1008] (1/4) Epoch 40, batch 50, loss[loss=0.1955, simple_loss=0.2756, pruned_loss=0.05767, over 19113.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2792, pruned_loss=0.06517, over 848220.58 frames. ], batch size: 94, lr: 8.31e-03, grad_scale: 32.0 2023-06-23 18:06:24,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=138753.33333333334, ans=0.0 2023-06-23 18:06:26,973 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.446e+02 1.678e+02 1.899e+02 2.125e+02 3.076e+02, threshold=3.798e+02, percent-clipped=0.0 2023-06-23 18:06:31,455 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.04 vs. limit=15.0 2023-06-23 18:06:46,819 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.47 vs. limit=15.0 2023-06-23 18:07:02,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=138953.33333333334, ans=0.1 2023-06-23 18:07:36,477 INFO [train.py:1008] (1/4) Epoch 40, batch 100, loss[loss=0.2124, simple_loss=0.2711, pruned_loss=0.07689, over 19983.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.278, pruned_loss=0.06457, over 1520721.15 frames. ], batch size: 293, lr: 8.30e-03, grad_scale: 16.0 2023-06-23 18:08:04,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=139153.33333333334, ans=0.0 2023-06-23 18:08:15,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=139220.0, ans=0.1 2023-06-23 18:08:41,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=139353.33333333334, ans=15.0 2023-06-23 18:08:45,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=139353.33333333334, ans=0.1 2023-06-23 18:08:50,709 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=139353.33333333334, ans=0.2 2023-06-23 18:08:55,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=139353.33333333334, ans=0.1 2023-06-23 18:08:59,327 INFO [train.py:1008] (1/4) Epoch 40, batch 150, loss[loss=0.2032, simple_loss=0.278, pruned_loss=0.06423, over 18938.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2782, pruned_loss=0.06435, over 2025581.20 frames. ], batch size: 86, lr: 8.29e-03, grad_scale: 16.0 2023-06-23 18:08:59,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=139420.0, ans=0.2 2023-06-23 18:08:59,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=139420.0, ans=0.125 2023-06-23 18:09:15,496 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.764e+02 1.962e+02 2.283e+02 3.805e+02, threshold=3.924e+02, percent-clipped=1.0 2023-06-23 18:09:53,343 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.65 vs. limit=22.5 2023-06-23 18:09:56,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=139620.0, ans=0.0 2023-06-23 18:10:21,920 INFO [train.py:1008] (1/4) Epoch 40, batch 200, loss[loss=0.1899, simple_loss=0.273, pruned_loss=0.05336, over 18652.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2767, pruned_loss=0.06379, over 2435811.94 frames. ], batch size: 80, lr: 8.28e-03, grad_scale: 16.0 2023-06-23 18:10:33,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=139753.33333333334, ans=0.0 2023-06-23 18:10:43,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=139820.0, ans=0.125 2023-06-23 18:10:53,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=139886.66666666666, ans=0.125 2023-06-23 18:11:34,993 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.42 vs. limit=15.0 2023-06-23 18:11:42,158 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.08 vs. limit=22.5 2023-06-23 18:11:42,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=140086.66666666666, ans=0.125 2023-06-23 18:11:44,623 INFO [train.py:1008] (1/4) Epoch 40, batch 250, loss[loss=0.2025, simple_loss=0.2736, pruned_loss=0.0657, over 19977.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2776, pruned_loss=0.06324, over 2746910.24 frames. ], batch size: 126, lr: 8.27e-03, grad_scale: 16.0 2023-06-23 18:11:50,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=140086.66666666666, ans=0.0 2023-06-23 18:11:57,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=140086.66666666666, ans=0.0 2023-06-23 18:12:01,722 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.803e+02 1.945e+02 2.158e+02 3.592e+02, threshold=3.889e+02, percent-clipped=0.0 2023-06-23 18:12:16,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=140220.0, ans=0.025 2023-06-23 18:12:16,914 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.69 vs. limit=15.0 2023-06-23 18:12:20,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=140220.0, ans=0.05 2023-06-23 18:12:38,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=140286.66666666666, ans=0.025 2023-06-23 18:12:51,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=140353.33333333334, ans=0.1 2023-06-23 18:13:08,087 INFO [train.py:1008] (1/4) Epoch 40, batch 300, loss[loss=0.1991, simple_loss=0.2839, pruned_loss=0.05715, over 16795.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2781, pruned_loss=0.06286, over 2975975.11 frames. ], batch size: 59, lr: 8.26e-03, grad_scale: 16.0 2023-06-23 18:13:19,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=140420.0, ans=0.0 2023-06-23 18:13:23,281 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.55 vs. limit=15.0 2023-06-23 18:13:29,268 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=140486.66666666666, ans=0.0 2023-06-23 18:13:42,344 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.53 vs. limit=15.0 2023-06-23 18:13:45,694 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.80 vs. limit=10.0 2023-06-23 18:14:07,837 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.53 vs. limit=15.0 2023-06-23 18:14:22,786 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:14:30,702 INFO [train.py:1008] (1/4) Epoch 40, batch 350, loss[loss=0.1871, simple_loss=0.2655, pruned_loss=0.05432, over 19311.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2786, pruned_loss=0.0628, over 3164152.68 frames. ], batch size: 98, lr: 8.25e-03, grad_scale: 16.0 2023-06-23 18:14:48,204 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.677e+02 1.868e+02 2.086e+02 3.212e+02, threshold=3.736e+02, percent-clipped=0.0 2023-06-23 18:14:56,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=140820.0, ans=0.1 2023-06-23 18:15:14,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=140886.66666666666, ans=0.04949747468305833 2023-06-23 18:15:53,144 INFO [train.py:1008] (1/4) Epoch 40, batch 400, loss[loss=0.2211, simple_loss=0.3003, pruned_loss=0.07098, over 18256.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2785, pruned_loss=0.06285, over 3305371.18 frames. ], batch size: 74, lr: 8.24e-03, grad_scale: 32.0 2023-06-23 18:15:56,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=141086.66666666666, ans=0.125 2023-06-23 18:16:02,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=141086.66666666666, ans=0.125 2023-06-23 18:16:09,692 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.07 vs. limit=15.0 2023-06-23 18:16:29,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=141220.0, ans=0.125 2023-06-23 18:16:48,873 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.04 vs. limit=22.5 2023-06-23 18:17:15,520 INFO [train.py:1008] (1/4) Epoch 40, batch 450, loss[loss=0.219, simple_loss=0.3027, pruned_loss=0.06764, over 15548.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.278, pruned_loss=0.06254, over 3419480.92 frames. ], batch size: 44, lr: 8.24e-03, grad_scale: 32.0 2023-06-23 18:17:19,838 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.09 vs. limit=15.0 2023-06-23 18:17:32,351 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.729e+02 1.897e+02 2.097e+02 3.976e+02, threshold=3.793e+02, percent-clipped=1.0 2023-06-23 18:17:59,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=141553.33333333334, ans=0.05 2023-06-23 18:18:04,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=141620.0, ans=0.2 2023-06-23 18:18:05,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=141620.0, ans=10.0 2023-06-23 18:18:13,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=141620.0, ans=0.125 2023-06-23 18:18:35,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=141753.33333333334, ans=0.1 2023-06-23 18:18:36,747 INFO [train.py:1008] (1/4) Epoch 40, batch 500, loss[loss=0.2043, simple_loss=0.2767, pruned_loss=0.06593, over 20426.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2774, pruned_loss=0.06271, over 3505741.39 frames. ], batch size: 160, lr: 8.23e-03, grad_scale: 32.0 2023-06-23 18:18:39,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=141753.33333333334, ans=0.2 2023-06-23 18:18:53,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=141820.0, ans=0.125 2023-06-23 18:18:57,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=141820.0, ans=0.125 2023-06-23 18:19:08,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=141886.66666666666, ans=0.125 2023-06-23 18:19:10,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=141886.66666666666, ans=0.0 2023-06-23 18:19:13,351 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=141886.66666666666, ans=0.1 2023-06-23 18:19:20,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=141886.66666666666, ans=0.125 2023-06-23 18:19:23,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=141953.33333333334, ans=0.125 2023-06-23 18:19:24,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=141953.33333333334, ans=0.0 2023-06-23 18:19:45,973 INFO [train.py:1008] (1/4) Epoch 41, batch 0, loss[loss=0.1927, simple_loss=0.272, pruned_loss=0.05671, over 19195.00 frames. ], tot_loss[loss=0.1927, simple_loss=0.272, pruned_loss=0.05671, over 19195.00 frames. ], batch size: 92, lr: 8.12e-03, grad_scale: 32.0 2023-06-23 18:19:45,973 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 18:19:51,587 INFO [train.py:1040] (1/4) Epoch 41, validation: loss=0.1955, simple_loss=0.2897, pruned_loss=0.05062, over 143649.00 frames. 2023-06-23 18:19:51,588 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 18:20:09,353 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.21 vs. limit=15.0 2023-06-23 18:20:11,847 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=142040.0, ans=0.125 2023-06-23 18:20:26,577 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.58 vs. limit=6.0 2023-06-23 18:20:36,137 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.747e+02 2.012e+02 2.296e+02 3.578e+02, threshold=4.024e+02, percent-clipped=0.0 2023-06-23 18:20:38,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=142106.66666666666, ans=0.2 2023-06-23 18:20:43,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=142173.33333333334, ans=0.125 2023-06-23 18:20:48,656 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.36 vs. limit=6.0 2023-06-23 18:21:13,871 INFO [train.py:1008] (1/4) Epoch 41, batch 50, loss[loss=0.1947, simple_loss=0.2766, pruned_loss=0.05641, over 18646.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2761, pruned_loss=0.06212, over 855755.96 frames. ], batch size: 80, lr: 8.11e-03, grad_scale: 32.0 2023-06-23 18:21:16,511 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=7.39 vs. limit=12.0 2023-06-23 18:21:39,348 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.06 vs. limit=12.0 2023-06-23 18:22:18,893 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:22:21,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=142573.33333333334, ans=0.125 2023-06-23 18:22:36,804 INFO [train.py:1008] (1/4) Epoch 41, batch 100, loss[loss=0.219, simple_loss=0.2745, pruned_loss=0.08179, over 19834.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.278, pruned_loss=0.06332, over 1495432.79 frames. ], batch size: 293, lr: 8.10e-03, grad_scale: 32.0 2023-06-23 18:22:41,995 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:23:03,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=142706.66666666666, ans=0.1 2023-06-23 18:23:13,316 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.33 vs. limit=12.0 2023-06-23 18:23:16,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=142773.33333333334, ans=0.1 2023-06-23 18:23:18,421 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.76 vs. limit=15.0 2023-06-23 18:23:22,012 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.755e+02 2.017e+02 2.293e+02 3.854e+02, threshold=4.034e+02, percent-clipped=0.0 2023-06-23 18:24:00,511 INFO [train.py:1008] (1/4) Epoch 41, batch 150, loss[loss=0.2009, simple_loss=0.2957, pruned_loss=0.05308, over 15623.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.278, pruned_loss=0.06288, over 2014635.23 frames. ], batch size: 44, lr: 8.09e-03, grad_scale: 32.0 2023-06-23 18:24:16,717 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=22.5 2023-06-23 18:24:51,054 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=15.02 vs. limit=22.5 2023-06-23 18:24:52,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=143173.33333333334, ans=0.125 2023-06-23 18:25:22,961 INFO [train.py:1008] (1/4) Epoch 41, batch 200, loss[loss=0.1922, simple_loss=0.2742, pruned_loss=0.05512, over 19344.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2776, pruned_loss=0.0628, over 2413422.08 frames. ], batch size: 98, lr: 8.09e-03, grad_scale: 32.0 2023-06-23 18:25:34,296 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:25:55,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=143440.0, ans=0.0 2023-06-23 18:26:07,055 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.697e+02 1.861e+02 2.042e+02 2.751e+02, threshold=3.722e+02, percent-clipped=0.0 2023-06-23 18:26:12,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=143506.66666666666, ans=0.2 2023-06-23 18:26:32,037 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.86 vs. limit=22.5 2023-06-23 18:26:38,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=143573.33333333334, ans=0.0 2023-06-23 18:26:44,765 INFO [train.py:1008] (1/4) Epoch 41, batch 250, loss[loss=0.1964, simple_loss=0.2745, pruned_loss=0.05914, over 19390.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2794, pruned_loss=0.06243, over 2711407.42 frames. ], batch size: 98, lr: 8.08e-03, grad_scale: 32.0 2023-06-23 18:27:13,546 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.21 vs. limit=15.0 2023-06-23 18:27:41,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=143840.0, ans=0.2 2023-06-23 18:28:08,510 INFO [train.py:1008] (1/4) Epoch 41, batch 300, loss[loss=0.2004, simple_loss=0.2746, pruned_loss=0.06303, over 18608.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2791, pruned_loss=0.06239, over 2956793.45 frames. ], batch size: 80, lr: 8.07e-03, grad_scale: 32.0 2023-06-23 18:28:10,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=143973.33333333334, ans=0.125 2023-06-23 18:28:26,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=144040.0, ans=0.07 2023-06-23 18:28:53,824 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.647e+02 1.872e+02 2.166e+02 2.965e+02, threshold=3.744e+02, percent-clipped=0.0 2023-06-23 18:29:20,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=144240.0, ans=0.125 2023-06-23 18:29:24,486 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.73 vs. limit=10.0 2023-06-23 18:29:31,889 INFO [train.py:1008] (1/4) Epoch 41, batch 350, loss[loss=0.1967, simple_loss=0.2729, pruned_loss=0.06024, over 20289.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2789, pruned_loss=0.06232, over 3145463.91 frames. ], batch size: 141, lr: 8.06e-03, grad_scale: 32.0 2023-06-23 18:29:33,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=144306.66666666666, ans=0.0 2023-06-23 18:29:37,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=144306.66666666666, ans=0.125 2023-06-23 18:30:18,351 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=144440.0, ans=0.0 2023-06-23 18:30:25,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=144506.66666666666, ans=0.0 2023-06-23 18:30:41,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=144573.33333333334, ans=0.125 2023-06-23 18:30:56,082 INFO [train.py:1008] (1/4) Epoch 41, batch 400, loss[loss=0.2214, simple_loss=0.2613, pruned_loss=0.09069, over 16911.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2782, pruned_loss=0.06205, over 3298348.06 frames. ], batch size: 391, lr: 8.05e-03, grad_scale: 32.0 2023-06-23 18:30:56,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=144640.0, ans=0.125 2023-06-23 18:31:24,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=144706.66666666666, ans=0.125 2023-06-23 18:31:28,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=144773.33333333334, ans=6.0 2023-06-23 18:31:41,115 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.737e+02 1.969e+02 2.223e+02 3.144e+02, threshold=3.939e+02, percent-clipped=0.0 2023-06-23 18:32:00,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=144840.0, ans=0.125 2023-06-23 18:32:13,624 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=12.0 2023-06-23 18:32:19,914 INFO [train.py:1008] (1/4) Epoch 41, batch 450, loss[loss=0.2005, simple_loss=0.276, pruned_loss=0.06254, over 20104.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2779, pruned_loss=0.06197, over 3404149.05 frames. ], batch size: 133, lr: 8.04e-03, grad_scale: 32.0 2023-06-23 18:32:38,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=145040.0, ans=0.125 2023-06-23 18:32:54,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=145106.66666666666, ans=0.125 2023-06-23 18:33:26,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=145240.0, ans=0.0 2023-06-23 18:33:40,018 INFO [train.py:1008] (1/4) Epoch 41, batch 500, loss[loss=0.2078, simple_loss=0.2758, pruned_loss=0.06985, over 20569.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2769, pruned_loss=0.06207, over 3472501.22 frames. ], batch size: 173, lr: 8.04e-03, grad_scale: 32.0 2023-06-23 18:33:41,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=145306.66666666666, ans=0.0 2023-06-23 18:33:46,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=145306.66666666666, ans=0.125 2023-06-23 18:34:06,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=145373.33333333334, ans=0.125 2023-06-23 18:34:21,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=145440.0, ans=0.0 2023-06-23 18:34:22,592 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.373e+02 1.720e+02 1.909e+02 2.113e+02 2.699e+02, threshold=3.817e+02, percent-clipped=0.0 2023-06-23 18:34:27,394 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=145506.66666666666, ans=0.2 2023-06-23 18:34:51,162 INFO [train.py:1008] (1/4) Epoch 42, batch 0, loss[loss=0.211, simple_loss=0.2838, pruned_loss=0.06914, over 19965.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2838, pruned_loss=0.06914, over 19965.00 frames. ], batch size: 126, lr: 7.93e-03, grad_scale: 32.0 2023-06-23 18:34:51,163 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 18:34:56,871 INFO [train.py:1040] (1/4) Epoch 42, validation: loss=0.1951, simple_loss=0.2892, pruned_loss=0.05045, over 143649.00 frames. 2023-06-23 18:34:56,872 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 18:35:09,510 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.36 vs. limit=15.0 2023-06-23 18:35:22,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=145586.66666666666, ans=0.125 2023-06-23 18:35:39,457 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=12.0 2023-06-23 18:35:43,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=145653.33333333334, ans=0.125 2023-06-23 18:35:45,470 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.99 vs. limit=15.0 2023-06-23 18:36:18,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=145853.33333333334, ans=0.125 2023-06-23 18:36:19,545 INFO [train.py:1008] (1/4) Epoch 42, batch 50, loss[loss=0.2138, simple_loss=0.2685, pruned_loss=0.07959, over 20004.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2762, pruned_loss=0.06183, over 842366.14 frames. ], batch size: 293, lr: 7.93e-03, grad_scale: 32.0 2023-06-23 18:36:41,949 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:37:16,198 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.29 vs. limit=15.0 2023-06-23 18:37:21,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=146053.33333333334, ans=0.0 2023-06-23 18:37:32,828 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.693e+02 1.862e+02 2.191e+02 3.305e+02, threshold=3.724e+02, percent-clipped=0.0 2023-06-23 18:37:41,600 INFO [train.py:1008] (1/4) Epoch 42, batch 100, loss[loss=0.2113, simple_loss=0.2928, pruned_loss=0.06495, over 18275.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2782, pruned_loss=0.06159, over 1488243.59 frames. ], batch size: 74, lr: 7.92e-03, grad_scale: 32.0 2023-06-23 18:38:31,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=146386.66666666666, ans=0.1 2023-06-23 18:38:47,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=146453.33333333334, ans=0.1 2023-06-23 18:39:02,339 INFO [train.py:1008] (1/4) Epoch 42, batch 150, loss[loss=0.2087, simple_loss=0.2954, pruned_loss=0.06106, over 17666.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2775, pruned_loss=0.061, over 2006151.99 frames. ], batch size: 67, lr: 7.91e-03, grad_scale: 16.0 2023-06-23 18:39:21,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=146586.66666666666, ans=0.2 2023-06-23 18:39:33,867 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=146653.33333333334, ans=0.0 2023-06-23 18:39:46,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=146653.33333333334, ans=0.0 2023-06-23 18:39:59,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=146720.0, ans=0.0 2023-06-23 18:40:10,298 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=15.0 2023-06-23 18:40:10,696 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.55 vs. limit=12.0 2023-06-23 18:40:12,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=146786.66666666666, ans=0.125 2023-06-23 18:40:17,251 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.884e+02 2.149e+02 2.581e+02 4.173e+02, threshold=4.298e+02, percent-clipped=2.0 2023-06-23 18:40:24,561 INFO [train.py:1008] (1/4) Epoch 42, batch 200, loss[loss=0.2304, simple_loss=0.3147, pruned_loss=0.0731, over 17618.00 frames. ], tot_loss[loss=0.2, simple_loss=0.278, pruned_loss=0.06094, over 2387972.78 frames. ], batch size: 67, lr: 7.90e-03, grad_scale: 16.0 2023-06-23 18:40:36,836 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.22 vs. limit=15.0 2023-06-23 18:41:10,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=146986.66666666666, ans=0.125 2023-06-23 18:41:10,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=146986.66666666666, ans=0.2 2023-06-23 18:41:30,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=147120.0, ans=0.1 2023-06-23 18:41:36,983 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=147120.0, ans=0.0 2023-06-23 18:41:46,142 INFO [train.py:1008] (1/4) Epoch 42, batch 250, loss[loss=0.2186, simple_loss=0.2996, pruned_loss=0.06875, over 18434.00 frames. ], tot_loss[loss=0.2011, simple_loss=0.279, pruned_loss=0.0616, over 2700230.46 frames. ], batch size: 77, lr: 7.89e-03, grad_scale: 16.0 2023-06-23 18:42:18,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=147320.0, ans=0.125 2023-06-23 18:42:18,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=147320.0, ans=0.0 2023-06-23 18:42:28,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=147320.0, ans=0.125 2023-06-23 18:42:35,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=147386.66666666666, ans=0.0 2023-06-23 18:42:47,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=147386.66666666666, ans=0.125 2023-06-23 18:42:47,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=147386.66666666666, ans=0.0 2023-06-23 18:42:55,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=147453.33333333334, ans=0.2 2023-06-23 18:43:04,699 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.767e+02 1.922e+02 2.149e+02 5.047e+02, threshold=3.844e+02, percent-clipped=1.0 2023-06-23 18:43:09,463 INFO [train.py:1008] (1/4) Epoch 42, batch 300, loss[loss=0.2049, simple_loss=0.2799, pruned_loss=0.06493, over 18943.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2779, pruned_loss=0.06205, over 2946939.47 frames. ], batch size: 86, lr: 7.88e-03, grad_scale: 8.0 2023-06-23 18:43:14,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=147520.0, ans=0.125 2023-06-23 18:43:19,289 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.74 vs. limit=22.5 2023-06-23 18:43:35,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=147586.66666666666, ans=0.125 2023-06-23 18:43:48,556 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.51 vs. limit=22.5 2023-06-23 18:44:09,365 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.13 vs. limit=22.5 2023-06-23 18:44:32,324 INFO [train.py:1008] (1/4) Epoch 42, batch 350, loss[loss=0.1891, simple_loss=0.2725, pruned_loss=0.05281, over 19468.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2779, pruned_loss=0.06167, over 3113538.29 frames. ], batch size: 105, lr: 7.88e-03, grad_scale: 8.0 2023-06-23 18:44:34,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=147853.33333333334, ans=0.1 2023-06-23 18:44:43,263 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2023-06-23 18:45:02,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=147920.0, ans=0.1 2023-06-23 18:45:05,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=147986.66666666666, ans=0.0 2023-06-23 18:45:51,459 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.757e+02 1.899e+02 2.123e+02 3.456e+02, threshold=3.798e+02, percent-clipped=0.0 2023-06-23 18:45:56,313 INFO [train.py:1008] (1/4) Epoch 42, batch 400, loss[loss=0.2189, simple_loss=0.2869, pruned_loss=0.07541, over 20292.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2772, pruned_loss=0.06176, over 3256828.56 frames. ], batch size: 149, lr: 7.87e-03, grad_scale: 16.0 2023-06-23 18:46:15,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=148253.33333333334, ans=0.1 2023-06-23 18:46:23,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=148253.33333333334, ans=0.125 2023-06-23 18:46:37,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=148320.0, ans=0.125 2023-06-23 18:46:55,137 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.28 vs. limit=22.5 2023-06-23 18:47:20,147 INFO [train.py:1008] (1/4) Epoch 42, batch 450, loss[loss=0.212, simple_loss=0.2561, pruned_loss=0.08394, over 16919.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2768, pruned_loss=0.06165, over 3369055.79 frames. ], batch size: 391, lr: 7.86e-03, grad_scale: 16.0 2023-06-23 18:47:31,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=148520.0, ans=0.0 2023-06-23 18:47:43,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=148586.66666666666, ans=0.1 2023-06-23 18:47:52,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=148653.33333333334, ans=0.0 2023-06-23 18:47:53,408 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.02 vs. limit=12.0 2023-06-23 18:47:54,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=148653.33333333334, ans=0.1 2023-06-23 18:48:19,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=148720.0, ans=0.125 2023-06-23 18:48:36,015 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.803e+02 1.999e+02 2.413e+02 3.336e+02, threshold=3.998e+02, percent-clipped=0.0 2023-06-23 18:48:40,750 INFO [train.py:1008] (1/4) Epoch 42, batch 500, loss[loss=0.2147, simple_loss=0.3071, pruned_loss=0.0612, over 15208.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2762, pruned_loss=0.0617, over 3467618.41 frames. ], batch size: 43, lr: 7.85e-03, grad_scale: 16.0 2023-06-23 18:48:45,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=148853.33333333334, ans=0.0 2023-06-23 18:48:47,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=148853.33333333334, ans=0.1 2023-06-23 18:48:51,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=148853.33333333334, ans=0.125 2023-06-23 18:49:03,914 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.87 vs. limit=22.5 2023-06-23 18:49:07,635 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:49:12,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=148986.66666666666, ans=0.125 2023-06-23 18:49:53,213 INFO [train.py:1008] (1/4) Epoch 43, batch 0, loss[loss=0.1971, simple_loss=0.271, pruned_loss=0.06156, over 20660.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.271, pruned_loss=0.06156, over 20660.00 frames. ], batch size: 211, lr: 7.76e-03, grad_scale: 32.0 2023-06-23 18:49:53,213 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 18:49:58,905 INFO [train.py:1040] (1/4) Epoch 43, validation: loss=0.1949, simple_loss=0.29, pruned_loss=0.04993, over 143649.00 frames. 2023-06-23 18:49:58,906 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 18:50:13,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=149140.0, ans=0.025 2023-06-23 18:50:16,797 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=149140.0, ans=10.0 2023-06-23 18:50:27,561 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.74 vs. limit=15.0 2023-06-23 18:50:41,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=149206.66666666666, ans=0.125 2023-06-23 18:51:00,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=149273.33333333334, ans=0.125 2023-06-23 18:51:21,830 INFO [train.py:1008] (1/4) Epoch 43, batch 50, loss[loss=0.1846, simple_loss=0.2701, pruned_loss=0.04952, over 19071.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.276, pruned_loss=0.06007, over 848048.39 frames. ], batch size: 89, lr: 7.75e-03, grad_scale: 32.0 2023-06-23 18:51:43,797 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.648e+02 1.897e+02 2.082e+02 3.369e+02, threshold=3.795e+02, percent-clipped=0.0 2023-06-23 18:51:53,261 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.67 vs. limit=15.0 2023-06-23 18:52:03,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=149540.0, ans=0.125 2023-06-23 18:52:07,455 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-23 18:52:09,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=149606.66666666666, ans=0.1 2023-06-23 18:52:16,668 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.77 vs. limit=12.0 2023-06-23 18:52:37,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=149673.33333333334, ans=0.125 2023-06-23 18:52:43,349 INFO [train.py:1008] (1/4) Epoch 43, batch 100, loss[loss=0.1956, simple_loss=0.2701, pruned_loss=0.0606, over 20519.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.2752, pruned_loss=0.06052, over 1505903.40 frames. ], batch size: 189, lr: 7.74e-03, grad_scale: 32.0 2023-06-23 18:53:23,155 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.95 vs. limit=10.0 2023-06-23 18:53:32,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=149940.0, ans=0.0 2023-06-23 18:53:40,218 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.73 vs. limit=22.5 2023-06-23 18:53:56,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=150006.66666666666, ans=0.1 2023-06-23 18:54:03,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=150073.33333333334, ans=0.125 2023-06-23 18:54:04,396 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.36 vs. limit=15.0 2023-06-23 18:54:04,763 INFO [train.py:1008] (1/4) Epoch 43, batch 150, loss[loss=0.1972, simple_loss=0.2819, pruned_loss=0.0563, over 18937.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2756, pruned_loss=0.06013, over 1985857.55 frames. ], batch size: 86, lr: 7.73e-03, grad_scale: 32.0 2023-06-23 18:54:13,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=150073.33333333334, ans=0.015 2023-06-23 18:54:16,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=150073.33333333334, ans=0.2 2023-06-23 18:54:28,200 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.713e+02 1.940e+02 2.500e+02 3.729e+02, threshold=3.879e+02, percent-clipped=0.0 2023-06-23 18:54:56,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=150273.33333333334, ans=10.0 2023-06-23 18:55:24,642 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.32 vs. limit=6.0 2023-06-23 18:55:26,934 INFO [train.py:1008] (1/4) Epoch 43, batch 200, loss[loss=0.1945, simple_loss=0.2687, pruned_loss=0.06013, over 19769.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2749, pruned_loss=0.06124, over 2382200.70 frames. ], batch size: 115, lr: 7.72e-03, grad_scale: 32.0 2023-06-23 18:55:30,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=150406.66666666666, ans=0.1 2023-06-23 18:55:40,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=150406.66666666666, ans=0.2 2023-06-23 18:55:41,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=150473.33333333334, ans=0.0 2023-06-23 18:55:46,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=150473.33333333334, ans=0.0 2023-06-23 18:56:11,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=150540.0, ans=0.125 2023-06-23 18:56:19,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=150606.66666666666, ans=0.125 2023-06-23 18:56:21,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=150606.66666666666, ans=0.0 2023-06-23 18:56:48,751 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=12.0 2023-06-23 18:56:49,159 INFO [train.py:1008] (1/4) Epoch 43, batch 250, loss[loss=0.2184, simple_loss=0.2999, pruned_loss=0.06851, over 16267.00 frames. ], tot_loss[loss=0.198, simple_loss=0.275, pruned_loss=0.06047, over 2692022.53 frames. ], batch size: 52, lr: 7.72e-03, grad_scale: 32.0 2023-06-23 18:56:53,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=150740.0, ans=0.1 2023-06-23 18:57:00,540 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=15.0 2023-06-23 18:57:12,510 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.729e+02 1.933e+02 2.195e+02 2.886e+02, threshold=3.865e+02, percent-clipped=0.0 2023-06-23 18:57:54,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=151006.66666666666, ans=0.125 2023-06-23 18:57:56,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=151006.66666666666, ans=0.0 2023-06-23 18:58:12,324 INFO [train.py:1008] (1/4) Epoch 43, batch 300, loss[loss=0.189, simple_loss=0.269, pruned_loss=0.05448, over 19100.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2748, pruned_loss=0.06023, over 2931626.36 frames. ], batch size: 94, lr: 7.71e-03, grad_scale: 32.0 2023-06-23 18:58:14,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=151073.33333333334, ans=0.125 2023-06-23 18:58:18,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=151073.33333333334, ans=0.125 2023-06-23 18:58:54,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=151206.66666666666, ans=0.1 2023-06-23 18:59:23,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=151340.0, ans=0.0 2023-06-23 18:59:34,813 INFO [train.py:1008] (1/4) Epoch 43, batch 350, loss[loss=0.2017, simple_loss=0.2798, pruned_loss=0.06175, over 19353.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2751, pruned_loss=0.06071, over 3123864.42 frames. ], batch size: 98, lr: 7.70e-03, grad_scale: 32.0 2023-06-23 18:59:49,070 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.57 vs. limit=6.0 2023-06-23 18:59:57,696 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.705e+02 1.912e+02 2.088e+02 2.578e+02, threshold=3.825e+02, percent-clipped=0.0 2023-06-23 19:00:19,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=151540.0, ans=0.125 2023-06-23 19:00:41,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=151673.33333333334, ans=0.2 2023-06-23 19:00:47,795 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=15.0 2023-06-23 19:00:56,372 INFO [train.py:1008] (1/4) Epoch 43, batch 400, loss[loss=0.2099, simple_loss=0.2869, pruned_loss=0.06651, over 18798.00 frames. ], tot_loss[loss=0.1994, simple_loss=0.2759, pruned_loss=0.06148, over 3270235.80 frames. ], batch size: 83, lr: 7.69e-03, grad_scale: 32.0 2023-06-23 19:01:08,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=151740.0, ans=0.0 2023-06-23 19:01:48,761 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.06 vs. limit=15.0 2023-06-23 19:01:52,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=151940.0, ans=0.125 2023-06-23 19:01:56,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=151940.0, ans=0.5 2023-06-23 19:02:18,541 INFO [train.py:1008] (1/4) Epoch 43, batch 450, loss[loss=0.1737, simple_loss=0.255, pruned_loss=0.04615, over 19812.00 frames. ], tot_loss[loss=0.199, simple_loss=0.2757, pruned_loss=0.06111, over 3361916.46 frames. ], batch size: 115, lr: 7.69e-03, grad_scale: 32.0 2023-06-23 19:02:24,687 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=6.51 vs. limit=15.0 2023-06-23 19:02:41,708 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.753e+02 2.020e+02 2.242e+02 3.524e+02, threshold=4.040e+02, percent-clipped=0.0 2023-06-23 19:03:18,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=152273.33333333334, ans=0.125 2023-06-23 19:03:33,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=152340.0, ans=0.2 2023-06-23 19:03:37,433 INFO [train.py:1008] (1/4) Epoch 43, batch 500, loss[loss=0.196, simple_loss=0.2781, pruned_loss=0.05692, over 19109.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.276, pruned_loss=0.06094, over 3443872.94 frames. ], batch size: 94, lr: 7.68e-03, grad_scale: 32.0 2023-06-23 19:03:48,866 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=152406.66666666666, ans=0.2 2023-06-23 19:04:03,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=152473.33333333334, ans=0.0 2023-06-23 19:04:05,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=152473.33333333334, ans=0.125 2023-06-23 19:04:16,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=152540.0, ans=0.125 2023-06-23 19:04:51,188 INFO [train.py:1008] (1/4) Epoch 44, batch 0, loss[loss=0.1954, simple_loss=0.2703, pruned_loss=0.06025, over 20142.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.2703, pruned_loss=0.06025, over 20142.00 frames. ], batch size: 133, lr: 7.58e-03, grad_scale: 32.0 2023-06-23 19:04:51,189 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 19:04:56,833 INFO [train.py:1040] (1/4) Epoch 44, validation: loss=0.1969, simple_loss=0.2906, pruned_loss=0.05158, over 143649.00 frames. 2023-06-23 19:04:56,834 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 19:05:25,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=152693.33333333334, ans=0.2 2023-06-23 19:05:25,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=152693.33333333334, ans=10.0 2023-06-23 19:05:44,366 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=152826.66666666666, ans=0.125 2023-06-23 19:05:47,385 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.338e+02 1.798e+02 2.006e+02 2.189e+02 3.546e+02, threshold=4.013e+02, percent-clipped=0.0 2023-06-23 19:05:53,985 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.86 vs. limit=15.0 2023-06-23 19:06:14,833 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=10.02 vs. limit=15.0 2023-06-23 19:06:18,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=152960.0, ans=0.0 2023-06-23 19:06:19,934 INFO [train.py:1008] (1/4) Epoch 44, batch 50, loss[loss=0.1903, simple_loss=0.2723, pruned_loss=0.05417, over 19213.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2758, pruned_loss=0.05919, over 851133.32 frames. ], batch size: 92, lr: 7.58e-03, grad_scale: 32.0 2023-06-23 19:06:21,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=152960.0, ans=0.125 2023-06-23 19:06:32,865 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=152960.0, ans=0.1 2023-06-23 19:06:43,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=153026.66666666666, ans=0.125 2023-06-23 19:06:59,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=153093.33333333334, ans=0.0 2023-06-23 19:07:16,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=153160.0, ans=0.125 2023-06-23 19:07:41,187 INFO [train.py:1008] (1/4) Epoch 44, batch 100, loss[loss=0.1996, simple_loss=0.2669, pruned_loss=0.06614, over 20256.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2765, pruned_loss=0.06137, over 1487725.77 frames. ], batch size: 239, lr: 7.57e-03, grad_scale: 32.0 2023-06-23 19:07:47,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=153293.33333333334, ans=0.95 2023-06-23 19:07:58,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=153360.0, ans=0.04949747468305833 2023-06-23 19:08:08,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=153360.0, ans=0.125 2023-06-23 19:08:19,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=153426.66666666666, ans=0.0 2023-06-23 19:08:32,163 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.721e+02 1.906e+02 2.161e+02 3.199e+02, threshold=3.812e+02, percent-clipped=0.0 2023-06-23 19:08:34,085 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=153493.33333333334, ans=0.0 2023-06-23 19:08:34,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=153493.33333333334, ans=0.1 2023-06-23 19:08:37,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=153493.33333333334, ans=0.0 2023-06-23 19:08:52,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=153560.0, ans=0.125 2023-06-23 19:08:55,231 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=153560.0, ans=0.1 2023-06-23 19:08:55,845 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.14 vs. limit=10.0 2023-06-23 19:09:03,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=153626.66666666666, ans=0.2 2023-06-23 19:09:04,504 INFO [train.py:1008] (1/4) Epoch 44, batch 150, loss[loss=0.2003, simple_loss=0.2793, pruned_loss=0.06062, over 18311.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2751, pruned_loss=0.06081, over 1999499.89 frames. ], batch size: 74, lr: 7.56e-03, grad_scale: 32.0 2023-06-23 19:09:13,068 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.86 vs. limit=22.5 2023-06-23 19:09:48,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=153760.0, ans=0.0 2023-06-23 19:10:09,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=153893.33333333334, ans=0.125 2023-06-23 19:10:26,214 INFO [train.py:1008] (1/4) Epoch 44, batch 200, loss[loss=0.1944, simple_loss=0.2704, pruned_loss=0.05925, over 19968.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2755, pruned_loss=0.06095, over 2367681.95 frames. ], batch size: 126, lr: 7.56e-03, grad_scale: 32.0 2023-06-23 19:10:28,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=153960.0, ans=0.125 2023-06-23 19:10:42,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=154026.66666666666, ans=0.1 2023-06-23 19:11:00,588 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-23 19:11:02,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=154093.33333333334, ans=0.09899494936611666 2023-06-23 19:11:16,798 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.708e+02 1.966e+02 2.304e+02 3.229e+02, threshold=3.932e+02, percent-clipped=0.0 2023-06-23 19:11:21,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=154160.0, ans=0.125 2023-06-23 19:11:49,417 INFO [train.py:1008] (1/4) Epoch 44, batch 250, loss[loss=0.1903, simple_loss=0.2695, pruned_loss=0.05554, over 18925.00 frames. ], tot_loss[loss=0.1969, simple_loss=0.2736, pruned_loss=0.06008, over 2685247.02 frames. ], batch size: 86, lr: 7.55e-03, grad_scale: 32.0 2023-06-23 19:11:50,623 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=4.47 vs. limit=15.0 2023-06-23 19:11:57,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=154293.33333333334, ans=0.5 2023-06-23 19:12:11,142 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.53 vs. limit=22.5 2023-06-23 19:12:35,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.min_positive, batch_count=154426.66666666666, ans=0.05 2023-06-23 19:13:11,231 INFO [train.py:1008] (1/4) Epoch 44, batch 300, loss[loss=0.1864, simple_loss=0.2696, pruned_loss=0.05159, over 18793.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.2745, pruned_loss=0.06001, over 2920629.59 frames. ], batch size: 83, lr: 7.54e-03, grad_scale: 32.0 2023-06-23 19:13:39,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=154693.33333333334, ans=0.1 2023-06-23 19:13:42,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=154760.0, ans=0.0 2023-06-23 19:13:47,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=154760.0, ans=0.0 2023-06-23 19:14:00,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=154826.66666666666, ans=0.2 2023-06-23 19:14:02,956 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.525e+02 1.753e+02 1.948e+02 2.277e+02 3.619e+02, threshold=3.896e+02, percent-clipped=0.0 2023-06-23 19:14:21,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=154893.33333333334, ans=0.125 2023-06-23 19:14:33,599 INFO [train.py:1008] (1/4) Epoch 44, batch 350, loss[loss=0.2048, simple_loss=0.2474, pruned_loss=0.08109, over 16945.00 frames. ], tot_loss[loss=0.1967, simple_loss=0.2745, pruned_loss=0.05946, over 3119942.21 frames. ], batch size: 392, lr: 7.53e-03, grad_scale: 32.0 2023-06-23 19:14:47,987 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=154960.0, ans=0.1 2023-06-23 19:15:01,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=155026.66666666666, ans=0.0 2023-06-23 19:15:07,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=155093.33333333334, ans=0.0 2023-06-23 19:15:13,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=155093.33333333334, ans=0.2 2023-06-23 19:15:21,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=155093.33333333334, ans=0.125 2023-06-23 19:15:25,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=155160.0, ans=0.125 2023-06-23 19:15:35,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=155160.0, ans=0.0 2023-06-23 19:15:57,316 INFO [train.py:1008] (1/4) Epoch 44, batch 400, loss[loss=0.2006, simple_loss=0.2742, pruned_loss=0.06349, over 20277.00 frames. ], tot_loss[loss=0.1977, simple_loss=0.2753, pruned_loss=0.06004, over 3243048.40 frames. ], batch size: 141, lr: 7.53e-03, grad_scale: 32.0 2023-06-23 19:16:06,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=155293.33333333334, ans=0.0 2023-06-23 19:16:24,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=155360.0, ans=0.0 2023-06-23 19:16:37,861 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.68 vs. limit=12.0 2023-06-23 19:16:48,289 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.681e+02 1.920e+02 2.248e+02 3.618e+02, threshold=3.840e+02, percent-clipped=0.0 2023-06-23 19:17:19,733 INFO [train.py:1008] (1/4) Epoch 44, batch 450, loss[loss=0.1984, simple_loss=0.2788, pruned_loss=0.05904, over 18292.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.2755, pruned_loss=0.0603, over 3359052.48 frames. ], batch size: 74, lr: 7.52e-03, grad_scale: 32.0 2023-06-23 19:17:33,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=155626.66666666666, ans=0.1 2023-06-23 19:17:50,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=155693.33333333334, ans=0.125 2023-06-23 19:17:57,678 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-06-23 19:18:06,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.whiten.whitening_limit, batch_count=155760.0, ans=12.0 2023-06-23 19:18:21,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=155826.66666666666, ans=0.125 2023-06-23 19:18:40,433 INFO [train.py:1008] (1/4) Epoch 44, batch 500, loss[loss=0.1977, simple_loss=0.2876, pruned_loss=0.05391, over 17661.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2751, pruned_loss=0.0597, over 3454683.28 frames. ], batch size: 67, lr: 7.51e-03, grad_scale: 32.0 2023-06-23 19:18:56,108 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.12 vs. limit=15.0 2023-06-23 19:19:29,816 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.329e+02 1.714e+02 1.879e+02 2.067e+02 4.059e+02, threshold=3.757e+02, percent-clipped=1.0 2023-06-23 19:19:48,883 INFO [train.py:1008] (1/4) Epoch 45, batch 0, loss[loss=0.1895, simple_loss=0.2694, pruned_loss=0.05478, over 19336.00 frames. ], tot_loss[loss=0.1895, simple_loss=0.2694, pruned_loss=0.05478, over 19336.00 frames. ], batch size: 98, lr: 7.42e-03, grad_scale: 32.0 2023-06-23 19:19:48,884 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 19:19:54,596 INFO [train.py:1040] (1/4) Epoch 45, validation: loss=0.1941, simple_loss=0.2881, pruned_loss=0.05006, over 143649.00 frames. 2023-06-23 19:19:54,597 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 19:20:07,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=156173.33333333334, ans=0.1 2023-06-23 19:20:15,672 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.52 vs. limit=22.5 2023-06-23 19:20:26,402 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.33 vs. limit=15.0 2023-06-23 19:20:40,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=156306.66666666666, ans=0.0 2023-06-23 19:20:45,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=156373.33333333334, ans=0.1 2023-06-23 19:20:48,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=156373.33333333334, ans=0.125 2023-06-23 19:20:51,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=156373.33333333334, ans=0.125 2023-06-23 19:20:54,852 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=156373.33333333334, ans=0.0 2023-06-23 19:21:17,119 INFO [train.py:1008] (1/4) Epoch 45, batch 50, loss[loss=0.1862, simple_loss=0.2717, pruned_loss=0.0503, over 18774.00 frames. ], tot_loss[loss=0.1925, simple_loss=0.271, pruned_loss=0.05704, over 872209.51 frames. ], batch size: 83, lr: 7.41e-03, grad_scale: 32.0 2023-06-23 19:21:29,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=156506.66666666666, ans=0.125 2023-06-23 19:21:41,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=156573.33333333334, ans=0.95 2023-06-23 19:22:03,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=156640.0, ans=0.125 2023-06-23 19:22:31,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=156773.33333333334, ans=0.125 2023-06-23 19:22:38,918 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.738e+02 1.900e+02 2.147e+02 3.061e+02, threshold=3.801e+02, percent-clipped=0.0 2023-06-23 19:22:40,583 INFO [train.py:1008] (1/4) Epoch 45, batch 100, loss[loss=0.1946, simple_loss=0.2777, pruned_loss=0.05575, over 18588.00 frames. ], tot_loss[loss=0.196, simple_loss=0.2723, pruned_loss=0.05985, over 1516869.05 frames. ], batch size: 80, lr: 7.41e-03, grad_scale: 32.0 2023-06-23 19:23:14,712 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=156973.33333333334, ans=0.07 2023-06-23 19:23:31,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=157040.0, ans=0.125 2023-06-23 19:23:31,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=157040.0, ans=0.0 2023-06-23 19:23:31,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=157040.0, ans=0.0 2023-06-23 19:23:47,267 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:23:52,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=157106.66666666666, ans=0.07 2023-06-23 19:23:54,982 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-06-23 19:24:04,612 INFO [train.py:1008] (1/4) Epoch 45, batch 150, loss[loss=0.1838, simple_loss=0.2682, pruned_loss=0.04975, over 19693.00 frames. ], tot_loss[loss=0.1964, simple_loss=0.2733, pruned_loss=0.05973, over 2020036.15 frames. ], batch size: 110, lr: 7.40e-03, grad_scale: 32.0 2023-06-23 19:24:18,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=157173.33333333334, ans=0.125 2023-06-23 19:24:31,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=157240.0, ans=0.0 2023-06-23 19:24:55,440 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:25:00,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=157373.33333333334, ans=0.05 2023-06-23 19:25:08,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=157373.33333333334, ans=0.0 2023-06-23 19:25:26,500 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.361e+02 1.703e+02 1.879e+02 2.064e+02 3.143e+02, threshold=3.758e+02, percent-clipped=0.0 2023-06-23 19:25:28,175 INFO [train.py:1008] (1/4) Epoch 45, batch 200, loss[loss=0.1768, simple_loss=0.2618, pruned_loss=0.04597, over 19498.00 frames. ], tot_loss[loss=0.1962, simple_loss=0.2737, pruned_loss=0.05938, over 2408511.77 frames. ], batch size: 102, lr: 7.39e-03, grad_scale: 32.0 2023-06-23 19:25:36,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=157506.66666666666, ans=0.1 2023-06-23 19:26:17,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=157706.66666666666, ans=0.125 2023-06-23 19:26:41,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=157773.33333333334, ans=0.125 2023-06-23 19:26:53,047 INFO [train.py:1008] (1/4) Epoch 45, batch 250, loss[loss=0.1905, simple_loss=0.2664, pruned_loss=0.05735, over 20217.00 frames. ], tot_loss[loss=0.1957, simple_loss=0.2731, pruned_loss=0.05914, over 2716675.76 frames. ], batch size: 141, lr: 7.39e-03, grad_scale: 32.0 2023-06-23 19:27:11,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=157906.66666666666, ans=0.125 2023-06-23 19:27:25,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=157973.33333333334, ans=0.0 2023-06-23 19:28:03,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=158106.66666666666, ans=0.1 2023-06-23 19:28:16,265 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.803e+02 2.062e+02 2.505e+02 3.858e+02, threshold=4.125e+02, percent-clipped=1.0 2023-06-23 19:28:17,879 INFO [train.py:1008] (1/4) Epoch 45, batch 300, loss[loss=0.1842, simple_loss=0.2675, pruned_loss=0.0504, over 19801.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.2731, pruned_loss=0.05879, over 2964698.80 frames. ], batch size: 115, lr: 7.38e-03, grad_scale: 32.0 2023-06-23 19:28:29,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=158173.33333333334, ans=0.1 2023-06-23 19:28:41,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=158240.0, ans=0.125 2023-06-23 19:28:56,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=158306.66666666666, ans=0.125 2023-06-23 19:28:59,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=158306.66666666666, ans=10.0 2023-06-23 19:29:05,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=158306.66666666666, ans=0.125 2023-06-23 19:29:42,254 INFO [train.py:1008] (1/4) Epoch 45, batch 350, loss[loss=0.1862, simple_loss=0.2721, pruned_loss=0.05016, over 19220.00 frames. ], tot_loss[loss=0.1947, simple_loss=0.2723, pruned_loss=0.0586, over 3150368.92 frames. ], batch size: 92, lr: 7.37e-03, grad_scale: 32.0 2023-06-23 19:29:51,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=158506.66666666666, ans=0.125 2023-06-23 19:30:11,263 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-23 19:30:24,493 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.05 vs. limit=10.0 2023-06-23 19:30:28,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=158640.0, ans=0.2 2023-06-23 19:30:32,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=158706.66666666666, ans=0.125 2023-06-23 19:30:44,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=158706.66666666666, ans=0.125 2023-06-23 19:30:50,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=158773.33333333334, ans=0.125 2023-06-23 19:30:57,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=158773.33333333334, ans=0.0 2023-06-23 19:31:07,426 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.753e+02 1.983e+02 2.244e+02 3.226e+02, threshold=3.966e+02, percent-clipped=0.0 2023-06-23 19:31:07,472 INFO [train.py:1008] (1/4) Epoch 45, batch 400, loss[loss=0.196, simple_loss=0.2686, pruned_loss=0.0617, over 20583.00 frames. ], tot_loss[loss=0.1947, simple_loss=0.2725, pruned_loss=0.05849, over 3291919.90 frames. ], batch size: 189, lr: 7.36e-03, grad_scale: 32.0 2023-06-23 19:31:17,316 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.43 vs. limit=12.0 2023-06-23 19:31:44,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=158973.33333333334, ans=0.125 2023-06-23 19:32:07,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=159040.0, ans=0.1 2023-06-23 19:32:24,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=159106.66666666666, ans=0.1 2023-06-23 19:32:30,811 INFO [train.py:1008] (1/4) Epoch 45, batch 450, loss[loss=0.2016, simple_loss=0.2761, pruned_loss=0.06354, over 19776.00 frames. ], tot_loss[loss=0.1953, simple_loss=0.2735, pruned_loss=0.05849, over 3384364.45 frames. ], batch size: 115, lr: 7.36e-03, grad_scale: 32.0 2023-06-23 19:33:00,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=159240.0, ans=0.0 2023-06-23 19:33:19,537 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:33:33,791 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:33:41,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=159440.0, ans=0.125 2023-06-23 19:33:44,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=159440.0, ans=0.0 2023-06-23 19:33:52,060 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.694e+02 1.864e+02 2.179e+02 3.076e+02, threshold=3.729e+02, percent-clipped=0.0 2023-06-23 19:33:52,106 INFO [train.py:1008] (1/4) Epoch 45, batch 500, loss[loss=0.205, simple_loss=0.2966, pruned_loss=0.05664, over 15136.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2736, pruned_loss=0.05903, over 3447142.77 frames. ], batch size: 43, lr: 7.35e-03, grad_scale: 32.0 2023-06-23 19:34:59,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=159726.66666666666, ans=0.0 2023-06-23 19:35:08,887 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.42 vs. limit=15.0 2023-06-23 19:35:09,617 INFO [train.py:1008] (1/4) Epoch 46, batch 0, loss[loss=0.1865, simple_loss=0.2676, pruned_loss=0.0527, over 19673.00 frames. ], tot_loss[loss=0.1865, simple_loss=0.2676, pruned_loss=0.0527, over 19673.00 frames. ], batch size: 110, lr: 7.27e-03, grad_scale: 32.0 2023-06-23 19:35:09,617 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 19:35:15,389 INFO [train.py:1040] (1/4) Epoch 46, validation: loss=0.1946, simple_loss=0.2884, pruned_loss=0.05047, over 143649.00 frames. 2023-06-23 19:35:15,390 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 19:35:32,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=159793.33333333334, ans=0.125 2023-06-23 19:35:32,920 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=159793.33333333334, ans=0.125 2023-06-23 19:35:35,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=159793.33333333334, ans=0.0 2023-06-23 19:35:38,184 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-23 19:35:39,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=159793.33333333334, ans=0.125 2023-06-23 19:36:03,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=159926.66666666666, ans=0.0 2023-06-23 19:36:40,393 INFO [train.py:1008] (1/4) Epoch 46, batch 50, loss[loss=0.1977, simple_loss=0.2827, pruned_loss=0.05634, over 16302.00 frames. ], tot_loss[loss=0.194, simple_loss=0.2737, pruned_loss=0.05719, over 860357.40 frames. ], batch size: 52, lr: 7.26e-03, grad_scale: 32.0 2023-06-23 19:37:07,793 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.754e+02 1.890e+02 2.079e+02 2.977e+02, threshold=3.779e+02, percent-clipped=0.0 2023-06-23 19:37:40,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=160260.0, ans=0.125 2023-06-23 19:38:02,591 INFO [train.py:1008] (1/4) Epoch 46, batch 100, loss[loss=0.1901, simple_loss=0.2663, pruned_loss=0.05692, over 20612.00 frames. ], tot_loss[loss=0.1955, simple_loss=0.2745, pruned_loss=0.05821, over 1508305.63 frames. ], batch size: 173, lr: 7.25e-03, grad_scale: 32.0 2023-06-23 19:38:30,901 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.37 vs. limit=15.0 2023-06-23 19:38:44,484 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.54 vs. limit=15.0 2023-06-23 19:39:01,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=160593.33333333334, ans=0.0 2023-06-23 19:39:13,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=160660.0, ans=0.125 2023-06-23 19:39:22,616 INFO [train.py:1008] (1/4) Epoch 46, batch 150, loss[loss=0.2025, simple_loss=0.2824, pruned_loss=0.06133, over 11095.00 frames. ], tot_loss[loss=0.1937, simple_loss=0.2728, pruned_loss=0.05736, over 2024665.73 frames. ], batch size: 31, lr: 7.24e-03, grad_scale: 32.0 2023-06-23 19:39:42,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=160793.33333333334, ans=0.0 2023-06-23 19:39:50,369 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.385e+02 1.683e+02 1.848e+02 2.036e+02 2.686e+02, threshold=3.697e+02, percent-clipped=0.0 2023-06-23 19:39:54,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=160860.0, ans=0.0 2023-06-23 19:40:29,190 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=14.34 vs. limit=22.5 2023-06-23 19:40:44,462 INFO [train.py:1008] (1/4) Epoch 46, batch 200, loss[loss=0.1776, simple_loss=0.2623, pruned_loss=0.04641, over 19053.00 frames. ], tot_loss[loss=0.1945, simple_loss=0.2739, pruned_loss=0.0576, over 2392897.36 frames. ], batch size: 89, lr: 7.24e-03, grad_scale: 32.0 2023-06-23 19:41:06,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=161126.66666666666, ans=0.125 2023-06-23 19:41:25,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=161193.33333333334, ans=0.125 2023-06-23 19:41:45,736 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.80 vs. limit=15.0 2023-06-23 19:42:05,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=161326.66666666666, ans=0.125 2023-06-23 19:42:06,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=161393.33333333334, ans=0.125 2023-06-23 19:42:08,719 INFO [train.py:1008] (1/4) Epoch 46, batch 250, loss[loss=0.2013, simple_loss=0.2783, pruned_loss=0.06212, over 20483.00 frames. ], tot_loss[loss=0.195, simple_loss=0.2737, pruned_loss=0.05815, over 2698551.69 frames. ], batch size: 160, lr: 7.23e-03, grad_scale: 32.0 2023-06-23 19:42:12,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=161393.33333333334, ans=0.1 2023-06-23 19:42:15,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=161393.33333333334, ans=0.0 2023-06-23 19:42:28,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=161460.0, ans=0.0 2023-06-23 19:42:32,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=161460.0, ans=0.2 2023-06-23 19:42:36,530 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.820e+02 2.032e+02 2.554e+02 4.119e+02, threshold=4.065e+02, percent-clipped=4.0 2023-06-23 19:42:54,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=161526.66666666666, ans=0.0 2023-06-23 19:42:59,187 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=7.37 vs. limit=15.0 2023-06-23 19:43:05,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=161593.33333333334, ans=0.1 2023-06-23 19:43:09,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=161593.33333333334, ans=0.1 2023-06-23 19:43:21,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=161660.0, ans=0.0 2023-06-23 19:43:29,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=161726.66666666666, ans=0.125 2023-06-23 19:43:30,573 INFO [train.py:1008] (1/4) Epoch 46, batch 300, loss[loss=0.1972, simple_loss=0.2797, pruned_loss=0.05734, over 18302.00 frames. ], tot_loss[loss=0.1953, simple_loss=0.2736, pruned_loss=0.0585, over 2935824.73 frames. ], batch size: 74, lr: 7.22e-03, grad_scale: 32.0 2023-06-23 19:43:32,900 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=161726.66666666666, ans=0.125 2023-06-23 19:43:39,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=161726.66666666666, ans=0.0 2023-06-23 19:43:48,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=161793.33333333334, ans=0.0 2023-06-23 19:44:04,042 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.25 vs. limit=10.0 2023-06-23 19:44:12,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=161860.0, ans=0.125 2023-06-23 19:44:27,960 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.98 vs. limit=10.0 2023-06-23 19:44:38,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=161993.33333333334, ans=0.125 2023-06-23 19:44:48,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=161993.33333333334, ans=0.0 2023-06-23 19:44:51,367 INFO [train.py:1008] (1/4) Epoch 46, batch 350, loss[loss=0.2027, simple_loss=0.2768, pruned_loss=0.06432, over 19073.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.2733, pruned_loss=0.05846, over 3130001.28 frames. ], batch size: 94, lr: 7.22e-03, grad_scale: 32.0 2023-06-23 19:45:04,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=162060.0, ans=0.125 2023-06-23 19:45:09,460 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.75 vs. limit=15.0 2023-06-23 19:45:19,776 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.692e+02 1.874e+02 2.105e+02 2.686e+02, threshold=3.748e+02, percent-clipped=0.0 2023-06-23 19:45:30,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=162193.33333333334, ans=0.1 2023-06-23 19:45:35,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=162193.33333333334, ans=0.2 2023-06-23 19:45:42,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=162260.0, ans=0.125 2023-06-23 19:46:14,893 INFO [train.py:1008] (1/4) Epoch 46, batch 400, loss[loss=0.1824, simple_loss=0.2657, pruned_loss=0.04952, over 18609.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.2735, pruned_loss=0.05838, over 3276304.87 frames. ], batch size: 80, lr: 7.21e-03, grad_scale: 32.0 2023-06-23 19:46:24,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=162393.33333333334, ans=0.0 2023-06-23 19:46:34,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=162460.0, ans=0.0 2023-06-23 19:46:49,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=162526.66666666666, ans=0.1 2023-06-23 19:47:36,237 INFO [train.py:1008] (1/4) Epoch 46, batch 450, loss[loss=0.1833, simple_loss=0.2677, pruned_loss=0.04945, over 19527.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.2738, pruned_loss=0.05825, over 3380388.87 frames. ], batch size: 102, lr: 7.20e-03, grad_scale: 32.0 2023-06-23 19:47:47,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=162726.66666666666, ans=0.125 2023-06-23 19:47:48,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=162726.66666666666, ans=0.0 2023-06-23 19:47:59,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=162793.33333333334, ans=0.125 2023-06-23 19:48:04,461 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.727e+02 2.003e+02 2.275e+02 3.093e+02, threshold=4.006e+02, percent-clipped=0.0 2023-06-23 19:48:05,332 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.55 vs. limit=15.0 2023-06-23 19:48:11,358 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.19 vs. limit=22.5 2023-06-23 19:48:20,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=162860.0, ans=0.1 2023-06-23 19:48:40,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=162993.33333333334, ans=0.0 2023-06-23 19:48:41,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=162993.33333333334, ans=0.2 2023-06-23 19:48:43,090 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:48:55,566 INFO [train.py:1008] (1/4) Epoch 46, batch 500, loss[loss=0.1828, simple_loss=0.2505, pruned_loss=0.05761, over 20366.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.2739, pruned_loss=0.05843, over 3457149.49 frames. ], batch size: 239, lr: 7.20e-03, grad_scale: 32.0 2023-06-23 19:49:40,508 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.92 vs. limit=6.0 2023-06-23 19:50:03,331 INFO [train.py:1008] (1/4) Epoch 47, batch 0, loss[loss=0.1949, simple_loss=0.2687, pruned_loss=0.06056, over 20192.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2687, pruned_loss=0.06056, over 20192.00 frames. ], batch size: 239, lr: 7.11e-03, grad_scale: 32.0 2023-06-23 19:50:03,331 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 19:50:08,999 INFO [train.py:1040] (1/4) Epoch 47, validation: loss=0.1939, simple_loss=0.288, pruned_loss=0.04994, over 143649.00 frames. 2023-06-23 19:50:08,999 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 19:50:14,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=163280.0, ans=0.125 2023-06-23 19:50:25,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=163346.66666666666, ans=0.0 2023-06-23 19:50:31,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=163346.66666666666, ans=0.125 2023-06-23 19:50:43,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=163413.33333333334, ans=0.2 2023-06-23 19:51:04,070 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-23 19:51:04,884 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.424e+02 1.745e+02 1.917e+02 2.227e+02 3.172e+02, threshold=3.834e+02, percent-clipped=0.0 2023-06-23 19:51:06,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=163480.0, ans=0.09899494936611666 2023-06-23 19:51:06,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=163480.0, ans=0.125 2023-06-23 19:51:24,546 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.06 vs. limit=15.0 2023-06-23 19:51:31,074 INFO [train.py:1008] (1/4) Epoch 47, batch 50, loss[loss=0.2033, simple_loss=0.2749, pruned_loss=0.06581, over 20307.00 frames. ], tot_loss[loss=0.1937, simple_loss=0.2721, pruned_loss=0.05766, over 876259.42 frames. ], batch size: 141, lr: 7.11e-03, grad_scale: 32.0 2023-06-23 19:51:36,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=163613.33333333334, ans=0.05 2023-06-23 19:51:47,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=163680.0, ans=0.125 2023-06-23 19:51:55,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=163680.0, ans=0.125 2023-06-23 19:52:27,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=163813.33333333334, ans=0.125 2023-06-23 19:52:27,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=163813.33333333334, ans=0.1 2023-06-23 19:52:53,248 INFO [train.py:1008] (1/4) Epoch 47, batch 100, loss[loss=0.2049, simple_loss=0.2746, pruned_loss=0.06764, over 20281.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.2726, pruned_loss=0.05813, over 1529185.72 frames. ], batch size: 149, lr: 7.10e-03, grad_scale: 32.0 2023-06-23 19:53:33,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=164080.0, ans=0.5 2023-06-23 19:53:45,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=164146.66666666666, ans=0.125 2023-06-23 19:53:48,246 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.729e+02 1.911e+02 2.192e+02 3.082e+02, threshold=3.821e+02, percent-clipped=0.0 2023-06-23 19:53:55,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=164146.66666666666, ans=0.0 2023-06-23 19:53:58,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=164213.33333333334, ans=0.0 2023-06-23 19:54:11,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=164213.33333333334, ans=0.125 2023-06-23 19:54:14,610 INFO [train.py:1008] (1/4) Epoch 47, batch 150, loss[loss=0.1847, simple_loss=0.2591, pruned_loss=0.05521, over 20646.00 frames. ], tot_loss[loss=0.1935, simple_loss=0.2718, pruned_loss=0.05759, over 2030169.04 frames. ], batch size: 211, lr: 7.10e-03, grad_scale: 32.0 2023-06-23 19:54:17,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=164280.0, ans=0.2 2023-06-23 19:54:35,085 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:54:50,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=164413.33333333334, ans=0.05 2023-06-23 19:55:03,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff3.min_abs, batch_count=164480.0, ans=0.2 2023-06-23 19:55:06,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=164480.0, ans=0.2 2023-06-23 19:55:10,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=164480.0, ans=0.1 2023-06-23 19:55:38,243 INFO [train.py:1008] (1/4) Epoch 47, batch 200, loss[loss=0.1804, simple_loss=0.2656, pruned_loss=0.04761, over 18608.00 frames. ], tot_loss[loss=0.1941, simple_loss=0.2721, pruned_loss=0.05798, over 2417173.12 frames. ], batch size: 80, lr: 7.09e-03, grad_scale: 32.0 2023-06-23 19:55:58,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=164680.0, ans=0.2 2023-06-23 19:56:09,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=164746.66666666666, ans=0.125 2023-06-23 19:56:34,203 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.753e+02 1.915e+02 2.109e+02 2.762e+02, threshold=3.830e+02, percent-clipped=0.0 2023-06-23 19:56:39,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=164813.33333333334, ans=0.2 2023-06-23 19:56:58,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=164880.0, ans=0.1 2023-06-23 19:57:01,083 INFO [train.py:1008] (1/4) Epoch 47, batch 250, loss[loss=0.1727, simple_loss=0.2551, pruned_loss=0.04515, over 19877.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.2716, pruned_loss=0.05748, over 2731810.88 frames. ], batch size: 120, lr: 7.08e-03, grad_scale: 32.0 2023-06-23 19:57:10,353 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.03 vs. limit=15.0 2023-06-23 19:57:13,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=164946.66666666666, ans=0.1 2023-06-23 19:57:47,834 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-23 19:57:59,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=165146.66666666666, ans=0.125 2023-06-23 19:58:04,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.min_positive, batch_count=165146.66666666666, ans=0.025 2023-06-23 19:58:14,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=165213.33333333334, ans=0.07 2023-06-23 19:58:23,437 INFO [train.py:1008] (1/4) Epoch 47, batch 300, loss[loss=0.1886, simple_loss=0.2607, pruned_loss=0.05823, over 20509.00 frames. ], tot_loss[loss=0.1937, simple_loss=0.2717, pruned_loss=0.05787, over 2949412.69 frames. ], batch size: 189, lr: 7.08e-03, grad_scale: 16.0 2023-06-23 19:58:36,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=165280.0, ans=0.125 2023-06-23 19:58:54,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=165346.66666666666, ans=0.125 2023-06-23 19:59:22,069 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.773e+02 1.990e+02 2.235e+02 3.211e+02, threshold=3.980e+02, percent-clipped=0.0 2023-06-23 19:59:26,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=165480.0, ans=0.125 2023-06-23 19:59:45,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=165613.33333333334, ans=0.125 2023-06-23 19:59:46,572 INFO [train.py:1008] (1/4) Epoch 47, batch 350, loss[loss=0.1907, simple_loss=0.2644, pruned_loss=0.05855, over 20531.00 frames. ], tot_loss[loss=0.1927, simple_loss=0.2707, pruned_loss=0.05734, over 3143180.51 frames. ], batch size: 173, lr: 7.07e-03, grad_scale: 16.0 2023-06-23 20:00:37,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=165813.33333333334, ans=0.125 2023-06-23 20:00:44,910 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.73 vs. limit=22.5 2023-06-23 20:00:45,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=165813.33333333334, ans=0.125 2023-06-23 20:01:04,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=165880.0, ans=0.1 2023-06-23 20:01:10,413 INFO [train.py:1008] (1/4) Epoch 47, batch 400, loss[loss=0.1625, simple_loss=0.2457, pruned_loss=0.03971, over 19851.00 frames. ], tot_loss[loss=0.1931, simple_loss=0.2706, pruned_loss=0.05779, over 3277635.52 frames. ], batch size: 120, lr: 7.06e-03, grad_scale: 32.0 2023-06-23 20:01:29,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=166013.33333333334, ans=0.0 2023-06-23 20:01:32,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=166013.33333333334, ans=0.0 2023-06-23 20:01:38,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=166013.33333333334, ans=0.2 2023-06-23 20:01:50,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=166080.0, ans=0.125 2023-06-23 20:02:09,663 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.715e+02 1.880e+02 2.205e+02 3.538e+02, threshold=3.760e+02, percent-clipped=0.0 2023-06-23 20:02:20,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=166213.33333333334, ans=0.2 2023-06-23 20:02:34,895 INFO [train.py:1008] (1/4) Epoch 47, batch 450, loss[loss=0.2041, simple_loss=0.2803, pruned_loss=0.06392, over 18487.00 frames. ], tot_loss[loss=0.193, simple_loss=0.2713, pruned_loss=0.05736, over 3383173.08 frames. ], batch size: 77, lr: 7.06e-03, grad_scale: 32.0 2023-06-23 20:02:36,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=166280.0, ans=0.0 2023-06-23 20:02:51,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=166346.66666666666, ans=0.125 2023-06-23 20:02:55,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=166346.66666666666, ans=0.1 2023-06-23 20:03:06,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=166413.33333333334, ans=0.035 2023-06-23 20:03:37,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=166480.0, ans=0.125 2023-06-23 20:03:43,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=166546.66666666666, ans=0.05 2023-06-23 20:03:45,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=166546.66666666666, ans=0.2 2023-06-23 20:03:55,642 INFO [train.py:1008] (1/4) Epoch 47, batch 500, loss[loss=0.2063, simple_loss=0.2968, pruned_loss=0.05786, over 18319.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.2718, pruned_loss=0.05727, over 3462027.92 frames. ], batch size: 72, lr: 7.05e-03, grad_scale: 32.0 2023-06-23 20:04:05,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=166613.33333333334, ans=0.0 2023-06-23 20:04:12,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=166680.0, ans=6.0 2023-06-23 20:04:38,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=166746.66666666666, ans=0.0 2023-06-23 20:04:39,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=166746.66666666666, ans=0.125 2023-06-23 20:05:01,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=166826.66666666666, ans=0.2 2023-06-23 20:05:06,512 INFO [train.py:1008] (1/4) Epoch 48, batch 0, loss[loss=0.1876, simple_loss=0.2634, pruned_loss=0.05589, over 20308.00 frames. ], tot_loss[loss=0.1876, simple_loss=0.2634, pruned_loss=0.05589, over 20308.00 frames. ], batch size: 239, lr: 6.97e-03, grad_scale: 32.0 2023-06-23 20:05:06,512 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 20:05:12,223 INFO [train.py:1040] (1/4) Epoch 48, validation: loss=0.196, simple_loss=0.2888, pruned_loss=0.05157, over 143649.00 frames. 2023-06-23 20:05:12,224 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 20:05:16,871 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.378e+02 1.645e+02 1.899e+02 2.115e+02 3.302e+02, threshold=3.798e+02, percent-clipped=0.0 2023-06-23 20:05:40,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=166893.33333333334, ans=0.2 2023-06-23 20:06:00,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=167026.66666666666, ans=0.09899494936611666 2023-06-23 20:06:28,957 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.88 vs. limit=15.0 2023-06-23 20:06:34,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=167160.0, ans=0.95 2023-06-23 20:06:35,224 INFO [train.py:1008] (1/4) Epoch 48, batch 50, loss[loss=0.1893, simple_loss=0.2622, pruned_loss=0.05818, over 20104.00 frames. ], tot_loss[loss=0.1918, simple_loss=0.2706, pruned_loss=0.05653, over 857837.42 frames. ], batch size: 239, lr: 6.96e-03, grad_scale: 32.0 2023-06-23 20:07:07,549 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=3.14 vs. limit=12.0 2023-06-23 20:07:58,675 INFO [train.py:1008] (1/4) Epoch 48, batch 100, loss[loss=0.1981, simple_loss=0.2802, pruned_loss=0.05799, over 16778.00 frames. ], tot_loss[loss=0.1923, simple_loss=0.2707, pruned_loss=0.05698, over 1499586.98 frames. ], batch size: 59, lr: 6.96e-03, grad_scale: 32.0 2023-06-23 20:07:59,849 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.14 vs. limit=15.0 2023-06-23 20:08:03,600 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.804e+02 2.047e+02 2.284e+02 3.996e+02, threshold=4.095e+02, percent-clipped=1.0 2023-06-23 20:08:13,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=167560.0, ans=0.125 2023-06-23 20:08:18,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=167560.0, ans=0.0 2023-06-23 20:08:23,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=167560.0, ans=0.125 2023-06-23 20:08:32,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=167626.66666666666, ans=0.125 2023-06-23 20:08:42,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=167626.66666666666, ans=0.5 2023-06-23 20:08:50,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=167693.33333333334, ans=0.125 2023-06-23 20:08:53,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=167693.33333333334, ans=0.125 2023-06-23 20:09:00,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=167693.33333333334, ans=0.125 2023-06-23 20:09:21,626 INFO [train.py:1008] (1/4) Epoch 48, batch 150, loss[loss=0.209, simple_loss=0.2961, pruned_loss=0.06092, over 16281.00 frames. ], tot_loss[loss=0.1929, simple_loss=0.2713, pruned_loss=0.05724, over 2006036.70 frames. ], batch size: 52, lr: 6.95e-03, grad_scale: 32.0 2023-06-23 20:09:30,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=167826.66666666666, ans=0.2 2023-06-23 20:09:50,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=167893.33333333334, ans=0.2 2023-06-23 20:09:55,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=167960.0, ans=0.125 2023-06-23 20:10:02,783 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=167960.0, ans=0.035 2023-06-23 20:10:08,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=167960.0, ans=0.1 2023-06-23 20:10:12,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=168026.66666666666, ans=0.0 2023-06-23 20:10:37,457 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.28 vs. limit=15.0 2023-06-23 20:10:43,366 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=168160.0, ans=0.125 2023-06-23 20:10:44,617 INFO [train.py:1008] (1/4) Epoch 48, batch 200, loss[loss=0.1877, simple_loss=0.2739, pruned_loss=0.05075, over 19461.00 frames. ], tot_loss[loss=0.1929, simple_loss=0.271, pruned_loss=0.05738, over 2379637.71 frames. ], batch size: 105, lr: 6.95e-03, grad_scale: 16.0 2023-06-23 20:10:51,403 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.331e+02 1.772e+02 1.937e+02 2.210e+02 3.460e+02, threshold=3.874e+02, percent-clipped=0.0 2023-06-23 20:10:51,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=168160.0, ans=0.0 2023-06-23 20:11:05,960 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:12:07,140 INFO [train.py:1008] (1/4) Epoch 48, batch 250, loss[loss=0.2054, simple_loss=0.2657, pruned_loss=0.07258, over 19923.00 frames. ], tot_loss[loss=0.1934, simple_loss=0.2724, pruned_loss=0.05724, over 2690497.62 frames. ], batch size: 293, lr: 6.94e-03, grad_scale: 16.0 2023-06-23 20:12:20,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=168493.33333333334, ans=0.0 2023-06-23 20:12:20,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=168493.33333333334, ans=0.125 2023-06-23 20:12:21,076 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.71 vs. limit=15.0 2023-06-23 20:12:34,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=168560.0, ans=0.0 2023-06-23 20:13:05,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=168693.33333333334, ans=0.1 2023-06-23 20:13:29,092 INFO [train.py:1008] (1/4) Epoch 48, batch 300, loss[loss=0.1996, simple_loss=0.2719, pruned_loss=0.06365, over 20451.00 frames. ], tot_loss[loss=0.1925, simple_loss=0.2713, pruned_loss=0.0568, over 2940624.27 frames. ], batch size: 160, lr: 6.93e-03, grad_scale: 16.0 2023-06-23 20:13:34,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=168826.66666666666, ans=0.0 2023-06-23 20:13:35,925 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.680e+02 1.833e+02 2.048e+02 2.749e+02, threshold=3.665e+02, percent-clipped=0.0 2023-06-23 20:13:47,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=168893.33333333334, ans=0.0 2023-06-23 20:14:52,078 INFO [train.py:1008] (1/4) Epoch 48, batch 350, loss[loss=0.1908, simple_loss=0.2687, pruned_loss=0.05644, over 19531.00 frames. ], tot_loss[loss=0.1921, simple_loss=0.271, pruned_loss=0.05661, over 3113865.29 frames. ], batch size: 102, lr: 6.93e-03, grad_scale: 16.0 2023-06-23 20:15:18,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=169226.66666666666, ans=0.125 2023-06-23 20:16:12,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=169426.66666666666, ans=0.0 2023-06-23 20:16:15,498 INFO [train.py:1008] (1/4) Epoch 48, batch 400, loss[loss=0.1993, simple_loss=0.2735, pruned_loss=0.06249, over 20311.00 frames. ], tot_loss[loss=0.1928, simple_loss=0.2715, pruned_loss=0.05699, over 3263970.93 frames. ], batch size: 149, lr: 6.92e-03, grad_scale: 32.0 2023-06-23 20:16:22,151 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.353e+02 1.719e+02 1.891e+02 2.067e+02 3.319e+02, threshold=3.781e+02, percent-clipped=0.0 2023-06-23 20:16:29,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=169493.33333333334, ans=0.125 2023-06-23 20:16:35,122 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=14.76 vs. limit=22.5 2023-06-23 20:17:12,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=169693.33333333334, ans=0.0 2023-06-23 20:17:37,642 INFO [train.py:1008] (1/4) Epoch 48, batch 450, loss[loss=0.2228, simple_loss=0.3104, pruned_loss=0.06758, over 16981.00 frames. ], tot_loss[loss=0.1925, simple_loss=0.2711, pruned_loss=0.05694, over 3394440.54 frames. ], batch size: 60, lr: 6.91e-03, grad_scale: 32.0 2023-06-23 20:17:40,864 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.23 vs. limit=10.0 2023-06-23 20:17:49,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=169826.66666666666, ans=0.025 2023-06-23 20:18:00,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=169893.33333333334, ans=0.125 2023-06-23 20:18:07,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=169893.33333333334, ans=0.0 2023-06-23 20:18:24,881 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=3.50 vs. limit=12.0 2023-06-23 20:18:26,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=170026.66666666666, ans=0.1 2023-06-23 20:18:30,188 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.72 vs. limit=12.0 2023-06-23 20:18:45,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=170093.33333333334, ans=0.1 2023-06-23 20:18:57,027 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.29 vs. limit=15.0 2023-06-23 20:18:59,411 INFO [train.py:1008] (1/4) Epoch 48, batch 500, loss[loss=0.1908, simple_loss=0.2609, pruned_loss=0.06035, over 20348.00 frames. ], tot_loss[loss=0.1924, simple_loss=0.271, pruned_loss=0.05687, over 3490478.99 frames. ], batch size: 239, lr: 6.91e-03, grad_scale: 32.0 2023-06-23 20:19:06,600 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.730e+02 1.897e+02 2.100e+02 3.393e+02, threshold=3.794e+02, percent-clipped=0.0 2023-06-23 20:19:14,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=170226.66666666666, ans=0.125 2023-06-23 20:19:28,656 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=170226.66666666666, ans=0.07 2023-06-23 20:19:36,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=170293.33333333334, ans=0.125 2023-06-23 20:19:36,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=170293.33333333334, ans=0.1 2023-06-23 20:20:11,719 INFO [train.py:1008] (1/4) Epoch 49, batch 0, loss[loss=0.2061, simple_loss=0.2779, pruned_loss=0.0672, over 19947.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2779, pruned_loss=0.0672, over 19947.00 frames. ], batch size: 126, lr: 6.83e-03, grad_scale: 32.0 2023-06-23 20:20:11,720 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 20:20:17,873 INFO [train.py:1040] (1/4) Epoch 49, validation: loss=0.1978, simple_loss=0.2901, pruned_loss=0.05269, over 143649.00 frames. 2023-06-23 20:20:17,874 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 20:20:37,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=170440.0, ans=0.125 2023-06-23 20:20:38,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=170440.0, ans=0.125 2023-06-23 20:20:39,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=170440.0, ans=0.0 2023-06-23 20:20:57,268 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.25 vs. limit=10.0 2023-06-23 20:20:58,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=170506.66666666666, ans=0.0 2023-06-23 20:21:36,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=170640.0, ans=0.0 2023-06-23 20:21:41,449 INFO [train.py:1008] (1/4) Epoch 49, batch 50, loss[loss=0.1789, simple_loss=0.2639, pruned_loss=0.04696, over 18643.00 frames. ], tot_loss[loss=0.1931, simple_loss=0.2706, pruned_loss=0.05778, over 858243.03 frames. ], batch size: 80, lr: 6.83e-03, grad_scale: 32.0 2023-06-23 20:21:58,147 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=170773.33333333334, ans=0.125 2023-06-23 20:22:11,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=170773.33333333334, ans=0.0 2023-06-23 20:22:13,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=170840.0, ans=0.125 2023-06-23 20:22:19,630 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.847e+02 2.066e+02 2.446e+02 3.977e+02, threshold=4.132e+02, percent-clipped=1.0 2023-06-23 20:22:22,267 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.95 vs. limit=15.0 2023-06-23 20:23:03,337 INFO [train.py:1008] (1/4) Epoch 49, batch 100, loss[loss=0.1831, simple_loss=0.2693, pruned_loss=0.0485, over 18315.00 frames. ], tot_loss[loss=0.192, simple_loss=0.2704, pruned_loss=0.05678, over 1506683.14 frames. ], batch size: 74, lr: 6.82e-03, grad_scale: 32.0 2023-06-23 20:23:08,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=171040.0, ans=0.05 2023-06-23 20:23:21,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=171106.66666666666, ans=0.0 2023-06-23 20:23:49,354 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.82 vs. limit=15.0 2023-06-23 20:24:08,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=171306.66666666666, ans=0.0 2023-06-23 20:24:26,710 INFO [train.py:1008] (1/4) Epoch 49, batch 150, loss[loss=0.1898, simple_loss=0.274, pruned_loss=0.05284, over 19122.00 frames. ], tot_loss[loss=0.1922, simple_loss=0.2702, pruned_loss=0.05706, over 2020217.57 frames. ], batch size: 94, lr: 6.81e-03, grad_scale: 32.0 2023-06-23 20:24:26,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=171373.33333333334, ans=0.125 2023-06-23 20:24:48,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=171440.0, ans=10.0 2023-06-23 20:25:04,228 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.656e+02 1.789e+02 2.179e+02 3.725e+02, threshold=3.578e+02, percent-clipped=0.0 2023-06-23 20:25:48,577 INFO [train.py:1008] (1/4) Epoch 49, batch 200, loss[loss=0.1982, simple_loss=0.2906, pruned_loss=0.05289, over 17635.00 frames. ], tot_loss[loss=0.1926, simple_loss=0.2712, pruned_loss=0.05697, over 2405783.26 frames. ], batch size: 67, lr: 6.81e-03, grad_scale: 32.0 2023-06-23 20:26:30,596 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=171840.0, ans=0.1 2023-06-23 20:26:48,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=171906.66666666666, ans=0.125 2023-06-23 20:27:11,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=172040.0, ans=0.0 2023-06-23 20:27:12,561 INFO [train.py:1008] (1/4) Epoch 49, batch 250, loss[loss=0.1892, simple_loss=0.2713, pruned_loss=0.05353, over 19848.00 frames. ], tot_loss[loss=0.1925, simple_loss=0.2711, pruned_loss=0.05696, over 2719227.32 frames. ], batch size: 115, lr: 6.80e-03, grad_scale: 32.0 2023-06-23 20:27:16,719 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.64 vs. limit=15.0 2023-06-23 20:27:29,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=172106.66666666666, ans=0.0 2023-06-23 20:27:48,875 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.691e+02 1.914e+02 2.245e+02 3.185e+02, threshold=3.827e+02, percent-clipped=0.0 2023-06-23 20:27:55,795 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.76 vs. limit=15.0 2023-06-23 20:28:12,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=172240.0, ans=0.1 2023-06-23 20:28:16,978 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.17 vs. limit=10.0 2023-06-23 20:28:23,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=172306.66666666666, ans=0.1 2023-06-23 20:28:34,556 INFO [train.py:1008] (1/4) Epoch 49, batch 300, loss[loss=0.2189, simple_loss=0.3072, pruned_loss=0.06529, over 17836.00 frames. ], tot_loss[loss=0.1922, simple_loss=0.2711, pruned_loss=0.05662, over 2959779.39 frames. ], batch size: 68, lr: 6.80e-03, grad_scale: 16.0 2023-06-23 20:28:58,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=172440.0, ans=0.125 2023-06-23 20:29:14,938 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.85 vs. limit=22.5 2023-06-23 20:29:45,005 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=172640.0, ans=0.0 2023-06-23 20:29:47,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=172640.0, ans=0.125 2023-06-23 20:29:56,732 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.89 vs. limit=15.0 2023-06-23 20:29:57,312 INFO [train.py:1008] (1/4) Epoch 49, batch 350, loss[loss=0.1883, simple_loss=0.2636, pruned_loss=0.05647, over 18599.00 frames. ], tot_loss[loss=0.1923, simple_loss=0.2712, pruned_loss=0.05667, over 3144249.46 frames. ], batch size: 80, lr: 6.79e-03, grad_scale: 16.0 2023-06-23 20:29:57,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=172706.66666666666, ans=0.125 2023-06-23 20:30:15,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=172773.33333333334, ans=0.035 2023-06-23 20:30:17,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_positive, batch_count=172773.33333333334, ans=0.05 2023-06-23 20:30:37,375 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.714e+02 1.918e+02 2.192e+02 2.856e+02, threshold=3.837e+02, percent-clipped=0.0 2023-06-23 20:31:02,759 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=12.0 2023-06-23 20:31:18,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=173040.0, ans=0.0 2023-06-23 20:31:19,824 INFO [train.py:1008] (1/4) Epoch 49, batch 400, loss[loss=0.2056, simple_loss=0.2917, pruned_loss=0.0597, over 16792.00 frames. ], tot_loss[loss=0.1922, simple_loss=0.2706, pruned_loss=0.05689, over 3278929.70 frames. ], batch size: 59, lr: 6.78e-03, grad_scale: 32.0 2023-06-23 20:31:26,289 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-23 20:31:53,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=173173.33333333334, ans=0.125 2023-06-23 20:31:54,182 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=173173.33333333334, ans=0.0 2023-06-23 20:32:04,243 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=173173.33333333334, ans=0.1 2023-06-23 20:32:05,509 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=7.78 vs. limit=15.0 2023-06-23 20:32:37,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=173306.66666666666, ans=0.125 2023-06-23 20:32:42,992 INFO [train.py:1008] (1/4) Epoch 49, batch 450, loss[loss=0.1615, simple_loss=0.2442, pruned_loss=0.03945, over 18772.00 frames. ], tot_loss[loss=0.1921, simple_loss=0.2706, pruned_loss=0.05677, over 3377194.86 frames. ], batch size: 83, lr: 6.78e-03, grad_scale: 32.0 2023-06-23 20:32:59,201 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-23 20:33:05,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=173440.0, ans=0.0 2023-06-23 20:33:12,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=173440.0, ans=0.125 2023-06-23 20:33:19,898 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.01 vs. limit=15.0 2023-06-23 20:33:22,125 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.706e+02 1.898e+02 2.100e+02 2.777e+02, threshold=3.796e+02, percent-clipped=0.0 2023-06-23 20:33:49,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=173640.0, ans=0.1 2023-06-23 20:34:04,005 INFO [train.py:1008] (1/4) Epoch 49, batch 500, loss[loss=0.1915, simple_loss=0.269, pruned_loss=0.05697, over 20128.00 frames. ], tot_loss[loss=0.192, simple_loss=0.2706, pruned_loss=0.05669, over 3472418.14 frames. ], batch size: 133, lr: 6.77e-03, grad_scale: 32.0 2023-06-23 20:34:35,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=173840.0, ans=0.0 2023-06-23 20:34:40,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=173840.0, ans=0.125 2023-06-23 20:35:12,129 INFO [train.py:1008] (1/4) Epoch 50, batch 0, loss[loss=0.1808, simple_loss=0.2646, pruned_loss=0.04853, over 19875.00 frames. ], tot_loss[loss=0.1808, simple_loss=0.2646, pruned_loss=0.04853, over 19875.00 frames. ], batch size: 120, lr: 6.70e-03, grad_scale: 32.0 2023-06-23 20:35:12,130 INFO [train.py:1031] (1/4) Computing validation loss 2023-06-23 20:35:17,757 INFO [train.py:1040] (1/4) Epoch 50, validation: loss=0.1967, simple_loss=0.2888, pruned_loss=0.05235, over 143649.00 frames. 2023-06-23 20:35:17,757 INFO [train.py:1041] (1/4) Maximum memory allocated so far is 13712MB 2023-06-23 20:35:58,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=174060.0, ans=0.125 2023-06-23 20:36:00,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=174060.0, ans=0.125 2023-06-23 20:36:23,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=174193.33333333334, ans=0.0 2023-06-23 20:36:26,512 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.756e+02 1.957e+02 2.326e+02 3.234e+02, threshold=3.914e+02, percent-clipped=0.0 2023-06-23 20:36:40,760 INFO [train.py:1008] (1/4) Epoch 50, batch 50, loss[loss=0.1781, simple_loss=0.2635, pruned_loss=0.04634, over 19100.00 frames. ], tot_loss[loss=0.1905, simple_loss=0.2707, pruned_loss=0.0551, over 848561.02 frames. ], batch size: 94, lr: 6.69e-03, grad_scale: 32.0 2023-06-23 20:36:41,640 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.08 vs. limit=22.5 2023-06-23 20:37:08,633 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.31 vs. limit=15.0 2023-06-23 20:37:30,182 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.36 vs. limit=15.0 2023-06-23 20:37:32,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=174460.0, ans=0.2 2023-06-23 20:37:34,870 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.44 vs. limit=10.0 2023-06-23 20:37:58,737 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:38:03,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=174593.33333333334, ans=0.125 2023-06-23 20:38:04,570 INFO [train.py:1008] (1/4) Epoch 50, batch 100, loss[loss=0.1946, simple_loss=0.2701, pruned_loss=0.05955, over 19474.00 frames. ], tot_loss[loss=0.1892, simple_loss=0.2679, pruned_loss=0.05521, over 1529640.96 frames. ], batch size: 105, lr: 6.69e-03, grad_scale: 32.0 2023-06-23 20:38:25,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=174660.0, ans=0.125 2023-06-23 20:38:40,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=174726.66666666666, ans=0.0 2023-06-23 20:38:59,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=174793.33333333334, ans=0.2 2023-06-23 20:39:05,542 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.54 vs. limit=6.0 2023-06-23 20:39:13,762 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.691e+02 1.852e+02 2.068e+02 3.727e+02, threshold=3.704e+02, percent-clipped=0.0 2023-06-23 20:39:18,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=174860.0, ans=0.07 2023-06-23 20:39:23,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=174860.0, ans=0.125 2023-06-23 20:39:28,333 INFO [train.py:1008] (1/4) Epoch 50, batch 150, loss[loss=0.1995, simple_loss=0.2792, pruned_loss=0.05989, over 18295.00 frames. ], tot_loss[loss=0.1904, simple_loss=0.2686, pruned_loss=0.05609, over 2038458.14 frames. ], batch size: 74, lr: 6.68e-03, grad_scale: 32.0 2023-06-23 20:39:29,269 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.58 vs. limit=10.0 2023-06-23 20:39:38,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=174926.66666666666, ans=0.125 2023-06-23 20:39:42,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=174926.66666666666, ans=0.0 2023-06-23 20:39:56,453 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.55 vs. limit=12.0 2023-06-23 20:40:01,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=175060.0, ans=0.0 2023-06-23 20:40:18,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=175126.66666666666, ans=0.125 2023-06-23 20:40:45,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=175193.33333333334, ans=0.0 2023-06-23 20:40:51,248 INFO [train.py:1008] (1/4) Epoch 50, batch 200, loss[loss=0.1746, simple_loss=0.2567, pruned_loss=0.04631, over 18941.00 frames. ], tot_loss[loss=0.1906, simple_loss=0.2694, pruned_loss=0.05593, over 2417892.17 frames. ], batch size: 86, lr: 6.68e-03, grad_scale: 16.0 2023-06-23 20:41:30,751 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.27 vs. limit=15.0 2023-06-23 20:42:01,416 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.719e+02 1.914e+02 2.210e+02 3.241e+02, threshold=3.828e+02, percent-clipped=0.0 2023-06-23 20:42:03,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=175526.66666666666, ans=0.0 2023-06-23 20:42:14,856 INFO [train.py:1008] (1/4) Epoch 50, batch 250, loss[loss=0.2039, simple_loss=0.2789, pruned_loss=0.06442, over 20261.00 frames. ], tot_loss[loss=0.1903, simple_loss=0.2693, pruned_loss=0.05564, over 2731179.15 frames. ], batch size: 141, lr: 6.67e-03, grad_scale: 16.0 2023-06-23 20:42:32,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=175660.0, ans=0.0 2023-06-23 20:42:34,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=175660.0, ans=0.1 2023-06-23 20:42:34,921 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.21 vs. limit=6.0 2023-06-23 20:43:19,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=175860.0, ans=0.2 2023-06-23 20:43:20,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=175860.0, ans=0.1 2023-06-23 20:43:38,672 INFO [train.py:1008] (1/4) Epoch 50, batch 300, loss[loss=0.1814, simple_loss=0.2626, pruned_loss=0.05008, over 18310.00 frames. ], tot_loss[loss=0.1908, simple_loss=0.2698, pruned_loss=0.05584, over 2952663.63 frames. ], batch size: 74, lr: 6.66e-03, grad_scale: 16.0 2023-06-23 20:43:40,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=175926.66666666666, ans=0.2 2023-06-23 20:43:53,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=175993.33333333334, ans=0.125 2023-06-23 20:43:56,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=175993.33333333334, ans=0.125 2023-06-23 20:44:06,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=175993.33333333334, ans=0.05 2023-06-23 20:44:49,238 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.741e+02 1.939e+02 2.314e+02 3.083e+02, threshold=3.877e+02, percent-clipped=0.0 2023-06-23 20:45:00,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=176193.33333333334, ans=0.1 2023-06-23 20:45:03,829 INFO [train.py:1008] (1/4) Epoch 50, batch 350, loss[loss=0.1973, simple_loss=0.2753, pruned_loss=0.05965, over 20460.00 frames. ], tot_loss[loss=0.1897, simple_loss=0.2688, pruned_loss=0.05533, over 3151202.35 frames. ], batch size: 160, lr: 6.66e-03, grad_scale: 16.0 2023-06-23 20:45:44,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=176393.33333333334, ans=0.125 2023-06-23 20:45:52,707 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=176460.0, ans=0.125 2023-06-23 20:46:25,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=176526.66666666666, ans=0.125 2023-06-23 20:46:27,861 INFO [train.py:1008] (1/4) Epoch 50, batch 400, loss[loss=0.1975, simple_loss=0.2827, pruned_loss=0.05612, over 19201.00 frames. ], tot_loss[loss=0.1903, simple_loss=0.2693, pruned_loss=0.05562, over 3288517.57 frames. ], batch size: 92, lr: 6.65e-03, grad_scale: 32.0 2023-06-23 20:47:36,793 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.753e+02 1.915e+02 2.152e+02 3.496e+02, threshold=3.829e+02, percent-clipped=0.0 2023-06-23 20:47:51,023 INFO [train.py:1008] (1/4) Epoch 50, batch 450, loss[loss=0.1725, simple_loss=0.2576, pruned_loss=0.04372, over 19107.00 frames. ], tot_loss[loss=0.1904, simple_loss=0.2699, pruned_loss=0.05542, over 3386533.76 frames. ], batch size: 94, lr: 6.65e-03, grad_scale: 32.0 2023-06-23 20:48:06,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=176993.33333333334, ans=0.125 2023-06-23 20:48:13,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=176993.33333333334, ans=0.0 2023-06-23 20:48:16,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=176993.33333333334, ans=0.09899494936611666 2023-06-23 20:48:26,208 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:48:36,642 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.45 vs. limit=15.0 2023-06-23 20:49:10,606 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=6.97 vs. limit=15.0 2023-06-23 20:49:11,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=177260.0, ans=0.125 2023-06-23 20:49:12,808 INFO [train.py:1008] (1/4) Epoch 50, batch 500, loss[loss=0.1919, simple_loss=0.271, pruned_loss=0.05636, over 19945.00 frames. ], tot_loss[loss=0.1915, simple_loss=0.2705, pruned_loss=0.0562, over 3475540.96 frames. ], batch size: 126, lr: 6.64e-03, grad_scale: 32.0 2023-06-23 20:49:30,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=177326.66666666666, ans=0.125 2023-06-23 20:49:35,023 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=177326.66666666666, ans=0.0 2023-06-23 20:49:51,233 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.53 vs. limit=15.0 2023-06-23 20:50:03,949 INFO [train.py:1221] (1/4) Done!