2023-06-23 08:06:25,987 INFO [train.py:1076] (2/4) Training started 2023-06-23 08:06:25,987 INFO [train.py:1086] (2/4) Device: cuda:2 2023-06-23 08:06:26,032 INFO [train.py:1095] (2/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Debug', 'k2-with-cuda': True, 'k2-git-sha1': '38211604d6a24b15f320578a1a38f6c12d7a711c', 'k2-git-date': 'Mon Jun 12 10:59:44 2023', 'lhotse-version': '1.15.0.dev+git.f1fd23d.clean', 'torch-version': '2.0.0+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.8', 'icefall-git-branch': 'jsalt_ted', 'icefall-git-sha1': '5e817e8-dirty', 'icefall-git-date': 'Thu Jun 22 03:25:04 2023', 'icefall-path': '/exp/draj/jsalt2023/icefall', 'k2-path': '/exp/draj/jsalt2023/k2/k2/python/k2/__init__.py', 'lhotse-path': '/exp/draj/jsalt2023/lhotse/lhotse/__init__.py', 'hostname': 'r3n06', 'IP address': '10.1.3.6'}, 'world_size': 4, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 50, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp/v7'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'base_lr': 0.04, 'lr_batches': 7500, 'lr_epochs': 5.0, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'rnnt_type': 'modified', 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 1, 'average_period': 200, 'use_fp16': True, 'delay_penalty': 0.0, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/manifests'), 'max_duration': 1000, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'blank_id': 0, 'vocab_size': 500} 2023-06-23 08:06:26,033 INFO [train.py:1097] (2/4) About to create model 2023-06-23 08:06:26,781 INFO [train.py:1101] (2/4) Number of model parameters: 65549011 2023-06-23 08:06:35,204 INFO [train.py:1116] (2/4) Using DDP 2023-06-23 08:06:35,517 INFO [asr_datamodule.py:406] (2/4) About to get train cuts 2023-06-23 08:06:35,575 INFO [asr_datamodule.py:232] (2/4) Enable SpecAugment 2023-06-23 08:06:35,575 INFO [asr_datamodule.py:233] (2/4) Time warp factor: 80 2023-06-23 08:06:35,576 INFO [asr_datamodule.py:249] (2/4) About to get Musan cuts 2023-06-23 08:06:35,576 INFO [asr_datamodule.py:252] (2/4) Enable MUSAN 2023-06-23 08:06:37,407 INFO [asr_datamodule.py:274] (2/4) About to create train dataset 2023-06-23 08:06:37,407 INFO [asr_datamodule.py:300] (2/4) Using DynamicBucketingSampler. 2023-06-23 08:06:39,661 INFO [asr_datamodule.py:321] (2/4) About to create train dataloader 2023-06-23 08:06:39,661 INFO [asr_datamodule.py:411] (2/4) About to get dev cuts 2023-06-23 08:06:39,663 INFO [asr_datamodule.py:342] (2/4) About to create dev dataset 2023-06-23 08:06:39,681 INFO [asr_datamodule.py:361] (2/4) About to create dev dataloader 2023-06-23 08:06:39,681 INFO [train.py:1269] (2/4) Sanity check -- see if any of the batches in epoch 1 would cause OOM. 2023-06-23 08:07:25,977 INFO [scaling.py:962] (2/4) Whitening: name=None, num_groups=4, num_channels=128, metric=13.47 vs. limit=3.0 2023-06-23 08:07:26,290 INFO [scaling.py:962] (2/4) Whitening: name=None, num_groups=1, num_channels=256, metric=44.35 vs. limit=5.0 2023-06-23 08:07:26,639 INFO [train.py:1297] (2/4) Maximum memory allocated so far is 8736MB 2023-06-23 08:07:29,311 INFO [train.py:1297] (2/4) Maximum memory allocated so far is 8863MB 2023-06-23 08:07:40,820 INFO [train.py:1297] (2/4) Maximum memory allocated so far is 11507MB 2023-06-23 08:07:47,179 INFO [train.py:1297] (2/4) Maximum memory allocated so far is 11837MB 2023-06-23 08:08:05,856 INFO [train.py:1297] (2/4) Maximum memory allocated so far is 11837MB 2023-06-23 08:08:15,175 INFO [train.py:1297] (2/4) Maximum memory allocated so far is 11988MB 2023-06-23 08:08:35,406 INFO [train.py:1008] (2/4) Epoch 1, batch 0, loss[loss=6.221, simple_loss=5.668, pruned_loss=5.524, over 18273.00 frames. ], tot_loss[loss=6.221, simple_loss=5.668, pruned_loss=5.524, over 18273.00 frames. ], batch size: 74, lr: 2.00e-02, grad_scale: 1.0 2023-06-23 08:08:35,407 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 08:08:41,599 INFO [train.py:1040] (2/4) Epoch 1, validation: loss=6.238, simple_loss=5.687, pruned_loss=5.495, over 143649.00 frames. 2023-06-23 08:08:41,600 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 11988MB 2023-06-23 08:08:45,354 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.06 vs. limit=7.5 2023-06-23 08:08:50,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys.whitening_limit, batch_count=0.0, ans=3.0 2023-06-23 08:08:53,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=0.0, ans=0.5 2023-06-23 08:09:10,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=66.66666666666667, ans=0.496875 2023-06-23 08:09:28,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=66.66666666666667, ans=0.29933333333333334 2023-06-23 08:09:39,055 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=22.21 vs. limit=4.053333333333334 2023-06-23 08:09:48,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=133.33333333333334, ans=0.49375 2023-06-23 08:09:49,834 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=27.57 vs. limit=7.6 2023-06-23 08:10:06,178 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=346.59 vs. limit=7.575 2023-06-23 08:10:09,771 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=23.11 vs. limit=4.08 2023-06-23 08:10:22,430 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=309.04 vs. limit=7.575 2023-06-23 08:10:33,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=266.6666666666667, ans=5.166666666666667 2023-06-23 08:10:42,464 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=103.77 vs. limit=7.6 2023-06-23 08:10:44,070 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=89.22 vs. limit=7.6 2023-06-23 08:10:47,196 INFO [train.py:1008] (2/4) Epoch 1, batch 50, loss[loss=1.43, simple_loss=1.285, pruned_loss=1.314, over 19971.00 frames. ], tot_loss[loss=2.843, simple_loss=2.616, pruned_loss=2.218, over 871859.79 frames. ], batch size: 126, lr: 2.20e-02, grad_scale: 0.5 2023-06-23 08:10:54,902 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=25.97 vs. limit=5.083333333333333 2023-06-23 08:10:55,162 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=228.29 vs. limit=7.625 2023-06-23 08:11:06,591 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=35.35 vs. limit=7.8 2023-06-23 08:11:06,730 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=286.84 vs. limit=7.65 2023-06-23 08:11:07,054 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=498.25 vs. limit=7.65 2023-06-23 08:11:18,296 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=13.34 vs. limit=5.1 2023-06-23 08:11:31,122 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=41.36 vs. limit=7.675 2023-06-23 08:11:32,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=466.6666666666667, ans=0.478125 2023-06-23 08:11:36,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=466.6666666666667, ans=0.478125 2023-06-23 08:11:47,388 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=51.96 vs. limit=7.675 2023-06-23 08:11:48,681 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=58.87 vs. limit=7.7 2023-06-23 08:11:52,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=533.3333333333334, ans=0.475 2023-06-23 08:11:53,058 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=93.47 vs. limit=7.7 2023-06-23 08:11:55,199 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=229.31 vs. limit=5.266666666666667 2023-06-23 08:12:01,402 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=14.89 vs. limit=7.7 2023-06-23 08:12:05,175 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=13.66 vs. limit=7.7 2023-06-23 08:12:17,061 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=180.35 vs. limit=5.3 2023-06-23 08:12:18,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=600.0, ans=0.471875 2023-06-23 08:12:20,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=600.0, ans=0.879 2023-06-23 08:12:20,843 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=18.25 vs. limit=7.725 2023-06-23 08:12:28,967 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=74.34 vs. limit=5.3 2023-06-23 08:12:31,910 INFO [train.py:1008] (2/4) Epoch 1, batch 100, loss[loss=1.166, simple_loss=1.018, pruned_loss=1.199, over 19207.00 frames. ], tot_loss[loss=1.99, simple_loss=1.805, pruned_loss=1.697, over 1525555.96 frames. ], batch size: 92, lr: 2.40e-02, grad_scale: 1.0 2023-06-23 08:12:32,993 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=58.21 vs. limit=7.75 2023-06-23 08:12:34,785 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=10.50 vs. limit=7.75 2023-06-23 08:12:35,815 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.225e+02 1.675e+02 4.751e+02 2.973e+03 1.162e+04, threshold=9.501e+02, percent-clipped=0.0 2023-06-23 08:12:52,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=733.3333333333334, ans=0.17250000000000001 2023-06-23 08:12:56,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=733.3333333333334, ans=0.5 2023-06-23 08:12:56,620 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=17.80 vs. limit=7.775 2023-06-23 08:13:17,145 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=237.92 vs. limit=7.8 2023-06-23 08:13:21,102 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=34.19 vs. limit=7.8 2023-06-23 08:13:22,826 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=23.48 vs. limit=8.1 2023-06-23 08:13:44,003 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=5.005e-03 2023-06-23 08:13:55,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=933.3333333333334, ans=0.04708333333333334 2023-06-23 08:13:56,350 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=21.12 vs. limit=8.2 2023-06-23 08:13:57,623 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=18.21 vs. limit=7.85 2023-06-23 08:13:59,676 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=13.47 vs. limit=7.85 2023-06-23 08:14:05,590 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.43 vs. limit=5.233333333333333 2023-06-23 08:14:14,449 INFO [train.py:1008] (2/4) Epoch 1, batch 150, loss[loss=1.08, simple_loss=0.9346, pruned_loss=1.076, over 18622.00 frames. ], tot_loss[loss=1.617, simple_loss=1.449, pruned_loss=1.454, over 2031428.21 frames. ], batch size: 80, lr: 2.60e-02, grad_scale: 1.0 2023-06-23 08:14:17,524 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.73 vs. limit=8.25 2023-06-23 08:14:21,232 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=490.74 vs. limit=7.875 2023-06-23 08:14:26,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1000.0, ans=0.09375 2023-06-23 08:14:33,383 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.25 vs. limit=8.3 2023-06-23 08:14:33,755 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=7.51 vs. limit=7.9 2023-06-23 08:14:51,021 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=251.83 vs. limit=7.9 2023-06-23 08:15:00,815 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=35.81 vs. limit=7.925 2023-06-23 08:15:11,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1133.3333333333333, ans=0.09291666666666668 2023-06-23 08:15:19,953 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=101.59 vs. limit=7.95 2023-06-23 08:15:22,220 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=196.91 vs. limit=7.95 2023-06-23 08:15:36,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=1266.6666666666667, ans=0.07150000000000001 2023-06-23 08:15:36,923 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=20.02 vs. limit=7.975 2023-06-23 08:15:40,339 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.29 vs. limit=7.975 2023-06-23 08:15:48,188 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=16.92 vs. limit=7.975 2023-06-23 08:15:49,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1266.6666666666667, ans=0.440625 2023-06-23 08:15:58,199 INFO [train.py:1008] (2/4) Epoch 1, batch 200, loss[loss=1, simple_loss=0.863, pruned_loss=0.9471, over 19507.00 frames. ], tot_loss[loss=1.4, simple_loss=1.244, pruned_loss=1.284, over 2411863.73 frames. ], batch size: 102, lr: 2.80e-02, grad_scale: 2.0 2023-06-23 08:16:02,122 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 9.170e+01 1.130e+02 1.229e+02 1.420e+02 8.859e+02, threshold=2.457e+02, percent-clipped=0.0 2023-06-23 08:16:09,068 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=206.25 vs. limit=8.0 2023-06-23 08:16:19,324 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.75 vs. limit=8.55 2023-06-23 08:16:22,450 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=21.33 vs. limit=8.025 2023-06-23 08:16:25,308 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=65.79 vs. limit=8.025 2023-06-23 08:16:26,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1400.0, ans=0.851 2023-06-23 08:16:32,585 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=9.02 vs. limit=5.35 2023-06-23 08:16:41,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=1466.6666666666667, ans=0.43125 2023-06-23 08:16:42,367 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.65 vs. limit=4.586666666666667 2023-06-23 08:16:50,166 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=38.64 vs. limit=8.05 2023-06-23 08:17:09,539 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=18.98 vs. limit=8.075 2023-06-23 08:17:19,689 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=43.47 vs. limit=8.1 2023-06-23 08:17:27,495 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=61.32 vs. limit=8.1 2023-06-23 08:17:30,985 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.05 vs. limit=5.8 2023-06-23 08:17:32,767 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=17.52 vs. limit=8.1 2023-06-23 08:17:34,465 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.88 vs. limit=8.7 2023-06-23 08:17:39,813 INFO [train.py:1008] (2/4) Epoch 1, batch 250, loss[loss=0.9334, simple_loss=0.8009, pruned_loss=0.8582, over 19959.00 frames. ], tot_loss[loss=1.265, simple_loss=1.115, pruned_loss=1.165, over 2713944.37 frames. ], batch size: 126, lr: 3.00e-02, grad_scale: 2.0 2023-06-23 08:17:42,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1666.6666666666667, ans=0.29166666666666663 2023-06-23 08:17:58,395 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=79.09 vs. limit=8.15 2023-06-23 08:17:58,449 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.69 vs. limit=8.8 2023-06-23 08:17:59,989 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=43.53 vs. limit=8.15 2023-06-23 08:18:04,949 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=11.96 vs. limit=5.433333333333334 2023-06-23 08:18:09,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1733.3333333333333, ans=0.061000000000000006 2023-06-23 08:18:30,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=1800.0, ans=0.415625 2023-06-23 08:18:36,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1800.0, ans=0.837 2023-06-23 08:18:36,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1800.0, ans=0.1325 2023-06-23 08:18:37,076 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=108.43 vs. limit=5.9 2023-06-23 08:18:43,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1866.6666666666667, ans=0.4125 2023-06-23 08:18:57,343 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=15.51 vs. limit=5.0 2023-06-23 08:19:07,663 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=26.49 vs. limit=8.225 2023-06-23 08:19:09,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1933.3333333333333, ans=0.8323333333333334 2023-06-23 08:19:13,029 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.40 vs. limit=8.225 2023-06-23 08:19:19,500 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=7.116e+00 2023-06-23 08:19:21,078 INFO [train.py:1008] (2/4) Epoch 1, batch 300, loss[loss=0.9313, simple_loss=0.7934, pruned_loss=0.8393, over 18618.00 frames. ], tot_loss[loss=1.172, simple_loss=1.026, pruned_loss=1.075, over 2955442.91 frames. ], batch size: 80, lr: 3.20e-02, grad_scale: 4.0 2023-06-23 08:19:24,878 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 8.030e+01 1.048e+02 1.320e+02 1.714e+02 2.522e+02, threshold=2.641e+02, percent-clipped=2.0 2023-06-23 08:19:28,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2000.0, ans=0.125 2023-06-23 08:19:32,075 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=18.58 vs. limit=8.25 2023-06-23 08:19:56,136 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=25.74 vs. limit=8.275 2023-06-23 08:19:58,279 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.68 vs. limit=9.05 2023-06-23 08:20:06,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2133.3333333333335, ans=0.12 2023-06-23 08:20:11,266 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.85 vs. limit=9.1 2023-06-23 08:20:22,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2200.0, ans=0.1175 2023-06-23 08:20:25,874 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=2.037e+01 2023-06-23 08:20:28,637 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=6.71 vs. limit=4.88 2023-06-23 08:20:48,088 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.25 vs. limit=9.2 2023-06-23 08:21:03,117 INFO [train.py:1008] (2/4) Epoch 1, batch 350, loss[loss=0.884, simple_loss=0.7492, pruned_loss=0.7774, over 19061.00 frames. ], tot_loss[loss=1.105, simple_loss=0.9614, pruned_loss=1.006, over 3127421.45 frames. ], batch size: 89, lr: 3.40e-02, grad_scale: 4.0 2023-06-23 08:21:09,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2333.3333333333335, ans=0.0475 2023-06-23 08:21:12,299 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=14.65 vs. limit=8.375 2023-06-23 08:21:20,204 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=184.66 vs. limit=8.375 2023-06-23 08:21:21,879 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=14.67 vs. limit=8.4 2023-06-23 08:21:28,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2400.0, ans=0.27599999999999997 2023-06-23 08:21:30,837 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=11.06 vs. limit=8.4 2023-06-23 08:21:31,544 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=29.62 vs. limit=8.4 2023-06-23 08:21:51,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=2466.6666666666665, ans=0.2753333333333333 2023-06-23 08:22:00,121 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.24 vs. limit=9.35 2023-06-23 08:22:03,547 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=17.86 vs. limit=8.45 2023-06-23 08:22:18,893 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=14.56 vs. limit=8.45 2023-06-23 08:22:20,576 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=13.12 vs. limit=9.45 2023-06-23 08:22:22,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=2600.0, ans=0.27399999999999997 2023-06-23 08:22:22,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=2600.0, ans=0.378125 2023-06-23 08:22:36,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=2600.0, ans=8.475 2023-06-23 08:22:44,000 INFO [train.py:1008] (2/4) Epoch 1, batch 400, loss[loss=0.9713, simple_loss=0.8248, pruned_loss=0.8173, over 15956.00 frames. ], tot_loss[loss=1.051, simple_loss=0.91, pruned_loss=0.9441, over 3271381.61 frames. ], batch size: 51, lr: 3.60e-02, grad_scale: 8.0 2023-06-23 08:22:46,812 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.90 vs. limit=8.5 2023-06-23 08:22:47,919 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 8.625e+01 1.356e+02 1.811e+02 2.301e+02 5.147e+02, threshold=3.622e+02, percent-clipped=12.0 2023-06-23 08:22:57,118 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.93 vs. limit=6.333333333333333 2023-06-23 08:23:07,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2733.3333333333335, ans=0.0385 2023-06-23 08:23:32,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=2800.0, ans=8.55 2023-06-23 08:23:35,626 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.55 vs. limit=8.55 2023-06-23 08:23:37,192 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.74 vs. limit=9.6 2023-06-23 08:23:44,067 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.57 vs. limit=8.575 2023-06-23 08:23:48,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=2866.6666666666665, ans=0.365625 2023-06-23 08:24:04,656 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=11.24 vs. limit=8.6 2023-06-23 08:24:05,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=2933.3333333333335, ans=0.7973333333333333 2023-06-23 08:24:06,134 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.00 vs. limit=5.173333333333334 2023-06-23 08:24:18,556 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.08 vs. limit=9.7 2023-06-23 08:24:19,735 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.92 vs. limit=8.6 2023-06-23 08:24:24,170 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=27.46 vs. limit=8.625 2023-06-23 08:24:24,833 INFO [train.py:1008] (2/4) Epoch 1, batch 450, loss[loss=0.8633, simple_loss=0.7348, pruned_loss=0.6975, over 19209.00 frames. ], tot_loss[loss=1.008, simple_loss=0.8693, pruned_loss=0.8882, over 3389999.77 frames. ], batch size: 92, lr: 3.80e-02, grad_scale: 8.0 2023-06-23 08:24:25,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=3000.0, ans=0.125 2023-06-23 08:24:29,777 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=5.38 vs. limit=5.2 2023-06-23 08:24:36,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=3000.0, ans=0.27 2023-06-23 08:24:42,896 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.53 vs. limit=8.65 2023-06-23 08:24:45,233 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.42 vs. limit=8.65 2023-06-23 08:24:49,377 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=9.80 vs. limit=8.65 2023-06-23 08:24:52,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=3066.6666666666665, ans=0.35625 2023-06-23 08:25:02,628 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=18.86 vs. limit=8.675 2023-06-23 08:25:12,716 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=11.15 vs. limit=9.85 2023-06-23 08:25:21,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=3200.0, ans=0.35 2023-06-23 08:25:32,068 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.41 vs. limit=8.7 2023-06-23 08:25:49,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=3266.6666666666665, ans=0.346875 2023-06-23 08:25:59,975 INFO [train.py:1008] (2/4) Epoch 1, batch 500, loss[loss=0.7783, simple_loss=0.6695, pruned_loss=0.592, over 20261.00 frames. ], tot_loss[loss=0.9586, simple_loss=0.8257, pruned_loss=0.8251, over 3485113.68 frames. ], batch size: 141, lr: 4.00e-02, grad_scale: 8.0 2023-06-23 08:26:01,195 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.16 vs. limit=8.75 2023-06-23 08:26:03,486 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.223e+02 1.809e+02 2.343e+02 3.380e+02 8.520e+02, threshold=4.686e+02, percent-clipped=21.0 2023-06-23 08:26:03,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=3333.3333333333335, ans=0.09899494936611666 2023-06-23 08:26:13,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=3333.3333333333335, ans=0.34375 2023-06-23 08:26:20,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=3400.0, ans=0.07875 2023-06-23 08:26:22,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=3400.0, ans=0.340625 2023-06-23 08:26:30,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=3400.0, ans=0.340625 2023-06-23 08:26:32,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=3400.0, ans=7.125 2023-06-23 08:26:41,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=3466.6666666666665, ans=0.07 2023-06-23 08:27:22,041 INFO [train.py:1008] (2/4) Epoch 2, batch 0, loss[loss=0.8332, simple_loss=0.7214, pruned_loss=0.6117, over 16789.00 frames. ], tot_loss[loss=0.8332, simple_loss=0.7214, pruned_loss=0.6117, over 16789.00 frames. ], batch size: 59, lr: 3.96e-02, grad_scale: 16.0 2023-06-23 08:27:22,042 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 08:27:27,598 INFO [train.py:1040] (2/4) Epoch 2, validation: loss=0.696, simple_loss=0.6181, pruned_loss=0.4715, over 143649.00 frames. 2023-06-23 08:27:27,598 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 08:27:35,644 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.40 vs. limit=8.8325 2023-06-23 08:27:46,405 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=14.05 vs. limit=8.8575 2023-06-23 08:28:05,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=3686.6666666666665, ans=0.32718749999999996 2023-06-23 08:28:07,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=3686.6666666666665, ans=0.07695833333333334 2023-06-23 08:28:16,286 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.66 vs. limit=8.8825 2023-06-23 08:28:30,972 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=5.73 vs. limit=5.501333333333333 2023-06-23 08:28:46,197 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.78 vs. limit=10.365 2023-06-23 08:28:48,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=3820.0, ans=0.056749999999999995 2023-06-23 08:28:56,035 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.05 vs. limit=8.932500000000001 2023-06-23 08:29:02,412 INFO [train.py:1008] (2/4) Epoch 2, batch 50, loss[loss=0.6817, simple_loss=0.5958, pruned_loss=0.4765, over 20311.00 frames. ], tot_loss[loss=0.7162, simple_loss=0.6224, pruned_loss=0.5143, over 851335.20 frames. ], batch size: 149, lr: 3.95e-02, grad_scale: 16.0 2023-06-23 08:29:03,079 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=28.05 vs. limit=8.9575 2023-06-23 08:29:04,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=3886.6666666666665, ans=0.26113333333333333 2023-06-23 08:29:08,350 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=3.214e+01 2023-06-23 08:29:16,724 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=3.583 2023-06-23 08:29:26,089 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=6.10 vs. limit=5.581333333333333 2023-06-23 08:29:38,270 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 2.585e+02 4.168e+02 5.706e+02 1.238e+03, threshold=8.337e+02, percent-clipped=41.0 2023-06-23 08:29:50,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=4020.0, ans=0.2598 2023-06-23 08:29:52,877 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.54 vs. limit=9.0075 2023-06-23 08:30:09,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=4086.6666666666665, ans=0.30843750000000003 2023-06-23 08:30:22,818 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.88 vs. limit=9.0575 2023-06-23 08:30:29,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=4153.333333333333, ans=0.3053125 2023-06-23 08:30:37,511 INFO [train.py:1008] (2/4) Epoch 2, batch 100, loss[loss=0.6785, simple_loss=0.5975, pruned_loss=0.4561, over 19104.00 frames. ], tot_loss[loss=0.6942, simple_loss=0.6063, pruned_loss=0.4861, over 1502903.21 frames. ], batch size: 94, lr: 3.95e-02, grad_scale: 16.0 2023-06-23 08:30:39,932 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.73 vs. limit=7.109999999999999 2023-06-23 08:30:44,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=4220.0, ans=9.0825 2023-06-23 08:30:45,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=4220.0, ans=0.2578 2023-06-23 08:31:30,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=4353.333333333333, ans=0.29593749999999996 2023-06-23 08:31:35,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=4420.0, ans=0.7453000000000001 2023-06-23 08:32:06,387 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.56 vs. limit=6.121666666666667 2023-06-23 08:32:10,565 INFO [train.py:1008] (2/4) Epoch 2, batch 150, loss[loss=0.6243, simple_loss=0.5556, pruned_loss=0.4014, over 19196.00 frames. ], tot_loss[loss=0.6697, simple_loss=0.5882, pruned_loss=0.4572, over 2006342.73 frames. ], batch size: 92, lr: 3.95e-02, grad_scale: 16.0 2023-06-23 08:32:11,836 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=10.915 2023-06-23 08:32:38,855 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.79 vs. limit=6.155 2023-06-23 08:32:41,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=4620.0, ans=0.7383 2023-06-23 08:32:47,155 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.685e+02 3.263e+02 5.019e+02 7.884e+02 1.752e+03, threshold=1.004e+03, percent-clipped=21.0 2023-06-23 08:32:49,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=4686.666666666667, ans=0.28031249999999996 2023-06-23 08:32:55,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=4686.666666666667, ans=0.28031249999999996 2023-06-23 08:33:01,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=4686.666666666667, ans=0.28031249999999996 2023-06-23 08:33:02,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=4686.666666666667, ans=0.7359666666666667 2023-06-23 08:33:28,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=4820.0, ans=8.0125 2023-06-23 08:33:42,509 INFO [train.py:1008] (2/4) Epoch 2, batch 200, loss[loss=0.623, simple_loss=0.5562, pruned_loss=0.3924, over 19702.00 frames. ], tot_loss[loss=0.6523, simple_loss=0.5757, pruned_loss=0.4355, over 2392840.48 frames. ], batch size: 110, lr: 3.95e-02, grad_scale: 8.0 2023-06-23 08:33:55,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=4886.666666666667, ans=0.2709375 2023-06-23 08:34:06,678 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.75 vs. limit=9.3575 2023-06-23 08:34:46,751 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.64 vs. limit=6.2716666666666665 2023-06-23 08:34:46,837 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.59 vs. limit=6.2716666666666665 2023-06-23 08:35:13,991 INFO [train.py:1008] (2/4) Epoch 2, batch 250, loss[loss=0.5467, simple_loss=0.4935, pruned_loss=0.3312, over 20566.00 frames. ], tot_loss[loss=0.6341, simple_loss=0.5623, pruned_loss=0.4142, over 2696988.25 frames. ], batch size: 173, lr: 3.95e-02, grad_scale: 8.0 2023-06-23 08:35:16,852 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.05 vs. limit=9.4575 2023-06-23 08:35:34,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=5286.666666666667, ans=0.24713333333333332 2023-06-23 08:35:38,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=5286.666666666667, ans=0.2521875 2023-06-23 08:35:45,410 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=1.036e-01 2023-06-23 08:35:45,817 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.51 vs. limit=9.4825 2023-06-23 08:35:48,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=5353.333333333333, ans=0.044361111111111115 2023-06-23 08:35:50,024 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.705e+02 3.360e+02 5.707e+02 8.306e+02 1.976e+03, threshold=1.141e+03, percent-clipped=19.0 2023-06-23 08:36:13,103 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 08:36:20,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=5420.0, ans=0.09899494936611666 2023-06-23 08:36:33,104 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.54 vs. limit=9.557500000000001 2023-06-23 08:36:33,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=5486.666666666667, ans=0.2428125 2023-06-23 08:36:44,285 INFO [train.py:1008] (2/4) Epoch 2, batch 300, loss[loss=0.5411, simple_loss=0.4857, pruned_loss=0.3296, over 20299.00 frames. ], tot_loss[loss=0.6167, simple_loss=0.5493, pruned_loss=0.3951, over 2942099.93 frames. ], batch size: 239, lr: 3.95e-02, grad_scale: 8.0 2023-06-23 08:36:46,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=5553.333333333333, ans=0.7056333333333333 2023-06-23 08:36:51,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=5553.333333333333, ans=0.24446666666666667 2023-06-23 08:37:25,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=5686.666666666667, ans=0.23343750000000002 2023-06-23 08:37:43,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=5753.333333333333, ans=0.23031249999999998 2023-06-23 08:38:11,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=5886.666666666667, ans=0.2240625 2023-06-23 08:38:11,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=5886.666666666667, ans=0.6939666666666666 2023-06-23 08:38:12,485 INFO [train.py:1008] (2/4) Epoch 2, batch 350, loss[loss=0.5287, simple_loss=0.4737, pruned_loss=0.3207, over 20278.00 frames. ], tot_loss[loss=0.5996, simple_loss=0.5365, pruned_loss=0.3769, over 3111909.97 frames. ], batch size: 239, lr: 3.95e-02, grad_scale: 8.0 2023-06-23 08:38:44,622 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=8.23 vs. limit=7.976666666666667 2023-06-23 08:38:48,692 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.796e+02 3.515e+02 5.337e+02 9.069e+02 1.620e+03, threshold=1.067e+03, percent-clipped=11.0 2023-06-23 08:38:54,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=6020.0, ans=0.21781250000000002 2023-06-23 08:39:21,405 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.84 vs. limit=6.5216666666666665 2023-06-23 08:39:43,022 INFO [train.py:1008] (2/4) Epoch 2, batch 400, loss[loss=0.5251, simple_loss=0.4715, pruned_loss=0.3147, over 20231.00 frames. ], tot_loss[loss=0.5838, simple_loss=0.5251, pruned_loss=0.3596, over 3264752.35 frames. ], batch size: 239, lr: 3.95e-02, grad_scale: 16.0 2023-06-23 08:39:48,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=6220.0, ans=0.6823 2023-06-23 08:39:58,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=6220.0, ans=0.009517391304347827 2023-06-23 08:40:11,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=6286.666666666667, ans=0.2053125 2023-06-23 08:40:16,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=6286.666666666667, ans=0.009502898550724639 2023-06-23 08:40:16,455 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.85 vs. limit=9.8575 2023-06-23 08:40:21,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=6353.333333333333, ans=0.00948840579710145 2023-06-23 08:40:33,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=6353.333333333333, ans=0.20218750000000002 2023-06-23 08:41:09,207 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.65 vs. limit=8.243333333333334 2023-06-23 08:41:11,419 INFO [train.py:1008] (2/4) Epoch 2, batch 450, loss[loss=0.5321, simple_loss=0.4941, pruned_loss=0.294, over 19227.00 frames. ], tot_loss[loss=0.5714, simple_loss=0.5158, pruned_loss=0.3465, over 3374980.93 frames. ], batch size: 92, lr: 3.94e-02, grad_scale: 4.0 2023-06-23 08:41:19,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=6553.333333333333, ans=0.1928125 2023-06-23 08:41:48,458 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.85 vs. limit=8.343333333333334 2023-06-23 08:41:50,627 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 2.476e+02 3.499e+02 6.035e+02 3.649e+03, threshold=6.998e+02, percent-clipped=11.0 2023-06-23 08:41:50,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=6686.666666666667, ans=0.18656250000000002 2023-06-23 08:41:51,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=6686.666666666667, ans=0.18656250000000002 2023-06-23 08:41:55,870 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 08:42:37,681 INFO [train.py:1008] (2/4) Epoch 2, batch 500, loss[loss=0.5015, simple_loss=0.4689, pruned_loss=0.2722, over 19546.00 frames. ], tot_loss[loss=0.558, simple_loss=0.5065, pruned_loss=0.3323, over 3461863.90 frames. ], batch size: 102, lr: 3.94e-02, grad_scale: 8.0 2023-06-23 08:42:37,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=6886.666666666667, ans=0.23113333333333333 2023-06-23 08:42:39,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=6886.666666666667, ans=0.23113333333333333 2023-06-23 08:43:09,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=7020.0, ans=0.03741666666666667 2023-06-23 08:43:46,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=7100.0, ans=0.00932608695652174 2023-06-23 08:43:52,960 INFO [train.py:1008] (2/4) Epoch 3, batch 0, loss[loss=0.4973, simple_loss=0.4482, pruned_loss=0.2907, over 19767.00 frames. ], tot_loss[loss=0.4973, simple_loss=0.4482, pruned_loss=0.2907, over 19767.00 frames. ], batch size: 293, lr: 3.84e-02, grad_scale: 16.0 2023-06-23 08:43:52,960 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 08:43:58,458 INFO [train.py:1040] (2/4) Epoch 3, validation: loss=0.4015, simple_loss=0.4171, pruned_loss=0.1648, over 143649.00 frames. 2023-06-23 08:43:58,459 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 08:44:27,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=7166.666666666667, ans=0.1640625 2023-06-23 08:44:32,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=7233.333333333333, ans=0.22766666666666668 2023-06-23 08:44:46,342 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.42 vs. limit=10.2125 2023-06-23 08:44:55,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=7300.0, ans=0.22699999999999998 2023-06-23 08:44:59,332 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=12.64 vs. limit=12.975 2023-06-23 08:45:00,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=7300.0, ans=0.15781250000000002 2023-06-23 08:45:09,078 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.685e+02 2.749e+02 5.394e+02 7.566e+02 2.513e+03, threshold=1.079e+03, percent-clipped=34.0 2023-06-23 08:45:16,001 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=12.73 vs. limit=13.025 2023-06-23 08:45:24,883 INFO [train.py:1008] (2/4) Epoch 3, batch 50, loss[loss=0.5154, simple_loss=0.4806, pruned_loss=0.2804, over 19862.00 frames. ], tot_loss[loss=0.501, simple_loss=0.4666, pruned_loss=0.2737, over 843424.79 frames. ], batch size: 120, lr: 3.83e-02, grad_scale: 8.0 2023-06-23 08:45:48,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=7500.0, ans=0.1484375 2023-06-23 08:45:52,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=7500.0, ans=0.035416666666666666 2023-06-23 08:45:54,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=7500.0, ans=0.3125 2023-06-23 08:45:56,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=7500.0, ans=0.00923913043478261 2023-06-23 08:46:31,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=7700.0, ans=0.13906249999999998 2023-06-23 08:46:50,091 INFO [train.py:1008] (2/4) Epoch 3, batch 100, loss[loss=0.457, simple_loss=0.4291, pruned_loss=0.245, over 20555.00 frames. ], tot_loss[loss=0.4929, simple_loss=0.4613, pruned_loss=0.2663, over 1502690.79 frames. ], batch size: 173, lr: 3.83e-02, grad_scale: 8.0 2023-06-23 08:47:07,136 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.35 vs. limit=7.133333333333333 2023-06-23 08:47:08,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=7833.333333333333, ans=0.0 2023-06-23 08:47:16,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=7833.333333333333, ans=0.03402777777777778 2023-06-23 08:47:32,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=7900.0, ans=0.6234999999999999 2023-06-23 08:47:57,778 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.817e+02 3.094e+02 5.025e+02 7.016e+02 1.761e+03, threshold=1.005e+03, percent-clipped=8.0 2023-06-23 08:48:14,411 INFO [train.py:1008] (2/4) Epoch 3, batch 150, loss[loss=0.5147, simple_loss=0.4909, pruned_loss=0.2674, over 18323.00 frames. ], tot_loss[loss=0.4871, simple_loss=0.458, pruned_loss=0.2606, over 2020783.28 frames. ], batch size: 72, lr: 3.83e-02, grad_scale: 8.0 2023-06-23 08:48:26,233 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=2.871e-03 2023-06-23 08:48:29,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=8166.666666666667, ans=0.21833333333333332 2023-06-23 08:48:47,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=8233.333333333334, ans=10.0 2023-06-23 08:49:37,833 INFO [train.py:1008] (2/4) Epoch 3, batch 200, loss[loss=0.4325, simple_loss=0.4168, pruned_loss=0.2204, over 20533.00 frames. ], tot_loss[loss=0.4799, simple_loss=0.4528, pruned_loss=0.2548, over 2424245.59 frames. ], batch size: 160, lr: 3.83e-02, grad_scale: 8.0 2023-06-23 08:49:48,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=8433.333333333334, ans=0.009036231884057971 2023-06-23 08:50:45,438 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.582e+02 3.128e+02 5.211e+02 8.719e+02 2.248e+03, threshold=1.042e+03, percent-clipped=14.0 2023-06-23 08:50:45,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=8700.0, ans=0.030416666666666668 2023-06-23 08:51:00,825 INFO [train.py:1008] (2/4) Epoch 3, batch 250, loss[loss=0.4608, simple_loss=0.4365, pruned_loss=0.2426, over 20601.00 frames. ], tot_loss[loss=0.4741, simple_loss=0.4496, pruned_loss=0.2494, over 2713892.02 frames. ], batch size: 189, lr: 3.83e-02, grad_scale: 8.0 2023-06-23 08:51:11,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=8766.666666666666, ans=0.125 2023-06-23 08:51:20,224 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.42 vs. limit=4.325 2023-06-23 08:51:23,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=8833.333333333334, ans=0.02986111111111111 2023-06-23 08:51:25,068 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.24 vs. limit=14.125 2023-06-23 08:51:57,755 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.22 vs. limit=7.241666666666666 2023-06-23 08:52:08,886 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=5.99 vs. limit=10.8875 2023-06-23 08:52:21,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=9033.333333333334, ans=0.125 2023-06-23 08:52:25,326 INFO [train.py:1008] (2/4) Epoch 3, batch 300, loss[loss=0.4257, simple_loss=0.4167, pruned_loss=0.2117, over 18796.00 frames. ], tot_loss[loss=0.4698, simple_loss=0.4474, pruned_loss=0.2453, over 2942429.20 frames. ], batch size: 83, lr: 3.82e-02, grad_scale: 8.0 2023-06-23 08:52:35,432 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.47 vs. limit=7.275 2023-06-23 08:52:46,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=9166.666666666666, ans=0.125 2023-06-23 08:52:46,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=9166.666666666666, ans=0.125 2023-06-23 08:52:59,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=9233.333333333334, ans=0.125 2023-06-23 08:53:12,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=9300.0, ans=0.125 2023-06-23 08:53:14,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=9300.0, ans=0.125 2023-06-23 08:53:33,153 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 3.201e+02 4.697e+02 7.108e+02 1.461e+03, threshold=9.395e+02, percent-clipped=6.0 2023-06-23 08:53:35,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=9366.666666666666, ans=10.0 2023-06-23 08:53:47,522 INFO [train.py:1008] (2/4) Epoch 3, batch 350, loss[loss=0.4311, simple_loss=0.4139, pruned_loss=0.2221, over 20286.00 frames. ], tot_loss[loss=0.463, simple_loss=0.4424, pruned_loss=0.2403, over 3143063.49 frames. ], batch size: 149, lr: 3.82e-02, grad_scale: 8.0 2023-06-23 08:54:50,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=9633.333333333334, ans=0.008775362318840579 2023-06-23 08:55:01,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=9700.0, ans=0.026250000000000002 2023-06-23 08:55:06,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=9700.0, ans=0.203 2023-06-23 08:55:11,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=9766.666666666666, ans=0.025972222222222226 2023-06-23 08:55:12,713 INFO [train.py:1008] (2/4) Epoch 3, batch 400, loss[loss=0.4207, simple_loss=0.4175, pruned_loss=0.2056, over 19810.00 frames. ], tot_loss[loss=0.457, simple_loss=0.4389, pruned_loss=0.2354, over 3293488.47 frames. ], batch size: 115, lr: 3.82e-02, grad_scale: 16.0 2023-06-23 08:55:59,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=9900.0, ans=0.07 2023-06-23 08:56:14,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=9966.666666666666, ans=0.008702898550724638 2023-06-23 08:56:22,512 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.727e+02 2.338e+02 3.683e+02 5.824e+02 1.434e+03, threshold=7.366e+02, percent-clipped=14.0 2023-06-23 08:56:23,462 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.98 vs. limit=15.025 2023-06-23 08:56:24,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=10033.333333333334, ans=0.125 2023-06-23 08:56:31,829 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=11.2625 2023-06-23 08:56:37,655 INFO [train.py:1008] (2/4) Epoch 3, batch 450, loss[loss=0.4263, simple_loss=0.4198, pruned_loss=0.2117, over 18943.00 frames. ], tot_loss[loss=0.4533, simple_loss=0.4369, pruned_loss=0.2323, over 3403655.07 frames. ], batch size: 86, lr: 3.82e-02, grad_scale: 16.0 2023-06-23 08:56:47,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=10100.0, ans=0.125 2023-06-23 08:57:16,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=10233.333333333334, ans=0.5418333333333334 2023-06-23 08:57:24,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=10233.333333333334, ans=0.125 2023-06-23 08:57:48,613 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=7.54 vs. limit=7.591666666666667 2023-06-23 08:58:00,675 INFO [train.py:1008] (2/4) Epoch 3, batch 500, loss[loss=0.453, simple_loss=0.4546, pruned_loss=0.2191, over 16615.00 frames. ], tot_loss[loss=0.4473, simple_loss=0.4327, pruned_loss=0.2281, over 3502035.86 frames. ], batch size: 59, lr: 3.81e-02, grad_scale: 16.0 2023-06-23 08:58:23,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=10500.0, ans=0.125 2023-06-23 08:58:41,226 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=11.4625 2023-06-23 08:59:14,634 INFO [train.py:1008] (2/4) Epoch 4, batch 0, loss[loss=0.4476, simple_loss=0.4405, pruned_loss=0.2235, over 19111.00 frames. ], tot_loss[loss=0.4476, simple_loss=0.4405, pruned_loss=0.2235, over 19111.00 frames. ], batch size: 94, lr: 3.66e-02, grad_scale: 32.0 2023-06-23 08:59:14,635 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 08:59:20,232 INFO [train.py:1040] (2/4) Epoch 4, validation: loss=0.3166, simple_loss=0.3753, pruned_loss=0.1114, over 143649.00 frames. 2023-06-23 08:59:20,232 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 08:59:36,444 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.697e+02 2.731e+02 4.204e+02 6.792e+02 1.679e+03, threshold=8.407e+02, percent-clipped=23.0 2023-06-23 08:59:42,105 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 08:59:47,860 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.70 vs. limit=4.607 2023-06-23 09:00:24,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=10846.666666666666, ans=0.021472222222222226 2023-06-23 09:00:43,459 INFO [train.py:1008] (2/4) Epoch 4, batch 50, loss[loss=0.4045, simple_loss=0.4121, pruned_loss=0.1927, over 12570.00 frames. ], tot_loss[loss=0.4228, simple_loss=0.4186, pruned_loss=0.2096, over 838396.20 frames. ], batch size: 35, lr: 3.66e-02, grad_scale: 16.0 2023-06-23 09:01:28,121 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.74 vs. limit=11.6675 2023-06-23 09:01:35,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=11180.0, ans=0.125 2023-06-23 09:01:43,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=11180.0, ans=0.18819999999999998 2023-06-23 09:01:45,084 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.00 vs. limit=15.885 2023-06-23 09:01:52,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=11246.666666666666, ans=0.125 2023-06-23 09:02:07,135 INFO [train.py:1008] (2/4) Epoch 4, batch 100, loss[loss=0.4093, simple_loss=0.4144, pruned_loss=0.1977, over 19679.00 frames. ], tot_loss[loss=0.4189, simple_loss=0.4155, pruned_loss=0.2076, over 1513486.30 frames. ], batch size: 110, lr: 3.66e-02, grad_scale: 16.0 2023-06-23 09:02:07,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=11313.333333333334, ans=0.019527777777777776 2023-06-23 09:02:20,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=11313.333333333334, ans=0.8631333333333333 2023-06-23 09:02:23,927 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.696e+02 2.593e+02 4.190e+02 6.338e+02 1.351e+03, threshold=8.380e+02, percent-clipped=12.0 2023-06-23 09:02:36,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=11380.0, ans=0.125 2023-06-23 09:02:47,308 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:03:12,956 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.00 vs. limit=16.185000000000002 2023-06-23 09:03:30,201 INFO [train.py:1008] (2/4) Epoch 4, batch 150, loss[loss=0.4161, simple_loss=0.4368, pruned_loss=0.192, over 18327.00 frames. ], tot_loss[loss=0.4171, simple_loss=0.4171, pruned_loss=0.2048, over 2011815.39 frames. ], batch size: 72, lr: 3.66e-02, grad_scale: 16.0 2023-06-23 09:03:41,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=11646.666666666666, ans=0.8664666666666666 2023-06-23 09:03:46,486 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.82 vs. limit=11.8925 2023-06-23 09:03:57,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=11713.333333333334, ans=0.125 2023-06-23 09:03:57,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=11713.333333333334, ans=0.18286666666666668 2023-06-23 09:04:07,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=11780.0, ans=0.125 2023-06-23 09:04:14,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=11780.0, ans=0.017583333333333333 2023-06-23 09:04:21,446 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.62 vs. limit=4.777 2023-06-23 09:04:25,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=11846.666666666666, ans=0.008294202898550724 2023-06-23 09:04:46,366 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:04:51,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=11913.333333333334, ans=0.4830333333333333 2023-06-23 09:04:51,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=11913.333333333334, ans=0.008279710144927536 2023-06-23 09:04:54,287 INFO [train.py:1008] (2/4) Epoch 4, batch 200, loss[loss=0.4245, simple_loss=0.4265, pruned_loss=0.2087, over 19331.00 frames. ], tot_loss[loss=0.4139, simple_loss=0.4151, pruned_loss=0.2028, over 2405828.55 frames. ], batch size: 98, lr: 3.65e-02, grad_scale: 16.0 2023-06-23 09:05:11,074 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.595e+02 3.697e+02 5.580e+02 1.285e+03, threshold=7.394e+02, percent-clipped=7.0 2023-06-23 09:05:13,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=12046.666666666666, ans=0.01647222222222223 2023-06-23 09:05:40,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=12113.333333333334, ans=0.125 2023-06-23 09:05:51,397 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.58 vs. limit=16.634999999999998 2023-06-23 09:05:53,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=12180.0, ans=0.17819999999999997 2023-06-23 09:06:12,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=12246.666666666666, ans=0.008207246376811594 2023-06-23 09:06:19,935 INFO [train.py:1008] (2/4) Epoch 4, batch 250, loss[loss=0.4258, simple_loss=0.4313, pruned_loss=0.208, over 18801.00 frames. ], tot_loss[loss=0.4122, simple_loss=0.4136, pruned_loss=0.2024, over 2718963.08 frames. ], batch size: 83, lr: 3.65e-02, grad_scale: 16.0 2023-06-23 09:06:32,551 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.50 vs. limit=4.8469999999999995 2023-06-23 09:06:58,370 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.66 vs. limit=12.1675 2023-06-23 09:07:02,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=12446.666666666666, ans=0.008163768115942029 2023-06-23 09:07:27,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=12580.0, ans=0.1742 2023-06-23 09:07:44,453 INFO [train.py:1008] (2/4) Epoch 4, batch 300, loss[loss=0.4031, simple_loss=0.4124, pruned_loss=0.1954, over 19441.00 frames. ], tot_loss[loss=0.4088, simple_loss=0.4124, pruned_loss=0.1998, over 2952144.04 frames. ], batch size: 105, lr: 3.65e-02, grad_scale: 16.0 2023-06-23 09:08:00,652 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.829e+02 2.988e+02 4.215e+02 5.995e+02 1.282e+03, threshold=8.430e+02, percent-clipped=15.0 2023-06-23 09:08:01,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=12713.333333333334, ans=0.125 2023-06-23 09:08:28,059 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.06 vs. limit=17.085 2023-06-23 09:09:09,839 INFO [train.py:1008] (2/4) Epoch 4, batch 350, loss[loss=0.405, simple_loss=0.4036, pruned_loss=0.2027, over 20276.00 frames. ], tot_loss[loss=0.4032, simple_loss=0.4091, pruned_loss=0.1963, over 3162504.20 frames. ], batch size: 239, lr: 3.64e-02, grad_scale: 16.0 2023-06-23 09:09:42,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=13113.333333333334, ans=0.16886666666666666 2023-06-23 09:09:58,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=13180.0, ans=0.011750000000000003 2023-06-23 09:10:34,564 INFO [train.py:1008] (2/4) Epoch 4, batch 400, loss[loss=0.3867, simple_loss=0.3903, pruned_loss=0.1915, over 20204.00 frames. ], tot_loss[loss=0.3995, simple_loss=0.4074, pruned_loss=0.194, over 3297361.35 frames. ], batch size: 239, lr: 3.64e-02, grad_scale: 32.0 2023-06-23 09:10:36,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=13313.333333333334, ans=10.0 2023-06-23 09:10:51,309 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.848e+02 2.547e+02 3.557e+02 5.280e+02 1.006e+03, threshold=7.113e+02, percent-clipped=4.0 2023-06-23 09:10:53,550 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.59 vs. limit=12.5175 2023-06-23 09:11:22,200 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=9.95 vs. limit=11.723333333333333 2023-06-23 09:11:49,851 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=13.15 vs. limit=12.592500000000001 2023-06-23 09:11:59,714 INFO [train.py:1008] (2/4) Epoch 4, batch 450, loss[loss=0.3862, simple_loss=0.4002, pruned_loss=0.1861, over 20261.00 frames. ], tot_loss[loss=0.3962, simple_loss=0.4055, pruned_loss=0.1921, over 3409477.45 frames. ], batch size: 141, lr: 3.64e-02, grad_scale: 32.0 2023-06-23 09:12:05,557 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.54 vs. limit=17.735 2023-06-23 09:12:12,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=13646.666666666666, ans=0.007902898550724638 2023-06-23 09:12:42,729 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.25 vs. limit=12.6675 2023-06-23 09:12:52,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=13846.666666666666, ans=0.4153666666666667 2023-06-23 09:13:21,458 INFO [train.py:1008] (2/4) Epoch 4, batch 500, loss[loss=0.3991, simple_loss=0.4165, pruned_loss=0.1908, over 10689.00 frames. ], tot_loss[loss=0.3932, simple_loss=0.4052, pruned_loss=0.1896, over 3478310.05 frames. ], batch size: 30, lr: 3.63e-02, grad_scale: 32.0 2023-06-23 09:13:21,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=13980.0, ans=0.4107 2023-06-23 09:13:37,512 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.023e+02 3.041e+02 4.268e+02 6.398e+02 1.193e+03, threshold=8.536e+02, percent-clipped=18.0 2023-06-23 09:13:38,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=14046.666666666666, ans=0.125 2023-06-23 09:13:42,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=14046.666666666666, ans=0.00813888888888889 2023-06-23 09:13:44,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=14046.666666666666, ans=0.125 2023-06-23 09:13:49,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=14046.666666666666, ans=0.125 2023-06-23 09:14:38,036 INFO [train.py:1008] (2/4) Epoch 5, batch 0, loss[loss=0.3686, simple_loss=0.3834, pruned_loss=0.1769, over 20695.00 frames. ], tot_loss[loss=0.3686, simple_loss=0.3834, pruned_loss=0.1769, over 20695.00 frames. ], batch size: 211, lr: 3.47e-02, grad_scale: 32.0 2023-06-23 09:14:38,037 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 09:14:43,505 INFO [train.py:1040] (2/4) Epoch 5, validation: loss=0.2696, simple_loss=0.3565, pruned_loss=0.09131, over 143649.00 frames. 2023-06-23 09:14:43,505 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 09:14:58,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=14193.333333333334, ans=0.007784057971014493 2023-06-23 09:15:13,231 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.02 vs. limit=12.8475 2023-06-23 09:16:07,691 INFO [train.py:1008] (2/4) Epoch 5, batch 50, loss[loss=0.3727, simple_loss=0.3915, pruned_loss=0.177, over 19965.00 frames. ], tot_loss[loss=0.3733, simple_loss=0.3947, pruned_loss=0.1759, over 864097.39 frames. ], batch size: 126, lr: 3.46e-02, grad_scale: 32.0 2023-06-23 09:16:55,806 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.912e+02 2.663e+02 3.585e+02 4.767e+02 1.366e+03, threshold=7.170e+02, percent-clipped=3.0 2023-06-23 09:17:05,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=14726.666666666666, ans=0.125 2023-06-23 09:17:14,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=14726.666666666666, ans=0.007668115942028986 2023-06-23 09:17:17,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=14793.333333333334, ans=0.007653623188405797 2023-06-23 09:17:21,088 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.74 vs. limit=13.0475 2023-06-23 09:17:27,961 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.01 vs. limit=13.0475 2023-06-23 09:17:33,170 INFO [train.py:1008] (2/4) Epoch 5, batch 100, loss[loss=0.3987, simple_loss=0.4301, pruned_loss=0.1837, over 17641.00 frames. ], tot_loss[loss=0.3723, simple_loss=0.393, pruned_loss=0.1758, over 1508320.31 frames. ], batch size: 67, lr: 3.46e-02, grad_scale: 32.0 2023-06-23 09:17:56,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=14926.666666666666, ans=0.3775666666666667 2023-06-23 09:17:56,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=14926.666666666666, ans=0.125 2023-06-23 09:18:07,243 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.15 vs. limit=13.122499999999999 2023-06-23 09:18:08,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=14993.333333333334, ans=0.125 2023-06-23 09:18:16,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=14993.333333333334, ans=0.125 2023-06-23 09:18:28,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=15060.0, ans=0.125 2023-06-23 09:18:37,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=15060.0, ans=0.1494 2023-06-23 09:18:41,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=15126.666666666666, ans=0.003638888888888886 2023-06-23 09:18:54,996 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=13.1725 2023-06-23 09:18:59,177 INFO [train.py:1008] (2/4) Epoch 5, batch 150, loss[loss=0.384, simple_loss=0.4165, pruned_loss=0.1758, over 16847.00 frames. ], tot_loss[loss=0.3733, simple_loss=0.3937, pruned_loss=0.1765, over 1996337.10 frames. ], batch size: 59, lr: 3.46e-02, grad_scale: 32.0 2023-06-23 09:19:11,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=15193.333333333334, ans=0.125 2023-06-23 09:19:30,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=15260.0, ans=0.125 2023-06-23 09:19:32,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=15326.666666666666, ans=0.125 2023-06-23 09:19:33,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=15326.666666666666, ans=0.125 2023-06-23 09:19:46,748 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.943e+02 2.580e+02 3.352e+02 4.730e+02 1.164e+03, threshold=6.705e+02, percent-clipped=8.0 2023-06-23 09:20:00,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=15393.333333333334, ans=0.025 2023-06-23 09:20:24,362 INFO [train.py:1008] (2/4) Epoch 5, batch 200, loss[loss=0.3736, simple_loss=0.3834, pruned_loss=0.1819, over 20680.00 frames. ], tot_loss[loss=0.371, simple_loss=0.3917, pruned_loss=0.1752, over 2390364.45 frames. ], batch size: 211, lr: 3.45e-02, grad_scale: 32.0 2023-06-23 09:20:31,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=15526.666666666666, ans=0.09899494936611666 2023-06-23 09:20:57,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=15660.0, ans=0.0014166666666666702 2023-06-23 09:21:04,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=15660.0, ans=0.3519 2023-06-23 09:21:28,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=15726.666666666666, ans=19.295 2023-06-23 09:21:50,768 INFO [train.py:1008] (2/4) Epoch 5, batch 250, loss[loss=0.3844, simple_loss=0.3905, pruned_loss=0.1891, over 20252.00 frames. ], tot_loss[loss=0.3692, simple_loss=0.3901, pruned_loss=0.1742, over 2708756.64 frames. ], batch size: 239, lr: 3.45e-02, grad_scale: 32.0 2023-06-23 09:22:39,216 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.813e+02 2.807e+02 3.851e+02 5.722e+02 1.154e+03, threshold=7.701e+02, percent-clipped=13.0 2023-06-23 09:22:51,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=16060.0, ans=0.3379 2023-06-23 09:23:11,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=16126.666666666666, ans=0.125 2023-06-23 09:23:15,785 INFO [train.py:1008] (2/4) Epoch 5, batch 300, loss[loss=0.3674, simple_loss=0.3831, pruned_loss=0.1759, over 20723.00 frames. ], tot_loss[loss=0.3668, simple_loss=0.389, pruned_loss=0.1723, over 2954922.78 frames. ], batch size: 211, lr: 3.45e-02, grad_scale: 32.0 2023-06-23 09:23:21,115 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.66 vs. limit=10.477333333333334 2023-06-23 09:23:55,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=16326.666666666666, ans=0.125 2023-06-23 09:24:30,645 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=11.23 vs. limit=13.23 2023-06-23 09:24:33,827 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.34 vs. limit=5.468999999999999 2023-06-23 09:24:41,032 INFO [train.py:1008] (2/4) Epoch 5, batch 350, loss[loss=0.347, simple_loss=0.3672, pruned_loss=0.1635, over 20661.00 frames. ], tot_loss[loss=0.3639, simple_loss=0.3869, pruned_loss=0.1705, over 3156568.89 frames. ], batch size: 211, lr: 3.44e-02, grad_scale: 32.0 2023-06-23 09:24:53,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=16526.666666666668, ans=0.125 2023-06-23 09:25:14,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=16660.0, ans=0.125 2023-06-23 09:25:23,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=16660.0, ans=0.125 2023-06-23 09:25:23,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=16660.0, ans=0.31690000000000007 2023-06-23 09:25:28,094 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 2.487e+02 3.376e+02 4.312e+02 1.022e+03, threshold=6.752e+02, percent-clipped=3.0 2023-06-23 09:25:35,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=16726.666666666668, ans=0.125 2023-06-23 09:25:51,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=16793.333333333332, ans=0.007218840579710145 2023-06-23 09:26:06,045 INFO [train.py:1008] (2/4) Epoch 5, batch 400, loss[loss=0.3304, simple_loss=0.3667, pruned_loss=0.1471, over 19362.00 frames. ], tot_loss[loss=0.365, simple_loss=0.3889, pruned_loss=0.1705, over 3274772.54 frames. ], batch size: 98, lr: 3.44e-02, grad_scale: 32.0 2023-06-23 09:26:47,052 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.38 vs. limit=13.872499999999999 2023-06-23 09:26:47,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=16993.333333333332, ans=0.00717536231884058 2023-06-23 09:26:52,385 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:27:32,076 INFO [train.py:1008] (2/4) Epoch 5, batch 450, loss[loss=0.3669, simple_loss=0.3894, pruned_loss=0.1722, over 19514.00 frames. ], tot_loss[loss=0.3629, simple_loss=0.3876, pruned_loss=0.1691, over 3399630.04 frames. ], batch size: 102, lr: 3.44e-02, grad_scale: 32.0 2023-06-23 09:27:37,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=17193.333333333332, ans=0.29823333333333346 2023-06-23 09:28:00,166 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.60 vs. limit=9.315000000000001 2023-06-23 09:28:08,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=17326.666666666668, ans=0.0 2023-06-23 09:28:19,069 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.791e+02 2.771e+02 3.568e+02 4.658e+02 8.532e+02, threshold=7.136e+02, percent-clipped=9.0 2023-06-23 09:28:19,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=17326.666666666668, ans=0.125 2023-06-23 09:28:33,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=17393.333333333332, ans=0.29123333333333346 2023-06-23 09:28:41,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=17460.0, ans=0.125 2023-06-23 09:28:53,954 INFO [train.py:1008] (2/4) Epoch 5, batch 500, loss[loss=0.3663, simple_loss=0.39, pruned_loss=0.1713, over 19817.00 frames. ], tot_loss[loss=0.3611, simple_loss=0.3868, pruned_loss=0.1677, over 3490612.20 frames. ], batch size: 115, lr: 3.43e-02, grad_scale: 32.0 2023-06-23 09:29:10,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=17593.333333333332, ans=0.12406666666666669 2023-06-23 09:29:25,645 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=27.75 vs. limit=20.744999999999997 2023-06-23 09:29:27,440 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=13.18 vs. limit=14.122499999999999 2023-06-23 09:30:09,887 INFO [train.py:1008] (2/4) Epoch 6, batch 0, loss[loss=0.3357, simple_loss=0.3764, pruned_loss=0.1475, over 18481.00 frames. ], tot_loss[loss=0.3357, simple_loss=0.3764, pruned_loss=0.1475, over 18481.00 frames. ], batch size: 77, lr: 3.27e-02, grad_scale: 32.0 2023-06-23 09:30:09,887 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 09:30:15,553 INFO [train.py:1040] (2/4) Epoch 6, validation: loss=0.257, simple_loss=0.3485, pruned_loss=0.08271, over 143649.00 frames. 2023-06-23 09:30:15,553 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 09:30:35,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=17813.333333333332, ans=0.2765333333333334 2023-06-23 09:30:37,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=17813.333333333332, ans=0.125 2023-06-23 09:30:44,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=17813.333333333332, ans=0.006997101449275362 2023-06-23 09:31:00,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=17880.0, ans=0.2742 2023-06-23 09:31:11,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=17946.666666666668, ans=0.12053333333333333 2023-06-23 09:31:26,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=18013.333333333332, ans=0.11986666666666668 2023-06-23 09:31:30,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=18013.333333333332, ans=0.11986666666666668 2023-06-23 09:31:31,229 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.785e+02 2.890e+02 3.718e+02 5.042e+02 9.661e+02, threshold=7.435e+02, percent-clipped=9.0 2023-06-23 09:31:31,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=18013.333333333332, ans=0.125 2023-06-23 09:31:40,608 INFO [train.py:1008] (2/4) Epoch 6, batch 50, loss[loss=0.37, simple_loss=0.3828, pruned_loss=0.1786, over 20243.00 frames. ], tot_loss[loss=0.3511, simple_loss=0.3776, pruned_loss=0.1623, over 864146.06 frames. ], batch size: 239, lr: 3.26e-02, grad_scale: 32.0 2023-06-23 09:32:03,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=18146.666666666668, ans=0.125 2023-06-23 09:32:15,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=18213.333333333332, ans=0.125 2023-06-23 09:32:15,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=18213.333333333332, ans=0.125 2023-06-23 09:32:23,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=18213.333333333332, ans=0.006910144927536232 2023-06-23 09:32:29,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=18280.0, ans=0.1172 2023-06-23 09:32:40,639 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.89 vs. limit=21.21 2023-06-23 09:32:46,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=18346.666666666668, ans=0.125 2023-06-23 09:32:52,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=18346.666666666668, ans=0.4752 2023-06-23 09:33:03,836 INFO [train.py:1008] (2/4) Epoch 6, batch 100, loss[loss=0.3435, simple_loss=0.3819, pruned_loss=0.1525, over 19208.00 frames. ], tot_loss[loss=0.3497, simple_loss=0.3795, pruned_loss=0.16, over 1514744.23 frames. ], batch size: 92, lr: 3.26e-02, grad_scale: 32.0 2023-06-23 09:33:07,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=18413.333333333332, ans=0.125 2023-06-23 09:33:36,758 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.27 vs. limit=21.41 2023-06-23 09:34:20,197 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.894e+02 2.573e+02 3.085e+02 4.128e+02 9.767e+02, threshold=6.169e+02, percent-clipped=1.0 2023-06-23 09:34:28,052 INFO [train.py:1008] (2/4) Epoch 6, batch 150, loss[loss=0.418, simple_loss=0.4325, pruned_loss=0.2017, over 16794.00 frames. ], tot_loss[loss=0.3495, simple_loss=0.3792, pruned_loss=0.1599, over 2018065.94 frames. ], batch size: 59, lr: 3.25e-02, grad_scale: 32.0 2023-06-23 09:34:30,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=18746.666666666668, ans=0.07 2023-06-23 09:34:40,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=18746.666666666668, ans=0.006794202898550724 2023-06-23 09:35:13,880 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:35:50,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=19013.333333333332, ans=0.125 2023-06-23 09:35:52,853 INFO [train.py:1008] (2/4) Epoch 6, batch 200, loss[loss=0.3387, simple_loss=0.3641, pruned_loss=0.1566, over 20526.00 frames. ], tot_loss[loss=0.3472, simple_loss=0.3779, pruned_loss=0.1582, over 2428993.68 frames. ], batch size: 173, lr: 3.25e-02, grad_scale: 32.0 2023-06-23 09:35:56,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=19080.0, ans=0.125 2023-06-23 09:35:58,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=19080.0, ans=0.125 2023-06-23 09:36:12,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=19146.666666666668, ans=0.006707246376811594 2023-06-23 09:36:27,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=19213.333333333332, ans=0.0 2023-06-23 09:36:39,416 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.72 vs. limit=9.803333333333333 2023-06-23 09:36:41,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=19280.0, ans=0.125 2023-06-23 09:36:50,163 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.59 vs. limit=21.96 2023-06-23 09:36:52,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer_ff2.min_abs, batch_count=19280.0, ans=0.1 2023-06-23 09:36:54,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=19280.0, ans=0.4892 2023-06-23 09:37:08,623 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.678e+02 3.588e+02 4.915e+02 9.806e+02, threshold=7.176e+02, percent-clipped=9.0 2023-06-23 09:37:15,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=19413.333333333332, ans=0.125 2023-06-23 09:37:17,030 INFO [train.py:1008] (2/4) Epoch 6, batch 250, loss[loss=0.344, simple_loss=0.3917, pruned_loss=0.1481, over 16017.00 frames. ], tot_loss[loss=0.3456, simple_loss=0.3773, pruned_loss=0.1569, over 2732089.98 frames. ], batch size: 51, lr: 3.25e-02, grad_scale: 32.0 2023-06-23 09:37:31,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=19413.333333333332, ans=0.10586666666666669 2023-06-23 09:37:44,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=19480.0, ans=0.125 2023-06-23 09:38:01,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=19546.666666666668, ans=0.125 2023-06-23 09:38:03,267 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.67 vs. limit=14.83 2023-06-23 09:38:10,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=19613.333333333332, ans=0.0 2023-06-23 09:38:14,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=19613.333333333332, ans=0.006605797101449276 2023-06-23 09:38:21,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=19613.333333333332, ans=0.125 2023-06-23 09:38:31,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=19680.0, ans=0.10320000000000001 2023-06-23 09:38:32,262 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.18 vs. limit=5.952 2023-06-23 09:38:41,165 INFO [train.py:1008] (2/4) Epoch 6, batch 300, loss[loss=0.3387, simple_loss=0.3804, pruned_loss=0.1485, over 18441.00 frames. ], tot_loss[loss=0.3463, simple_loss=0.3781, pruned_loss=0.1573, over 2950381.58 frames. ], batch size: 77, lr: 3.24e-02, grad_scale: 32.0 2023-06-23 09:38:55,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=19813.333333333332, ans=0.0 2023-06-23 09:39:18,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=19880.0, ans=0.10120000000000001 2023-06-23 09:39:47,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=20013.333333333332, ans=0.1 2023-06-23 09:39:56,399 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.052e+02 2.604e+02 3.378e+02 4.786e+02 8.678e+02, threshold=6.756e+02, percent-clipped=6.0 2023-06-23 09:39:57,766 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.59 vs. limit=6.0 2023-06-23 09:40:05,039 INFO [train.py:1008] (2/4) Epoch 6, batch 350, loss[loss=0.3449, simple_loss=0.3813, pruned_loss=0.1543, over 19827.00 frames. ], tot_loss[loss=0.3452, simple_loss=0.3764, pruned_loss=0.157, over 3131483.55 frames. ], batch size: 115, lr: 3.24e-02, grad_scale: 32.0 2023-06-23 09:40:10,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=20080.0, ans=0.125 2023-06-23 09:40:38,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=20213.333333333332, ans=0.125 2023-06-23 09:41:20,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=20346.666666666668, ans=0.0 2023-06-23 09:41:28,287 INFO [train.py:1008] (2/4) Epoch 6, batch 400, loss[loss=0.3387, simple_loss=0.3698, pruned_loss=0.1538, over 20298.00 frames. ], tot_loss[loss=0.345, simple_loss=0.3765, pruned_loss=0.1567, over 3275188.42 frames. ], batch size: 149, lr: 3.24e-02, grad_scale: 32.0 2023-06-23 09:41:33,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=20413.333333333332, ans=0.125 2023-06-23 09:41:43,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=20480.0, ans=0.0 2023-06-23 09:41:48,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=20480.0, ans=0.1 2023-06-23 09:42:02,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=20546.666666666668, ans=0.125 2023-06-23 09:42:42,562 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.754e+02 2.860e+02 3.612e+02 4.527e+02 7.434e+02, threshold=7.224e+02, percent-clipped=4.0 2023-06-23 09:42:52,422 INFO [train.py:1008] (2/4) Epoch 6, batch 450, loss[loss=0.3388, simple_loss=0.3633, pruned_loss=0.1572, over 20747.00 frames. ], tot_loss[loss=0.3432, simple_loss=0.3757, pruned_loss=0.1553, over 3401080.61 frames. ], batch size: 211, lr: 3.23e-02, grad_scale: 32.0 2023-06-23 09:43:46,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=20946.666666666668, ans=0.05 2023-06-23 09:44:05,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=21013.333333333332, ans=0.1 2023-06-23 09:44:12,760 INFO [train.py:1008] (2/4) Epoch 6, batch 500, loss[loss=0.3509, simple_loss=0.3616, pruned_loss=0.1701, over 19842.00 frames. ], tot_loss[loss=0.3421, simple_loss=0.3751, pruned_loss=0.1546, over 3484843.34 frames. ], batch size: 294, lr: 3.23e-02, grad_scale: 32.0 2023-06-23 09:44:17,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=21080.0, ans=0.125 2023-06-23 09:44:20,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=21080.0, ans=0.125 2023-06-23 09:44:27,286 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.17 vs. limit=6.0 2023-06-23 09:44:36,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=21146.666666666668, ans=0.125 2023-06-23 09:44:59,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=21280.0, ans=0.125 2023-06-23 09:45:26,857 INFO [train.py:1008] (2/4) Epoch 7, batch 0, loss[loss=0.3495, simple_loss=0.3943, pruned_loss=0.1523, over 17875.00 frames. ], tot_loss[loss=0.3495, simple_loss=0.3943, pruned_loss=0.1523, over 17875.00 frames. ], batch size: 68, lr: 3.07e-02, grad_scale: 32.0 2023-06-23 09:45:26,858 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 09:45:32,635 INFO [train.py:1040] (2/4) Epoch 7, validation: loss=0.2465, simple_loss=0.3396, pruned_loss=0.07665, over 143649.00 frames. 2023-06-23 09:45:32,635 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 09:45:38,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=21300.0, ans=0.1 2023-06-23 09:45:53,222 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.639e+02 3.413e+02 4.875e+02 7.305e+02, threshold=6.826e+02, percent-clipped=1.0 2023-06-23 09:45:55,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=21366.666666666668, ans=0.125 2023-06-23 09:46:20,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=21433.333333333332, ans=0.125 2023-06-23 09:46:22,392 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:46:56,420 INFO [train.py:1008] (2/4) Epoch 7, batch 50, loss[loss=0.3227, simple_loss=0.3634, pruned_loss=0.141, over 19233.00 frames. ], tot_loss[loss=0.3314, simple_loss=0.3667, pruned_loss=0.1481, over 871998.90 frames. ], batch size: 92, lr: 3.07e-02, grad_scale: 32.0 2023-06-23 09:47:04,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=21633.333333333332, ans=0.125 2023-06-23 09:47:45,906 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-06-23 09:47:52,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=21833.333333333332, ans=0.006123188405797102 2023-06-23 09:47:58,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=21833.333333333332, ans=0.09899494936611666 2023-06-23 09:48:06,693 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:48:17,469 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=6.25 vs. limit=12.0 2023-06-23 09:48:19,720 INFO [train.py:1008] (2/4) Epoch 7, batch 100, loss[loss=0.33, simple_loss=0.3628, pruned_loss=0.1486, over 20518.00 frames. ], tot_loss[loss=0.3333, simple_loss=0.3686, pruned_loss=0.1489, over 1525864.79 frames. ], batch size: 173, lr: 3.06e-02, grad_scale: 32.0 2023-06-23 09:48:40,163 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.977e+02 2.721e+02 3.392e+02 4.507e+02 8.334e+02, threshold=6.785e+02, percent-clipped=4.0 2023-06-23 09:48:48,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=22033.333333333332, ans=0.2 2023-06-23 09:49:18,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=22166.666666666668, ans=0.1 2023-06-23 09:49:25,015 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.69 vs. limit=15.0 2023-06-23 09:49:27,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=22233.333333333332, ans=0.125 2023-06-23 09:49:42,093 INFO [train.py:1008] (2/4) Epoch 7, batch 150, loss[loss=0.357, simple_loss=0.3835, pruned_loss=0.1653, over 20272.00 frames. ], tot_loss[loss=0.3305, simple_loss=0.3676, pruned_loss=0.1467, over 2028380.19 frames. ], batch size: 141, lr: 3.06e-02, grad_scale: 32.0 2023-06-23 09:49:42,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=22300.0, ans=0.125 2023-06-23 09:49:46,154 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.99 vs. limit=12.0 2023-06-23 09:49:47,724 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.40 vs. limit=15.0 2023-06-23 09:49:55,173 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.49 vs. limit=15.0 2023-06-23 09:49:56,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=22300.0, ans=0.125 2023-06-23 09:50:16,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=22433.333333333332, ans=0.125 2023-06-23 09:50:36,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=22500.0, ans=0.125 2023-06-23 09:50:47,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=22566.666666666668, ans=0.1 2023-06-23 09:50:59,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=22566.666666666668, ans=0.005963768115942029 2023-06-23 09:51:02,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=22566.666666666668, ans=0.1 2023-06-23 09:51:05,080 INFO [train.py:1008] (2/4) Epoch 7, batch 200, loss[loss=0.324, simple_loss=0.3525, pruned_loss=0.1477, over 20521.00 frames. ], tot_loss[loss=0.3312, simple_loss=0.3682, pruned_loss=0.1471, over 2409148.29 frames. ], batch size: 189, lr: 3.05e-02, grad_scale: 32.0 2023-06-23 09:51:18,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=22633.333333333332, ans=0.125 2023-06-23 09:51:26,005 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.837e+02 2.559e+02 3.231e+02 3.874e+02 6.394e+02, threshold=6.463e+02, percent-clipped=0.0 2023-06-23 09:51:31,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=22700.0, ans=0.2 2023-06-23 09:51:43,360 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.95 vs. limit=22.5 2023-06-23 09:51:46,387 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2023-06-23 09:51:59,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.max_abs, batch_count=22833.333333333332, ans=10.0 2023-06-23 09:52:07,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=22833.333333333332, ans=0.2 2023-06-23 09:52:12,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=22900.0, ans=0.125 2023-06-23 09:52:27,995 INFO [train.py:1008] (2/4) Epoch 7, batch 250, loss[loss=0.3168, simple_loss=0.3579, pruned_loss=0.1379, over 19892.00 frames. ], tot_loss[loss=0.3297, simple_loss=0.3669, pruned_loss=0.1463, over 2737579.60 frames. ], batch size: 120, lr: 3.05e-02, grad_scale: 32.0 2023-06-23 09:52:54,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=23033.333333333332, ans=0.125 2023-06-23 09:53:08,408 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=29.44 vs. limit=22.5 2023-06-23 09:53:21,171 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.50 vs. limit=15.0 2023-06-23 09:53:27,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=23166.666666666668, ans=0.125 2023-06-23 09:53:29,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=23166.666666666668, ans=0.0 2023-06-23 09:53:32,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=23233.333333333332, ans=0.125 2023-06-23 09:53:38,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=23233.333333333332, ans=0.125 2023-06-23 09:53:50,825 INFO [train.py:1008] (2/4) Epoch 7, batch 300, loss[loss=0.3699, simple_loss=0.3918, pruned_loss=0.174, over 20480.00 frames. ], tot_loss[loss=0.3294, simple_loss=0.3667, pruned_loss=0.146, over 2975502.92 frames. ], batch size: 160, lr: 3.05e-02, grad_scale: 32.0 2023-06-23 09:54:07,175 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.93 vs. limit=15.0 2023-06-23 09:54:09,416 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.661e+02 2.566e+02 3.013e+02 4.099e+02 6.159e+02, threshold=6.026e+02, percent-clipped=0.0 2023-06-23 09:54:35,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=23433.333333333332, ans=0.04949747468305833 2023-06-23 09:54:49,830 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=15.0 2023-06-23 09:54:54,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=23566.666666666668, ans=0.07 2023-06-23 09:55:10,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=23566.666666666668, ans=0.125 2023-06-23 09:55:13,432 INFO [train.py:1008] (2/4) Epoch 7, batch 350, loss[loss=0.3717, simple_loss=0.4017, pruned_loss=0.1709, over 18463.00 frames. ], tot_loss[loss=0.3274, simple_loss=0.3654, pruned_loss=0.1447, over 3154573.32 frames. ], batch size: 77, lr: 3.04e-02, grad_scale: 32.0 2023-06-23 09:55:18,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=23633.333333333332, ans=0.2 2023-06-23 09:56:22,464 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.97 vs. limit=22.5 2023-06-23 09:56:34,423 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.69 vs. limit=15.0 2023-06-23 09:56:36,502 INFO [train.py:1008] (2/4) Epoch 7, batch 400, loss[loss=0.3074, simple_loss=0.3579, pruned_loss=0.1284, over 18453.00 frames. ], tot_loss[loss=0.3266, simple_loss=0.3647, pruned_loss=0.1443, over 3299304.99 frames. ], batch size: 77, lr: 3.04e-02, grad_scale: 32.0 2023-06-23 09:56:51,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=24033.333333333332, ans=0.05 2023-06-23 09:56:55,721 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.878e+02 2.370e+02 2.886e+02 3.873e+02 7.478e+02, threshold=5.773e+02, percent-clipped=7.0 2023-06-23 09:57:11,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=24100.0, ans=0.005630434782608696 2023-06-23 09:57:25,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=24166.666666666668, ans=0.125 2023-06-23 09:57:33,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=24166.666666666668, ans=0.125 2023-06-23 09:57:46,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=24233.333333333332, ans=0.125 2023-06-23 09:57:48,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=24233.333333333332, ans=0.1 2023-06-23 09:57:58,899 INFO [train.py:1008] (2/4) Epoch 7, batch 450, loss[loss=0.3109, simple_loss=0.3459, pruned_loss=0.138, over 20087.00 frames. ], tot_loss[loss=0.3257, simple_loss=0.3644, pruned_loss=0.1435, over 3396664.63 frames. ], batch size: 133, lr: 3.04e-02, grad_scale: 64.0 2023-06-23 09:58:00,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=24300.0, ans=0.0 2023-06-23 09:58:14,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=24366.666666666668, ans=10.0 2023-06-23 09:58:31,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=24433.333333333332, ans=0.125 2023-06-23 09:58:32,789 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:58:41,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=24433.333333333332, ans=0.125 2023-06-23 09:58:52,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=24500.0, ans=0.125 2023-06-23 09:58:54,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=24500.0, ans=0.2 2023-06-23 09:58:56,423 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.96 vs. limit=10.0 2023-06-23 09:59:13,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=24566.666666666668, ans=0.1 2023-06-23 09:59:17,629 INFO [train.py:1008] (2/4) Epoch 7, batch 500, loss[loss=0.3233, simple_loss=0.3556, pruned_loss=0.1455, over 20498.00 frames. ], tot_loss[loss=0.3243, simple_loss=0.3635, pruned_loss=0.1426, over 3489325.41 frames. ], batch size: 160, lr: 3.03e-02, grad_scale: 64.0 2023-06-23 09:59:28,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=24633.333333333332, ans=0.125 2023-06-23 09:59:31,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=24700.0, ans=0.125 2023-06-23 09:59:36,073 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.772e+02 2.330e+02 2.795e+02 3.451e+02 4.911e+02, threshold=5.589e+02, percent-clipped=0.0 2023-06-23 09:59:39,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=24700.0, ans=0.1 2023-06-23 09:59:56,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=24766.666666666668, ans=0.0 2023-06-23 10:00:00,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=24766.666666666668, ans=0.0 2023-06-23 10:00:30,074 INFO [train.py:1008] (2/4) Epoch 8, batch 0, loss[loss=0.3146, simple_loss=0.3595, pruned_loss=0.1348, over 19770.00 frames. ], tot_loss[loss=0.3146, simple_loss=0.3595, pruned_loss=0.1348, over 19770.00 frames. ], batch size: 115, lr: 2.89e-02, grad_scale: 64.0 2023-06-23 10:00:30,074 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 10:00:35,752 INFO [train.py:1040] (2/4) Epoch 8, validation: loss=0.2397, simple_loss=0.334, pruned_loss=0.07275, over 143649.00 frames. 2023-06-23 10:00:35,753 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 10:01:24,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=25046.666666666668, ans=0.125 2023-06-23 10:01:42,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=25113.333333333332, ans=0.0 2023-06-23 10:01:55,644 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 10:01:58,318 INFO [train.py:1008] (2/4) Epoch 8, batch 50, loss[loss=0.2966, simple_loss=0.3412, pruned_loss=0.126, over 19469.00 frames. ], tot_loss[loss=0.3146, simple_loss=0.3573, pruned_loss=0.136, over 854244.43 frames. ], batch size: 105, lr: 2.88e-02, grad_scale: 64.0 2023-06-23 10:02:21,815 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.81 vs. limit=15.0 2023-06-23 10:02:49,146 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.860e+02 2.335e+02 2.730e+02 3.502e+02 6.500e+02, threshold=5.461e+02, percent-clipped=1.0 2023-06-23 10:02:56,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=25380.0, ans=0.0 2023-06-23 10:03:22,485 INFO [train.py:1008] (2/4) Epoch 8, batch 100, loss[loss=0.3292, simple_loss=0.3704, pruned_loss=0.144, over 18644.00 frames. ], tot_loss[loss=0.318, simple_loss=0.3597, pruned_loss=0.1381, over 1504643.00 frames. ], batch size: 80, lr: 2.88e-02, grad_scale: 64.0 2023-06-23 10:03:38,277 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.75 vs. limit=15.0 2023-06-23 10:03:41,801 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.09 vs. limit=22.5 2023-06-23 10:03:49,259 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.93 vs. limit=15.0 2023-06-23 10:04:01,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=25646.666666666668, ans=0.125 2023-06-23 10:04:15,119 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=9.24 vs. limit=15.0 2023-06-23 10:04:29,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=25780.0, ans=0.0 2023-06-23 10:04:34,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=25780.0, ans=0.125 2023-06-23 10:04:43,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=25780.0, ans=0.1 2023-06-23 10:04:45,896 INFO [train.py:1008] (2/4) Epoch 8, batch 150, loss[loss=0.3207, simple_loss=0.3739, pruned_loss=0.1338, over 17047.00 frames. ], tot_loss[loss=0.3194, simple_loss=0.3602, pruned_loss=0.1393, over 2004021.90 frames. ], batch size: 60, lr: 2.87e-02, grad_scale: 64.0 2023-06-23 10:04:53,482 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=15.0 2023-06-23 10:05:01,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=25913.333333333332, ans=0.125 2023-06-23 10:05:03,971 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.96 vs. limit=10.0 2023-06-23 10:05:07,014 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.97 vs. limit=6.0 2023-06-23 10:05:11,856 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.45 vs. limit=15.0 2023-06-23 10:05:12,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=25913.333333333332, ans=0.125 2023-06-23 10:05:29,085 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.73 vs. limit=15.0 2023-06-23 10:05:33,908 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.31 vs. limit=8.0 2023-06-23 10:05:36,162 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.682e+02 2.398e+02 2.849e+02 3.834e+02 8.685e+02, threshold=5.698e+02, percent-clipped=3.0 2023-06-23 10:05:57,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=26113.333333333332, ans=0.125 2023-06-23 10:06:02,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=26113.333333333332, ans=0.125 2023-06-23 10:06:09,736 INFO [train.py:1008] (2/4) Epoch 8, batch 200, loss[loss=0.2836, simple_loss=0.3406, pruned_loss=0.1133, over 19665.00 frames. ], tot_loss[loss=0.3183, simple_loss=0.3589, pruned_loss=0.1389, over 2403316.98 frames. ], batch size: 110, lr: 2.87e-02, grad_scale: 64.0 2023-06-23 10:06:30,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=26246.666666666668, ans=0.1 2023-06-23 10:07:15,328 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=15.0 2023-06-23 10:07:31,958 INFO [train.py:1008] (2/4) Epoch 8, batch 250, loss[loss=0.3381, simple_loss=0.3551, pruned_loss=0.1606, over 20028.00 frames. ], tot_loss[loss=0.3169, simple_loss=0.3586, pruned_loss=0.1376, over 2712264.07 frames. ], batch size: 293, lr: 2.87e-02, grad_scale: 64.0 2023-06-23 10:08:23,644 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.719e+02 2.405e+02 3.162e+02 4.078e+02 7.691e+02, threshold=6.324e+02, percent-clipped=7.0 2023-06-23 10:08:48,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=26780.0, ans=0.1 2023-06-23 10:08:56,867 INFO [train.py:1008] (2/4) Epoch 8, batch 300, loss[loss=0.3088, simple_loss=0.3552, pruned_loss=0.1312, over 19516.00 frames. ], tot_loss[loss=0.3175, simple_loss=0.3587, pruned_loss=0.1382, over 2922775.04 frames. ], batch size: 102, lr: 2.86e-02, grad_scale: 64.0 2023-06-23 10:09:10,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=26846.666666666668, ans=0.025 2023-06-23 10:09:47,180 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.70 vs. limit=15.0 2023-06-23 10:09:53,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=27046.666666666668, ans=0.125 2023-06-23 10:10:00,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=27046.666666666668, ans=0.125 2023-06-23 10:10:00,762 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.48 vs. limit=22.5 2023-06-23 10:10:01,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=27113.333333333332, ans=0.1 2023-06-23 10:10:07,151 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=21.77 vs. limit=22.5 2023-06-23 10:10:18,661 INFO [train.py:1008] (2/4) Epoch 8, batch 350, loss[loss=0.3121, simple_loss=0.3494, pruned_loss=0.1374, over 20289.00 frames. ], tot_loss[loss=0.3157, simple_loss=0.3575, pruned_loss=0.137, over 3112173.88 frames. ], batch size: 149, lr: 2.86e-02, grad_scale: 64.0 2023-06-23 10:10:19,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=27180.0, ans=0.125 2023-06-23 10:11:01,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=27313.333333333332, ans=10.0 2023-06-23 10:11:01,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=27313.333333333332, ans=0.0 2023-06-23 10:11:08,211 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.700e+02 2.345e+02 2.694e+02 3.348e+02 7.255e+02, threshold=5.388e+02, percent-clipped=3.0 2023-06-23 10:11:12,425 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.41 vs. limit=22.5 2023-06-23 10:11:24,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=27446.666666666668, ans=0.1 2023-06-23 10:11:41,434 INFO [train.py:1008] (2/4) Epoch 8, batch 400, loss[loss=0.3165, simple_loss=0.3185, pruned_loss=0.1572, over 17151.00 frames. ], tot_loss[loss=0.3155, simple_loss=0.3577, pruned_loss=0.1367, over 3247544.84 frames. ], batch size: 391, lr: 2.85e-02, grad_scale: 64.0 2023-06-23 10:12:18,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=27646.666666666668, ans=0.125 2023-06-23 10:12:39,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=27713.333333333332, ans=0.07 2023-06-23 10:12:52,331 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=23.09 vs. limit=22.5 2023-06-23 10:12:58,580 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.14 vs. limit=12.0 2023-06-23 10:13:03,989 INFO [train.py:1008] (2/4) Epoch 8, batch 450, loss[loss=0.3167, simple_loss=0.3544, pruned_loss=0.1395, over 19955.00 frames. ], tot_loss[loss=0.316, simple_loss=0.3577, pruned_loss=0.1371, over 3357218.83 frames. ], batch size: 126, lr: 2.85e-02, grad_scale: 64.0 2023-06-23 10:13:13,169 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=14.83 vs. limit=15.0 2023-06-23 10:13:35,226 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=15.0 2023-06-23 10:13:36,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=27980.0, ans=0.2 2023-06-23 10:13:52,899 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.385e+02 2.830e+02 3.601e+02 6.779e+02, threshold=5.659e+02, percent-clipped=6.0 2023-06-23 10:13:57,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=28046.666666666668, ans=0.1 2023-06-23 10:14:14,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=28113.333333333332, ans=0.125 2023-06-23 10:14:21,211 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.17 vs. limit=22.5 2023-06-23 10:14:23,519 INFO [train.py:1008] (2/4) Epoch 8, batch 500, loss[loss=0.3026, simple_loss=0.352, pruned_loss=0.1266, over 18293.00 frames. ], tot_loss[loss=0.3154, simple_loss=0.3576, pruned_loss=0.1366, over 3432438.51 frames. ], batch size: 74, lr: 2.85e-02, grad_scale: 64.0 2023-06-23 10:14:26,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=28180.0, ans=0.125 2023-06-23 10:14:35,541 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.66 vs. limit=22.5 2023-06-23 10:15:02,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=28313.333333333332, ans=0.2 2023-06-23 10:15:06,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=28313.333333333332, ans=0.1 2023-06-23 10:15:36,365 INFO [train.py:1008] (2/4) Epoch 9, batch 0, loss[loss=0.3174, simple_loss=0.3634, pruned_loss=0.1357, over 19785.00 frames. ], tot_loss[loss=0.3174, simple_loss=0.3634, pruned_loss=0.1357, over 19785.00 frames. ], batch size: 115, lr: 2.72e-02, grad_scale: 64.0 2023-06-23 10:15:36,365 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 10:15:42,521 INFO [train.py:1040] (2/4) Epoch 9, validation: loss=0.2321, simple_loss=0.3284, pruned_loss=0.06788, over 143649.00 frames. 2023-06-23 10:15:42,522 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 10:16:13,046 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 10:16:28,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=28526.666666666668, ans=0.125 2023-06-23 10:16:41,789 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.81 vs. limit=12.0 2023-06-23 10:16:47,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=28660.0, ans=0.125 2023-06-23 10:17:01,820 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.740e+02 2.235e+02 2.795e+02 3.608e+02 5.356e+02, threshold=5.591e+02, percent-clipped=0.0 2023-06-23 10:17:02,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=28660.0, ans=0.125 2023-06-23 10:17:02,630 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.85 vs. limit=22.5 2023-06-23 10:17:05,078 INFO [train.py:1008] (2/4) Epoch 9, batch 50, loss[loss=0.3131, simple_loss=0.3576, pruned_loss=0.1343, over 20013.00 frames. ], tot_loss[loss=0.3109, simple_loss=0.3549, pruned_loss=0.1335, over 865128.17 frames. ], batch size: 126, lr: 2.71e-02, grad_scale: 64.0 2023-06-23 10:17:20,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=28793.333333333332, ans=0.1 2023-06-23 10:17:23,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=28793.333333333332, ans=0.125 2023-06-23 10:17:31,852 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.31 vs. limit=15.0 2023-06-23 10:17:58,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=28926.666666666668, ans=0.125 2023-06-23 10:18:26,264 INFO [train.py:1008] (2/4) Epoch 9, batch 100, loss[loss=0.3018, simple_loss=0.3483, pruned_loss=0.1277, over 19801.00 frames. ], tot_loss[loss=0.3106, simple_loss=0.3547, pruned_loss=0.1333, over 1498741.18 frames. ], batch size: 115, lr: 2.71e-02, grad_scale: 64.0 2023-06-23 10:18:32,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=29060.0, ans=0.004552173913043479 2023-06-23 10:18:33,447 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.60 vs. limit=15.0 2023-06-23 10:18:44,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=29126.666666666668, ans=0.2 2023-06-23 10:18:56,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=29193.333333333332, ans=0.125 2023-06-23 10:19:09,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=29193.333333333332, ans=0.1 2023-06-23 10:19:23,300 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.63 vs. limit=15.0 2023-06-23 10:19:38,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=29326.666666666668, ans=0.1 2023-06-23 10:19:41,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=29326.666666666668, ans=0.1 2023-06-23 10:19:45,723 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.821e+02 2.343e+02 2.904e+02 3.644e+02 6.921e+02, threshold=5.807e+02, percent-clipped=3.0 2023-06-23 10:19:46,409 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.50 vs. limit=15.0 2023-06-23 10:19:47,372 INFO [train.py:1008] (2/4) Epoch 9, batch 150, loss[loss=0.3042, simple_loss=0.3509, pruned_loss=0.1287, over 19109.00 frames. ], tot_loss[loss=0.3115, simple_loss=0.3555, pruned_loss=0.1338, over 1997663.42 frames. ], batch size: 94, lr: 2.70e-02, grad_scale: 32.0 2023-06-23 10:19:58,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=29393.333333333332, ans=0.125 2023-06-23 10:20:05,655 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.49 vs. limit=6.0 2023-06-23 10:20:09,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=29460.0, ans=0.95 2023-06-23 10:20:25,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=29526.666666666668, ans=0.5 2023-06-23 10:20:45,239 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.61 vs. limit=6.0 2023-06-23 10:21:07,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=29660.0, ans=0.2 2023-06-23 10:21:09,947 INFO [train.py:1008] (2/4) Epoch 9, batch 200, loss[loss=0.2913, simple_loss=0.3395, pruned_loss=0.1215, over 19093.00 frames. ], tot_loss[loss=0.3105, simple_loss=0.3541, pruned_loss=0.1334, over 2399363.57 frames. ], batch size: 89, lr: 2.70e-02, grad_scale: 32.0 2023-06-23 10:21:22,029 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.75 vs. limit=22.5 2023-06-23 10:21:29,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=29793.333333333332, ans=0.125 2023-06-23 10:21:36,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=29793.333333333332, ans=0.125 2023-06-23 10:21:57,390 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.36 vs. limit=22.5 2023-06-23 10:22:08,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=29926.666666666668, ans=0.1 2023-06-23 10:22:15,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=29993.333333333332, ans=0.125 2023-06-23 10:22:22,747 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.34 vs. limit=22.5 2023-06-23 10:22:31,027 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.762e+02 2.339e+02 2.716e+02 3.327e+02 5.463e+02, threshold=5.433e+02, percent-clipped=0.0 2023-06-23 10:22:32,598 INFO [train.py:1008] (2/4) Epoch 9, batch 250, loss[loss=0.2964, simple_loss=0.3453, pruned_loss=0.1238, over 19533.00 frames. ], tot_loss[loss=0.3084, simple_loss=0.352, pruned_loss=0.1324, over 2709004.73 frames. ], batch size: 102, lr: 2.70e-02, grad_scale: 32.0 2023-06-23 10:22:51,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=30126.666666666668, ans=0.0 2023-06-23 10:22:52,709 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.58 vs. limit=15.0 2023-06-23 10:22:55,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=30126.666666666668, ans=0.2 2023-06-23 10:23:14,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=30193.333333333332, ans=0.0 2023-06-23 10:23:14,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=30193.333333333332, ans=0.1 2023-06-23 10:23:46,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=30326.666666666668, ans=0.1 2023-06-23 10:23:55,982 INFO [train.py:1008] (2/4) Epoch 9, batch 300, loss[loss=0.3237, simple_loss=0.3786, pruned_loss=0.1344, over 18302.00 frames. ], tot_loss[loss=0.3071, simple_loss=0.3515, pruned_loss=0.1313, over 2956089.65 frames. ], batch size: 72, lr: 2.69e-02, grad_scale: 32.0 2023-06-23 10:24:01,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=30393.333333333332, ans=0.1 2023-06-23 10:24:45,748 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=8.03 vs. limit=10.0 2023-06-23 10:24:46,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=30593.333333333332, ans=0.1 2023-06-23 10:25:00,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=30593.333333333332, ans=0.05 2023-06-23 10:25:01,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=30660.0, ans=0.0 2023-06-23 10:25:11,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=30660.0, ans=0.125 2023-06-23 10:25:11,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=30660.0, ans=0.125 2023-06-23 10:25:13,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=30660.0, ans=0.0 2023-06-23 10:25:17,833 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.462e+02 2.891e+02 3.800e+02 6.993e+02, threshold=5.782e+02, percent-clipped=6.0 2023-06-23 10:25:20,045 INFO [train.py:1008] (2/4) Epoch 9, batch 350, loss[loss=0.3014, simple_loss=0.3365, pruned_loss=0.1331, over 20667.00 frames. ], tot_loss[loss=0.3063, simple_loss=0.3502, pruned_loss=0.1312, over 3154884.59 frames. ], batch size: 211, lr: 2.69e-02, grad_scale: 32.0 2023-06-23 10:25:20,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=30726.666666666668, ans=0.2 2023-06-23 10:25:30,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=30726.666666666668, ans=0.0 2023-06-23 10:25:55,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=30860.0, ans=0.125 2023-06-23 10:26:07,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=30860.0, ans=0.004160869565217391 2023-06-23 10:26:43,141 INFO [train.py:1008] (2/4) Epoch 9, batch 400, loss[loss=0.2758, simple_loss=0.3252, pruned_loss=0.1132, over 19499.00 frames. ], tot_loss[loss=0.3052, simple_loss=0.3497, pruned_loss=0.1303, over 3303476.27 frames. ], batch size: 105, lr: 2.68e-02, grad_scale: 32.0 2023-06-23 10:26:55,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=31060.0, ans=0.125 2023-06-23 10:27:08,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=31126.666666666668, ans=0.07 2023-06-23 10:27:25,048 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.25 vs. limit=15.0 2023-06-23 10:27:27,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=31193.333333333332, ans=0.125 2023-06-23 10:27:52,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=31326.666666666668, ans=0.125 2023-06-23 10:28:03,162 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.693e+02 2.233e+02 2.638e+02 3.310e+02 5.527e+02, threshold=5.277e+02, percent-clipped=0.0 2023-06-23 10:28:04,767 INFO [train.py:1008] (2/4) Epoch 9, batch 450, loss[loss=0.319, simple_loss=0.3795, pruned_loss=0.1292, over 16994.00 frames. ], tot_loss[loss=0.3053, simple_loss=0.3505, pruned_loss=0.1301, over 3387961.08 frames. ], batch size: 60, lr: 2.68e-02, grad_scale: 32.0 2023-06-23 10:28:32,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=31460.0, ans=0.125 2023-06-23 10:28:32,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=31460.0, ans=0.125 2023-06-23 10:28:33,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten.whitening_limit, batch_count=31460.0, ans=15.0 2023-06-23 10:29:10,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=31660.0, ans=0.00398695652173913 2023-06-23 10:29:25,432 INFO [train.py:1008] (2/4) Epoch 9, batch 500, loss[loss=0.2819, simple_loss=0.3336, pruned_loss=0.1151, over 19814.00 frames. ], tot_loss[loss=0.3046, simple_loss=0.35, pruned_loss=0.1295, over 3466630.27 frames. ], batch size: 115, lr: 2.68e-02, grad_scale: 32.0 2023-06-23 10:29:27,880 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.77 vs. limit=6.0 2023-06-23 10:29:28,702 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 10:29:38,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=31726.666666666668, ans=0.125 2023-06-23 10:29:39,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=31793.333333333332, ans=0.125 2023-06-23 10:29:55,320 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 10:30:41,979 INFO [train.py:1008] (2/4) Epoch 10, batch 0, loss[loss=0.3144, simple_loss=0.3605, pruned_loss=0.1341, over 19808.00 frames. ], tot_loss[loss=0.3144, simple_loss=0.3605, pruned_loss=0.1341, over 19808.00 frames. ], batch size: 120, lr: 2.56e-02, grad_scale: 32.0 2023-06-23 10:30:41,979 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 10:30:47,641 INFO [train.py:1040] (2/4) Epoch 10, validation: loss=0.2252, simple_loss=0.3233, pruned_loss=0.06351, over 143649.00 frames. 2023-06-23 10:30:47,642 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 10:31:06,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=32006.666666666668, ans=0.1 2023-06-23 10:31:14,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=32006.666666666668, ans=0.125 2023-06-23 10:31:15,955 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.771e+02 2.285e+02 2.652e+02 3.209e+02 5.793e+02, threshold=5.303e+02, percent-clipped=2.0 2023-06-23 10:31:17,526 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=16.68 vs. limit=15.0 2023-06-23 10:31:34,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=32073.333333333332, ans=0.125 2023-06-23 10:31:56,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=32206.666666666668, ans=0.0 2023-06-23 10:32:10,770 INFO [train.py:1008] (2/4) Epoch 10, batch 50, loss[loss=0.2855, simple_loss=0.3359, pruned_loss=0.1175, over 19862.00 frames. ], tot_loss[loss=0.302, simple_loss=0.3477, pruned_loss=0.1281, over 854467.96 frames. ], batch size: 120, lr: 2.56e-02, grad_scale: 32.0 2023-06-23 10:32:19,452 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.64 vs. limit=22.5 2023-06-23 10:32:27,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=32340.0, ans=0.09899494936611666 2023-06-23 10:32:36,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=32340.0, ans=0.015 2023-06-23 10:32:41,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=32406.666666666668, ans=0.125 2023-06-23 10:32:41,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=32406.666666666668, ans=0.125 2023-06-23 10:33:04,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=32473.333333333332, ans=0.0 2023-06-23 10:33:26,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=32540.0, ans=0.003795652173913043 2023-06-23 10:33:33,230 INFO [train.py:1008] (2/4) Epoch 10, batch 100, loss[loss=0.309, simple_loss=0.3433, pruned_loss=0.1374, over 20697.00 frames. ], tot_loss[loss=0.3013, simple_loss=0.3477, pruned_loss=0.1275, over 1515871.20 frames. ], batch size: 211, lr: 2.55e-02, grad_scale: 32.0 2023-06-23 10:33:34,231 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.32 vs. limit=22.5 2023-06-23 10:34:00,488 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 2.231e+02 2.647e+02 3.330e+02 6.227e+02, threshold=5.294e+02, percent-clipped=3.0 2023-06-23 10:34:06,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=32740.0, ans=0.125 2023-06-23 10:34:54,570 INFO [train.py:1008] (2/4) Epoch 10, batch 150, loss[loss=0.2753, simple_loss=0.3311, pruned_loss=0.1097, over 10630.00 frames. ], tot_loss[loss=0.3012, simple_loss=0.3472, pruned_loss=0.1276, over 2008791.87 frames. ], batch size: 30, lr: 2.55e-02, grad_scale: 32.0 2023-06-23 10:35:08,507 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.68 vs. limit=15.0 2023-06-23 10:35:51,444 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.80 vs. limit=22.5 2023-06-23 10:36:16,917 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.35 vs. limit=15.0 2023-06-23 10:36:17,386 INFO [train.py:1008] (2/4) Epoch 10, batch 200, loss[loss=0.3142, simple_loss=0.3622, pruned_loss=0.1331, over 18945.00 frames. ], tot_loss[loss=0.301, simple_loss=0.3478, pruned_loss=0.1271, over 2400717.52 frames. ], batch size: 86, lr: 2.54e-02, grad_scale: 32.0 2023-06-23 10:36:19,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=33273.333333333336, ans=0.1 2023-06-23 10:36:20,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=33273.333333333336, ans=0.125 2023-06-23 10:36:20,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=33273.333333333336, ans=0.5 2023-06-23 10:36:35,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=33340.0, ans=0.0036217391304347825 2023-06-23 10:36:44,643 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.786e+02 2.278e+02 2.899e+02 3.563e+02 6.245e+02, threshold=5.798e+02, percent-clipped=3.0 2023-06-23 10:36:57,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=33406.666666666664, ans=0.1 2023-06-23 10:36:59,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=33406.666666666664, ans=0.0 2023-06-23 10:37:02,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=33406.666666666664, ans=0.0 2023-06-23 10:37:04,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=33406.666666666664, ans=0.07 2023-06-23 10:37:15,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=33473.333333333336, ans=0.04949747468305833 2023-06-23 10:37:28,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=33540.0, ans=0.125 2023-06-23 10:37:40,531 INFO [train.py:1008] (2/4) Epoch 10, batch 250, loss[loss=0.2937, simple_loss=0.345, pruned_loss=0.1212, over 19347.00 frames. ], tot_loss[loss=0.3003, simple_loss=0.3477, pruned_loss=0.1264, over 2712449.55 frames. ], batch size: 98, lr: 2.54e-02, grad_scale: 32.0 2023-06-23 10:37:44,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=33606.666666666664, ans=0.04949747468305833 2023-06-23 10:37:48,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=33606.666666666664, ans=0.2 2023-06-23 10:37:52,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=33606.666666666664, ans=0.125 2023-06-23 10:37:54,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=33606.666666666664, ans=0.0 2023-06-23 10:37:58,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=33673.333333333336, ans=0.1 2023-06-23 10:38:14,279 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=14.36 vs. limit=15.0 2023-06-23 10:38:14,495 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.25 vs. limit=15.0 2023-06-23 10:38:15,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=33740.0, ans=0.125 2023-06-23 10:38:23,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=33740.0, ans=0.125 2023-06-23 10:38:51,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=33873.333333333336, ans=0.0 2023-06-23 10:38:54,000 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.76 vs. limit=6.0 2023-06-23 10:39:02,818 INFO [train.py:1008] (2/4) Epoch 10, batch 300, loss[loss=0.2754, simple_loss=0.3333, pruned_loss=0.1087, over 19807.00 frames. ], tot_loss[loss=0.2991, simple_loss=0.3471, pruned_loss=0.1256, over 2963039.21 frames. ], batch size: 115, lr: 2.54e-02, grad_scale: 32.0 2023-06-23 10:39:03,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=33940.0, ans=0.025 2023-06-23 10:39:05,413 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.31 vs. limit=22.5 2023-06-23 10:39:29,953 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.768e+02 2.326e+02 2.732e+02 3.307e+02 6.159e+02, threshold=5.464e+02, percent-clipped=1.0 2023-06-23 10:39:33,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=34073.333333333336, ans=0.2 2023-06-23 10:39:33,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=34073.333333333336, ans=0.0 2023-06-23 10:39:43,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=34073.333333333336, ans=0.125 2023-06-23 10:40:00,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=34140.0, ans=0.035 2023-06-23 10:40:01,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=34140.0, ans=0.2 2023-06-23 10:40:09,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=34206.666666666664, ans=0.0 2023-06-23 10:40:24,789 INFO [train.py:1008] (2/4) Epoch 10, batch 350, loss[loss=0.277, simple_loss=0.3326, pruned_loss=0.1107, over 19701.00 frames. ], tot_loss[loss=0.2986, simple_loss=0.3461, pruned_loss=0.1256, over 3151602.21 frames. ], batch size: 110, lr: 2.53e-02, grad_scale: 32.0 2023-06-23 10:41:05,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=34406.666666666664, ans=0.2 2023-06-23 10:41:15,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=34473.333333333336, ans=0.125 2023-06-23 10:41:36,579 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=10.58 vs. limit=15.0 2023-06-23 10:41:47,454 INFO [train.py:1008] (2/4) Epoch 10, batch 400, loss[loss=0.321, simple_loss=0.374, pruned_loss=0.134, over 16360.00 frames. ], tot_loss[loss=0.2983, simple_loss=0.3455, pruned_loss=0.1255, over 3301177.43 frames. ], batch size: 52, lr: 2.53e-02, grad_scale: 32.0 2023-06-23 10:42:05,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=34673.333333333336, ans=0.0033318840579710145 2023-06-23 10:42:13,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=34673.333333333336, ans=0.0033318840579710145 2023-06-23 10:42:13,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=34673.333333333336, ans=0.125 2023-06-23 10:42:15,842 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.734e+02 2.198e+02 2.659e+02 3.246e+02 4.895e+02, threshold=5.319e+02, percent-clipped=0.0 2023-06-23 10:42:19,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=34740.0, ans=0.125 2023-06-23 10:42:22,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=34740.0, ans=0.035 2023-06-23 10:42:58,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=34873.333333333336, ans=0.125 2023-06-23 10:43:09,654 INFO [train.py:1008] (2/4) Epoch 10, batch 450, loss[loss=0.2924, simple_loss=0.3418, pruned_loss=0.1215, over 19839.00 frames. ], tot_loss[loss=0.2972, simple_loss=0.3454, pruned_loss=0.1245, over 3409797.70 frames. ], batch size: 120, lr: 2.52e-02, grad_scale: 32.0 2023-06-23 10:43:40,277 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.33 vs. limit=10.0 2023-06-23 10:43:55,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=35073.333333333336, ans=0.125 2023-06-23 10:44:04,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=35140.0, ans=0.0 2023-06-23 10:44:29,451 INFO [train.py:1008] (2/4) Epoch 10, batch 500, loss[loss=0.2754, simple_loss=0.3269, pruned_loss=0.112, over 20382.00 frames. ], tot_loss[loss=0.2973, simple_loss=0.3459, pruned_loss=0.1244, over 3489998.57 frames. ], batch size: 149, lr: 2.52e-02, grad_scale: 32.0 2023-06-23 10:44:34,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=35273.333333333336, ans=0.125 2023-06-23 10:44:54,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=35340.0, ans=0.0 2023-06-23 10:44:57,139 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.829e+02 2.221e+02 2.542e+02 2.887e+02 4.179e+02, threshold=5.084e+02, percent-clipped=0.0 2023-06-23 10:44:59,473 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.08 vs. limit=15.0 2023-06-23 10:45:46,046 INFO [train.py:1008] (2/4) Epoch 11, batch 0, loss[loss=0.2769, simple_loss=0.3311, pruned_loss=0.1114, over 19655.00 frames. ], tot_loss[loss=0.2769, simple_loss=0.3311, pruned_loss=0.1114, over 19655.00 frames. ], batch size: 110, lr: 2.42e-02, grad_scale: 32.0 2023-06-23 10:45:46,047 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 10:45:51,643 INFO [train.py:1040] (2/4) Epoch 11, validation: loss=0.2248, simple_loss=0.3217, pruned_loss=0.06391, over 143649.00 frames. 2023-06-23 10:45:51,643 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 10:45:52,860 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.52 vs. limit=22.5 2023-06-23 10:45:59,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=35493.333333333336, ans=0.125 2023-06-23 10:46:50,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=35693.333333333336, ans=0.0 2023-06-23 10:46:57,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=35760.0, ans=0.025 2023-06-23 10:47:14,640 INFO [train.py:1008] (2/4) Epoch 11, batch 50, loss[loss=0.2775, simple_loss=0.335, pruned_loss=0.1099, over 19081.00 frames. ], tot_loss[loss=0.2947, simple_loss=0.3417, pruned_loss=0.1239, over 871998.28 frames. ], batch size: 89, lr: 2.41e-02, grad_scale: 32.0 2023-06-23 10:47:23,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=35826.666666666664, ans=0.125 2023-06-23 10:47:26,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=35826.666666666664, ans=0.125 2023-06-23 10:47:27,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=35826.666666666664, ans=0.125 2023-06-23 10:47:43,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=35893.333333333336, ans=0.125 2023-06-23 10:47:59,903 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.13 vs. limit=22.5 2023-06-23 10:48:03,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=36026.666666666664, ans=0.1 2023-06-23 10:48:09,998 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.688e+02 2.233e+02 2.520e+02 2.847e+02 4.196e+02, threshold=5.041e+02, percent-clipped=0.0 2023-06-23 10:48:21,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=36093.333333333336, ans=0.125 2023-06-23 10:48:36,754 INFO [train.py:1008] (2/4) Epoch 11, batch 100, loss[loss=0.3059, simple_loss=0.3689, pruned_loss=0.1214, over 16384.00 frames. ], tot_loss[loss=0.2926, simple_loss=0.3421, pruned_loss=0.1215, over 1508385.04 frames. ], batch size: 52, lr: 2.41e-02, grad_scale: 32.0 2023-06-23 10:48:42,117 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.31 vs. limit=22.5 2023-06-23 10:48:43,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=36160.0, ans=0.2 2023-06-23 10:49:04,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=36226.666666666664, ans=0.1 2023-06-23 10:49:34,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=36360.0, ans=0.125 2023-06-23 10:50:00,207 INFO [train.py:1008] (2/4) Epoch 11, batch 150, loss[loss=0.2877, simple_loss=0.3293, pruned_loss=0.1231, over 20567.00 frames. ], tot_loss[loss=0.2927, simple_loss=0.3416, pruned_loss=0.1219, over 2023251.13 frames. ], batch size: 189, lr: 2.40e-02, grad_scale: 32.0 2023-06-23 10:50:03,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=36493.333333333336, ans=0.1 2023-06-23 10:50:45,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=36626.666666666664, ans=15.0 2023-06-23 10:50:55,661 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.693e+02 2.157e+02 2.625e+02 2.983e+02 4.556e+02, threshold=5.251e+02, percent-clipped=0.0 2023-06-23 10:51:20,053 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=12.0 2023-06-23 10:51:20,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=36826.666666666664, ans=0.2 2023-06-23 10:51:22,150 INFO [train.py:1008] (2/4) Epoch 11, batch 200, loss[loss=0.2723, simple_loss=0.3308, pruned_loss=0.1069, over 18631.00 frames. ], tot_loss[loss=0.2903, simple_loss=0.3399, pruned_loss=0.1204, over 2413365.01 frames. ], batch size: 80, lr: 2.40e-02, grad_scale: 32.0 2023-06-23 10:51:22,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=36826.666666666664, ans=0.125 2023-06-23 10:51:55,447 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.73 vs. limit=15.0 2023-06-23 10:52:03,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=36960.0, ans=0.125 2023-06-23 10:52:14,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=37026.666666666664, ans=0.2 2023-06-23 10:52:24,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=37026.666666666664, ans=0.125 2023-06-23 10:52:25,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=37093.333333333336, ans=0.0 2023-06-23 10:52:37,954 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-23 10:52:43,214 INFO [train.py:1008] (2/4) Epoch 11, batch 250, loss[loss=0.31, simple_loss=0.3689, pruned_loss=0.1255, over 16942.00 frames. ], tot_loss[loss=0.2901, simple_loss=0.3413, pruned_loss=0.1194, over 2704838.46 frames. ], batch size: 60, lr: 2.40e-02, grad_scale: 32.0 2023-06-23 10:53:08,362 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.33 vs. limit=15.0 2023-06-23 10:53:14,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=37226.666666666664, ans=0.125 2023-06-23 10:53:19,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=37293.333333333336, ans=0.1 2023-06-23 10:53:24,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=37293.333333333336, ans=0.2 2023-06-23 10:53:29,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=37293.333333333336, ans=0.125 2023-06-23 10:53:41,581 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.605e+02 2.088e+02 2.350e+02 3.005e+02 5.122e+02, threshold=4.700e+02, percent-clipped=0.0 2023-06-23 10:53:58,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=37426.666666666664, ans=0.125 2023-06-23 10:54:07,909 INFO [train.py:1008] (2/4) Epoch 11, batch 300, loss[loss=0.2841, simple_loss=0.3346, pruned_loss=0.1168, over 20483.00 frames. ], tot_loss[loss=0.2898, simple_loss=0.3407, pruned_loss=0.1194, over 2931877.22 frames. ], batch size: 160, lr: 2.39e-02, grad_scale: 16.0 2023-06-23 10:54:13,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=37493.333333333336, ans=0.0 2023-06-23 10:54:30,009 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-23 10:54:37,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=37560.0, ans=0.2 2023-06-23 10:55:32,499 INFO [train.py:1008] (2/4) Epoch 11, batch 350, loss[loss=0.266, simple_loss=0.3219, pruned_loss=0.1051, over 19897.00 frames. ], tot_loss[loss=0.2886, simple_loss=0.3397, pruned_loss=0.1188, over 3119674.21 frames. ], batch size: 120, lr: 2.39e-02, grad_scale: 16.0 2023-06-23 10:55:56,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=37893.333333333336, ans=0.125 2023-06-23 10:56:04,692 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.46 vs. limit=10.0 2023-06-23 10:56:10,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=37960.0, ans=0.0 2023-06-23 10:56:19,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=37960.0, ans=0.0026173913043478257 2023-06-23 10:56:25,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=38026.666666666664, ans=0.0 2023-06-23 10:56:32,926 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.815e+02 2.281e+02 2.772e+02 3.917e+02 6.266e+02, threshold=5.543e+02, percent-clipped=13.0 2023-06-23 10:56:56,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=38160.0, ans=0.125 2023-06-23 10:56:57,390 INFO [train.py:1008] (2/4) Epoch 11, batch 400, loss[loss=0.2932, simple_loss=0.3419, pruned_loss=0.1223, over 20098.00 frames. ], tot_loss[loss=0.289, simple_loss=0.3396, pruned_loss=0.1192, over 3284739.60 frames. ], batch size: 133, lr: 2.38e-02, grad_scale: 32.0 2023-06-23 10:57:06,454 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 10:57:18,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=38226.666666666664, ans=0.07 2023-06-23 10:57:27,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=38226.666666666664, ans=0.035 2023-06-23 10:57:47,341 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.29 vs. limit=22.5 2023-06-23 10:58:11,236 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 10:58:15,196 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.54 vs. limit=22.5 2023-06-23 10:58:22,122 INFO [train.py:1008] (2/4) Epoch 11, batch 450, loss[loss=0.3005, simple_loss=0.3461, pruned_loss=0.1274, over 20269.00 frames. ], tot_loss[loss=0.2891, simple_loss=0.3393, pruned_loss=0.1195, over 3399639.97 frames. ], batch size: 141, lr: 2.38e-02, grad_scale: 32.0 2023-06-23 10:58:39,247 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.15 vs. limit=6.0 2023-06-23 10:58:40,241 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 10:58:45,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=38560.0, ans=0.1 2023-06-23 10:59:17,305 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.17 vs. limit=10.0 2023-06-23 10:59:20,874 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.759e+02 2.099e+02 2.478e+02 2.998e+02 4.300e+02, threshold=4.956e+02, percent-clipped=0.0 2023-06-23 10:59:44,937 INFO [train.py:1008] (2/4) Epoch 11, batch 500, loss[loss=0.2598, simple_loss=0.3303, pruned_loss=0.09467, over 16777.00 frames. ], tot_loss[loss=0.2874, simple_loss=0.3381, pruned_loss=0.1183, over 3478255.70 frames. ], batch size: 59, lr: 2.38e-02, grad_scale: 32.0 2023-06-23 10:59:47,718 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.45 vs. limit=15.0 2023-06-23 10:59:48,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=38826.666666666664, ans=0.1 2023-06-23 11:00:08,484 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.09 vs. limit=15.0 2023-06-23 11:00:09,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=38893.333333333336, ans=0.1 2023-06-23 11:00:20,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=38960.0, ans=0.09899494936611666 2023-06-23 11:00:24,062 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.64 vs. limit=15.0 2023-06-23 11:00:25,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=38960.0, ans=0.125 2023-06-23 11:01:00,459 INFO [train.py:1008] (2/4) Epoch 12, batch 0, loss[loss=0.2784, simple_loss=0.3381, pruned_loss=0.1094, over 19079.00 frames. ], tot_loss[loss=0.2784, simple_loss=0.3381, pruned_loss=0.1094, over 19079.00 frames. ], batch size: 94, lr: 2.28e-02, grad_scale: 32.0 2023-06-23 11:01:00,460 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 11:01:06,092 INFO [train.py:1040] (2/4) Epoch 12, validation: loss=0.2211, simple_loss=0.3184, pruned_loss=0.06189, over 143649.00 frames. 2023-06-23 11:01:06,093 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 11:01:06,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=39040.0, ans=0.125 2023-06-23 11:01:30,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=39106.666666666664, ans=0.0023681159420289857 2023-06-23 11:02:11,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=39306.666666666664, ans=0.0 2023-06-23 11:02:19,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=39306.666666666664, ans=0.125 2023-06-23 11:02:29,865 INFO [train.py:1008] (2/4) Epoch 12, batch 50, loss[loss=0.2957, simple_loss=0.3451, pruned_loss=0.1231, over 19341.00 frames. ], tot_loss[loss=0.2839, simple_loss=0.3368, pruned_loss=0.1155, over 842789.87 frames. ], batch size: 98, lr: 2.28e-02, grad_scale: 32.0 2023-06-23 11:02:34,463 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.442e+02 2.173e+02 2.414e+02 2.819e+02 3.920e+02, threshold=4.827e+02, percent-clipped=0.0 2023-06-23 11:02:41,545 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.38 vs. limit=6.0 2023-06-23 11:02:50,047 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.31 vs. limit=15.0 2023-06-23 11:03:26,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=39573.333333333336, ans=0.125 2023-06-23 11:03:52,873 INFO [train.py:1008] (2/4) Epoch 12, batch 100, loss[loss=0.2939, simple_loss=0.3491, pruned_loss=0.1194, over 16211.00 frames. ], tot_loss[loss=0.2854, simple_loss=0.3371, pruned_loss=0.1169, over 1500492.40 frames. ], batch size: 52, lr: 2.28e-02, grad_scale: 32.0 2023-06-23 11:04:03,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=39706.666666666664, ans=0.5 2023-06-23 11:04:21,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=39773.333333333336, ans=0.1 2023-06-23 11:04:28,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=39840.0, ans=0.002208695652173912 2023-06-23 11:04:30,560 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.33 vs. limit=6.0 2023-06-23 11:04:47,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=39906.666666666664, ans=0.05 2023-06-23 11:05:09,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=39973.333333333336, ans=0.0 2023-06-23 11:05:15,884 INFO [train.py:1008] (2/4) Epoch 12, batch 150, loss[loss=0.3245, simple_loss=0.369, pruned_loss=0.14, over 19216.00 frames. ], tot_loss[loss=0.2859, simple_loss=0.337, pruned_loss=0.1174, over 2014440.32 frames. ], batch size: 92, lr: 2.27e-02, grad_scale: 32.0 2023-06-23 11:05:20,572 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.671e+02 2.197e+02 2.613e+02 3.167e+02 4.551e+02, threshold=5.226e+02, percent-clipped=0.0 2023-06-23 11:05:38,137 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.73 vs. limit=15.0 2023-06-23 11:06:20,076 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.83 vs. limit=15.0 2023-06-23 11:06:22,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=40306.666666666664, ans=0.1 2023-06-23 11:06:38,935 INFO [train.py:1008] (2/4) Epoch 12, batch 200, loss[loss=0.2843, simple_loss=0.348, pruned_loss=0.1104, over 18284.00 frames. ], tot_loss[loss=0.2857, simple_loss=0.3368, pruned_loss=0.1173, over 2405573.85 frames. ], batch size: 74, lr: 2.27e-02, grad_scale: 32.0 2023-06-23 11:06:41,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=40373.333333333336, ans=0.125 2023-06-23 11:07:00,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=40440.0, ans=0.1 2023-06-23 11:07:53,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=40640.0, ans=0.2 2023-06-23 11:08:03,148 INFO [train.py:1008] (2/4) Epoch 12, batch 250, loss[loss=0.2651, simple_loss=0.3214, pruned_loss=0.1043, over 19213.00 frames. ], tot_loss[loss=0.2844, simple_loss=0.3354, pruned_loss=0.1167, over 2715108.50 frames. ], batch size: 92, lr: 2.27e-02, grad_scale: 32.0 2023-06-23 11:08:07,860 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.708e+02 2.162e+02 2.502e+02 3.086e+02 4.353e+02, threshold=5.005e+02, percent-clipped=0.0 2023-06-23 11:08:08,867 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.10 vs. limit=10.0 2023-06-23 11:09:04,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=40906.666666666664, ans=0.0 2023-06-23 11:09:05,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=40906.666666666664, ans=0.125 2023-06-23 11:09:13,345 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 11:09:27,216 INFO [train.py:1008] (2/4) Epoch 12, batch 300, loss[loss=0.2752, simple_loss=0.3275, pruned_loss=0.1115, over 20495.00 frames. ], tot_loss[loss=0.2853, simple_loss=0.3362, pruned_loss=0.1172, over 2946067.18 frames. ], batch size: 160, lr: 2.26e-02, grad_scale: 32.0 2023-06-23 11:09:37,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=41040.0, ans=0.0 2023-06-23 11:09:44,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=41106.666666666664, ans=0.125 2023-06-23 11:09:52,348 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.24 vs. limit=22.5 2023-06-23 11:09:58,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=41173.333333333336, ans=0.125 2023-06-23 11:10:38,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=41306.666666666664, ans=0.2 2023-06-23 11:10:49,879 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.28 vs. limit=10.0 2023-06-23 11:10:50,367 INFO [train.py:1008] (2/4) Epoch 12, batch 350, loss[loss=0.2643, simple_loss=0.3373, pruned_loss=0.09563, over 18308.00 frames. ], tot_loss[loss=0.2851, simple_loss=0.336, pruned_loss=0.117, over 3124717.84 frames. ], batch size: 72, lr: 2.26e-02, grad_scale: 32.0 2023-06-23 11:10:52,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=41373.333333333336, ans=0.125 2023-06-23 11:10:54,992 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.720e+02 2.164e+02 2.476e+02 2.969e+02 4.432e+02, threshold=4.951e+02, percent-clipped=0.0 2023-06-23 11:11:10,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=41440.0, ans=0.125 2023-06-23 11:11:26,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten.whitening_limit, batch_count=41506.666666666664, ans=22.5 2023-06-23 11:12:00,135 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=24.10 vs. limit=22.5 2023-06-23 11:12:13,273 INFO [train.py:1008] (2/4) Epoch 12, batch 400, loss[loss=0.2769, simple_loss=0.3282, pruned_loss=0.1128, over 19477.00 frames. ], tot_loss[loss=0.2848, simple_loss=0.3355, pruned_loss=0.1171, over 3267628.43 frames. ], batch size: 105, lr: 2.25e-02, grad_scale: 32.0 2023-06-23 11:12:14,412 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.37 vs. limit=15.0 2023-06-23 11:12:23,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=41706.666666666664, ans=0.0 2023-06-23 11:12:25,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=41706.666666666664, ans=0.125 2023-06-23 11:12:25,611 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-23 11:12:58,583 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-23 11:13:35,904 INFO [train.py:1008] (2/4) Epoch 12, batch 450, loss[loss=0.2949, simple_loss=0.3417, pruned_loss=0.124, over 19476.00 frames. ], tot_loss[loss=0.2844, simple_loss=0.3356, pruned_loss=0.1165, over 3390462.00 frames. ], batch size: 105, lr: 2.25e-02, grad_scale: 32.0 2023-06-23 11:13:41,745 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.635e+02 2.136e+02 2.493e+02 3.022e+02 5.993e+02, threshold=4.985e+02, percent-clipped=1.0 2023-06-23 11:13:42,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=42040.0, ans=0.0017304347826086943 2023-06-23 11:13:43,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=42040.0, ans=0.0 2023-06-23 11:14:13,631 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.98 vs. limit=15.0 2023-06-23 11:14:18,704 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 11:14:22,698 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-06-23 11:14:38,867 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.46 vs. limit=12.0 2023-06-23 11:14:57,114 INFO [train.py:1008] (2/4) Epoch 12, batch 500, loss[loss=0.3046, simple_loss=0.361, pruned_loss=0.1241, over 17640.00 frames. ], tot_loss[loss=0.2831, simple_loss=0.3354, pruned_loss=0.1154, over 3484149.47 frames. ], batch size: 67, lr: 2.25e-02, grad_scale: 32.0 2023-06-23 11:15:13,656 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=10.32 vs. limit=15.0 2023-06-23 11:15:25,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=42440.0, ans=0.125 2023-06-23 11:15:33,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=42506.666666666664, ans=0.0 2023-06-23 11:15:34,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=42506.666666666664, ans=0.125 2023-06-23 11:15:34,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=42506.666666666664, ans=0.125 2023-06-23 11:16:12,190 INFO [train.py:1008] (2/4) Epoch 13, batch 0, loss[loss=0.2828, simple_loss=0.3372, pruned_loss=0.1142, over 18805.00 frames. ], tot_loss[loss=0.2828, simple_loss=0.3372, pruned_loss=0.1142, over 18805.00 frames. ], batch size: 83, lr: 2.16e-02, grad_scale: 32.0 2023-06-23 11:16:12,191 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 11:16:18,217 INFO [train.py:1040] (2/4) Epoch 13, validation: loss=0.2164, simple_loss=0.3138, pruned_loss=0.05947, over 143649.00 frames. 2023-06-23 11:16:18,218 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 11:16:22,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=42593.333333333336, ans=0.125 2023-06-23 11:16:24,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=42593.333333333336, ans=0.125 2023-06-23 11:16:27,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=42593.333333333336, ans=0.0 2023-06-23 11:16:40,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=42660.0, ans=0.125 2023-06-23 11:16:50,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=42726.666666666664, ans=0.1 2023-06-23 11:16:51,931 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.682e+02 2.193e+02 2.511e+02 3.488e+02 6.127e+02, threshold=5.022e+02, percent-clipped=8.0 2023-06-23 11:16:54,048 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=9.645e-02 2023-06-23 11:17:04,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=42726.666666666664, ans=0.07 2023-06-23 11:17:20,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=42793.333333333336, ans=0.05 2023-06-23 11:17:41,400 INFO [train.py:1008] (2/4) Epoch 13, batch 50, loss[loss=0.2657, simple_loss=0.3317, pruned_loss=0.09984, over 18298.00 frames. ], tot_loss[loss=0.2762, simple_loss=0.3313, pruned_loss=0.1105, over 852243.71 frames. ], batch size: 74, lr: 2.16e-02, grad_scale: 32.0 2023-06-23 11:17:47,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=42926.666666666664, ans=0.0015376811594202903 2023-06-23 11:18:10,554 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=3.335e-02 2023-06-23 11:18:13,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=43060.0, ans=0.0 2023-06-23 11:18:19,402 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.29 vs. limit=15.0 2023-06-23 11:18:25,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=43060.0, ans=0.001508695652173913 2023-06-23 11:18:54,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=43193.333333333336, ans=0.125 2023-06-23 11:19:04,636 INFO [train.py:1008] (2/4) Epoch 13, batch 100, loss[loss=0.2606, simple_loss=0.3182, pruned_loss=0.1015, over 19340.00 frames. ], tot_loss[loss=0.2775, simple_loss=0.3329, pruned_loss=0.1111, over 1504551.81 frames. ], batch size: 98, lr: 2.16e-02, grad_scale: 32.0 2023-06-23 11:19:16,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=43260.0, ans=0.125 2023-06-23 11:19:18,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=43260.0, ans=0.125 2023-06-23 11:19:37,890 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.683e+02 2.152e+02 2.569e+02 2.957e+02 4.748e+02, threshold=5.138e+02, percent-clipped=0.0 2023-06-23 11:19:59,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=43460.0, ans=0.125 2023-06-23 11:20:09,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=43526.666666666664, ans=0.2 2023-06-23 11:20:20,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=43526.666666666664, ans=0.001407246376811595 2023-06-23 11:20:27,030 INFO [train.py:1008] (2/4) Epoch 13, batch 150, loss[loss=0.2687, simple_loss=0.3316, pruned_loss=0.1029, over 19620.00 frames. ], tot_loss[loss=0.2783, simple_loss=0.3331, pruned_loss=0.1118, over 2001700.92 frames. ], batch size: 110, lr: 2.15e-02, grad_scale: 32.0 2023-06-23 11:20:27,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=43593.333333333336, ans=0.125 2023-06-23 11:20:29,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=43593.333333333336, ans=0.125 2023-06-23 11:20:33,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=43593.333333333336, ans=0.125 2023-06-23 11:20:35,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=43593.333333333336, ans=0.1 2023-06-23 11:20:43,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=43660.0, ans=0.1 2023-06-23 11:21:18,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=43793.333333333336, ans=0.125 2023-06-23 11:21:50,233 INFO [train.py:1008] (2/4) Epoch 13, batch 200, loss[loss=0.269, simple_loss=0.3317, pruned_loss=0.1032, over 19207.00 frames. ], tot_loss[loss=0.2776, simple_loss=0.332, pruned_loss=0.1116, over 2393129.35 frames. ], batch size: 92, lr: 2.15e-02, grad_scale: 32.0 2023-06-23 11:22:04,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=43926.666666666664, ans=0.125 2023-06-23 11:22:17,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=43993.333333333336, ans=0.125 2023-06-23 11:22:23,034 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.640e+02 2.129e+02 2.385e+02 2.873e+02 5.055e+02, threshold=4.770e+02, percent-clipped=0.0 2023-06-23 11:22:37,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=44060.0, ans=0.0 2023-06-23 11:23:13,613 INFO [train.py:1008] (2/4) Epoch 13, batch 250, loss[loss=0.2774, simple_loss=0.3257, pruned_loss=0.1145, over 19926.00 frames. ], tot_loss[loss=0.2783, simple_loss=0.3317, pruned_loss=0.1125, over 2701246.71 frames. ], batch size: 126, lr: 2.15e-02, grad_scale: 32.0 2023-06-23 11:23:45,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=44393.333333333336, ans=0.09899494936611666 2023-06-23 11:23:52,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=44393.333333333336, ans=0.125 2023-06-23 11:23:55,004 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.02 vs. limit=15.0 2023-06-23 11:23:59,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=44393.333333333336, ans=0.0012188405797101433 2023-06-23 11:24:25,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=44526.666666666664, ans=0.125 2023-06-23 11:24:29,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=44526.666666666664, ans=0.125 2023-06-23 11:24:34,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=44526.666666666664, ans=0.1 2023-06-23 11:24:37,086 INFO [train.py:1008] (2/4) Epoch 13, batch 300, loss[loss=0.2759, simple_loss=0.3316, pruned_loss=0.1101, over 19218.00 frames. ], tot_loss[loss=0.2781, simple_loss=0.3314, pruned_loss=0.1124, over 2942113.68 frames. ], batch size: 92, lr: 2.14e-02, grad_scale: 32.0 2023-06-23 11:24:38,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=44593.333333333336, ans=0.1 2023-06-23 11:24:41,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=44593.333333333336, ans=0.125 2023-06-23 11:24:45,201 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.71 vs. limit=22.5 2023-06-23 11:24:54,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=44660.0, ans=0.5 2023-06-23 11:25:07,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=44660.0, ans=0.125 2023-06-23 11:25:10,126 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.524e+02 2.051e+02 2.313e+02 2.745e+02 4.176e+02, threshold=4.625e+02, percent-clipped=1.0 2023-06-23 11:26:00,772 INFO [train.py:1008] (2/4) Epoch 13, batch 350, loss[loss=0.2937, simple_loss=0.3554, pruned_loss=0.116, over 15342.00 frames. ], tot_loss[loss=0.2786, simple_loss=0.3324, pruned_loss=0.1124, over 3119771.18 frames. ], batch size: 43, lr: 2.14e-02, grad_scale: 32.0 2023-06-23 11:26:42,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=45060.0, ans=0.125 2023-06-23 11:26:47,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=45060.0, ans=0.2 2023-06-23 11:26:48,038 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.45 vs. limit=15.0 2023-06-23 11:27:23,849 INFO [train.py:1008] (2/4) Epoch 13, batch 400, loss[loss=0.252, simple_loss=0.3228, pruned_loss=0.09065, over 17106.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3322, pruned_loss=0.1119, over 3264503.50 frames. ], batch size: 60, lr: 2.14e-02, grad_scale: 32.0 2023-06-23 11:27:35,062 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-23 11:27:41,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=45326.666666666664, ans=0.2 2023-06-23 11:27:58,228 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 2.091e+02 2.460e+02 2.937e+02 5.771e+02, threshold=4.919e+02, percent-clipped=2.0 2023-06-23 11:28:03,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=45393.333333333336, ans=0.0 2023-06-23 11:28:10,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=45393.333333333336, ans=0.125 2023-06-23 11:28:13,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=45460.0, ans=0.0 2023-06-23 11:28:16,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=45460.0, ans=0.125 2023-06-23 11:28:48,323 INFO [train.py:1008] (2/4) Epoch 13, batch 450, loss[loss=0.2618, simple_loss=0.3132, pruned_loss=0.1052, over 20306.00 frames. ], tot_loss[loss=0.2772, simple_loss=0.3306, pruned_loss=0.1119, over 3387596.62 frames. ], batch size: 141, lr: 2.13e-02, grad_scale: 32.0 2023-06-23 11:28:52,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.min_abs, batch_count=45593.333333333336, ans=0.5 2023-06-23 11:29:17,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=45660.0, ans=0.1 2023-06-23 11:29:50,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.prob, batch_count=45793.333333333336, ans=0.125 2023-06-23 11:29:53,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=45860.0, ans=0.0 2023-06-23 11:30:02,002 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=22.55 vs. limit=22.5 2023-06-23 11:30:06,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=45860.0, ans=0.0008999999999999998 2023-06-23 11:30:08,815 INFO [train.py:1008] (2/4) Epoch 13, batch 500, loss[loss=0.2714, simple_loss=0.3285, pruned_loss=0.1071, over 19858.00 frames. ], tot_loss[loss=0.277, simple_loss=0.3306, pruned_loss=0.1117, over 3481221.76 frames. ], batch size: 115, lr: 2.13e-02, grad_scale: 32.0 2023-06-23 11:30:29,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=45993.333333333336, ans=0.1 2023-06-23 11:30:40,868 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.702e+02 2.058e+02 2.521e+02 2.960e+02 4.342e+02, threshold=5.042e+02, percent-clipped=0.0 2023-06-23 11:31:22,270 INFO [train.py:1008] (2/4) Epoch 14, batch 0, loss[loss=0.2654, simple_loss=0.3287, pruned_loss=0.101, over 18756.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3287, pruned_loss=0.101, over 18756.00 frames. ], batch size: 83, lr: 2.05e-02, grad_scale: 32.0 2023-06-23 11:31:22,270 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 11:31:28,277 INFO [train.py:1040] (2/4) Epoch 14, validation: loss=0.2154, simple_loss=0.3133, pruned_loss=0.05873, over 143649.00 frames. 2023-06-23 11:31:28,278 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 11:31:44,956 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=3.49 vs. limit=12.0 2023-06-23 11:32:46,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=46406.666666666664, ans=0.125 2023-06-23 11:32:49,803 INFO [train.py:1008] (2/4) Epoch 14, batch 50, loss[loss=0.2608, simple_loss=0.3119, pruned_loss=0.1048, over 20114.00 frames. ], tot_loss[loss=0.2786, simple_loss=0.3307, pruned_loss=0.1133, over 876254.64 frames. ], batch size: 133, lr: 2.05e-02, grad_scale: 32.0 2023-06-23 11:33:00,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=46473.333333333336, ans=0.1 2023-06-23 11:33:10,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=46540.0, ans=0.125 2023-06-23 11:33:18,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=46540.0, ans=0.0 2023-06-23 11:33:29,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=46606.666666666664, ans=0.0 2023-06-23 11:33:52,220 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.58 vs. limit=12.0 2023-06-23 11:33:52,839 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.682e+02 2.120e+02 2.475e+02 2.928e+02 4.576e+02, threshold=4.949e+02, percent-clipped=0.0 2023-06-23 11:34:00,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=46740.0, ans=0.1 2023-06-23 11:34:12,904 INFO [train.py:1008] (2/4) Epoch 14, batch 100, loss[loss=0.2622, simple_loss=0.3265, pruned_loss=0.09895, over 19530.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.328, pruned_loss=0.1092, over 1534007.96 frames. ], batch size: 102, lr: 2.05e-02, grad_scale: 32.0 2023-06-23 11:34:27,143 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.73 vs. limit=15.0 2023-06-23 11:34:44,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=46940.0, ans=0.125 2023-06-23 11:35:35,490 INFO [train.py:1008] (2/4) Epoch 14, batch 150, loss[loss=0.2771, simple_loss=0.3243, pruned_loss=0.1149, over 20641.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.3289, pruned_loss=0.1097, over 2033990.61 frames. ], batch size: 173, lr: 2.04e-02, grad_scale: 32.0 2023-06-23 11:35:45,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=47140.0, ans=0.125 2023-06-23 11:36:06,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=47273.333333333336, ans=0.2 2023-06-23 11:36:37,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=47340.0, ans=15.0 2023-06-23 11:36:37,862 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.488e+02 2.008e+02 2.294e+02 2.617e+02 4.085e+02, threshold=4.588e+02, percent-clipped=0.0 2023-06-23 11:36:38,248 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 11:36:46,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=47406.666666666664, ans=0.2 2023-06-23 11:36:51,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=47406.666666666664, ans=0.0 2023-06-23 11:36:54,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=47406.666666666664, ans=0.125 2023-06-23 11:36:56,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=47473.333333333336, ans=22.5 2023-06-23 11:36:57,114 INFO [train.py:1008] (2/4) Epoch 14, batch 200, loss[loss=0.2636, simple_loss=0.3231, pruned_loss=0.102, over 19691.00 frames. ], tot_loss[loss=0.2731, simple_loss=0.3285, pruned_loss=0.1089, over 2419315.48 frames. ], batch size: 110, lr: 2.04e-02, grad_scale: 32.0 2023-06-23 11:37:02,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=47473.333333333336, ans=0.0 2023-06-23 11:38:04,751 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.41 vs. limit=6.0 2023-06-23 11:38:16,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=47740.0, ans=0.125 2023-06-23 11:38:19,877 INFO [train.py:1008] (2/4) Epoch 14, batch 250, loss[loss=0.2931, simple_loss=0.332, pruned_loss=0.1271, over 20295.00 frames. ], tot_loss[loss=0.2722, simple_loss=0.3283, pruned_loss=0.108, over 2729042.68 frames. ], batch size: 239, lr: 2.04e-02, grad_scale: 32.0 2023-06-23 11:38:23,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=47806.666666666664, ans=0.125 2023-06-23 11:38:28,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=47806.666666666664, ans=0.125 2023-06-23 11:38:35,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=47873.333333333336, ans=0.125 2023-06-23 11:38:37,883 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.16 vs. limit=15.0 2023-06-23 11:38:45,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=47873.333333333336, ans=0.125 2023-06-23 11:39:00,754 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.22 vs. limit=15.0 2023-06-23 11:39:11,085 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.17 vs. limit=15.0 2023-06-23 11:39:12,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=48006.666666666664, ans=0.125 2023-06-23 11:39:12,407 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.24 vs. limit=12.0 2023-06-23 11:39:21,805 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 11:39:23,141 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.616e+02 2.088e+02 2.363e+02 2.839e+02 5.052e+02, threshold=4.726e+02, percent-clipped=1.0 2023-06-23 11:39:28,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=48073.333333333336, ans=0.0 2023-06-23 11:39:43,353 INFO [train.py:1008] (2/4) Epoch 14, batch 300, loss[loss=0.2971, simple_loss=0.3101, pruned_loss=0.1421, over 17017.00 frames. ], tot_loss[loss=0.2724, simple_loss=0.3283, pruned_loss=0.1082, over 2970304.21 frames. ], batch size: 392, lr: 2.03e-02, grad_scale: 32.0 2023-06-23 11:39:57,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=48140.0, ans=0.0 2023-06-23 11:40:05,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=48206.666666666664, ans=0.0003898550724637691 2023-06-23 11:40:09,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=48206.666666666664, ans=0.5 2023-06-23 11:40:41,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=48340.0, ans=0.1 2023-06-23 11:40:48,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=48406.666666666664, ans=0.125 2023-06-23 11:41:06,350 INFO [train.py:1008] (2/4) Epoch 14, batch 350, loss[loss=0.2726, simple_loss=0.3367, pruned_loss=0.1042, over 16938.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.327, pruned_loss=0.1079, over 3153276.08 frames. ], batch size: 60, lr: 2.03e-02, grad_scale: 32.0 2023-06-23 11:41:09,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=48473.333333333336, ans=0.125 2023-06-23 11:41:19,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=48473.333333333336, ans=0.125 2023-06-23 11:42:07,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=48673.333333333336, ans=0.00028840579710144934 2023-06-23 11:42:07,383 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=12.0 2023-06-23 11:42:08,172 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.701e+02 2.151e+02 2.394e+02 2.779e+02 4.644e+02, threshold=4.788e+02, percent-clipped=0.0 2023-06-23 11:42:14,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=48740.0, ans=0.0 2023-06-23 11:42:22,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=48740.0, ans=0.125 2023-06-23 11:42:28,597 INFO [train.py:1008] (2/4) Epoch 14, batch 400, loss[loss=0.2645, simple_loss=0.3251, pruned_loss=0.1019, over 18632.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.3273, pruned_loss=0.108, over 3298743.63 frames. ], batch size: 80, lr: 2.03e-02, grad_scale: 32.0 2023-06-23 11:42:36,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer_ff2.min_abs, batch_count=48806.666666666664, ans=0.1 2023-06-23 11:42:44,408 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.56 vs. limit=6.0 2023-06-23 11:43:05,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=48940.0, ans=0.1 2023-06-23 11:43:19,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=49006.666666666664, ans=0.125 2023-06-23 11:43:37,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=49073.333333333336, ans=0.0 2023-06-23 11:43:51,076 INFO [train.py:1008] (2/4) Epoch 14, batch 450, loss[loss=0.2742, simple_loss=0.3316, pruned_loss=0.1083, over 18799.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3272, pruned_loss=0.1076, over 3406804.92 frames. ], batch size: 83, lr: 2.02e-02, grad_scale: 32.0 2023-06-23 11:43:58,408 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=12.0 2023-06-23 11:44:13,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=49206.666666666664, ans=10.0 2023-06-23 11:44:13,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=49206.666666666664, ans=0.125 2023-06-23 11:44:40,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=49340.0, ans=0.0 2023-06-23 11:44:44,271 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.56 vs. limit=6.0 2023-06-23 11:44:51,422 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.655e+02 2.043e+02 2.389e+02 2.856e+02 3.989e+02, threshold=4.778e+02, percent-clipped=0.0 2023-06-23 11:45:10,421 INFO [train.py:1008] (2/4) Epoch 14, batch 500, loss[loss=0.2789, simple_loss=0.3366, pruned_loss=0.1106, over 19087.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3271, pruned_loss=0.107, over 3498158.51 frames. ], batch size: 89, lr: 2.02e-02, grad_scale: 32.0 2023-06-23 11:45:28,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=49540.0, ans=9.99999999999994e-05 2023-06-23 11:45:31,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.whiten.whitening_limit, batch_count=49540.0, ans=12.0 2023-06-23 11:45:44,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=49606.666666666664, ans=0.125 2023-06-23 11:45:55,749 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.68 vs. limit=6.0 2023-06-23 11:45:56,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=49673.333333333336, ans=0.1 2023-06-23 11:46:24,388 INFO [train.py:1008] (2/4) Epoch 15, batch 0, loss[loss=0.2618, simple_loss=0.3166, pruned_loss=0.1035, over 18648.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3166, pruned_loss=0.1035, over 18648.00 frames. ], batch size: 80, lr: 1.95e-02, grad_scale: 32.0 2023-06-23 11:46:24,388 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 11:46:31,361 INFO [train.py:1040] (2/4) Epoch 15, validation: loss=0.2123, simple_loss=0.3108, pruned_loss=0.05692, over 143649.00 frames. 2023-06-23 11:46:31,362 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 11:46:33,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=49693.333333333336, ans=0.0 2023-06-23 11:46:36,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=49693.333333333336, ans=0.125 2023-06-23 11:46:47,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=49760.0, ans=0.2 2023-06-23 11:47:33,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=49893.333333333336, ans=0.1 2023-06-23 11:47:51,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=49960.0, ans=8.695652173913368e-06 2023-06-23 11:47:54,015 INFO [train.py:1008] (2/4) Epoch 15, batch 50, loss[loss=0.2916, simple_loss=0.3425, pruned_loss=0.1204, over 20136.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3224, pruned_loss=0.1028, over 856357.08 frames. ], batch size: 133, lr: 1.95e-02, grad_scale: 32.0 2023-06-23 11:47:56,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=50026.666666666664, ans=0.07 2023-06-23 11:48:01,752 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.603e+02 2.176e+02 2.428e+02 2.954e+02 4.942e+02, threshold=4.857e+02, percent-clipped=1.0 2023-06-23 11:48:03,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=50026.666666666664, ans=0.125 2023-06-23 11:48:53,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=50226.666666666664, ans=0.2 2023-06-23 11:49:08,407 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.85 vs. limit=15.0 2023-06-23 11:49:15,449 INFO [train.py:1008] (2/4) Epoch 15, batch 100, loss[loss=0.2876, simple_loss=0.3505, pruned_loss=0.1124, over 16967.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3239, pruned_loss=0.104, over 1502632.60 frames. ], batch size: 60, lr: 1.95e-02, grad_scale: 32.0 2023-06-23 11:49:45,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=50426.666666666664, ans=0.125 2023-06-23 11:50:09,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=50560.0, ans=0.1 2023-06-23 11:50:38,458 INFO [train.py:1008] (2/4) Epoch 15, batch 150, loss[loss=0.2489, simple_loss=0.3157, pruned_loss=0.09109, over 18309.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3232, pruned_loss=0.1034, over 2011814.65 frames. ], batch size: 74, lr: 1.94e-02, grad_scale: 32.0 2023-06-23 11:50:46,281 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.598e+02 1.892e+02 2.103e+02 2.343e+02 3.831e+02, threshold=4.207e+02, percent-clipped=0.0 2023-06-23 11:51:16,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=50826.666666666664, ans=0.125 2023-06-23 11:51:34,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=50893.333333333336, ans=0.1 2023-06-23 11:51:49,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=50960.0, ans=0.0 2023-06-23 11:52:01,031 INFO [train.py:1008] (2/4) Epoch 15, batch 200, loss[loss=0.2476, simple_loss=0.3135, pruned_loss=0.09083, over 19097.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3233, pruned_loss=0.1036, over 2415568.82 frames. ], batch size: 94, lr: 1.94e-02, grad_scale: 64.0 2023-06-23 11:52:23,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=51093.333333333336, ans=0.125 2023-06-23 11:52:32,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=51160.0, ans=0.0 2023-06-23 11:52:48,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=51226.666666666664, ans=0.0 2023-06-23 11:53:17,595 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.79 vs. limit=6.0 2023-06-23 11:53:21,014 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.99 vs. limit=15.0 2023-06-23 11:53:23,363 INFO [train.py:1008] (2/4) Epoch 15, batch 250, loss[loss=0.2647, simple_loss=0.3236, pruned_loss=0.1029, over 18939.00 frames. ], tot_loss[loss=0.2646, simple_loss=0.323, pruned_loss=0.1031, over 2720471.76 frames. ], batch size: 86, lr: 1.94e-02, grad_scale: 64.0 2023-06-23 11:53:31,297 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.602e+02 2.086e+02 2.413e+02 2.905e+02 4.106e+02, threshold=4.826e+02, percent-clipped=0.0 2023-06-23 11:53:39,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=51426.666666666664, ans=0.125 2023-06-23 11:54:24,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=51560.0, ans=0.125 2023-06-23 11:54:40,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=51626.666666666664, ans=0.1 2023-06-23 11:54:45,795 INFO [train.py:1008] (2/4) Epoch 15, batch 300, loss[loss=0.2645, simple_loss=0.3196, pruned_loss=0.1047, over 20300.00 frames. ], tot_loss[loss=0.2662, simple_loss=0.3237, pruned_loss=0.1044, over 2964867.71 frames. ], batch size: 149, lr: 1.93e-02, grad_scale: 64.0 2023-06-23 11:55:00,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=51760.0, ans=0.1 2023-06-23 11:55:12,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=51760.0, ans=0.0 2023-06-23 11:55:20,909 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=23.26 vs. limit=22.5 2023-06-23 11:55:25,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=51826.666666666664, ans=0.125 2023-06-23 11:55:32,903 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.71 vs. limit=22.5 2023-06-23 11:55:34,347 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.68 vs. limit=6.0 2023-06-23 11:55:45,202 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.99 vs. limit=15.0 2023-06-23 11:55:49,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=51960.0, ans=0.2 2023-06-23 11:55:51,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=51960.0, ans=0.07 2023-06-23 11:56:07,184 INFO [train.py:1008] (2/4) Epoch 15, batch 350, loss[loss=0.2601, simple_loss=0.325, pruned_loss=0.09761, over 19689.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3232, pruned_loss=0.1042, over 3140170.73 frames. ], batch size: 110, lr: 1.93e-02, grad_scale: 64.0 2023-06-23 11:56:07,602 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-06-23 11:56:16,016 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.692e+02 2.082e+02 2.457e+02 3.003e+02 4.853e+02, threshold=4.914e+02, percent-clipped=1.0 2023-06-23 11:57:06,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=52226.666666666664, ans=0.125 2023-06-23 11:57:08,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=52226.666666666664, ans=0.125 2023-06-23 11:57:08,624 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=512, metric=4.12 vs. limit=12.0 2023-06-23 11:57:19,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=52293.333333333336, ans=0.125 2023-06-23 11:57:29,261 INFO [train.py:1008] (2/4) Epoch 15, batch 400, loss[loss=0.2565, simple_loss=0.3106, pruned_loss=0.1012, over 20316.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3228, pruned_loss=0.1043, over 3279009.80 frames. ], batch size: 149, lr: 1.93e-02, grad_scale: 64.0 2023-06-23 11:57:57,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_na.min_abs, batch_count=52426.666666666664, ans=0.02 2023-06-23 11:58:04,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=52493.333333333336, ans=0.0 2023-06-23 11:58:13,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=52493.333333333336, ans=0.0 2023-06-23 11:58:43,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=52626.666666666664, ans=0.09899494936611666 2023-06-23 11:58:51,026 INFO [train.py:1008] (2/4) Epoch 15, batch 450, loss[loss=0.271, simple_loss=0.3144, pruned_loss=0.1138, over 20561.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3226, pruned_loss=0.104, over 3378155.56 frames. ], batch size: 189, lr: 1.92e-02, grad_scale: 64.0 2023-06-23 11:58:58,828 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.634e+02 1.980e+02 2.151e+02 2.530e+02 3.585e+02, threshold=4.302e+02, percent-clipped=0.0 2023-06-23 11:59:20,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=52760.0, ans=0.125 2023-06-23 11:59:33,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=52826.666666666664, ans=0.0 2023-06-23 11:59:41,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=52893.333333333336, ans=0.1 2023-06-23 12:00:08,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=52960.0, ans=0.07 2023-06-23 12:00:11,071 INFO [train.py:1008] (2/4) Epoch 15, batch 500, loss[loss=0.263, simple_loss=0.3063, pruned_loss=0.1098, over 20265.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3218, pruned_loss=0.1043, over 3467952.93 frames. ], batch size: 239, lr: 1.92e-02, grad_scale: 64.0 2023-06-23 12:01:23,409 INFO [train.py:1008] (2/4) Epoch 16, batch 0, loss[loss=0.2589, simple_loss=0.3197, pruned_loss=0.09907, over 19320.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3197, pruned_loss=0.09907, over 19320.00 frames. ], batch size: 98, lr: 1.86e-02, grad_scale: 64.0 2023-06-23 12:01:23,409 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 12:01:28,950 INFO [train.py:1040] (2/4) Epoch 16, validation: loss=0.2067, simple_loss=0.3069, pruned_loss=0.05322, over 143649.00 frames. 2023-06-23 12:01:28,951 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 12:01:39,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=53240.0, ans=0.125 2023-06-23 12:01:44,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=53306.666666666664, ans=0.0 2023-06-23 12:01:49,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=53306.666666666664, ans=0.125 2023-06-23 12:02:10,837 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.559e+02 2.015e+02 2.320e+02 2.731e+02 3.993e+02, threshold=4.641e+02, percent-clipped=0.0 2023-06-23 12:02:13,578 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.39 vs. limit=15.0 2023-06-23 12:02:33,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=53440.0, ans=0.125 2023-06-23 12:02:51,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=53506.666666666664, ans=0.2 2023-06-23 12:02:54,597 INFO [train.py:1008] (2/4) Epoch 16, batch 50, loss[loss=0.248, simple_loss=0.3185, pruned_loss=0.08875, over 18282.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3201, pruned_loss=0.101, over 861810.90 frames. ], batch size: 74, lr: 1.86e-02, grad_scale: 32.0 2023-06-23 12:03:03,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=53573.333333333336, ans=0.125 2023-06-23 12:03:43,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=53773.333333333336, ans=0.125 2023-06-23 12:04:06,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=53840.0, ans=0.1 2023-06-23 12:04:16,525 INFO [train.py:1008] (2/4) Epoch 16, batch 100, loss[loss=0.2674, simple_loss=0.3252, pruned_loss=0.1049, over 19520.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3195, pruned_loss=0.1003, over 1505461.62 frames. ], batch size: 102, lr: 1.85e-02, grad_scale: 32.0 2023-06-23 12:04:17,889 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.52 vs. limit=12.0 2023-06-23 12:04:49,640 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.88 vs. limit=15.0 2023-06-23 12:04:53,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=54040.0, ans=0.0 2023-06-23 12:04:55,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=54040.0, ans=0.0 2023-06-23 12:04:56,725 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.578e+02 2.011e+02 2.286e+02 2.547e+02 3.194e+02, threshold=4.572e+02, percent-clipped=0.0 2023-06-23 12:05:11,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=54106.666666666664, ans=0.0 2023-06-23 12:05:38,404 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.51 vs. limit=15.0 2023-06-23 12:05:40,384 INFO [train.py:1008] (2/4) Epoch 16, batch 150, loss[loss=0.2507, simple_loss=0.3105, pruned_loss=0.09547, over 19119.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3193, pruned_loss=0.1007, over 2014652.07 frames. ], batch size: 94, lr: 1.85e-02, grad_scale: 32.0 2023-06-23 12:05:44,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=54240.0, ans=0.0 2023-06-23 12:06:13,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=54373.333333333336, ans=0.125 2023-06-23 12:06:17,475 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.85 vs. limit=6.0 2023-06-23 12:06:32,085 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.86 vs. limit=12.0 2023-06-23 12:06:35,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=54440.0, ans=0.0 2023-06-23 12:06:56,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=54506.666666666664, ans=0.125 2023-06-23 12:07:05,368 INFO [train.py:1008] (2/4) Epoch 16, batch 200, loss[loss=0.2993, simple_loss=0.3649, pruned_loss=0.1169, over 17615.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3197, pruned_loss=0.1023, over 2403125.57 frames. ], batch size: 67, lr: 1.85e-02, grad_scale: 32.0 2023-06-23 12:07:23,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=54640.0, ans=0.125 2023-06-23 12:07:31,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=54640.0, ans=0.1 2023-06-23 12:07:44,311 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:07:45,467 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.605e+02 1.985e+02 2.333e+02 2.775e+02 4.718e+02, threshold=4.665e+02, percent-clipped=2.0 2023-06-23 12:08:12,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=54840.0, ans=0.125 2023-06-23 12:08:25,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=54840.0, ans=0.1 2023-06-23 12:08:29,509 INFO [train.py:1008] (2/4) Epoch 16, batch 250, loss[loss=0.2462, simple_loss=0.3089, pruned_loss=0.09169, over 19069.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3199, pruned_loss=0.1018, over 2707542.91 frames. ], batch size: 89, lr: 1.85e-02, grad_scale: 32.0 2023-06-23 12:08:51,330 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.05 vs. limit=15.0 2023-06-23 12:08:52,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=54973.333333333336, ans=0.0 2023-06-23 12:09:18,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=55106.666666666664, ans=0.125 2023-06-23 12:09:20,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=55106.666666666664, ans=0.125 2023-06-23 12:09:22,802 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.36 vs. limit=15.0 2023-06-23 12:09:32,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=55106.666666666664, ans=0.2 2023-06-23 12:09:34,845 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.49 vs. limit=22.5 2023-06-23 12:09:35,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=55173.333333333336, ans=0.125 2023-06-23 12:09:46,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=55173.333333333336, ans=0.0 2023-06-23 12:09:53,099 INFO [train.py:1008] (2/4) Epoch 16, batch 300, loss[loss=0.2781, simple_loss=0.3397, pruned_loss=0.1082, over 16723.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3202, pruned_loss=0.1012, over 2931101.39 frames. ], batch size: 59, lr: 1.84e-02, grad_scale: 32.0 2023-06-23 12:09:53,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=55240.0, ans=0.125 2023-06-23 12:09:55,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=55240.0, ans=0.07 2023-06-23 12:10:00,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=55240.0, ans=0.1 2023-06-23 12:10:00,953 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.62 vs. limit=12.0 2023-06-23 12:10:07,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=55240.0, ans=0.0 2023-06-23 12:10:32,329 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.89 vs. limit=15.0 2023-06-23 12:10:34,234 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.643e+02 2.012e+02 2.441e+02 2.908e+02 3.801e+02, threshold=4.883e+02, percent-clipped=0.0 2023-06-23 12:11:14,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=55506.666666666664, ans=0.1 2023-06-23 12:11:18,543 INFO [train.py:1008] (2/4) Epoch 16, batch 350, loss[loss=0.2601, simple_loss=0.3182, pruned_loss=0.101, over 20650.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3194, pruned_loss=0.1014, over 3132526.00 frames. ], batch size: 211, lr: 1.84e-02, grad_scale: 32.0 2023-06-23 12:11:39,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=55640.0, ans=0.0 2023-06-23 12:11:49,686 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:11:52,291 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=13.02 vs. limit=15.0 2023-06-23 12:12:40,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=55840.0, ans=0.1 2023-06-23 12:12:43,449 INFO [train.py:1008] (2/4) Epoch 16, batch 400, loss[loss=0.2468, simple_loss=0.3092, pruned_loss=0.09222, over 19695.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.3201, pruned_loss=0.1012, over 3259941.58 frames. ], batch size: 110, lr: 1.84e-02, grad_scale: 32.0 2023-06-23 12:13:04,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=55973.333333333336, ans=0.2 2023-06-23 12:13:04,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=55973.333333333336, ans=0.1 2023-06-23 12:13:23,934 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-23 12:13:24,803 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 2.066e+02 2.371e+02 2.737e+02 3.800e+02, threshold=4.742e+02, percent-clipped=0.0 2023-06-23 12:13:37,780 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.25 vs. limit=12.0 2023-06-23 12:13:51,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=56173.333333333336, ans=0.0 2023-06-23 12:13:54,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=56173.333333333336, ans=0.125 2023-06-23 12:14:09,377 INFO [train.py:1008] (2/4) Epoch 16, batch 450, loss[loss=0.2327, simple_loss=0.2997, pruned_loss=0.08285, over 19924.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3197, pruned_loss=0.1006, over 3382371.04 frames. ], batch size: 120, lr: 1.83e-02, grad_scale: 32.0 2023-06-23 12:14:33,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=56306.666666666664, ans=0.0 2023-06-23 12:14:47,451 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.35 vs. limit=10.0 2023-06-23 12:15:11,545 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.59 vs. limit=15.0 2023-06-23 12:15:31,261 INFO [train.py:1008] (2/4) Epoch 16, batch 500, loss[loss=0.2455, simple_loss=0.3142, pruned_loss=0.08833, over 19343.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3198, pruned_loss=0.1009, over 3476071.82 frames. ], batch size: 98, lr: 1.83e-02, grad_scale: 32.0 2023-06-23 12:16:10,045 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.938e+02 2.207e+02 2.461e+02 4.032e+02, threshold=4.415e+02, percent-clipped=0.0 2023-06-23 12:16:44,471 INFO [train.py:1008] (2/4) Epoch 17, batch 0, loss[loss=0.2414, simple_loss=0.3094, pruned_loss=0.08674, over 19537.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3094, pruned_loss=0.08674, over 19537.00 frames. ], batch size: 102, lr: 1.78e-02, grad_scale: 32.0 2023-06-23 12:16:44,472 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 12:16:50,202 INFO [train.py:1040] (2/4) Epoch 17, validation: loss=0.2079, simple_loss=0.3059, pruned_loss=0.0549, over 143649.00 frames. 2023-06-23 12:16:50,203 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 12:17:08,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=56853.333333333336, ans=0.125 2023-06-23 12:17:25,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=56920.0, ans=0.05 2023-06-23 12:17:26,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=56920.0, ans=0.0 2023-06-23 12:17:28,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=56920.0, ans=0.125 2023-06-23 12:18:13,398 INFO [train.py:1008] (2/4) Epoch 17, batch 50, loss[loss=0.2564, simple_loss=0.314, pruned_loss=0.09937, over 20106.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3181, pruned_loss=0.1001, over 873593.85 frames. ], batch size: 133, lr: 1.77e-02, grad_scale: 32.0 2023-06-23 12:18:40,013 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:18:45,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=57253.333333333336, ans=0.0 2023-06-23 12:18:53,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=57253.333333333336, ans=0.125 2023-06-23 12:19:01,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=57320.0, ans=0.1 2023-06-23 12:19:01,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=57320.0, ans=0.0 2023-06-23 12:19:10,632 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.39 vs. limit=15.0 2023-06-23 12:19:23,001 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.496e+02 1.944e+02 2.187e+02 2.439e+02 3.665e+02, threshold=4.374e+02, percent-clipped=0.0 2023-06-23 12:19:32,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=57386.666666666664, ans=0.2 2023-06-23 12:19:33,880 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.64 vs. limit=12.0 2023-06-23 12:19:36,908 INFO [train.py:1008] (2/4) Epoch 17, batch 100, loss[loss=0.261, simple_loss=0.3385, pruned_loss=0.09173, over 17633.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.317, pruned_loss=0.1008, over 1521631.70 frames. ], batch size: 67, lr: 1.77e-02, grad_scale: 32.0 2023-06-23 12:19:59,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=57520.0, ans=0.125 2023-06-23 12:20:03,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=57520.0, ans=0.125 2023-06-23 12:21:00,458 INFO [train.py:1008] (2/4) Epoch 17, batch 150, loss[loss=0.2586, simple_loss=0.3151, pruned_loss=0.1011, over 19980.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3162, pruned_loss=0.1001, over 2011222.34 frames. ], batch size: 126, lr: 1.77e-02, grad_scale: 32.0 2023-06-23 12:21:05,710 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:22:08,706 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.366e+02 2.008e+02 2.186e+02 2.450e+02 4.680e+02, threshold=4.373e+02, percent-clipped=1.0 2023-06-23 12:22:12,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=58053.333333333336, ans=0.125 2023-06-23 12:22:21,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=58120.0, ans=0.125 2023-06-23 12:22:22,857 INFO [train.py:1008] (2/4) Epoch 17, batch 200, loss[loss=0.2481, simple_loss=0.3122, pruned_loss=0.09199, over 18485.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3164, pruned_loss=0.09901, over 2395603.59 frames. ], batch size: 77, lr: 1.76e-02, grad_scale: 32.0 2023-06-23 12:22:24,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=58120.0, ans=0.2 2023-06-23 12:22:34,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=58120.0, ans=0.2 2023-06-23 12:22:35,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=58120.0, ans=0.125 2023-06-23 12:22:39,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=58186.666666666664, ans=0.125 2023-06-23 12:22:50,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=58186.666666666664, ans=0.0 2023-06-23 12:22:52,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.max_positive, batch_count=58186.666666666664, ans=0.95 2023-06-23 12:22:55,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=58253.333333333336, ans=0.125 2023-06-23 12:22:59,596 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:23:11,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=58320.0, ans=0.09899494936611666 2023-06-23 12:23:33,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=58386.666666666664, ans=0.0 2023-06-23 12:23:33,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=58386.666666666664, ans=0.2 2023-06-23 12:23:45,482 INFO [train.py:1008] (2/4) Epoch 17, batch 250, loss[loss=0.2818, simple_loss=0.3314, pruned_loss=0.1162, over 20294.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3167, pruned_loss=0.09919, over 2707144.49 frames. ], batch size: 141, lr: 1.76e-02, grad_scale: 32.0 2023-06-23 12:23:53,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=58453.333333333336, ans=0.125 2023-06-23 12:23:55,310 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.80 vs. limit=15.0 2023-06-23 12:24:02,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=58520.0, ans=0.1 2023-06-23 12:24:04,508 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.79 vs. limit=22.5 2023-06-23 12:24:29,722 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.56 vs. limit=12.0 2023-06-23 12:24:39,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer_ff3.min_abs, batch_count=58653.333333333336, ans=0.2 2023-06-23 12:24:54,506 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.622e+02 1.998e+02 2.239e+02 2.590e+02 4.394e+02, threshold=4.478e+02, percent-clipped=1.0 2023-06-23 12:25:08,894 INFO [train.py:1008] (2/4) Epoch 17, batch 300, loss[loss=0.256, simple_loss=0.3136, pruned_loss=0.09926, over 20113.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3166, pruned_loss=0.09873, over 2945903.31 frames. ], batch size: 133, lr: 1.76e-02, grad_scale: 32.0 2023-06-23 12:25:30,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=58853.333333333336, ans=0.2 2023-06-23 12:25:46,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=58920.0, ans=0.0 2023-06-23 12:26:27,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=59053.333333333336, ans=0.0 2023-06-23 12:26:30,773 INFO [train.py:1008] (2/4) Epoch 17, batch 350, loss[loss=0.2653, simple_loss=0.3178, pruned_loss=0.1064, over 20591.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3162, pruned_loss=0.09807, over 3138459.51 frames. ], batch size: 173, lr: 1.76e-02, grad_scale: 32.0 2023-06-23 12:26:45,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=59120.0, ans=0.125 2023-06-23 12:26:55,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=59186.666666666664, ans=0.125 2023-06-23 12:27:00,368 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=23.73 vs. limit=22.5 2023-06-23 12:27:17,351 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=12.0 2023-06-23 12:27:41,358 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 1.927e+02 2.178e+02 2.510e+02 4.895e+02, threshold=4.356e+02, percent-clipped=2.0 2023-06-23 12:27:49,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=59386.666666666664, ans=0.125 2023-06-23 12:27:54,102 INFO [train.py:1008] (2/4) Epoch 17, batch 400, loss[loss=0.2597, simple_loss=0.3053, pruned_loss=0.1071, over 20204.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3162, pruned_loss=0.09825, over 3274540.67 frames. ], batch size: 239, lr: 1.75e-02, grad_scale: 32.0 2023-06-23 12:28:00,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=59453.333333333336, ans=0.125 2023-06-23 12:28:02,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=59453.333333333336, ans=0.125 2023-06-23 12:28:10,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=59520.0, ans=0.2 2023-06-23 12:28:11,917 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:28:26,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=59586.666666666664, ans=0.02 2023-06-23 12:28:54,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=59653.333333333336, ans=0.0 2023-06-23 12:29:06,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=59720.0, ans=0.0 2023-06-23 12:29:16,914 INFO [train.py:1008] (2/4) Epoch 17, batch 450, loss[loss=0.254, simple_loss=0.3095, pruned_loss=0.09923, over 20535.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3171, pruned_loss=0.0986, over 3370041.92 frames. ], batch size: 173, lr: 1.75e-02, grad_scale: 32.0 2023-06-23 12:29:23,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=59786.666666666664, ans=0.125 2023-06-23 12:29:33,189 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.44 vs. limit=6.0 2023-06-23 12:30:07,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=59986.666666666664, ans=0.0 2023-06-23 12:30:24,544 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.618e+02 2.084e+02 2.367e+02 2.867e+02 4.445e+02, threshold=4.735e+02, percent-clipped=1.0 2023-06-23 12:30:37,439 INFO [train.py:1008] (2/4) Epoch 17, batch 500, loss[loss=0.2601, simple_loss=0.3289, pruned_loss=0.09565, over 18464.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3173, pruned_loss=0.09818, over 3464778.61 frames. ], batch size: 77, lr: 1.75e-02, grad_scale: 32.0 2023-06-23 12:30:58,484 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:30:58,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=60186.666666666664, ans=0.1 2023-06-23 12:31:02,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=60186.666666666664, ans=0.125 2023-06-23 12:31:08,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=60253.333333333336, ans=0.0 2023-06-23 12:31:16,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=60253.333333333336, ans=0.125 2023-06-23 12:31:49,843 INFO [train.py:1008] (2/4) Epoch 18, batch 0, loss[loss=0.2846, simple_loss=0.3488, pruned_loss=0.1103, over 18318.00 frames. ], tot_loss[loss=0.2846, simple_loss=0.3488, pruned_loss=0.1103, over 18318.00 frames. ], batch size: 72, lr: 1.70e-02, grad_scale: 32.0 2023-06-23 12:31:49,844 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 12:31:55,527 INFO [train.py:1040] (2/4) Epoch 18, validation: loss=0.2057, simple_loss=0.3034, pruned_loss=0.05401, over 143649.00 frames. 2023-06-23 12:31:55,527 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 12:32:02,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=60333.333333333336, ans=0.125 2023-06-23 12:32:04,128 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.58 vs. limit=22.5 2023-06-23 12:32:13,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=60400.0, ans=0.0 2023-06-23 12:32:29,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=60466.666666666664, ans=0.0 2023-06-23 12:33:07,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=60600.0, ans=0.1 2023-06-23 12:33:14,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=60600.0, ans=0.2 2023-06-23 12:33:14,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=60600.0, ans=0.0 2023-06-23 12:33:17,517 INFO [train.py:1008] (2/4) Epoch 18, batch 50, loss[loss=0.2514, simple_loss=0.331, pruned_loss=0.08591, over 16385.00 frames. ], tot_loss[loss=0.25, simple_loss=0.313, pruned_loss=0.09349, over 837911.13 frames. ], batch size: 52, lr: 1.69e-02, grad_scale: 32.0 2023-06-23 12:33:33,564 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.571e+02 1.949e+02 2.212e+02 2.514e+02 4.263e+02, threshold=4.424e+02, percent-clipped=0.0 2023-06-23 12:33:43,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=60733.333333333336, ans=0.125 2023-06-23 12:34:02,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=60800.0, ans=0.125 2023-06-23 12:34:10,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=60866.666666666664, ans=0.125 2023-06-23 12:34:14,574 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=5.66 vs. limit=15.0 2023-06-23 12:34:15,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=60866.666666666664, ans=0.125 2023-06-23 12:34:35,139 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:34:39,499 INFO [train.py:1008] (2/4) Epoch 18, batch 100, loss[loss=0.2715, simple_loss=0.3217, pruned_loss=0.1106, over 19988.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3149, pruned_loss=0.09626, over 1503205.69 frames. ], batch size: 126, lr: 1.69e-02, grad_scale: 32.0 2023-06-23 12:34:47,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_positive, batch_count=61000.0, ans=0.05 2023-06-23 12:34:58,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=61066.666666666664, ans=0.5 2023-06-23 12:35:05,480 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.51 vs. limit=15.0 2023-06-23 12:35:09,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=61066.666666666664, ans=0.125 2023-06-23 12:35:28,506 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.632e-01 2023-06-23 12:35:29,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=61200.0, ans=0.2 2023-06-23 12:35:56,027 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.23 vs. limit=22.5 2023-06-23 12:36:02,344 INFO [train.py:1008] (2/4) Epoch 18, batch 150, loss[loss=0.2574, simple_loss=0.3148, pruned_loss=0.1, over 20294.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3141, pruned_loss=0.09476, over 2018970.54 frames. ], batch size: 141, lr: 1.69e-02, grad_scale: 32.0 2023-06-23 12:36:05,534 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.05 vs. limit=22.5 2023-06-23 12:36:07,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=61333.333333333336, ans=0.125 2023-06-23 12:36:08,512 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.11 vs. limit=15.0 2023-06-23 12:36:18,968 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.492e+02 1.917e+02 2.135e+02 2.463e+02 3.823e+02, threshold=4.269e+02, percent-clipped=0.0 2023-06-23 12:36:25,252 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.98 vs. limit=15.0 2023-06-23 12:36:34,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=61466.666666666664, ans=6.0 2023-06-23 12:36:45,320 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.28 vs. limit=10.0 2023-06-23 12:36:50,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=61533.333333333336, ans=0.0 2023-06-23 12:36:55,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=61533.333333333336, ans=0.0 2023-06-23 12:37:21,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=61600.0, ans=0.0 2023-06-23 12:37:24,471 INFO [train.py:1008] (2/4) Epoch 18, batch 200, loss[loss=0.2642, simple_loss=0.3188, pruned_loss=0.1048, over 20476.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3145, pruned_loss=0.09586, over 2398611.15 frames. ], batch size: 160, lr: 1.69e-02, grad_scale: 32.0 2023-06-23 12:37:25,166 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.13 vs. limit=15.0 2023-06-23 12:37:51,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=61733.333333333336, ans=0.125 2023-06-23 12:38:15,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=61866.666666666664, ans=0.125 2023-06-23 12:38:20,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=61866.666666666664, ans=0.125 2023-06-23 12:38:44,624 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:38:45,970 INFO [train.py:1008] (2/4) Epoch 18, batch 250, loss[loss=0.2333, simple_loss=0.3091, pruned_loss=0.07879, over 17601.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3149, pruned_loss=0.09606, over 2701373.30 frames. ], batch size: 67, lr: 1.68e-02, grad_scale: 32.0 2023-06-23 12:39:02,740 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 1.868e+02 2.092e+02 2.337e+02 4.828e+02, threshold=4.183e+02, percent-clipped=1.0 2023-06-23 12:39:15,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=62066.666666666664, ans=0.125 2023-06-23 12:39:42,166 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.07 vs. limit=12.0 2023-06-23 12:39:47,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=62200.0, ans=0.125 2023-06-23 12:39:59,392 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.89 vs. limit=15.0 2023-06-23 12:40:04,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=62266.666666666664, ans=15.0 2023-06-23 12:40:04,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=62266.666666666664, ans=0.0 2023-06-23 12:40:09,903 INFO [train.py:1008] (2/4) Epoch 18, batch 300, loss[loss=0.26, simple_loss=0.3206, pruned_loss=0.09975, over 19684.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3142, pruned_loss=0.09612, over 2941112.06 frames. ], batch size: 110, lr: 1.68e-02, grad_scale: 32.0 2023-06-23 12:40:49,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=62466.666666666664, ans=0.125 2023-06-23 12:41:28,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=62600.0, ans=0.125 2023-06-23 12:41:31,701 INFO [train.py:1008] (2/4) Epoch 18, batch 350, loss[loss=0.2342, simple_loss=0.3078, pruned_loss=0.08029, over 18302.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3141, pruned_loss=0.09601, over 3120973.18 frames. ], batch size: 74, lr: 1.68e-02, grad_scale: 32.0 2023-06-23 12:41:47,676 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.611e+02 1.885e+02 2.140e+02 2.544e+02 4.064e+02, threshold=4.281e+02, percent-clipped=0.0 2023-06-23 12:41:53,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=62733.333333333336, ans=0.2 2023-06-23 12:42:51,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=62933.333333333336, ans=0.1 2023-06-23 12:42:54,373 INFO [train.py:1008] (2/4) Epoch 18, batch 400, loss[loss=0.2638, simple_loss=0.3372, pruned_loss=0.09519, over 16358.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3143, pruned_loss=0.09571, over 3264464.92 frames. ], batch size: 52, lr: 1.68e-02, grad_scale: 32.0 2023-06-23 12:42:56,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=63000.0, ans=0.125 2023-06-23 12:43:00,196 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:43:09,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=63066.666666666664, ans=0.125 2023-06-23 12:43:35,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=63133.333333333336, ans=0.1 2023-06-23 12:43:48,578 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.80 vs. limit=15.0 2023-06-23 12:44:17,929 INFO [train.py:1008] (2/4) Epoch 18, batch 450, loss[loss=0.2455, simple_loss=0.3036, pruned_loss=0.09372, over 20543.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3138, pruned_loss=0.09547, over 3384215.25 frames. ], batch size: 189, lr: 1.67e-02, grad_scale: 32.0 2023-06-23 12:44:33,787 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.630e+02 1.966e+02 2.259e+02 2.581e+02 5.300e+02, threshold=4.517e+02, percent-clipped=2.0 2023-06-23 12:45:19,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=63533.333333333336, ans=0.04949747468305833 2023-06-23 12:45:36,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=63666.666666666664, ans=0.125 2023-06-23 12:45:38,025 INFO [train.py:1008] (2/4) Epoch 18, batch 500, loss[loss=0.2254, simple_loss=0.2981, pruned_loss=0.0764, over 11472.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3137, pruned_loss=0.0949, over 3447084.82 frames. ], batch size: 32, lr: 1.67e-02, grad_scale: 32.0 2023-06-23 12:45:41,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=63666.666666666664, ans=0.2 2023-06-23 12:46:13,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=63800.0, ans=0.125 2023-06-23 12:46:50,138 INFO [train.py:1008] (2/4) Epoch 19, batch 0, loss[loss=0.2369, simple_loss=0.3045, pruned_loss=0.08467, over 18766.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3045, pruned_loss=0.08467, over 18766.00 frames. ], batch size: 83, lr: 1.62e-02, grad_scale: 32.0 2023-06-23 12:46:50,138 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 12:46:55,823 INFO [train.py:1040] (2/4) Epoch 19, validation: loss=0.2047, simple_loss=0.3033, pruned_loss=0.05308, over 143649.00 frames. 2023-06-23 12:46:55,824 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 12:46:59,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=63880.0, ans=0.125 2023-06-23 12:47:14,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=63946.666666666664, ans=0.0 2023-06-23 12:47:25,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=63946.666666666664, ans=15.0 2023-06-23 12:47:41,320 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.842e+02 2.018e+02 2.417e+02 3.791e+02, threshold=4.036e+02, percent-clipped=0.0 2023-06-23 12:47:54,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=64080.0, ans=0.0 2023-06-23 12:48:15,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=64146.666666666664, ans=0.125 2023-06-23 12:48:17,897 INFO [train.py:1008] (2/4) Epoch 19, batch 50, loss[loss=0.2598, simple_loss=0.3303, pruned_loss=0.09467, over 16462.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3102, pruned_loss=0.09321, over 857864.37 frames. ], batch size: 52, lr: 1.62e-02, grad_scale: 32.0 2023-06-23 12:48:29,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=64213.333333333336, ans=0.1 2023-06-23 12:48:32,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=64280.0, ans=0.125 2023-06-23 12:48:45,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=64280.0, ans=0.09899494936611666 2023-06-23 12:48:59,537 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=22.98 vs. limit=22.5 2023-06-23 12:49:06,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=64413.333333333336, ans=0.2 2023-06-23 12:49:10,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=64413.333333333336, ans=0.0 2023-06-23 12:49:28,643 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.26 vs. limit=6.0 2023-06-23 12:49:39,863 INFO [train.py:1008] (2/4) Epoch 19, batch 100, loss[loss=0.2388, simple_loss=0.307, pruned_loss=0.08528, over 19324.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3109, pruned_loss=0.09382, over 1515900.07 frames. ], batch size: 98, lr: 1.62e-02, grad_scale: 32.0 2023-06-23 12:49:44,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=64546.666666666664, ans=0.125 2023-06-23 12:49:50,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=64546.666666666664, ans=0.125 2023-06-23 12:49:59,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=64613.333333333336, ans=0.125 2023-06-23 12:50:07,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=64613.333333333336, ans=0.0 2023-06-23 12:50:14,891 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.08 vs. limit=22.5 2023-06-23 12:50:17,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=64680.0, ans=0.125 2023-06-23 12:50:25,075 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.480e+02 1.991e+02 2.213e+02 2.609e+02 4.659e+02, threshold=4.425e+02, percent-clipped=3.0 2023-06-23 12:50:38,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=64746.666666666664, ans=0.2 2023-06-23 12:50:38,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=64746.666666666664, ans=0.07 2023-06-23 12:50:53,966 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=7.09 vs. limit=15.0 2023-06-23 12:51:01,325 INFO [train.py:1008] (2/4) Epoch 19, batch 150, loss[loss=0.2347, simple_loss=0.2909, pruned_loss=0.0893, over 20565.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3111, pruned_loss=0.09373, over 2019272.00 frames. ], batch size: 173, lr: 1.62e-02, grad_scale: 32.0 2023-06-23 12:51:11,910 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=7.09 vs. limit=15.0 2023-06-23 12:51:54,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=65080.0, ans=0.2 2023-06-23 12:52:02,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=65080.0, ans=0.0 2023-06-23 12:52:02,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=65080.0, ans=0.1 2023-06-23 12:52:15,193 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.94 vs. limit=15.0 2023-06-23 12:52:24,639 INFO [train.py:1008] (2/4) Epoch 19, batch 200, loss[loss=0.2495, simple_loss=0.2992, pruned_loss=0.09991, over 20568.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3115, pruned_loss=0.09425, over 2416845.97 frames. ], batch size: 189, lr: 1.61e-02, grad_scale: 32.0 2023-06-23 12:52:31,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=65213.333333333336, ans=0.125 2023-06-23 12:52:32,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=65213.333333333336, ans=0.0 2023-06-23 12:52:44,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=65280.0, ans=0.125 2023-06-23 12:52:45,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=65280.0, ans=0.0 2023-06-23 12:53:10,920 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.572e+02 1.917e+02 2.134e+02 2.466e+02 4.511e+02, threshold=4.268e+02, percent-clipped=1.0 2023-06-23 12:53:14,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=65413.333333333336, ans=0.125 2023-06-23 12:53:20,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=65413.333333333336, ans=0.025 2023-06-23 12:53:47,872 INFO [train.py:1008] (2/4) Epoch 19, batch 250, loss[loss=0.2659, simple_loss=0.3176, pruned_loss=0.1071, over 20607.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3112, pruned_loss=0.09456, over 2725665.96 frames. ], batch size: 189, lr: 1.61e-02, grad_scale: 32.0 2023-06-23 12:53:53,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=65546.66666666667, ans=0.0 2023-06-23 12:54:15,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=65613.33333333333, ans=0.07 2023-06-23 12:55:06,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=65813.33333333333, ans=0.1 2023-06-23 12:55:10,894 INFO [train.py:1008] (2/4) Epoch 19, batch 300, loss[loss=0.2644, simple_loss=0.3338, pruned_loss=0.09747, over 16746.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3115, pruned_loss=0.09473, over 2952914.34 frames. ], batch size: 59, lr: 1.61e-02, grad_scale: 32.0 2023-06-23 12:55:14,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=65880.0, ans=0.125 2023-06-23 12:55:17,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=65880.0, ans=0.125 2023-06-23 12:55:23,297 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.77 vs. limit=6.0 2023-06-23 12:55:28,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=65946.66666666667, ans=0.5 2023-06-23 12:55:31,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=65946.66666666667, ans=0.05 2023-06-23 12:55:56,122 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=13.11 vs. limit=15.0 2023-06-23 12:55:56,540 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.473e+02 1.851e+02 2.103e+02 2.314e+02 3.495e+02, threshold=4.207e+02, percent-clipped=0.0 2023-06-23 12:56:29,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=66146.66666666667, ans=0.125 2023-06-23 12:56:31,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=66213.33333333333, ans=0.0 2023-06-23 12:56:32,561 INFO [train.py:1008] (2/4) Epoch 19, batch 350, loss[loss=0.2572, simple_loss=0.3206, pruned_loss=0.09689, over 19114.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3117, pruned_loss=0.09526, over 3109081.52 frames. ], batch size: 94, lr: 1.61e-02, grad_scale: 32.0 2023-06-23 12:57:29,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=66413.33333333333, ans=0.125 2023-06-23 12:57:44,298 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.63 vs. limit=15.0 2023-06-23 12:57:51,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=66480.0, ans=0.125 2023-06-23 12:57:54,124 INFO [train.py:1008] (2/4) Epoch 19, batch 400, loss[loss=0.2522, simple_loss=0.3107, pruned_loss=0.0968, over 20490.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3129, pruned_loss=0.09466, over 3247687.77 frames. ], batch size: 160, lr: 1.60e-02, grad_scale: 32.0 2023-06-23 12:58:17,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=66613.33333333333, ans=0.0 2023-06-23 12:58:38,066 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=512, metric=2.88 vs. limit=15.0 2023-06-23 12:58:40,253 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.465e+02 1.858e+02 2.159e+02 2.467e+02 4.399e+02, threshold=4.318e+02, percent-clipped=1.0 2023-06-23 12:58:56,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=66746.66666666667, ans=0.125 2023-06-23 12:59:16,971 INFO [train.py:1008] (2/4) Epoch 19, batch 450, loss[loss=0.2499, simple_loss=0.3076, pruned_loss=0.09616, over 20353.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3125, pruned_loss=0.09414, over 3353330.84 frames. ], batch size: 149, lr: 1.60e-02, grad_scale: 64.0 2023-06-23 12:59:17,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=66880.0, ans=0.0 2023-06-23 12:59:38,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=66946.66666666667, ans=0.07 2023-06-23 13:00:14,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=67080.0, ans=0.0 2023-06-23 13:00:16,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=67080.0, ans=0.1 2023-06-23 13:00:19,876 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.88 vs. limit=15.0 2023-06-23 13:00:21,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=67146.66666666667, ans=0.2 2023-06-23 13:00:38,143 INFO [train.py:1008] (2/4) Epoch 19, batch 500, loss[loss=0.2709, simple_loss=0.3435, pruned_loss=0.09914, over 16724.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3115, pruned_loss=0.09433, over 3447144.87 frames. ], batch size: 59, lr: 1.60e-02, grad_scale: 64.0 2023-06-23 13:00:46,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=67213.33333333333, ans=0.04949747468305833 2023-06-23 13:01:22,696 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.865e+02 2.077e+02 2.519e+02 3.781e+02, threshold=4.153e+02, percent-clipped=0.0 2023-06-23 13:01:27,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=67413.33333333333, ans=0.0 2023-06-23 13:01:52,657 INFO [train.py:1008] (2/4) Epoch 20, batch 0, loss[loss=0.2587, simple_loss=0.3311, pruned_loss=0.09316, over 18322.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3311, pruned_loss=0.09316, over 18322.00 frames. ], batch size: 72, lr: 1.56e-02, grad_scale: 64.0 2023-06-23 13:01:52,657 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 13:01:58,311 INFO [train.py:1040] (2/4) Epoch 20, validation: loss=0.2036, simple_loss=0.3021, pruned_loss=0.05255, over 143649.00 frames. 2023-06-23 13:01:58,311 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 13:02:01,808 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:02:08,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.balancer1.prob, batch_count=67433.33333333333, ans=0.125 2023-06-23 13:02:55,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=67633.33333333333, ans=0.125 2023-06-23 13:03:03,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=67700.0, ans=0.125 2023-06-23 13:03:09,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=67700.0, ans=0.07 2023-06-23 13:03:12,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=67700.0, ans=0.1 2023-06-23 13:03:21,097 INFO [train.py:1008] (2/4) Epoch 20, batch 50, loss[loss=0.2514, simple_loss=0.3063, pruned_loss=0.09829, over 20648.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3084, pruned_loss=0.09119, over 857316.17 frames. ], batch size: 211, lr: 1.55e-02, grad_scale: 64.0 2023-06-23 13:03:35,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=67833.33333333333, ans=0.2 2023-06-23 13:04:07,547 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:04:09,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.min_abs, batch_count=67966.66666666667, ans=0.5 2023-06-23 13:04:34,889 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.553e+02 1.894e+02 2.119e+02 2.506e+02 3.679e+02, threshold=4.238e+02, percent-clipped=0.0 2023-06-23 13:04:43,803 INFO [train.py:1008] (2/4) Epoch 20, batch 100, loss[loss=0.2507, simple_loss=0.3317, pruned_loss=0.08485, over 17620.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.311, pruned_loss=0.09134, over 1505886.02 frames. ], batch size: 67, lr: 1.55e-02, grad_scale: 64.0 2023-06-23 13:05:21,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=68233.33333333333, ans=0.1 2023-06-23 13:05:36,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=68300.0, ans=0.125 2023-06-23 13:05:50,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=68366.66666666667, ans=0.125 2023-06-23 13:05:50,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=68366.66666666667, ans=0.1 2023-06-23 13:05:58,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=68366.66666666667, ans=0.125 2023-06-23 13:06:04,708 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.35 vs. limit=15.0 2023-06-23 13:06:05,381 INFO [train.py:1008] (2/4) Epoch 20, batch 150, loss[loss=0.2501, simple_loss=0.3211, pruned_loss=0.08951, over 18769.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3098, pruned_loss=0.09037, over 2014980.96 frames. ], batch size: 83, lr: 1.55e-02, grad_scale: 64.0 2023-06-23 13:06:16,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=68433.33333333333, ans=0.125 2023-06-23 13:06:19,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=68433.33333333333, ans=0.125 2023-06-23 13:06:46,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=68566.66666666667, ans=0.125 2023-06-23 13:07:18,941 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.872e+02 2.112e+02 2.327e+02 3.463e+02, threshold=4.224e+02, percent-clipped=0.0 2023-06-23 13:07:26,783 INFO [train.py:1008] (2/4) Epoch 20, batch 200, loss[loss=0.2435, simple_loss=0.2988, pruned_loss=0.09413, over 20603.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3097, pruned_loss=0.09101, over 2390910.85 frames. ], batch size: 189, lr: 1.55e-02, grad_scale: 64.0 2023-06-23 13:08:04,363 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.11 vs. limit=12.0 2023-06-23 13:08:29,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=68966.66666666667, ans=0.1 2023-06-23 13:08:29,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=68966.66666666667, ans=0.1 2023-06-23 13:08:36,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=69033.33333333333, ans=0.0 2023-06-23 13:08:43,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=69033.33333333333, ans=15.0 2023-06-23 13:08:48,885 INFO [train.py:1008] (2/4) Epoch 20, batch 250, loss[loss=0.2222, simple_loss=0.2975, pruned_loss=0.07348, over 19107.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3096, pruned_loss=0.09015, over 2708232.68 frames. ], batch size: 89, lr: 1.54e-02, grad_scale: 32.0 2023-06-23 13:09:06,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=69166.66666666667, ans=0.0 2023-06-23 13:09:23,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=69233.33333333333, ans=0.1 2023-06-23 13:09:34,339 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=512, metric=20.79 vs. limit=22.5 2023-06-23 13:09:35,989 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.48 vs. limit=15.0 2023-06-23 13:09:37,728 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.90 vs. limit=15.0 2023-06-23 13:09:42,520 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:09:49,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=69300.0, ans=0.0 2023-06-23 13:10:05,267 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 2.000e+02 2.200e+02 2.591e+02 4.331e+02, threshold=4.400e+02, percent-clipped=1.0 2023-06-23 13:10:12,161 INFO [train.py:1008] (2/4) Epoch 20, batch 300, loss[loss=0.2313, simple_loss=0.2991, pruned_loss=0.08178, over 18930.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3092, pruned_loss=0.09006, over 2956401.98 frames. ], batch size: 86, lr: 1.54e-02, grad_scale: 32.0 2023-06-23 13:10:25,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=69433.33333333333, ans=0.125 2023-06-23 13:10:54,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=69566.66666666667, ans=0.125 2023-06-23 13:11:34,762 INFO [train.py:1008] (2/4) Epoch 20, batch 350, loss[loss=0.2476, simple_loss=0.3178, pruned_loss=0.08871, over 10599.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3082, pruned_loss=0.09029, over 3118492.69 frames. ], batch size: 30, lr: 1.54e-02, grad_scale: 32.0 2023-06-23 13:11:56,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=69833.33333333333, ans=0.0 2023-06-23 13:12:01,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=69833.33333333333, ans=0.0 2023-06-23 13:12:26,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=69966.66666666667, ans=0.125 2023-06-23 13:12:40,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=70033.33333333333, ans=0.125 2023-06-23 13:12:50,074 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.533e+02 1.960e+02 2.265e+02 2.644e+02 4.141e+02, threshold=4.531e+02, percent-clipped=0.0 2023-06-23 13:12:56,229 INFO [train.py:1008] (2/4) Epoch 20, batch 400, loss[loss=0.2554, simple_loss=0.3216, pruned_loss=0.0946, over 18911.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.308, pruned_loss=0.09023, over 3260785.09 frames. ], batch size: 86, lr: 1.54e-02, grad_scale: 32.0 2023-06-23 13:13:14,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=70166.66666666667, ans=0.125 2023-06-23 13:13:39,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=70233.33333333333, ans=0.125 2023-06-23 13:13:44,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=70300.0, ans=0.125 2023-06-23 13:13:44,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=70300.0, ans=0.125 2023-06-23 13:13:44,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=70300.0, ans=0.2 2023-06-23 13:14:15,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=70366.66666666667, ans=0.125 2023-06-23 13:14:18,937 INFO [train.py:1008] (2/4) Epoch 20, batch 450, loss[loss=0.2687, simple_loss=0.3416, pruned_loss=0.09786, over 16245.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3086, pruned_loss=0.09108, over 3371300.88 frames. ], batch size: 52, lr: 1.54e-02, grad_scale: 32.0 2023-06-23 13:14:19,949 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.27 vs. limit=22.5 2023-06-23 13:14:45,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=70500.0, ans=10.0 2023-06-23 13:14:57,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=70566.66666666667, ans=0.1 2023-06-23 13:15:10,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=70633.33333333333, ans=0.1 2023-06-23 13:15:30,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=70700.0, ans=0.125 2023-06-23 13:15:32,554 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.641e+02 1.927e+02 2.175e+02 2.531e+02 3.833e+02, threshold=4.351e+02, percent-clipped=0.0 2023-06-23 13:15:38,741 INFO [train.py:1008] (2/4) Epoch 20, batch 500, loss[loss=0.2356, simple_loss=0.3151, pruned_loss=0.07801, over 17648.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3088, pruned_loss=0.09074, over 3464451.10 frames. ], batch size: 67, lr: 1.53e-02, grad_scale: 32.0 2023-06-23 13:15:48,976 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.91 vs. limit=15.0 2023-06-23 13:16:21,019 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.12 vs. limit=15.0 2023-06-23 13:16:49,190 INFO [train.py:1008] (2/4) Epoch 21, batch 0, loss[loss=0.2312, simple_loss=0.2976, pruned_loss=0.08238, over 18786.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.2976, pruned_loss=0.08238, over 18786.00 frames. ], batch size: 83, lr: 1.49e-02, grad_scale: 32.0 2023-06-23 13:16:49,190 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 13:16:54,882 INFO [train.py:1040] (2/4) Epoch 21, validation: loss=0.2034, simple_loss=0.3003, pruned_loss=0.05328, over 143649.00 frames. 2023-06-23 13:16:54,884 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 13:17:14,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=71046.66666666667, ans=0.125 2023-06-23 13:17:52,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=71180.0, ans=0.125 2023-06-23 13:18:04,090 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:18:07,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=71246.66666666667, ans=0.1 2023-06-23 13:18:07,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=71246.66666666667, ans=0.1 2023-06-23 13:18:16,793 INFO [train.py:1008] (2/4) Epoch 21, batch 50, loss[loss=0.2368, simple_loss=0.3065, pruned_loss=0.08357, over 19544.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3079, pruned_loss=0.09252, over 862867.06 frames. ], batch size: 102, lr: 1.49e-02, grad_scale: 32.0 2023-06-23 13:18:24,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=71313.33333333333, ans=0.0 2023-06-23 13:18:40,342 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 1.956e+02 2.297e+02 2.608e+02 3.935e+02, threshold=4.595e+02, percent-clipped=0.0 2023-06-23 13:19:05,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=71513.33333333333, ans=0.125 2023-06-23 13:19:38,756 INFO [train.py:1008] (2/4) Epoch 21, batch 100, loss[loss=0.2443, simple_loss=0.2983, pruned_loss=0.09512, over 20736.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3065, pruned_loss=0.09087, over 1518654.41 frames. ], batch size: 211, lr: 1.49e-02, grad_scale: 32.0 2023-06-23 13:20:33,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=71846.66666666667, ans=0.125 2023-06-23 13:20:36,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=71846.66666666667, ans=0.1 2023-06-23 13:20:45,346 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.60 vs. limit=15.0 2023-06-23 13:21:01,237 INFO [train.py:1008] (2/4) Epoch 21, batch 150, loss[loss=0.2345, simple_loss=0.3087, pruned_loss=0.08019, over 19831.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3066, pruned_loss=0.08926, over 2016261.25 frames. ], batch size: 115, lr: 1.49e-02, grad_scale: 32.0 2023-06-23 13:21:25,287 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.847e+02 2.064e+02 2.318e+02 4.387e+02, threshold=4.128e+02, percent-clipped=0.0 2023-06-23 13:21:34,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=72113.33333333333, ans=0.125 2023-06-23 13:21:36,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=72113.33333333333, ans=0.0 2023-06-23 13:22:00,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=72180.0, ans=0.125 2023-06-23 13:22:05,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=72180.0, ans=0.0 2023-06-23 13:22:24,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=72313.33333333333, ans=0.0 2023-06-23 13:22:25,791 INFO [train.py:1008] (2/4) Epoch 21, batch 200, loss[loss=0.2307, simple_loss=0.2976, pruned_loss=0.08189, over 19081.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3069, pruned_loss=0.08936, over 2403327.16 frames. ], batch size: 89, lr: 1.49e-02, grad_scale: 32.0 2023-06-23 13:23:39,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=72580.0, ans=0.05 2023-06-23 13:23:41,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=72580.0, ans=0.1 2023-06-23 13:23:47,619 INFO [train.py:1008] (2/4) Epoch 21, batch 250, loss[loss=0.2562, simple_loss=0.3184, pruned_loss=0.09702, over 19120.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3074, pruned_loss=0.08886, over 2682544.03 frames. ], batch size: 94, lr: 1.48e-02, grad_scale: 32.0 2023-06-23 13:23:54,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=72646.66666666667, ans=0.0 2023-06-23 13:23:57,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=72646.66666666667, ans=0.125 2023-06-23 13:24:05,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=72713.33333333333, ans=0.1 2023-06-23 13:24:10,333 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.903e+02 2.120e+02 2.539e+02 3.451e+02, threshold=4.241e+02, percent-clipped=0.0 2023-06-23 13:24:17,598 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:24:23,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=72780.0, ans=0.0 2023-06-23 13:24:58,483 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.74 vs. limit=6.0 2023-06-23 13:25:10,682 INFO [train.py:1008] (2/4) Epoch 21, batch 300, loss[loss=0.2691, simple_loss=0.2906, pruned_loss=0.1238, over 17047.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3074, pruned_loss=0.08837, over 2918827.20 frames. ], batch size: 392, lr: 1.48e-02, grad_scale: 32.0 2023-06-23 13:25:41,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=73046.66666666667, ans=0.1 2023-06-23 13:25:49,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=73113.33333333333, ans=0.1 2023-06-23 13:26:02,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=73180.0, ans=0.1 2023-06-23 13:26:03,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=73180.0, ans=0.1 2023-06-23 13:26:22,111 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:26:23,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=73246.66666666667, ans=0.035 2023-06-23 13:26:32,893 INFO [train.py:1008] (2/4) Epoch 21, batch 350, loss[loss=0.2387, simple_loss=0.3075, pruned_loss=0.085, over 19784.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3085, pruned_loss=0.08866, over 3101415.50 frames. ], batch size: 115, lr: 1.48e-02, grad_scale: 32.0 2023-06-23 13:26:33,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=73313.33333333333, ans=0.125 2023-06-23 13:26:35,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=73313.33333333333, ans=0.125 2023-06-23 13:26:55,869 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 1.792e+02 1.983e+02 2.184e+02 3.913e+02, threshold=3.966e+02, percent-clipped=0.0 2023-06-23 13:27:29,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=73513.33333333333, ans=0.125 2023-06-23 13:27:31,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=73513.33333333333, ans=0.125 2023-06-23 13:27:33,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=73513.33333333333, ans=0.125 2023-06-23 13:27:55,446 INFO [train.py:1008] (2/4) Epoch 21, batch 400, loss[loss=0.2486, simple_loss=0.3088, pruned_loss=0.09418, over 19975.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3074, pruned_loss=0.08811, over 3264611.22 frames. ], batch size: 126, lr: 1.48e-02, grad_scale: 32.0 2023-06-23 13:27:55,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=73646.66666666667, ans=0.2 2023-06-23 13:28:05,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=73646.66666666667, ans=0.2 2023-06-23 13:28:09,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=73646.66666666667, ans=10.0 2023-06-23 13:28:25,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=73713.33333333333, ans=0.0 2023-06-23 13:28:28,044 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.92 vs. limit=15.0 2023-06-23 13:29:09,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=73913.33333333333, ans=0.0 2023-06-23 13:29:18,727 INFO [train.py:1008] (2/4) Epoch 21, batch 450, loss[loss=0.2504, simple_loss=0.3308, pruned_loss=0.085, over 17627.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3077, pruned_loss=0.08804, over 3372049.52 frames. ], batch size: 67, lr: 1.47e-02, grad_scale: 32.0 2023-06-23 13:29:34,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=74046.66666666667, ans=0.125 2023-06-23 13:29:41,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=74046.66666666667, ans=0.2 2023-06-23 13:29:42,752 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.482e+02 1.982e+02 2.281e+02 2.665e+02 3.713e+02, threshold=4.563e+02, percent-clipped=0.0 2023-06-23 13:29:43,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=74046.66666666667, ans=0.0 2023-06-23 13:30:03,055 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=4.09 vs. limit=12.0 2023-06-23 13:30:23,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=74246.66666666667, ans=0.125 2023-06-23 13:30:40,120 INFO [train.py:1008] (2/4) Epoch 21, batch 500, loss[loss=0.2528, simple_loss=0.3281, pruned_loss=0.08876, over 16318.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3079, pruned_loss=0.08845, over 3451973.01 frames. ], batch size: 52, lr: 1.47e-02, grad_scale: 32.0 2023-06-23 13:30:58,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=74380.0, ans=0.0 2023-06-23 13:31:27,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=74513.33333333333, ans=0.0 2023-06-23 13:31:51,770 INFO [train.py:1008] (2/4) Epoch 22, batch 0, loss[loss=0.2435, simple_loss=0.311, pruned_loss=0.08805, over 18627.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.311, pruned_loss=0.08805, over 18627.00 frames. ], batch size: 80, lr: 1.44e-02, grad_scale: 32.0 2023-06-23 13:31:51,771 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 13:31:57,325 INFO [train.py:1040] (2/4) Epoch 22, validation: loss=0.2, simple_loss=0.2979, pruned_loss=0.05103, over 143649.00 frames. 2023-06-23 13:31:57,326 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 13:32:02,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=74533.33333333333, ans=0.125 2023-06-23 13:32:07,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=74533.33333333333, ans=0.2 2023-06-23 13:32:32,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=74666.66666666667, ans=0.125 2023-06-23 13:32:34,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=74666.66666666667, ans=0.125 2023-06-23 13:32:48,443 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.573e+02 1.893e+02 2.081e+02 2.479e+02 4.096e+02, threshold=4.162e+02, percent-clipped=0.0 2023-06-23 13:32:51,134 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.449e-01 2023-06-23 13:32:53,055 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.38 vs. limit=15.0 2023-06-23 13:32:54,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=74733.33333333333, ans=0.2 2023-06-23 13:33:12,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=74800.0, ans=0.2 2023-06-23 13:33:17,268 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:33:20,235 INFO [train.py:1008] (2/4) Epoch 22, batch 50, loss[loss=0.2263, simple_loss=0.2962, pruned_loss=0.07824, over 18811.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3021, pruned_loss=0.08691, over 865769.32 frames. ], batch size: 83, lr: 1.43e-02, grad_scale: 32.0 2023-06-23 13:33:37,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=74933.33333333333, ans=0.035 2023-06-23 13:34:17,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=75066.66666666667, ans=0.0 2023-06-23 13:34:42,059 INFO [train.py:1008] (2/4) Epoch 22, batch 100, loss[loss=0.2278, simple_loss=0.2903, pruned_loss=0.08267, over 20527.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3025, pruned_loss=0.08759, over 1509941.96 frames. ], batch size: 189, lr: 1.43e-02, grad_scale: 32.0 2023-06-23 13:35:01,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=75266.66666666667, ans=0.0 2023-06-23 13:35:16,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=75333.33333333333, ans=10.0 2023-06-23 13:35:24,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=75333.33333333333, ans=0.025 2023-06-23 13:35:28,258 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.93 vs. limit=6.0 2023-06-23 13:35:32,042 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.896e+02 2.104e+02 2.417e+02 4.108e+02, threshold=4.207e+02, percent-clipped=0.0 2023-06-23 13:35:51,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=75466.66666666667, ans=0.125 2023-06-23 13:35:59,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=75466.66666666667, ans=0.125 2023-06-23 13:36:04,950 INFO [train.py:1008] (2/4) Epoch 22, batch 150, loss[loss=0.2374, simple_loss=0.3161, pruned_loss=0.07934, over 16389.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3027, pruned_loss=0.08617, over 2021827.71 frames. ], batch size: 52, lr: 1.43e-02, grad_scale: 32.0 2023-06-23 13:36:14,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=75533.33333333333, ans=0.07 2023-06-23 13:36:53,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=75733.33333333333, ans=0.1 2023-06-23 13:36:57,095 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.25 vs. limit=22.5 2023-06-23 13:37:14,260 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.41 vs. limit=15.0 2023-06-23 13:37:27,971 INFO [train.py:1008] (2/4) Epoch 22, batch 200, loss[loss=0.2232, simple_loss=0.2983, pruned_loss=0.07408, over 19504.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3042, pruned_loss=0.08664, over 2393159.26 frames. ], batch size: 105, lr: 1.43e-02, grad_scale: 32.0 2023-06-23 13:37:34,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=75866.66666666667, ans=0.2 2023-06-23 13:37:56,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=75933.33333333333, ans=0.125 2023-06-23 13:38:15,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=76000.0, ans=0.125 2023-06-23 13:38:21,021 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.520e+02 1.823e+02 2.064e+02 2.218e+02 4.834e+02, threshold=4.128e+02, percent-clipped=1.0 2023-06-23 13:38:38,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=76133.33333333333, ans=0.0 2023-06-23 13:38:48,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=76133.33333333333, ans=0.0 2023-06-23 13:38:51,860 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-23 13:38:52,336 INFO [train.py:1008] (2/4) Epoch 22, batch 250, loss[loss=0.2638, simple_loss=0.3364, pruned_loss=0.09561, over 16764.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3044, pruned_loss=0.08688, over 2697224.50 frames. ], batch size: 59, lr: 1.43e-02, grad_scale: 32.0 2023-06-23 13:38:54,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=76200.0, ans=0.0 2023-06-23 13:39:15,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=76266.66666666667, ans=0.1 2023-06-23 13:39:16,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=76266.66666666667, ans=0.125 2023-06-23 13:39:23,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=76266.66666666667, ans=0.1 2023-06-23 13:39:40,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=76333.33333333333, ans=0.0 2023-06-23 13:40:08,814 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.48 vs. limit=15.0 2023-06-23 13:40:16,053 INFO [train.py:1008] (2/4) Epoch 22, batch 300, loss[loss=0.2476, simple_loss=0.3128, pruned_loss=0.09122, over 18293.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.304, pruned_loss=0.08719, over 2949029.38 frames. ], batch size: 74, lr: 1.42e-02, grad_scale: 32.0 2023-06-23 13:40:24,379 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=9.54 vs. limit=15.0 2023-06-23 13:40:30,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=76533.33333333333, ans=0.0 2023-06-23 13:40:38,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=76600.0, ans=0.125 2023-06-23 13:40:41,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=76600.0, ans=0.0 2023-06-23 13:40:42,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=76600.0, ans=0.0 2023-06-23 13:41:00,534 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=4.16 vs. limit=15.0 2023-06-23 13:41:07,033 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.581e+02 1.877e+02 2.058e+02 2.358e+02 3.896e+02, threshold=4.117e+02, percent-clipped=0.0 2023-06-23 13:41:10,037 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.06 vs. limit=15.0 2023-06-23 13:41:36,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=76800.0, ans=0.125 2023-06-23 13:41:39,119 INFO [train.py:1008] (2/4) Epoch 22, batch 350, loss[loss=0.2199, simple_loss=0.2875, pruned_loss=0.07619, over 20008.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3037, pruned_loss=0.0867, over 3130949.76 frames. ], batch size: 126, lr: 1.42e-02, grad_scale: 32.0 2023-06-23 13:41:50,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=76866.66666666667, ans=0.09899494936611666 2023-06-23 13:41:54,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=76933.33333333333, ans=0.0 2023-06-23 13:42:47,093 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.25 vs. limit=10.0 2023-06-23 13:43:01,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=77200.0, ans=0.0 2023-06-23 13:43:02,750 INFO [train.py:1008] (2/4) Epoch 22, batch 400, loss[loss=0.2157, simple_loss=0.2844, pruned_loss=0.0735, over 19465.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.303, pruned_loss=0.08611, over 3273585.22 frames. ], batch size: 105, lr: 1.42e-02, grad_scale: 32.0 2023-06-23 13:43:20,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=77266.66666666667, ans=0.125 2023-06-23 13:43:21,598 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.19 vs. limit=22.5 2023-06-23 13:43:43,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=77333.33333333333, ans=0.0 2023-06-23 13:43:54,275 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 1.898e+02 2.176e+02 2.564e+02 3.831e+02, threshold=4.352e+02, percent-clipped=0.0 2023-06-23 13:44:01,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=77400.0, ans=0.2 2023-06-23 13:44:20,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=77466.66666666667, ans=0.0 2023-06-23 13:44:26,719 INFO [train.py:1008] (2/4) Epoch 22, batch 450, loss[loss=0.2381, simple_loss=0.3162, pruned_loss=0.08, over 16309.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3042, pruned_loss=0.08671, over 3383169.52 frames. ], batch size: 52, lr: 1.42e-02, grad_scale: 32.0 2023-06-23 13:44:39,820 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:45:05,567 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.05 vs. limit=15.0 2023-06-23 13:45:35,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=77800.0, ans=0.0 2023-06-23 13:45:47,960 INFO [train.py:1008] (2/4) Epoch 22, batch 500, loss[loss=0.232, simple_loss=0.2927, pruned_loss=0.08559, over 20575.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3037, pruned_loss=0.08663, over 3480574.46 frames. ], batch size: 189, lr: 1.42e-02, grad_scale: 32.0 2023-06-23 13:46:27,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=78000.0, ans=0.04949747468305833 2023-06-23 13:46:31,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=78000.0, ans=0.0 2023-06-23 13:46:36,457 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.596e+02 1.861e+02 2.045e+02 2.383e+02 4.014e+02, threshold=4.089e+02, percent-clipped=0.0 2023-06-23 13:47:00,144 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.15 vs. limit=15.0 2023-06-23 13:47:00,585 INFO [train.py:1008] (2/4) Epoch 23, batch 0, loss[loss=0.2248, simple_loss=0.2901, pruned_loss=0.07972, over 20336.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2901, pruned_loss=0.07972, over 20336.00 frames. ], batch size: 149, lr: 1.38e-02, grad_scale: 32.0 2023-06-23 13:47:00,585 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 13:47:06,607 INFO [train.py:1040] (2/4) Epoch 23, validation: loss=0.1991, simple_loss=0.2976, pruned_loss=0.05029, over 143649.00 frames. 2023-06-23 13:47:06,607 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 13:47:10,632 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.74 vs. limit=15.0 2023-06-23 13:47:44,671 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=22.5 2023-06-23 13:48:03,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=78280.0, ans=0.125 2023-06-23 13:48:03,645 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.52 vs. limit=15.0 2023-06-23 13:48:28,462 INFO [train.py:1008] (2/4) Epoch 23, batch 50, loss[loss=0.2376, simple_loss=0.3088, pruned_loss=0.08322, over 18768.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3025, pruned_loss=0.0862, over 855489.08 frames. ], batch size: 83, lr: 1.38e-02, grad_scale: 32.0 2023-06-23 13:48:34,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=78413.33333333333, ans=0.1 2023-06-23 13:48:44,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=78480.0, ans=0.1 2023-06-23 13:48:47,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=78480.0, ans=0.2 2023-06-23 13:48:56,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=78480.0, ans=0.0 2023-06-23 13:49:41,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=78680.0, ans=0.1 2023-06-23 13:49:50,568 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.849e+02 2.027e+02 2.431e+02 4.295e+02, threshold=4.054e+02, percent-clipped=1.0 2023-06-23 13:49:51,575 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.69 vs. limit=15.0 2023-06-23 13:49:52,203 INFO [train.py:1008] (2/4) Epoch 23, batch 100, loss[loss=0.237, simple_loss=0.2997, pruned_loss=0.08715, over 20473.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3027, pruned_loss=0.08453, over 1496320.23 frames. ], batch size: 160, lr: 1.38e-02, grad_scale: 32.0 2023-06-23 13:49:59,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=78746.66666666667, ans=0.025 2023-06-23 13:50:10,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff2_skip_rate, batch_count=78813.33333333333, ans=0.0 2023-06-23 13:50:11,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=78813.33333333333, ans=10.0 2023-06-23 13:51:04,013 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.39 vs. limit=6.0 2023-06-23 13:51:05,675 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.23 vs. limit=15.0 2023-06-23 13:51:14,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=79080.0, ans=0.0 2023-06-23 13:51:15,956 INFO [train.py:1008] (2/4) Epoch 23, batch 150, loss[loss=0.2219, simple_loss=0.294, pruned_loss=0.07486, over 19531.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3014, pruned_loss=0.08368, over 2005761.08 frames. ], batch size: 102, lr: 1.38e-02, grad_scale: 32.0 2023-06-23 13:51:39,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=79146.66666666667, ans=0.95 2023-06-23 13:52:19,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=79280.0, ans=0.125 2023-06-23 13:52:38,821 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.604e+02 1.946e+02 2.168e+02 2.387e+02 3.422e+02, threshold=4.335e+02, percent-clipped=0.0 2023-06-23 13:52:40,457 INFO [train.py:1008] (2/4) Epoch 23, batch 200, loss[loss=0.2335, simple_loss=0.306, pruned_loss=0.08051, over 19804.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3008, pruned_loss=0.0843, over 2412593.60 frames. ], batch size: 115, lr: 1.37e-02, grad_scale: 32.0 2023-06-23 13:53:16,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=79546.66666666667, ans=0.125 2023-06-23 13:53:29,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=79613.33333333333, ans=0.125 2023-06-23 13:53:30,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=79613.33333333333, ans=0.0 2023-06-23 13:54:00,340 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=6.0 2023-06-23 13:54:01,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=79680.0, ans=0.125 2023-06-23 13:54:03,905 INFO [train.py:1008] (2/4) Epoch 23, batch 250, loss[loss=0.2161, simple_loss=0.2914, pruned_loss=0.07034, over 18913.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3014, pruned_loss=0.08442, over 2724280.19 frames. ], batch size: 86, lr: 1.37e-02, grad_scale: 16.0 2023-06-23 13:54:18,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=79813.33333333333, ans=0.125 2023-06-23 13:54:30,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=79813.33333333333, ans=0.2 2023-06-23 13:54:33,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=79813.33333333333, ans=0.04949747468305833 2023-06-23 13:54:54,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=79946.66666666667, ans=0.125 2023-06-23 13:55:03,321 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.56 vs. limit=15.0 2023-06-23 13:55:16,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=80013.33333333333, ans=0.2 2023-06-23 13:55:17,415 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.33 vs. limit=5.0 2023-06-23 13:55:29,289 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.558e+02 1.905e+02 2.118e+02 2.432e+02 3.559e+02, threshold=4.237e+02, percent-clipped=0.0 2023-06-23 13:55:29,337 INFO [train.py:1008] (2/4) Epoch 23, batch 300, loss[loss=0.2401, simple_loss=0.2927, pruned_loss=0.09374, over 20592.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3023, pruned_loss=0.08458, over 2941363.77 frames. ], batch size: 173, lr: 1.37e-02, grad_scale: 16.0 2023-06-23 13:56:06,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=80213.33333333333, ans=0.0 2023-06-23 13:56:21,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=80280.0, ans=0.1 2023-06-23 13:56:21,818 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.58 vs. limit=6.0 2023-06-23 13:56:24,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=80280.0, ans=0.1 2023-06-23 13:56:36,857 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:56:42,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=80346.66666666667, ans=0.0 2023-06-23 13:56:53,742 INFO [train.py:1008] (2/4) Epoch 23, batch 350, loss[loss=0.2753, simple_loss=0.2974, pruned_loss=0.1266, over 16959.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3026, pruned_loss=0.08527, over 3127592.10 frames. ], batch size: 392, lr: 1.37e-02, grad_scale: 16.0 2023-06-23 13:56:55,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=80413.33333333333, ans=0.125 2023-06-23 13:57:27,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=80546.66666666667, ans=0.125 2023-06-23 13:57:35,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=80546.66666666667, ans=0.125 2023-06-23 13:57:40,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=80546.66666666667, ans=0.025 2023-06-23 13:57:53,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=80613.33333333333, ans=0.125 2023-06-23 13:57:55,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=80613.33333333333, ans=0.025 2023-06-23 13:57:56,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=80613.33333333333, ans=0.05 2023-06-23 13:58:02,644 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=17.94 vs. limit=22.5 2023-06-23 13:58:04,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=80680.0, ans=0.125 2023-06-23 13:58:17,442 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.818e+02 2.008e+02 2.403e+02 5.012e+02, threshold=4.015e+02, percent-clipped=2.0 2023-06-23 13:58:17,499 INFO [train.py:1008] (2/4) Epoch 23, batch 400, loss[loss=0.226, simple_loss=0.3009, pruned_loss=0.07553, over 18323.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3027, pruned_loss=0.08518, over 3272376.16 frames. ], batch size: 74, lr: 1.37e-02, grad_scale: 32.0 2023-06-23 13:58:23,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=80746.66666666667, ans=0.125 2023-06-23 13:58:29,464 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=15.0 2023-06-23 13:58:30,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=80746.66666666667, ans=0.0 2023-06-23 13:58:49,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=80880.0, ans=0.125 2023-06-23 13:58:50,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=80880.0, ans=0.2 2023-06-23 13:59:15,134 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.22 vs. limit=15.0 2023-06-23 13:59:25,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=81013.33333333333, ans=0.1 2023-06-23 13:59:28,279 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=9.49 vs. limit=15.0 2023-06-23 13:59:39,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=81080.0, ans=0.125 2023-06-23 13:59:40,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=81080.0, ans=0.125 2023-06-23 13:59:41,189 INFO [train.py:1008] (2/4) Epoch 23, batch 450, loss[loss=0.2311, simple_loss=0.3096, pruned_loss=0.07626, over 17644.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3024, pruned_loss=0.08467, over 3382798.16 frames. ], batch size: 67, lr: 1.36e-02, grad_scale: 32.0 2023-06-23 13:59:42,253 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.80 vs. limit=6.0 2023-06-23 14:00:00,258 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.15 vs. limit=10.0 2023-06-23 14:00:12,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=81213.33333333333, ans=0.1 2023-06-23 14:00:18,444 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.35 vs. limit=15.0 2023-06-23 14:00:26,861 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.87 vs. limit=10.0 2023-06-23 14:00:35,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=81280.0, ans=0.1 2023-06-23 14:00:35,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=81280.0, ans=0.0 2023-06-23 14:00:55,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=81346.66666666667, ans=0.1 2023-06-23 14:01:02,844 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.592e+02 1.844e+02 2.145e+02 2.439e+02 3.625e+02, threshold=4.291e+02, percent-clipped=0.0 2023-06-23 14:01:02,890 INFO [train.py:1008] (2/4) Epoch 23, batch 500, loss[loss=0.2466, simple_loss=0.2983, pruned_loss=0.09745, over 20154.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3019, pruned_loss=0.08478, over 3476901.88 frames. ], batch size: 239, lr: 1.36e-02, grad_scale: 32.0 2023-06-23 14:01:31,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=81480.0, ans=0.125 2023-06-23 14:01:40,219 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.69 vs. limit=22.5 2023-06-23 14:02:13,112 INFO [train.py:1008] (2/4) Epoch 24, batch 0, loss[loss=0.233, simple_loss=0.2935, pruned_loss=0.08624, over 20070.00 frames. ], tot_loss[loss=0.233, simple_loss=0.2935, pruned_loss=0.08624, over 20070.00 frames. ], batch size: 133, lr: 1.33e-02, grad_scale: 32.0 2023-06-23 14:02:13,113 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 14:02:19,039 INFO [train.py:1040] (2/4) Epoch 24, validation: loss=0.2009, simple_loss=0.2983, pruned_loss=0.05178, over 143649.00 frames. 2023-06-23 14:02:19,040 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 14:02:27,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=81626.66666666667, ans=0.125 2023-06-23 14:02:36,390 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=15.0 2023-06-23 14:02:48,496 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.49 vs. limit=15.0 2023-06-23 14:02:51,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=81760.0, ans=0.0 2023-06-23 14:03:05,273 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=16.19 vs. limit=22.5 2023-06-23 14:03:21,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=81826.66666666667, ans=0.1 2023-06-23 14:03:41,731 INFO [train.py:1008] (2/4) Epoch 24, batch 50, loss[loss=0.2174, simple_loss=0.2859, pruned_loss=0.07449, over 18614.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3013, pruned_loss=0.08672, over 837439.17 frames. ], batch size: 80, lr: 1.33e-02, grad_scale: 32.0 2023-06-23 14:03:48,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=81960.0, ans=0.1 2023-06-23 14:04:02,236 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=19.92 vs. limit=22.5 2023-06-23 14:04:06,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=82026.66666666667, ans=0.0 2023-06-23 14:04:11,072 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.514e+02 1.845e+02 2.024e+02 2.275e+02 3.151e+02, threshold=4.048e+02, percent-clipped=0.0 2023-06-23 14:04:38,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=82160.0, ans=0.125 2023-06-23 14:04:44,918 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:04:49,274 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.72 vs. limit=15.0 2023-06-23 14:05:04,394 INFO [train.py:1008] (2/4) Epoch 24, batch 100, loss[loss=0.2263, simple_loss=0.292, pruned_loss=0.08031, over 19227.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3019, pruned_loss=0.08534, over 1492475.44 frames. ], batch size: 92, lr: 1.33e-02, grad_scale: 32.0 2023-06-23 14:05:14,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=82293.33333333333, ans=0.125 2023-06-23 14:05:25,306 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.08 vs. limit=15.0 2023-06-23 14:05:32,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=82360.0, ans=0.2 2023-06-23 14:05:51,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=82426.66666666667, ans=0.0 2023-06-23 14:06:27,416 INFO [train.py:1008] (2/4) Epoch 24, batch 150, loss[loss=0.2293, simple_loss=0.2982, pruned_loss=0.08017, over 19481.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3026, pruned_loss=0.08513, over 2015663.73 frames. ], batch size: 105, lr: 1.33e-02, grad_scale: 32.0 2023-06-23 14:06:57,411 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.611e+02 1.876e+02 2.097e+02 2.518e+02 3.387e+02, threshold=4.194e+02, percent-clipped=0.0 2023-06-23 14:07:22,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=82826.66666666667, ans=0.125 2023-06-23 14:07:39,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=82893.33333333333, ans=0.0 2023-06-23 14:07:40,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=82893.33333333333, ans=0.125 2023-06-23 14:07:42,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=82893.33333333333, ans=0.125 2023-06-23 14:07:43,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=82893.33333333333, ans=0.125 2023-06-23 14:07:50,551 INFO [train.py:1008] (2/4) Epoch 24, batch 200, loss[loss=0.2449, simple_loss=0.3016, pruned_loss=0.09411, over 20583.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3026, pruned_loss=0.08376, over 2390484.73 frames. ], batch size: 173, lr: 1.32e-02, grad_scale: 32.0 2023-06-23 14:07:50,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=82960.0, ans=0.0 2023-06-23 14:07:59,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=82960.0, ans=0.1 2023-06-23 14:08:00,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=82960.0, ans=0.1 2023-06-23 14:08:01,214 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.43 vs. limit=15.0 2023-06-23 14:08:01,503 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=7.25 vs. limit=15.0 2023-06-23 14:08:05,350 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.71 vs. limit=22.5 2023-06-23 14:08:18,856 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=4.54 vs. limit=12.0 2023-06-23 14:08:45,537 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=19.23 vs. limit=22.5 2023-06-23 14:08:48,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=83160.0, ans=0.125 2023-06-23 14:09:14,275 INFO [train.py:1008] (2/4) Epoch 24, batch 250, loss[loss=0.226, simple_loss=0.2917, pruned_loss=0.08014, over 20091.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3007, pruned_loss=0.0834, over 2709134.20 frames. ], batch size: 133, lr: 1.32e-02, grad_scale: 32.0 2023-06-23 14:09:22,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=83293.33333333333, ans=0.125 2023-06-23 14:09:28,805 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=16.93 vs. limit=22.5 2023-06-23 14:09:41,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=83360.0, ans=0.125 2023-06-23 14:09:45,995 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.854e+02 2.071e+02 2.345e+02 4.579e+02, threshold=4.143e+02, percent-clipped=1.0 2023-06-23 14:09:54,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=83426.66666666667, ans=0.0 2023-06-23 14:10:38,026 INFO [train.py:1008] (2/4) Epoch 24, batch 300, loss[loss=0.228, simple_loss=0.2974, pruned_loss=0.07931, over 19211.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3011, pruned_loss=0.08353, over 2931805.57 frames. ], batch size: 92, lr: 1.32e-02, grad_scale: 16.0 2023-06-23 14:11:06,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=83693.33333333333, ans=0.0 2023-06-23 14:11:08,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=83693.33333333333, ans=0.09899494936611666 2023-06-23 14:11:11,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=83760.0, ans=15.0 2023-06-23 14:11:30,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=83826.66666666667, ans=0.025 2023-06-23 14:11:37,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=83826.66666666667, ans=0.125 2023-06-23 14:11:42,367 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.07 vs. limit=15.0 2023-06-23 14:11:46,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=83893.33333333333, ans=0.125 2023-06-23 14:11:48,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=83893.33333333333, ans=0.0 2023-06-23 14:11:59,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=83893.33333333333, ans=0.2 2023-06-23 14:12:02,739 INFO [train.py:1008] (2/4) Epoch 24, batch 350, loss[loss=0.2297, simple_loss=0.2967, pruned_loss=0.0814, over 20471.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3006, pruned_loss=0.08303, over 3132649.55 frames. ], batch size: 160, lr: 1.32e-02, grad_scale: 16.0 2023-06-23 14:12:25,903 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=15.0 2023-06-23 14:12:35,430 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.880e+02 2.120e+02 2.567e+02 4.330e+02, threshold=4.240e+02, percent-clipped=2.0 2023-06-23 14:13:00,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=84160.0, ans=0.0 2023-06-23 14:13:10,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=84226.66666666667, ans=0.0 2023-06-23 14:13:26,826 INFO [train.py:1008] (2/4) Epoch 24, batch 400, loss[loss=0.2465, simple_loss=0.3198, pruned_loss=0.08657, over 15474.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3001, pruned_loss=0.08256, over 3282122.24 frames. ], batch size: 44, lr: 1.32e-02, grad_scale: 32.0 2023-06-23 14:14:25,884 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:14:39,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=84560.0, ans=0.2 2023-06-23 14:14:50,368 INFO [train.py:1008] (2/4) Epoch 24, batch 450, loss[loss=0.2526, simple_loss=0.3311, pruned_loss=0.08706, over 17606.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.2998, pruned_loss=0.08228, over 3382974.12 frames. ], batch size: 67, lr: 1.31e-02, grad_scale: 32.0 2023-06-23 14:14:50,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=84626.66666666667, ans=0.125 2023-06-23 14:15:22,868 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.867e+02 2.068e+02 2.284e+02 2.784e+02, threshold=4.137e+02, percent-clipped=0.0 2023-06-23 14:15:40,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=84826.66666666667, ans=0.1 2023-06-23 14:15:42,906 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.08 vs. limit=15.0 2023-06-23 14:16:12,063 INFO [train.py:1008] (2/4) Epoch 24, batch 500, loss[loss=0.233, simple_loss=0.2887, pruned_loss=0.08869, over 20546.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.2995, pruned_loss=0.08165, over 3467568.64 frames. ], batch size: 189, lr: 1.31e-02, grad_scale: 32.0 2023-06-23 14:16:30,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=85026.66666666667, ans=0.0 2023-06-23 14:16:55,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=85093.33333333333, ans=0.125 2023-06-23 14:17:24,934 INFO [train.py:1008] (2/4) Epoch 25, batch 0, loss[loss=0.2459, simple_loss=0.3229, pruned_loss=0.08446, over 17638.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3229, pruned_loss=0.08446, over 17638.00 frames. ], batch size: 67, lr: 1.29e-02, grad_scale: 32.0 2023-06-23 14:17:24,934 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 14:17:30,730 INFO [train.py:1040] (2/4) Epoch 25, validation: loss=0.2009, simple_loss=0.2979, pruned_loss=0.05193, over 143649.00 frames. 2023-06-23 14:17:30,731 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 14:18:32,978 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.781e+02 2.037e+02 2.242e+02 3.347e+02, threshold=4.075e+02, percent-clipped=0.0 2023-06-23 14:18:54,957 INFO [train.py:1008] (2/4) Epoch 25, batch 50, loss[loss=0.2293, simple_loss=0.3052, pruned_loss=0.07667, over 18928.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.2993, pruned_loss=0.08197, over 868320.42 frames. ], batch size: 86, lr: 1.28e-02, grad_scale: 32.0 2023-06-23 14:19:12,953 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2023-06-23 14:19:36,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=85640.0, ans=0.2 2023-06-23 14:20:16,945 INFO [train.py:1008] (2/4) Epoch 25, batch 100, loss[loss=0.2372, simple_loss=0.294, pruned_loss=0.09023, over 20202.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.2997, pruned_loss=0.08206, over 1519155.25 frames. ], batch size: 239, lr: 1.28e-02, grad_scale: 32.0 2023-06-23 14:21:05,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=85973.33333333333, ans=15.0 2023-06-23 14:21:18,851 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.390e+02 1.789e+02 1.946e+02 2.136e+02 3.859e+02, threshold=3.893e+02, percent-clipped=0.0 2023-06-23 14:21:23,688 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=15.0 2023-06-23 14:21:40,857 INFO [train.py:1008] (2/4) Epoch 25, batch 150, loss[loss=0.2378, simple_loss=0.3141, pruned_loss=0.08072, over 17003.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.2987, pruned_loss=0.08169, over 2012830.65 frames. ], batch size: 60, lr: 1.28e-02, grad_scale: 32.0 2023-06-23 14:22:11,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=86240.0, ans=0.125 2023-06-23 14:22:17,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=86306.66666666667, ans=0.125 2023-06-23 14:22:35,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=86373.33333333333, ans=0.0 2023-06-23 14:22:47,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=86440.0, ans=0.0 2023-06-23 14:22:55,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=86440.0, ans=0.0 2023-06-23 14:23:04,974 INFO [train.py:1008] (2/4) Epoch 25, batch 200, loss[loss=0.2351, simple_loss=0.3168, pruned_loss=0.07676, over 18333.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.2981, pruned_loss=0.08119, over 2412746.70 frames. ], batch size: 72, lr: 1.28e-02, grad_scale: 32.0 2023-06-23 14:23:43,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=86640.0, ans=0.0 2023-06-23 14:23:45,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=86640.0, ans=0.125 2023-06-23 14:23:53,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=86706.66666666667, ans=0.0 2023-06-23 14:23:58,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=86706.66666666667, ans=0.02 2023-06-23 14:24:01,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass_mid.scale_min, batch_count=86706.66666666667, ans=0.2 2023-06-23 14:24:05,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=86706.66666666667, ans=0.125 2023-06-23 14:24:06,850 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.790e+02 2.060e+02 2.451e+02 3.773e+02, threshold=4.120e+02, percent-clipped=0.0 2023-06-23 14:24:09,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=86706.66666666667, ans=0.2 2023-06-23 14:24:16,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=86773.33333333333, ans=0.125 2023-06-23 14:24:29,161 INFO [train.py:1008] (2/4) Epoch 25, batch 250, loss[loss=0.2456, simple_loss=0.3014, pruned_loss=0.09485, over 20361.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.2979, pruned_loss=0.08136, over 2709536.97 frames. ], batch size: 149, lr: 1.28e-02, grad_scale: 32.0 2023-06-23 14:24:29,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=86840.0, ans=0.125 2023-06-23 14:24:46,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=86906.66666666667, ans=0.2 2023-06-23 14:25:01,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=86973.33333333333, ans=0.5 2023-06-23 14:25:14,381 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.55 vs. limit=10.0 2023-06-23 14:25:20,544 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.71 vs. limit=22.5 2023-06-23 14:25:35,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=87106.66666666667, ans=0.2 2023-06-23 14:25:38,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=87106.66666666667, ans=0.2 2023-06-23 14:25:42,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=87106.66666666667, ans=0.125 2023-06-23 14:25:53,867 INFO [train.py:1008] (2/4) Epoch 25, batch 300, loss[loss=0.2411, simple_loss=0.3099, pruned_loss=0.08614, over 19128.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.2977, pruned_loss=0.08188, over 2964584.31 frames. ], batch size: 94, lr: 1.27e-02, grad_scale: 32.0 2023-06-23 14:25:57,794 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=20.46 vs. limit=15.0 2023-06-23 14:26:08,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=87240.0, ans=0.0 2023-06-23 14:26:25,409 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=512, metric=21.99 vs. limit=22.5 2023-06-23 14:26:55,932 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.827e+02 2.003e+02 2.228e+02 3.545e+02, threshold=4.005e+02, percent-clipped=0.0 2023-06-23 14:26:56,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=87373.33333333333, ans=0.05 2023-06-23 14:27:09,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=87440.0, ans=0.0 2023-06-23 14:27:17,116 INFO [train.py:1008] (2/4) Epoch 25, batch 350, loss[loss=0.209, simple_loss=0.2978, pruned_loss=0.06013, over 16746.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.2974, pruned_loss=0.08195, over 3128335.17 frames. ], batch size: 59, lr: 1.27e-02, grad_scale: 32.0 2023-06-23 14:27:50,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=87640.0, ans=0.1 2023-06-23 14:28:39,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=87773.33333333333, ans=0.04949747468305833 2023-06-23 14:28:41,717 INFO [train.py:1008] (2/4) Epoch 25, batch 400, loss[loss=0.2442, simple_loss=0.3213, pruned_loss=0.0836, over 16383.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.2985, pruned_loss=0.08199, over 3264771.68 frames. ], batch size: 52, lr: 1.27e-02, grad_scale: 32.0 2023-06-23 14:28:55,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=87840.0, ans=0.125 2023-06-23 14:29:26,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=87973.33333333333, ans=0.125 2023-06-23 14:29:33,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=88040.0, ans=0.0 2023-06-23 14:29:33,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=88040.0, ans=0.125 2023-06-23 14:29:38,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=88040.0, ans=0.1 2023-06-23 14:29:43,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=88040.0, ans=0.125 2023-06-23 14:29:44,288 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 1.862e+02 1.997e+02 2.222e+02 3.084e+02, threshold=3.994e+02, percent-clipped=0.0 2023-06-23 14:29:46,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=88040.0, ans=0.1 2023-06-23 14:30:01,186 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.99 vs. limit=10.0 2023-06-23 14:30:06,725 INFO [train.py:1008] (2/4) Epoch 25, batch 450, loss[loss=0.2572, simple_loss=0.2829, pruned_loss=0.1157, over 16877.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.2979, pruned_loss=0.08178, over 3387692.86 frames. ], batch size: 391, lr: 1.27e-02, grad_scale: 32.0 2023-06-23 14:30:17,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=88173.33333333333, ans=10.0 2023-06-23 14:30:24,559 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.48 vs. limit=5.0 2023-06-23 14:31:04,784 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=18.32 vs. limit=22.5 2023-06-23 14:31:08,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=88373.33333333333, ans=0.015 2023-06-23 14:31:13,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=88440.0, ans=0.125 2023-06-23 14:31:27,987 INFO [train.py:1008] (2/4) Epoch 25, batch 500, loss[loss=0.2108, simple_loss=0.289, pruned_loss=0.06627, over 18794.00 frames. ], tot_loss[loss=0.23, simple_loss=0.2974, pruned_loss=0.08127, over 3475069.48 frames. ], batch size: 83, lr: 1.27e-02, grad_scale: 32.0 2023-06-23 14:31:29,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=88506.66666666667, ans=0.125 2023-06-23 14:31:34,805 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-23 14:31:48,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=88573.33333333333, ans=0.1 2023-06-23 14:32:12,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=88706.66666666667, ans=0.2 2023-06-23 14:32:38,447 INFO [train.py:1008] (2/4) Epoch 26, batch 0, loss[loss=0.2258, simple_loss=0.2919, pruned_loss=0.07988, over 10876.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2919, pruned_loss=0.07988, over 10876.00 frames. ], batch size: 30, lr: 1.24e-02, grad_scale: 32.0 2023-06-23 14:32:38,447 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 14:32:44,189 INFO [train.py:1040] (2/4) Epoch 26, validation: loss=0.1977, simple_loss=0.2958, pruned_loss=0.04978, over 143649.00 frames. 2023-06-23 14:32:44,190 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 14:32:51,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=88720.0, ans=0.125 2023-06-23 14:32:52,811 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.487e+02 1.812e+02 2.007e+02 2.182e+02 2.880e+02, threshold=4.014e+02, percent-clipped=0.0 2023-06-23 14:32:59,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=88786.66666666667, ans=0.05 2023-06-23 14:33:03,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=88786.66666666667, ans=0.05 2023-06-23 14:33:54,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=88986.66666666667, ans=0.125 2023-06-23 14:34:02,992 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-23 14:34:07,086 INFO [train.py:1008] (2/4) Epoch 26, batch 50, loss[loss=0.2398, simple_loss=0.3226, pruned_loss=0.0785, over 17651.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.2963, pruned_loss=0.07956, over 836937.46 frames. ], batch size: 67, lr: 1.24e-02, grad_scale: 32.0 2023-06-23 14:34:22,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten.whitening_limit, batch_count=89053.33333333333, ans=15.0 2023-06-23 14:34:23,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=89120.0, ans=0.125 2023-06-23 14:34:58,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=89253.33333333333, ans=0.0 2023-06-23 14:35:21,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=89320.0, ans=0.125 2023-06-23 14:35:30,374 INFO [train.py:1008] (2/4) Epoch 26, batch 100, loss[loss=0.2245, simple_loss=0.291, pruned_loss=0.07904, over 20098.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.2961, pruned_loss=0.08045, over 1500285.39 frames. ], batch size: 133, lr: 1.24e-02, grad_scale: 32.0 2023-06-23 14:35:38,426 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.25 vs. limit=6.0 2023-06-23 14:35:39,097 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 1.887e+02 2.093e+02 2.327e+02 3.480e+02, threshold=4.186e+02, percent-clipped=0.0 2023-06-23 14:36:02,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=89520.0, ans=0.1 2023-06-23 14:36:37,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=89653.33333333333, ans=0.125 2023-06-23 14:36:46,546 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.19 vs. limit=15.0 2023-06-23 14:36:51,909 INFO [train.py:1008] (2/4) Epoch 26, batch 150, loss[loss=0.2398, simple_loss=0.2954, pruned_loss=0.0921, over 20691.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.2969, pruned_loss=0.08028, over 1994368.46 frames. ], batch size: 211, lr: 1.24e-02, grad_scale: 32.0 2023-06-23 14:37:00,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=89720.0, ans=0.0 2023-06-23 14:37:29,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=89853.33333333333, ans=0.125 2023-06-23 14:37:39,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=89920.0, ans=0.125 2023-06-23 14:37:46,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=89920.0, ans=0.1 2023-06-23 14:38:07,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=89986.66666666667, ans=0.0 2023-06-23 14:38:13,806 INFO [train.py:1008] (2/4) Epoch 26, batch 200, loss[loss=0.2348, simple_loss=0.2705, pruned_loss=0.09958, over 17101.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.297, pruned_loss=0.08036, over 2389860.49 frames. ], batch size: 392, lr: 1.23e-02, grad_scale: 32.0 2023-06-23 14:38:22,158 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.871e+02 2.185e+02 2.799e+02 4.409e+02, threshold=4.369e+02, percent-clipped=2.0 2023-06-23 14:38:36,703 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-23 14:38:47,757 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.whiten, num_groups=1, num_channels=512, metric=4.56 vs. limit=12.0 2023-06-23 14:38:50,982 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.77 vs. limit=10.0 2023-06-23 14:38:55,346 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.98 vs. limit=10.0 2023-06-23 14:39:06,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=90253.33333333333, ans=0.0 2023-06-23 14:39:13,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer1.prob, batch_count=90253.33333333333, ans=0.125 2023-06-23 14:39:35,907 INFO [train.py:1008] (2/4) Epoch 26, batch 250, loss[loss=0.2122, simple_loss=0.289, pruned_loss=0.06768, over 19671.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.2964, pruned_loss=0.07992, over 2705733.36 frames. ], batch size: 110, lr: 1.23e-02, grad_scale: 32.0 2023-06-23 14:39:36,662 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.46 vs. limit=15.0 2023-06-23 14:39:44,667 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:40:05,352 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.03 vs. limit=15.0 2023-06-23 14:40:27,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=90586.66666666667, ans=0.125 2023-06-23 14:40:30,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=90586.66666666667, ans=0.0 2023-06-23 14:40:59,304 INFO [train.py:1008] (2/4) Epoch 26, batch 300, loss[loss=0.2319, simple_loss=0.2987, pruned_loss=0.0825, over 19229.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.2963, pruned_loss=0.07959, over 2958855.00 frames. ], batch size: 92, lr: 1.23e-02, grad_scale: 32.0 2023-06-23 14:40:59,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=90720.0, ans=0.125 2023-06-23 14:41:07,099 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.424e+02 1.789e+02 2.016e+02 2.294e+02 4.058e+02, threshold=4.033e+02, percent-clipped=0.0 2023-06-23 14:41:23,341 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=512, metric=3.96 vs. limit=12.0 2023-06-23 14:42:04,107 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.15 vs. limit=22.5 2023-06-23 14:42:20,317 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.88 vs. limit=15.0 2023-06-23 14:42:21,045 INFO [train.py:1008] (2/4) Epoch 26, batch 350, loss[loss=0.224, simple_loss=0.291, pruned_loss=0.07847, over 20599.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.2974, pruned_loss=0.07991, over 3133784.95 frames. ], batch size: 173, lr: 1.23e-02, grad_scale: 8.0 2023-06-23 14:42:21,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=91053.33333333333, ans=0.125 2023-06-23 14:43:03,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=91186.66666666667, ans=0.1 2023-06-23 14:43:35,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=91320.0, ans=0.125 2023-06-23 14:43:42,026 INFO [train.py:1008] (2/4) Epoch 26, batch 400, loss[loss=0.2375, simple_loss=0.308, pruned_loss=0.08348, over 18278.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.2973, pruned_loss=0.07966, over 3255054.22 frames. ], batch size: 74, lr: 1.23e-02, grad_scale: 16.0 2023-06-23 14:43:52,736 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.419e+02 1.810e+02 2.065e+02 2.376e+02 3.330e+02, threshold=4.130e+02, percent-clipped=0.0 2023-06-23 14:44:29,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=91586.66666666667, ans=0.125 2023-06-23 14:44:39,055 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.19 vs. limit=22.5 2023-06-23 14:44:43,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=91586.66666666667, ans=10.0 2023-06-23 14:44:52,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=91653.33333333333, ans=0.0 2023-06-23 14:45:01,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=91720.0, ans=0.1 2023-06-23 14:45:03,412 INFO [train.py:1008] (2/4) Epoch 26, batch 450, loss[loss=0.2168, simple_loss=0.2944, pruned_loss=0.06958, over 18658.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.2971, pruned_loss=0.07971, over 3363971.42 frames. ], batch size: 80, lr: 1.23e-02, grad_scale: 16.0 2023-06-23 14:45:49,760 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.92 vs. limit=10.0 2023-06-23 14:46:15,144 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=12.56 vs. limit=15.0 2023-06-23 14:46:16,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=91986.66666666667, ans=0.125 2023-06-23 14:46:23,504 INFO [train.py:1008] (2/4) Epoch 26, batch 500, loss[loss=0.2174, simple_loss=0.2867, pruned_loss=0.07408, over 19804.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.2968, pruned_loss=0.07964, over 3469709.26 frames. ], batch size: 115, lr: 1.22e-02, grad_scale: 16.0 2023-06-23 14:46:34,081 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.741e+02 1.935e+02 2.246e+02 4.045e+02, threshold=3.869e+02, percent-clipped=0.0 2023-06-23 14:46:34,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=92053.33333333333, ans=0.125 2023-06-23 14:47:36,841 INFO [train.py:1008] (2/4) Epoch 27, batch 0, loss[loss=0.2429, simple_loss=0.3026, pruned_loss=0.09156, over 20483.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3026, pruned_loss=0.09156, over 20483.00 frames. ], batch size: 160, lr: 1.20e-02, grad_scale: 32.0 2023-06-23 14:47:36,841 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 14:47:42,438 INFO [train.py:1040] (2/4) Epoch 27, validation: loss=0.1977, simple_loss=0.2951, pruned_loss=0.05021, over 143649.00 frames. 2023-06-23 14:47:42,439 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 14:47:47,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=92273.33333333333, ans=0.125 2023-06-23 14:48:27,768 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.98 vs. limit=22.5 2023-06-23 14:48:32,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=92473.33333333333, ans=0.125 2023-06-23 14:48:32,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=92473.33333333333, ans=15.0 2023-06-23 14:48:39,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=92473.33333333333, ans=0.125 2023-06-23 14:48:39,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=92473.33333333333, ans=0.125 2023-06-23 14:49:08,348 INFO [train.py:1008] (2/4) Epoch 27, batch 50, loss[loss=0.2377, simple_loss=0.3153, pruned_loss=0.08001, over 17877.00 frames. ], tot_loss[loss=0.23, simple_loss=0.2972, pruned_loss=0.08143, over 861945.84 frames. ], batch size: 68, lr: 1.20e-02, grad_scale: 32.0 2023-06-23 14:49:13,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=92606.66666666667, ans=0.0 2023-06-23 14:49:23,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=92673.33333333333, ans=0.0 2023-06-23 14:49:37,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=92673.33333333333, ans=0.0 2023-06-23 14:49:38,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=92673.33333333333, ans=0.125 2023-06-23 14:49:48,383 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.403e+02 1.807e+02 2.048e+02 2.353e+02 3.369e+02, threshold=4.097e+02, percent-clipped=0.0 2023-06-23 14:50:32,183 INFO [train.py:1008] (2/4) Epoch 27, batch 100, loss[loss=0.2168, simple_loss=0.2889, pruned_loss=0.07236, over 19339.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.297, pruned_loss=0.08016, over 1505805.02 frames. ], batch size: 98, lr: 1.20e-02, grad_scale: 32.0 2023-06-23 14:51:05,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=93073.33333333333, ans=0.0 2023-06-23 14:51:37,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=93140.0, ans=0.2 2023-06-23 14:51:56,224 INFO [train.py:1008] (2/4) Epoch 27, batch 150, loss[loss=0.2769, simple_loss=0.296, pruned_loss=0.1289, over 17190.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2956, pruned_loss=0.07945, over 2022208.44 frames. ], batch size: 392, lr: 1.19e-02, grad_scale: 32.0 2023-06-23 14:51:58,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=93273.33333333333, ans=0.09899494936611666 2023-06-23 14:52:17,906 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.90 vs. limit=6.0 2023-06-23 14:52:38,088 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.519e+02 1.808e+02 2.052e+02 2.276e+02 3.582e+02, threshold=4.103e+02, percent-clipped=0.0 2023-06-23 14:53:09,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=93540.0, ans=0.1 2023-06-23 14:53:12,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=93540.0, ans=0.125 2023-06-23 14:53:16,738 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.44 vs. limit=22.5 2023-06-23 14:53:22,641 INFO [train.py:1008] (2/4) Epoch 27, batch 200, loss[loss=0.2523, simple_loss=0.3252, pruned_loss=0.08974, over 16740.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2951, pruned_loss=0.07841, over 2417488.23 frames. ], batch size: 59, lr: 1.19e-02, grad_scale: 32.0 2023-06-23 14:53:28,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=93606.66666666667, ans=0.125 2023-06-23 14:53:58,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=93740.0, ans=0.0 2023-06-23 14:54:45,931 INFO [train.py:1008] (2/4) Epoch 27, batch 250, loss[loss=0.2006, simple_loss=0.274, pruned_loss=0.06354, over 19099.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2957, pruned_loss=0.07823, over 2715216.27 frames. ], batch size: 94, lr: 1.19e-02, grad_scale: 16.0 2023-06-23 14:55:06,153 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=8.04 vs. limit=15.0 2023-06-23 14:55:10,132 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:55:14,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=94006.66666666667, ans=0.125 2023-06-23 14:55:27,789 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.519e+02 1.850e+02 2.043e+02 2.344e+02 3.255e+02, threshold=4.087e+02, percent-clipped=0.0 2023-06-23 14:55:42,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=94140.0, ans=0.0 2023-06-23 14:55:44,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=94140.0, ans=0.125 2023-06-23 14:55:56,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=94206.66666666667, ans=0.125 2023-06-23 14:56:09,901 INFO [train.py:1008] (2/4) Epoch 27, batch 300, loss[loss=0.2243, simple_loss=0.2936, pruned_loss=0.07747, over 20223.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2953, pruned_loss=0.07854, over 2955634.27 frames. ], batch size: 141, lr: 1.19e-02, grad_scale: 16.0 2023-06-23 14:56:50,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=94406.66666666667, ans=0.1 2023-06-23 14:57:20,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=94540.0, ans=0.125 2023-06-23 14:57:33,339 INFO [train.py:1008] (2/4) Epoch 27, batch 350, loss[loss=0.2354, simple_loss=0.2867, pruned_loss=0.09208, over 20280.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2959, pruned_loss=0.07837, over 3122783.98 frames. ], batch size: 239, lr: 1.19e-02, grad_scale: 16.0 2023-06-23 14:57:38,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=94606.66666666667, ans=0.1 2023-06-23 14:57:52,977 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.95 vs. limit=15.0 2023-06-23 14:58:12,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=94740.0, ans=0.0 2023-06-23 14:58:15,015 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.806e+02 2.000e+02 2.279e+02 3.758e+02, threshold=4.000e+02, percent-clipped=0.0 2023-06-23 14:58:21,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=94806.66666666667, ans=0.0 2023-06-23 14:58:21,705 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:58:28,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=94806.66666666667, ans=0.125 2023-06-23 14:58:40,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=94873.33333333333, ans=0.1 2023-06-23 14:58:51,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=94873.33333333333, ans=0.1 2023-06-23 14:58:56,653 INFO [train.py:1008] (2/4) Epoch 27, batch 400, loss[loss=0.2105, simple_loss=0.2903, pruned_loss=0.06535, over 19207.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2953, pruned_loss=0.07822, over 3270640.49 frames. ], batch size: 92, lr: 1.19e-02, grad_scale: 32.0 2023-06-23 14:58:57,387 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.96 vs. limit=22.5 2023-06-23 14:59:01,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=94940.0, ans=0.5 2023-06-23 14:59:17,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=95006.66666666667, ans=0.125 2023-06-23 14:59:17,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=95006.66666666667, ans=0.0 2023-06-23 14:59:21,229 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.74 vs. limit=15.0 2023-06-23 14:59:26,338 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.96 vs. limit=15.0 2023-06-23 14:59:31,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=95073.33333333333, ans=0.0 2023-06-23 14:59:46,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=95140.0, ans=0.1 2023-06-23 14:59:47,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=95140.0, ans=0.1 2023-06-23 14:59:57,059 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.32 vs. limit=15.0 2023-06-23 15:00:20,247 INFO [train.py:1008] (2/4) Epoch 27, batch 450, loss[loss=0.2483, simple_loss=0.3207, pruned_loss=0.08793, over 16263.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2947, pruned_loss=0.07779, over 3384562.47 frames. ], batch size: 52, lr: 1.18e-02, grad_scale: 32.0 2023-06-23 15:00:31,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=95273.33333333333, ans=0.125 2023-06-23 15:00:43,293 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.95 vs. limit=15.0 2023-06-23 15:00:58,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=95406.66666666667, ans=0.125 2023-06-23 15:01:02,077 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.418e+02 1.792e+02 1.984e+02 2.286e+02 3.799e+02, threshold=3.968e+02, percent-clipped=0.0 2023-06-23 15:01:22,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=95473.33333333333, ans=0.5 2023-06-23 15:01:26,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=95540.0, ans=0.125 2023-06-23 15:01:33,226 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=3.06 vs. limit=15.0 2023-06-23 15:01:34,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=95540.0, ans=0.1 2023-06-23 15:01:34,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=95540.0, ans=0.1 2023-06-23 15:01:40,197 INFO [train.py:1008] (2/4) Epoch 27, batch 500, loss[loss=0.1917, simple_loss=0.2659, pruned_loss=0.05873, over 19332.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2948, pruned_loss=0.07791, over 3477447.74 frames. ], batch size: 98, lr: 1.18e-02, grad_scale: 32.0 2023-06-23 15:02:12,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=95740.0, ans=0.2 2023-06-23 15:02:15,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=95740.0, ans=0.125 2023-06-23 15:02:20,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=95740.0, ans=0.125 2023-06-23 15:02:54,448 INFO [train.py:1008] (2/4) Epoch 28, batch 0, loss[loss=0.2013, simple_loss=0.2813, pruned_loss=0.06065, over 18605.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2813, pruned_loss=0.06065, over 18605.00 frames. ], batch size: 80, lr: 1.16e-02, grad_scale: 32.0 2023-06-23 15:02:54,449 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 15:03:00,354 INFO [train.py:1040] (2/4) Epoch 28, validation: loss=0.1967, simple_loss=0.2958, pruned_loss=0.0488, over 143649.00 frames. 2023-06-23 15:03:00,355 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 15:03:40,521 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.84 vs. limit=6.0 2023-06-23 15:03:41,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=95960.0, ans=0.125 2023-06-23 15:04:09,224 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.389e+02 1.841e+02 2.095e+02 2.528e+02 5.157e+02, threshold=4.189e+02, percent-clipped=2.0 2023-06-23 15:04:09,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=96093.33333333333, ans=0.0 2023-06-23 15:04:23,616 INFO [train.py:1008] (2/4) Epoch 28, batch 50, loss[loss=0.243, simple_loss=0.3194, pruned_loss=0.08326, over 16437.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2918, pruned_loss=0.07833, over 851836.38 frames. ], batch size: 52, lr: 1.16e-02, grad_scale: 32.0 2023-06-23 15:04:33,015 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.53 vs. limit=6.0 2023-06-23 15:04:52,107 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-23 15:05:01,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=96293.33333333333, ans=0.125 2023-06-23 15:05:45,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=96493.33333333333, ans=0.125 2023-06-23 15:05:46,201 INFO [train.py:1008] (2/4) Epoch 28, batch 100, loss[loss=0.2171, simple_loss=0.3046, pruned_loss=0.06478, over 11809.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2927, pruned_loss=0.07677, over 1500155.81 frames. ], batch size: 33, lr: 1.16e-02, grad_scale: 32.0 2023-06-23 15:05:46,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=96493.33333333333, ans=0.07 2023-06-23 15:06:51,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=96760.0, ans=0.125 2023-06-23 15:06:51,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=96760.0, ans=0.1 2023-06-23 15:06:52,559 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.83 vs. limit=6.0 2023-06-23 15:06:53,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=96760.0, ans=0.125 2023-06-23 15:06:54,817 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.744e+02 1.940e+02 2.256e+02 4.424e+02, threshold=3.879e+02, percent-clipped=2.0 2023-06-23 15:07:08,089 INFO [train.py:1008] (2/4) Epoch 28, batch 150, loss[loss=0.2218, simple_loss=0.2962, pruned_loss=0.07368, over 18628.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2935, pruned_loss=0.07741, over 2020720.62 frames. ], batch size: 80, lr: 1.16e-02, grad_scale: 32.0 2023-06-23 15:07:08,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=96826.66666666667, ans=0.0 2023-06-23 15:07:44,552 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.04 vs. limit=15.0 2023-06-23 15:07:48,525 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.23 vs. limit=15.0 2023-06-23 15:08:08,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=97026.66666666667, ans=0.125 2023-06-23 15:08:31,637 INFO [train.py:1008] (2/4) Epoch 28, batch 200, loss[loss=0.2367, simple_loss=0.2979, pruned_loss=0.08774, over 20000.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.293, pruned_loss=0.0778, over 2408937.02 frames. ], batch size: 126, lr: 1.15e-02, grad_scale: 32.0 2023-06-23 15:08:32,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=97160.0, ans=0.125 2023-06-23 15:08:35,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=97160.0, ans=0.125 2023-06-23 15:09:07,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=97293.33333333333, ans=0.1 2023-06-23 15:09:37,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=97426.66666666667, ans=0.0 2023-06-23 15:09:41,907 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.508e+02 1.794e+02 1.961e+02 2.215e+02 3.143e+02, threshold=3.923e+02, percent-clipped=0.0 2023-06-23 15:09:54,712 INFO [train.py:1008] (2/4) Epoch 28, batch 250, loss[loss=0.2194, simple_loss=0.2917, pruned_loss=0.07349, over 19667.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2928, pruned_loss=0.0772, over 2726492.23 frames. ], batch size: 110, lr: 1.15e-02, grad_scale: 32.0 2023-06-23 15:10:11,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=97560.0, ans=0.125 2023-06-23 15:10:23,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=97560.0, ans=0.2 2023-06-23 15:11:10,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=97760.0, ans=0.125 2023-06-23 15:11:10,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=97760.0, ans=0.0 2023-06-23 15:11:20,013 INFO [train.py:1008] (2/4) Epoch 28, batch 300, loss[loss=0.2174, simple_loss=0.2946, pruned_loss=0.07007, over 19833.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2924, pruned_loss=0.07744, over 2961903.75 frames. ], batch size: 115, lr: 1.15e-02, grad_scale: 32.0 2023-06-23 15:11:22,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=97826.66666666667, ans=0.125 2023-06-23 15:11:27,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=97826.66666666667, ans=0.1 2023-06-23 15:11:43,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=97893.33333333333, ans=0.2 2023-06-23 15:11:53,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=97960.0, ans=0.02 2023-06-23 15:11:54,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=97960.0, ans=0.125 2023-06-23 15:12:04,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=97960.0, ans=0.125 2023-06-23 15:12:09,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=98026.66666666667, ans=0.125 2023-06-23 15:12:33,699 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.834e+02 2.088e+02 2.410e+02 3.645e+02, threshold=4.176e+02, percent-clipped=0.0 2023-06-23 15:12:44,767 INFO [train.py:1008] (2/4) Epoch 28, batch 350, loss[loss=0.218, simple_loss=0.2899, pruned_loss=0.07309, over 19433.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2922, pruned_loss=0.07678, over 3133788.64 frames. ], batch size: 105, lr: 1.15e-02, grad_scale: 16.0 2023-06-23 15:13:06,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=98226.66666666667, ans=0.2 2023-06-23 15:14:03,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=98426.66666666667, ans=0.125 2023-06-23 15:14:08,945 INFO [train.py:1008] (2/4) Epoch 28, batch 400, loss[loss=0.2251, simple_loss=0.2999, pruned_loss=0.07517, over 16842.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2929, pruned_loss=0.0763, over 3268487.54 frames. ], batch size: 59, lr: 1.15e-02, grad_scale: 32.0 2023-06-23 15:14:09,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=98493.33333333333, ans=0.125 2023-06-23 15:14:15,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=98493.33333333333, ans=0.035 2023-06-23 15:14:26,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=98560.0, ans=0.125 2023-06-23 15:14:28,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=98560.0, ans=0.1 2023-06-23 15:14:38,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=98560.0, ans=0.0 2023-06-23 15:14:54,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=98626.66666666667, ans=0.125 2023-06-23 15:14:59,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=98693.33333333333, ans=0.2 2023-06-23 15:15:15,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=98760.0, ans=0.0 2023-06-23 15:15:15,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=98760.0, ans=0.0 2023-06-23 15:15:21,703 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.409e+02 1.819e+02 2.049e+02 2.416e+02 3.292e+02, threshold=4.098e+02, percent-clipped=0.0 2023-06-23 15:15:33,613 INFO [train.py:1008] (2/4) Epoch 28, batch 450, loss[loss=0.2207, simple_loss=0.285, pruned_loss=0.07818, over 20632.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2932, pruned_loss=0.07627, over 3382617.39 frames. ], batch size: 189, lr: 1.15e-02, grad_scale: 32.0 2023-06-23 15:15:47,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=98826.66666666667, ans=0.0 2023-06-23 15:15:50,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=98893.33333333333, ans=0.0 2023-06-23 15:15:54,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=98893.33333333333, ans=0.0 2023-06-23 15:15:55,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=98893.33333333333, ans=0.2 2023-06-23 15:16:17,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=98960.0, ans=0.0 2023-06-23 15:16:33,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=99026.66666666667, ans=0.125 2023-06-23 15:16:40,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=99093.33333333333, ans=0.1 2023-06-23 15:16:49,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=99093.33333333333, ans=0.2 2023-06-23 15:16:50,189 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.58 vs. limit=15.0 2023-06-23 15:16:55,943 INFO [train.py:1008] (2/4) Epoch 28, batch 500, loss[loss=0.2446, simple_loss=0.2825, pruned_loss=0.1033, over 16874.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2925, pruned_loss=0.07664, over 3484192.77 frames. ], batch size: 391, lr: 1.15e-02, grad_scale: 32.0 2023-06-23 15:17:05,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=99160.0, ans=0.1 2023-06-23 15:17:07,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=99160.0, ans=0.125 2023-06-23 15:17:12,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=99226.66666666667, ans=0.0 2023-06-23 15:17:24,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=99226.66666666667, ans=0.2 2023-06-23 15:17:39,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=99293.33333333333, ans=0.125 2023-06-23 15:18:09,968 INFO [train.py:1008] (2/4) Epoch 29, batch 0, loss[loss=0.216, simple_loss=0.2778, pruned_loss=0.07715, over 20574.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2778, pruned_loss=0.07715, over 20574.00 frames. ], batch size: 189, lr: 1.12e-02, grad_scale: 32.0 2023-06-23 15:18:09,968 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 15:18:15,767 INFO [train.py:1040] (2/4) Epoch 29, validation: loss=0.1936, simple_loss=0.2924, pruned_loss=0.04736, over 143649.00 frames. 2023-06-23 15:18:15,768 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 15:18:18,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer2.prob, batch_count=99380.0, ans=0.125 2023-06-23 15:18:32,890 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.428e+02 1.762e+02 1.895e+02 2.191e+02 3.587e+02, threshold=3.791e+02, percent-clipped=0.0 2023-06-23 15:19:39,088 INFO [train.py:1008] (2/4) Epoch 29, batch 50, loss[loss=0.2005, simple_loss=0.2751, pruned_loss=0.0629, over 19798.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2908, pruned_loss=0.07509, over 862713.95 frames. ], batch size: 115, lr: 1.12e-02, grad_scale: 32.0 2023-06-23 15:19:49,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=99713.33333333333, ans=0.1 2023-06-23 15:20:13,181 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 15:20:24,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=99846.66666666667, ans=0.125 2023-06-23 15:20:24,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=99846.66666666667, ans=0.125 2023-06-23 15:20:52,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=99980.0, ans=0.125 2023-06-23 15:20:53,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=99980.0, ans=0.125 2023-06-23 15:21:02,846 INFO [train.py:1008] (2/4) Epoch 29, batch 100, loss[loss=0.2081, simple_loss=0.2696, pruned_loss=0.07333, over 20661.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2911, pruned_loss=0.07506, over 1513245.58 frames. ], batch size: 211, lr: 1.12e-02, grad_scale: 32.0 2023-06-23 15:21:19,917 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 1.800e+02 2.046e+02 2.472e+02 3.527e+02, threshold=4.093e+02, percent-clipped=0.0 2023-06-23 15:21:53,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=100246.66666666667, ans=0.1 2023-06-23 15:22:00,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=100246.66666666667, ans=0.125 2023-06-23 15:22:11,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=100313.33333333333, ans=0.125 2023-06-23 15:22:12,237 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=22.69 vs. limit=22.5 2023-06-23 15:22:27,020 INFO [train.py:1008] (2/4) Epoch 29, batch 150, loss[loss=0.2285, simple_loss=0.2998, pruned_loss=0.07862, over 19128.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2913, pruned_loss=0.07554, over 2002468.83 frames. ], batch size: 94, lr: 1.12e-02, grad_scale: 32.0 2023-06-23 15:23:01,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=100513.33333333333, ans=0.125 2023-06-23 15:23:23,163 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.39 vs. limit=22.5 2023-06-23 15:23:29,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=100580.0, ans=0.125 2023-06-23 15:23:50,814 INFO [train.py:1008] (2/4) Epoch 29, batch 200, loss[loss=0.2096, simple_loss=0.2885, pruned_loss=0.06536, over 19215.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2907, pruned_loss=0.07605, over 2393406.19 frames. ], batch size: 92, lr: 1.12e-02, grad_scale: 32.0 2023-06-23 15:24:06,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=100780.0, ans=0.0 2023-06-23 15:24:07,773 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.845e+02 2.075e+02 2.500e+02 4.166e+02, threshold=4.151e+02, percent-clipped=1.0 2023-06-23 15:24:30,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=100846.66666666667, ans=0.2 2023-06-23 15:24:46,359 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.53 vs. limit=22.5 2023-06-23 15:25:12,631 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 15:25:15,534 INFO [train.py:1008] (2/4) Epoch 29, batch 250, loss[loss=0.236, simple_loss=0.3164, pruned_loss=0.07778, over 16335.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2911, pruned_loss=0.07574, over 2685305.17 frames. ], batch size: 52, lr: 1.12e-02, grad_scale: 32.0 2023-06-23 15:25:20,475 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.87 vs. limit=15.0 2023-06-23 15:25:22,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=101046.66666666667, ans=0.0 2023-06-23 15:25:27,204 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-23 15:25:32,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=101113.33333333333, ans=0.125 2023-06-23 15:25:45,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=101113.33333333333, ans=0.0 2023-06-23 15:25:51,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=101180.0, ans=0.1 2023-06-23 15:26:18,864 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.65 vs. limit=15.0 2023-06-23 15:26:40,100 INFO [train.py:1008] (2/4) Epoch 29, batch 300, loss[loss=0.2499, simple_loss=0.3244, pruned_loss=0.08766, over 16982.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2914, pruned_loss=0.07555, over 2917338.38 frames. ], batch size: 60, lr: 1.12e-02, grad_scale: 32.0 2023-06-23 15:26:57,107 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.583e+02 1.868e+02 2.052e+02 2.316e+02 3.360e+02, threshold=4.104e+02, percent-clipped=0.0 2023-06-23 15:27:08,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=101446.66666666667, ans=0.125 2023-06-23 15:27:08,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=101446.66666666667, ans=0.0 2023-06-23 15:27:13,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=101513.33333333333, ans=0.125 2023-06-23 15:28:03,695 INFO [train.py:1008] (2/4) Epoch 29, batch 350, loss[loss=0.2108, simple_loss=0.2874, pruned_loss=0.06714, over 19706.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2915, pruned_loss=0.07547, over 3103024.35 frames. ], batch size: 110, lr: 1.11e-02, grad_scale: 32.0 2023-06-23 15:28:09,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=101713.33333333333, ans=0.1 2023-06-23 15:28:14,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=101713.33333333333, ans=0.0 2023-06-23 15:28:48,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.attention_skip_rate, batch_count=101846.66666666667, ans=0.0 2023-06-23 15:29:14,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=101980.0, ans=0.0 2023-06-23 15:29:27,158 INFO [train.py:1008] (2/4) Epoch 29, batch 400, loss[loss=0.2247, simple_loss=0.2912, pruned_loss=0.07911, over 19561.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2914, pruned_loss=0.07551, over 3261511.96 frames. ], batch size: 102, lr: 1.11e-02, grad_scale: 32.0 2023-06-23 15:29:34,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=102046.66666666667, ans=0.0 2023-06-23 15:29:45,532 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.374e+02 1.818e+02 2.034e+02 2.468e+02 3.455e+02, threshold=4.068e+02, percent-clipped=0.0 2023-06-23 15:29:49,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=102113.33333333333, ans=0.0 2023-06-23 15:29:54,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=102113.33333333333, ans=0.0 2023-06-23 15:29:57,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=102113.33333333333, ans=0.1 2023-06-23 15:30:02,874 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=5.53 vs. limit=10.0 2023-06-23 15:30:03,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=102180.0, ans=0.0 2023-06-23 15:30:12,509 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.75 vs. limit=15.0 2023-06-23 15:30:29,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=102246.66666666667, ans=0.0 2023-06-23 15:30:51,678 INFO [train.py:1008] (2/4) Epoch 29, batch 450, loss[loss=0.1857, simple_loss=0.2664, pruned_loss=0.05254, over 19843.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2917, pruned_loss=0.07522, over 3379513.78 frames. ], batch size: 120, lr: 1.11e-02, grad_scale: 32.0 2023-06-23 15:31:33,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=102513.33333333333, ans=0.0 2023-06-23 15:31:42,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=102580.0, ans=0.125 2023-06-23 15:32:12,052 INFO [train.py:1008] (2/4) Epoch 29, batch 500, loss[loss=0.2268, simple_loss=0.2923, pruned_loss=0.08062, over 20239.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2907, pruned_loss=0.07548, over 3487188.35 frames. ], batch size: 141, lr: 1.11e-02, grad_scale: 32.0 2023-06-23 15:32:28,492 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.532e+02 1.876e+02 2.021e+02 2.378e+02 4.301e+02, threshold=4.041e+02, percent-clipped=2.0 2023-06-23 15:32:35,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=102780.0, ans=10.0 2023-06-23 15:32:42,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=102846.66666666667, ans=0.04949747468305833 2023-06-23 15:32:57,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=102913.33333333333, ans=0.125 2023-06-23 15:33:18,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=102926.66666666667, ans=0.125 2023-06-23 15:33:24,635 INFO [train.py:1008] (2/4) Epoch 30, batch 0, loss[loss=0.2131, simple_loss=0.2872, pruned_loss=0.06946, over 19865.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2872, pruned_loss=0.06946, over 19865.00 frames. ], batch size: 120, lr: 1.09e-02, grad_scale: 32.0 2023-06-23 15:33:24,636 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 15:33:30,311 INFO [train.py:1040] (2/4) Epoch 30, validation: loss=0.1959, simple_loss=0.2936, pruned_loss=0.04913, over 143649.00 frames. 2023-06-23 15:33:30,311 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 15:33:39,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=102926.66666666667, ans=0.1 2023-06-23 15:33:59,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=102993.33333333333, ans=0.0 2023-06-23 15:33:59,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=102993.33333333333, ans=0.0 2023-06-23 15:34:29,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=103126.66666666667, ans=0.125 2023-06-23 15:34:53,416 INFO [train.py:1008] (2/4) Epoch 30, batch 50, loss[loss=0.2259, simple_loss=0.3034, pruned_loss=0.07416, over 18791.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2912, pruned_loss=0.07454, over 856730.01 frames. ], batch size: 83, lr: 1.09e-02, grad_scale: 32.0 2023-06-23 15:34:55,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=103260.0, ans=0.1 2023-06-23 15:35:31,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=103393.33333333333, ans=0.1 2023-06-23 15:35:36,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=103393.33333333333, ans=0.125 2023-06-23 15:35:39,393 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.397e+02 1.710e+02 1.932e+02 2.166e+02 3.409e+02, threshold=3.865e+02, percent-clipped=0.0 2023-06-23 15:35:46,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=103460.0, ans=0.125 2023-06-23 15:35:59,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=103526.66666666667, ans=0.125 2023-06-23 15:36:01,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=103526.66666666667, ans=0.125 2023-06-23 15:36:16,855 INFO [train.py:1008] (2/4) Epoch 30, batch 100, loss[loss=0.208, simple_loss=0.2839, pruned_loss=0.06608, over 19676.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2924, pruned_loss=0.07347, over 1478071.15 frames. ], batch size: 110, lr: 1.09e-02, grad_scale: 32.0 2023-06-23 15:36:31,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=103660.0, ans=0.0 2023-06-23 15:36:37,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=103660.0, ans=0.09899494936611666 2023-06-23 15:37:22,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=103860.0, ans=0.05 2023-06-23 15:37:24,434 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.98 vs. limit=15.0 2023-06-23 15:37:29,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=103860.0, ans=0.0 2023-06-23 15:37:39,577 INFO [train.py:1008] (2/4) Epoch 30, batch 150, loss[loss=0.2275, simple_loss=0.291, pruned_loss=0.08197, over 20091.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2915, pruned_loss=0.07377, over 1995562.14 frames. ], batch size: 133, lr: 1.09e-02, grad_scale: 32.0 2023-06-23 15:38:28,108 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.472e+02 1.832e+02 2.048e+02 2.358e+02 3.520e+02, threshold=4.096e+02, percent-clipped=0.0 2023-06-23 15:38:39,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=104126.66666666667, ans=0.2 2023-06-23 15:38:43,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=104126.66666666667, ans=0.0 2023-06-23 15:38:47,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=104193.33333333333, ans=0.125 2023-06-23 15:38:56,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=104193.33333333333, ans=0.2 2023-06-23 15:38:59,573 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=2.79 vs. limit=12.0 2023-06-23 15:39:05,386 INFO [train.py:1008] (2/4) Epoch 30, batch 200, loss[loss=0.2198, simple_loss=0.2773, pruned_loss=0.0812, over 20178.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2908, pruned_loss=0.07422, over 2386020.29 frames. ], batch size: 239, lr: 1.08e-02, grad_scale: 32.0 2023-06-23 15:39:05,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=104260.0, ans=0.1 2023-06-23 15:40:23,838 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 15:40:28,823 INFO [train.py:1008] (2/4) Epoch 30, batch 250, loss[loss=0.2274, simple_loss=0.2949, pruned_loss=0.07989, over 20120.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2898, pruned_loss=0.07402, over 2698507.06 frames. ], batch size: 133, lr: 1.08e-02, grad_scale: 32.0 2023-06-23 15:40:32,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=104593.33333333333, ans=0.125 2023-06-23 15:41:16,340 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.612e+02 1.895e+02 2.101e+02 2.429e+02 4.312e+02, threshold=4.202e+02, percent-clipped=1.0 2023-06-23 15:41:32,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=104793.33333333333, ans=0.125 2023-06-23 15:41:38,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=104860.0, ans=0.0 2023-06-23 15:41:54,047 INFO [train.py:1008] (2/4) Epoch 30, batch 300, loss[loss=0.2254, simple_loss=0.2945, pruned_loss=0.07812, over 20290.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2896, pruned_loss=0.07389, over 2938922.06 frames. ], batch size: 149, lr: 1.08e-02, grad_scale: 16.0 2023-06-23 15:41:54,961 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.46 vs. limit=15.0 2023-06-23 15:42:12,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=104993.33333333333, ans=0.0 2023-06-23 15:42:13,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=104993.33333333333, ans=0.0 2023-06-23 15:42:17,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=104993.33333333333, ans=10.0 2023-06-23 15:42:17,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=104993.33333333333, ans=0.0 2023-06-23 15:43:00,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=105193.33333333333, ans=0.125 2023-06-23 15:43:01,874 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.39 vs. limit=15.0 2023-06-23 15:43:11,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=105193.33333333333, ans=0.125 2023-06-23 15:43:17,694 INFO [train.py:1008] (2/4) Epoch 30, batch 350, loss[loss=0.2108, simple_loss=0.2915, pruned_loss=0.06501, over 18648.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2895, pruned_loss=0.07345, over 3125849.31 frames. ], batch size: 80, lr: 1.08e-02, grad_scale: 16.0 2023-06-23 15:43:35,893 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.96 vs. limit=15.0 2023-06-23 15:43:37,412 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=9.44 vs. limit=15.0 2023-06-23 15:44:07,283 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.468e+02 1.775e+02 1.994e+02 2.405e+02 3.305e+02, threshold=3.989e+02, percent-clipped=0.0 2023-06-23 15:44:07,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=105460.0, ans=0.125 2023-06-23 15:44:37,196 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.42 vs. limit=15.0 2023-06-23 15:44:42,281 INFO [train.py:1008] (2/4) Epoch 30, batch 400, loss[loss=0.2181, simple_loss=0.2943, pruned_loss=0.07088, over 18819.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2899, pruned_loss=0.0731, over 3269925.70 frames. ], batch size: 83, lr: 1.08e-02, grad_scale: 32.0 2023-06-23 15:44:44,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=105593.33333333333, ans=0.1 2023-06-23 15:45:00,441 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.04 vs. limit=15.0 2023-06-23 15:45:03,515 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=21.12 vs. limit=22.5 2023-06-23 15:45:36,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=105793.33333333333, ans=0.125 2023-06-23 15:45:48,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=105860.0, ans=0.1 2023-06-23 15:46:06,201 INFO [train.py:1008] (2/4) Epoch 30, batch 450, loss[loss=0.2013, simple_loss=0.2825, pruned_loss=0.06009, over 18950.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2893, pruned_loss=0.07289, over 3387512.87 frames. ], batch size: 86, lr: 1.08e-02, grad_scale: 32.0 2023-06-23 15:46:34,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=105993.33333333333, ans=0.0 2023-06-23 15:46:50,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=106060.0, ans=0.125 2023-06-23 15:46:55,354 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.841e+02 2.075e+02 2.342e+02 3.586e+02, threshold=4.150e+02, percent-clipped=0.0 2023-06-23 15:47:14,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=106193.33333333333, ans=0.125 2023-06-23 15:47:15,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=106193.33333333333, ans=0.2 2023-06-23 15:47:29,130 INFO [train.py:1008] (2/4) Epoch 30, batch 500, loss[loss=0.219, simple_loss=0.2917, pruned_loss=0.07314, over 19961.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2899, pruned_loss=0.07311, over 3462875.11 frames. ], batch size: 126, lr: 1.08e-02, grad_scale: 32.0 2023-06-23 15:47:35,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=106260.0, ans=0.125 2023-06-23 15:47:39,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=106260.0, ans=15.0 2023-06-23 15:47:59,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=106393.33333333333, ans=15.0 2023-06-23 15:48:10,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=106393.33333333333, ans=0.125 2023-06-23 15:48:42,472 INFO [train.py:1008] (2/4) Epoch 31, batch 0, loss[loss=0.2237, simple_loss=0.2894, pruned_loss=0.079, over 20105.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2894, pruned_loss=0.079, over 20105.00 frames. ], batch size: 133, lr: 1.06e-02, grad_scale: 32.0 2023-06-23 15:48:42,473 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 15:48:49,575 INFO [train.py:1040] (2/4) Epoch 31, validation: loss=0.1972, simple_loss=0.2938, pruned_loss=0.05034, over 143649.00 frames. 2023-06-23 15:48:49,576 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 15:48:54,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=106480.0, ans=0.125 2023-06-23 15:49:49,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=106680.0, ans=0.04949747468305833 2023-06-23 15:49:59,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=106746.66666666667, ans=0.125 2023-06-23 15:50:05,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=106746.66666666667, ans=0.125 2023-06-23 15:50:08,413 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.479e+02 1.811e+02 2.098e+02 2.397e+02 4.313e+02, threshold=4.196e+02, percent-clipped=1.0 2023-06-23 15:50:08,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=106746.66666666667, ans=0.1 2023-06-23 15:50:15,213 INFO [train.py:1008] (2/4) Epoch 31, batch 50, loss[loss=0.2159, simple_loss=0.2942, pruned_loss=0.06874, over 18908.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2906, pruned_loss=0.07235, over 853161.52 frames. ], batch size: 86, lr: 1.06e-02, grad_scale: 32.0 2023-06-23 15:50:15,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=106813.33333333333, ans=0.0 2023-06-23 15:50:17,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=106813.33333333333, ans=0.1 2023-06-23 15:50:28,028 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 15:50:57,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=106946.66666666667, ans=0.0 2023-06-23 15:51:01,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=106946.66666666667, ans=0.0 2023-06-23 15:51:02,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=106946.66666666667, ans=15.0 2023-06-23 15:51:09,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=107013.33333333333, ans=0.0 2023-06-23 15:51:25,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=107080.0, ans=0.2 2023-06-23 15:51:38,183 INFO [train.py:1008] (2/4) Epoch 31, batch 100, loss[loss=0.2077, simple_loss=0.2826, pruned_loss=0.06639, over 19823.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2884, pruned_loss=0.07192, over 1510776.99 frames. ], batch size: 115, lr: 1.05e-02, grad_scale: 16.0 2023-06-23 15:52:15,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=107280.0, ans=0.125 2023-06-23 15:52:47,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=107413.33333333333, ans=0.125 2023-06-23 15:52:56,568 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.362e+02 1.706e+02 1.875e+02 2.242e+02 3.229e+02, threshold=3.751e+02, percent-clipped=0.0 2023-06-23 15:53:00,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=107480.0, ans=0.0 2023-06-23 15:53:01,955 INFO [train.py:1008] (2/4) Epoch 31, batch 150, loss[loss=0.233, simple_loss=0.292, pruned_loss=0.08701, over 20662.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2889, pruned_loss=0.07265, over 2022112.46 frames. ], batch size: 211, lr: 1.05e-02, grad_scale: 16.0 2023-06-23 15:53:03,124 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.34 vs. limit=15.0 2023-06-23 15:53:21,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=107546.66666666667, ans=0.09899494936611666 2023-06-23 15:53:26,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=107546.66666666667, ans=0.125 2023-06-23 15:54:17,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=107746.66666666667, ans=0.125 2023-06-23 15:54:22,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=107813.33333333333, ans=0.0 2023-06-23 15:54:23,771 INFO [train.py:1008] (2/4) Epoch 31, batch 200, loss[loss=0.2215, simple_loss=0.2831, pruned_loss=0.07991, over 20547.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.288, pruned_loss=0.07252, over 2416756.17 frames. ], batch size: 173, lr: 1.05e-02, grad_scale: 16.0 2023-06-23 15:55:07,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=107946.66666666667, ans=0.125 2023-06-23 15:55:29,691 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.13 vs. limit=15.0 2023-06-23 15:55:42,529 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.794e+02 2.048e+02 2.340e+02 3.476e+02, threshold=4.095e+02, percent-clipped=0.0 2023-06-23 15:55:47,882 INFO [train.py:1008] (2/4) Epoch 31, batch 250, loss[loss=0.2048, simple_loss=0.2804, pruned_loss=0.0646, over 19990.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2887, pruned_loss=0.07304, over 2721516.55 frames. ], batch size: 126, lr: 1.05e-02, grad_scale: 16.0 2023-06-23 15:56:03,657 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.33 vs. limit=15.0 2023-06-23 15:56:27,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=108280.0, ans=0.125 2023-06-23 15:56:42,949 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.78 vs. limit=6.0 2023-06-23 15:56:48,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=108346.66666666667, ans=0.0 2023-06-23 15:57:03,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=108413.33333333333, ans=0.1 2023-06-23 15:57:05,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=108413.33333333333, ans=0.125 2023-06-23 15:57:10,795 INFO [train.py:1008] (2/4) Epoch 31, batch 300, loss[loss=0.2145, simple_loss=0.2824, pruned_loss=0.07329, over 19641.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2886, pruned_loss=0.07286, over 2963458.62 frames. ], batch size: 110, lr: 1.05e-02, grad_scale: 16.0 2023-06-23 15:57:22,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=108480.0, ans=0.0 2023-06-23 15:57:52,979 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.83 vs. limit=22.5 2023-06-23 15:57:55,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=108613.33333333333, ans=0.125 2023-06-23 15:58:00,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=108680.0, ans=0.0 2023-06-23 15:58:08,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=108680.0, ans=0.2 2023-06-23 15:58:27,913 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.554e+02 1.826e+02 1.984e+02 2.389e+02 5.943e+02, threshold=3.969e+02, percent-clipped=1.0 2023-06-23 15:58:28,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=108746.66666666667, ans=0.2 2023-06-23 15:58:33,191 INFO [train.py:1008] (2/4) Epoch 31, batch 350, loss[loss=0.216, simple_loss=0.279, pruned_loss=0.07646, over 20564.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2883, pruned_loss=0.07224, over 3153316.35 frames. ], batch size: 189, lr: 1.05e-02, grad_scale: 16.0 2023-06-23 15:58:40,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=108813.33333333333, ans=0.0 2023-06-23 15:58:50,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=108880.0, ans=0.125 2023-06-23 15:59:26,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=109013.33333333333, ans=0.07 2023-06-23 15:59:55,482 INFO [train.py:1008] (2/4) Epoch 31, batch 400, loss[loss=0.2268, simple_loss=0.2987, pruned_loss=0.0775, over 19823.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2887, pruned_loss=0.07236, over 3299836.01 frames. ], batch size: 115, lr: 1.05e-02, grad_scale: 32.0 2023-06-23 15:59:58,733 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.55 vs. limit=6.0 2023-06-23 16:00:50,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=109346.66666666667, ans=0.2 2023-06-23 16:00:57,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=109346.66666666667, ans=0.0 2023-06-23 16:01:00,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=109346.66666666667, ans=0.125 2023-06-23 16:01:04,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=109413.33333333333, ans=0.125 2023-06-23 16:01:06,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=109413.33333333333, ans=0.1 2023-06-23 16:01:13,834 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.820e+02 2.015e+02 2.494e+02 3.790e+02, threshold=4.031e+02, percent-clipped=0.0 2023-06-23 16:01:19,318 INFO [train.py:1008] (2/4) Epoch 31, batch 450, loss[loss=0.2079, simple_loss=0.2827, pruned_loss=0.06651, over 19661.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2884, pruned_loss=0.07199, over 3409090.94 frames. ], batch size: 110, lr: 1.05e-02, grad_scale: 32.0 2023-06-23 16:01:24,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=109480.0, ans=0.125 2023-06-23 16:01:27,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=109480.0, ans=0.05 2023-06-23 16:01:56,300 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 16:02:12,987 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.25 vs. limit=22.5 2023-06-23 16:02:22,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=109680.0, ans=15.0 2023-06-23 16:02:23,442 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 16:02:34,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=109746.66666666667, ans=0.0 2023-06-23 16:02:34,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=109746.66666666667, ans=0.125 2023-06-23 16:02:40,583 INFO [train.py:1008] (2/4) Epoch 31, batch 500, loss[loss=0.2417, simple_loss=0.3125, pruned_loss=0.08545, over 16400.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2881, pruned_loss=0.07154, over 3482290.61 frames. ], batch size: 52, lr: 1.04e-02, grad_scale: 32.0 2023-06-23 16:03:02,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=109880.0, ans=0.0 2023-06-23 16:03:51,638 INFO [train.py:1008] (2/4) Epoch 32, batch 0, loss[loss=0.2167, simple_loss=0.2813, pruned_loss=0.07606, over 20599.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2813, pruned_loss=0.07606, over 20599.00 frames. ], batch size: 211, lr: 1.03e-02, grad_scale: 32.0 2023-06-23 16:03:51,638 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 16:03:57,282 INFO [train.py:1040] (2/4) Epoch 32, validation: loss=0.1948, simple_loss=0.2928, pruned_loss=0.04842, over 143649.00 frames. 2023-06-23 16:03:57,283 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 16:04:00,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=110026.66666666667, ans=0.2 2023-06-23 16:04:02,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=110026.66666666667, ans=0.0 2023-06-23 16:04:18,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=110093.33333333333, ans=0.0 2023-06-23 16:04:21,936 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.549e+02 1.764e+02 1.926e+02 2.188e+02 3.298e+02, threshold=3.851e+02, percent-clipped=0.0 2023-06-23 16:04:42,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=110160.0, ans=0.125 2023-06-23 16:04:42,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=110160.0, ans=0.125 2023-06-23 16:04:48,288 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=5.17 vs. limit=12.0 2023-06-23 16:04:52,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=110226.66666666667, ans=0.0 2023-06-23 16:05:00,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=110226.66666666667, ans=0.0 2023-06-23 16:05:12,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=110293.33333333333, ans=0.2 2023-06-23 16:05:15,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=110293.33333333333, ans=0.0 2023-06-23 16:05:20,969 INFO [train.py:1008] (2/4) Epoch 32, batch 50, loss[loss=0.2062, simple_loss=0.2814, pruned_loss=0.06555, over 19464.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2858, pruned_loss=0.07189, over 867216.10 frames. ], batch size: 105, lr: 1.03e-02, grad_scale: 32.0 2023-06-23 16:05:37,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=110426.66666666667, ans=0.125 2023-06-23 16:05:39,733 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.64 vs. limit=22.5 2023-06-23 16:05:46,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=110426.66666666667, ans=0.2 2023-06-23 16:06:43,338 INFO [train.py:1008] (2/4) Epoch 32, batch 100, loss[loss=0.2075, simple_loss=0.2843, pruned_loss=0.06535, over 19431.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2873, pruned_loss=0.07078, over 1507098.69 frames. ], batch size: 105, lr: 1.02e-02, grad_scale: 32.0 2023-06-23 16:06:57,244 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.36 vs. limit=15.0 2023-06-23 16:07:07,343 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.436e+02 1.733e+02 1.900e+02 2.160e+02 3.250e+02, threshold=3.800e+02, percent-clipped=0.0 2023-06-23 16:07:16,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=110826.66666666667, ans=0.2 2023-06-23 16:07:20,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=110826.66666666667, ans=0.125 2023-06-23 16:07:38,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=110893.33333333333, ans=0.1 2023-06-23 16:07:47,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=110960.0, ans=0.125 2023-06-23 16:08:05,639 INFO [train.py:1008] (2/4) Epoch 32, batch 150, loss[loss=0.2353, simple_loss=0.3138, pruned_loss=0.07847, over 16870.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2889, pruned_loss=0.07093, over 2001515.26 frames. ], batch size: 59, lr: 1.02e-02, grad_scale: 32.0 2023-06-23 16:08:07,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=111026.66666666667, ans=0.125 2023-06-23 16:08:07,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=111026.66666666667, ans=0.125 2023-06-23 16:08:38,361 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.87 vs. limit=15.0 2023-06-23 16:09:26,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=111293.33333333333, ans=0.0 2023-06-23 16:09:29,356 INFO [train.py:1008] (2/4) Epoch 32, batch 200, loss[loss=0.2146, simple_loss=0.2889, pruned_loss=0.07016, over 19105.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2891, pruned_loss=0.07114, over 2380888.04 frames. ], batch size: 89, lr: 1.02e-02, grad_scale: 32.0 2023-06-23 16:09:30,345 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=10.63 vs. limit=15.0 2023-06-23 16:09:53,756 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.415e+02 1.794e+02 2.041e+02 2.395e+02 3.648e+02, threshold=4.083e+02, percent-clipped=0.0 2023-06-23 16:10:00,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=111493.33333333333, ans=0.125 2023-06-23 16:10:35,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=111626.66666666667, ans=0.0 2023-06-23 16:10:35,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=111626.66666666667, ans=0.2 2023-06-23 16:10:37,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=111626.66666666667, ans=0.0 2023-06-23 16:10:40,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=111626.66666666667, ans=0.2 2023-06-23 16:10:42,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=111626.66666666667, ans=0.0 2023-06-23 16:10:52,237 INFO [train.py:1008] (2/4) Epoch 32, batch 250, loss[loss=0.2141, simple_loss=0.2967, pruned_loss=0.06575, over 18291.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2891, pruned_loss=0.07126, over 2693758.81 frames. ], batch size: 74, lr: 1.02e-02, grad_scale: 32.0 2023-06-23 16:11:00,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=111693.33333333333, ans=0.0 2023-06-23 16:11:24,902 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=22.34 vs. limit=22.5 2023-06-23 16:11:43,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=111893.33333333333, ans=0.2 2023-06-23 16:11:44,336 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.83 vs. limit=22.5 2023-06-23 16:11:56,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=111893.33333333333, ans=0.125 2023-06-23 16:12:10,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=111960.0, ans=0.125 2023-06-23 16:12:16,964 INFO [train.py:1008] (2/4) Epoch 32, batch 300, loss[loss=0.1976, simple_loss=0.2801, pruned_loss=0.05757, over 19078.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2885, pruned_loss=0.07142, over 2935479.42 frames. ], batch size: 89, lr: 1.02e-02, grad_scale: 32.0 2023-06-23 16:12:19,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=112026.66666666667, ans=0.125 2023-06-23 16:12:26,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=112026.66666666667, ans=0.05 2023-06-23 16:12:37,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=112093.33333333333, ans=0.125 2023-06-23 16:12:42,334 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.375e+02 1.747e+02 1.919e+02 2.144e+02 3.308e+02, threshold=3.838e+02, percent-clipped=0.0 2023-06-23 16:12:43,847 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.86 vs. limit=15.0 2023-06-23 16:13:27,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=112293.33333333333, ans=0.125 2023-06-23 16:13:40,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=112360.0, ans=0.125 2023-06-23 16:13:41,596 INFO [train.py:1008] (2/4) Epoch 32, batch 350, loss[loss=0.2389, simple_loss=0.2983, pruned_loss=0.08979, over 20703.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2866, pruned_loss=0.07133, over 3140161.63 frames. ], batch size: 211, lr: 1.02e-02, grad_scale: 32.0 2023-06-23 16:13:43,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=112360.0, ans=0.04949747468305833 2023-06-23 16:14:33,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=112560.0, ans=0.125 2023-06-23 16:14:52,580 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.15 vs. limit=15.0 2023-06-23 16:15:03,581 INFO [train.py:1008] (2/4) Epoch 32, batch 400, loss[loss=0.2093, simple_loss=0.2873, pruned_loss=0.06562, over 19667.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2869, pruned_loss=0.07187, over 3280145.88 frames. ], batch size: 110, lr: 1.02e-02, grad_scale: 32.0 2023-06-23 16:15:09,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=112693.33333333333, ans=0.025 2023-06-23 16:15:25,717 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.06 vs. limit=10.0 2023-06-23 16:15:28,200 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.955e+02 2.266e+02 2.752e+02 4.281e+02, threshold=4.532e+02, percent-clipped=1.0 2023-06-23 16:15:50,906 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.67 vs. limit=15.0 2023-06-23 16:16:18,040 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 16:16:26,233 INFO [train.py:1008] (2/4) Epoch 32, batch 450, loss[loss=0.2211, simple_loss=0.3031, pruned_loss=0.06955, over 18267.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2867, pruned_loss=0.0719, over 3394904.76 frames. ], batch size: 74, lr: 1.02e-02, grad_scale: 32.0 2023-06-23 16:16:26,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=113026.66666666667, ans=0.125 2023-06-23 16:16:45,717 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=8.66 vs. limit=15.0 2023-06-23 16:16:51,715 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 16:17:47,507 INFO [train.py:1008] (2/4) Epoch 32, batch 500, loss[loss=0.2073, simple_loss=0.2853, pruned_loss=0.06458, over 18784.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2865, pruned_loss=0.0722, over 3477080.15 frames. ], batch size: 83, lr: 1.01e-02, grad_scale: 32.0 2023-06-23 16:18:00,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=113360.0, ans=0.125 2023-06-23 16:18:10,895 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.460e+02 1.754e+02 1.901e+02 2.303e+02 3.486e+02, threshold=3.802e+02, percent-clipped=0.0 2023-06-23 16:18:54,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=113573.33333333333, ans=0.1 2023-06-23 16:18:59,125 INFO [train.py:1008] (2/4) Epoch 33, batch 0, loss[loss=0.2346, simple_loss=0.2958, pruned_loss=0.08668, over 20483.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.2958, pruned_loss=0.08668, over 20483.00 frames. ], batch size: 160, lr: 9.98e-03, grad_scale: 32.0 2023-06-23 16:18:59,125 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 16:19:05,316 INFO [train.py:1040] (2/4) Epoch 33, validation: loss=0.1977, simple_loss=0.2933, pruned_loss=0.05106, over 143649.00 frames. 2023-06-23 16:19:05,317 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 16:19:25,402 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=8.10 vs. limit=15.0 2023-06-23 16:19:38,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=113706.66666666667, ans=0.2 2023-06-23 16:19:43,677 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.98 vs. limit=22.5 2023-06-23 16:19:57,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=113773.33333333333, ans=10.0 2023-06-23 16:20:06,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=113773.33333333333, ans=0.125 2023-06-23 16:20:14,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=113840.0, ans=0.125 2023-06-23 16:20:14,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten.whitening_limit, batch_count=113840.0, ans=15.0 2023-06-23 16:20:24,506 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.36 vs. limit=15.0 2023-06-23 16:20:27,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=113906.66666666667, ans=0.0 2023-06-23 16:20:28,511 INFO [train.py:1008] (2/4) Epoch 33, batch 50, loss[loss=0.2538, simple_loss=0.2767, pruned_loss=0.1155, over 16934.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2884, pruned_loss=0.07376, over 865076.51 frames. ], batch size: 391, lr: 9.96e-03, grad_scale: 32.0 2023-06-23 16:20:30,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=113906.66666666667, ans=0.2 2023-06-23 16:20:31,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=113906.66666666667, ans=0.0 2023-06-23 16:20:38,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=113906.66666666667, ans=0.125 2023-06-23 16:21:00,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=114040.0, ans=0.0 2023-06-23 16:21:23,152 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.406e+02 1.777e+02 1.999e+02 2.231e+02 3.108e+02, threshold=3.997e+02, percent-clipped=0.0 2023-06-23 16:21:51,082 INFO [train.py:1008] (2/4) Epoch 33, batch 100, loss[loss=0.2283, simple_loss=0.2967, pruned_loss=0.07999, over 19098.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.289, pruned_loss=0.07285, over 1483755.03 frames. ], batch size: 94, lr: 9.95e-03, grad_scale: 32.0 2023-06-23 16:22:04,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=114240.0, ans=0.0 2023-06-23 16:22:12,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=114306.66666666667, ans=0.125 2023-06-23 16:22:14,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=114306.66666666667, ans=0.125 2023-06-23 16:22:27,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=114373.33333333333, ans=0.0 2023-06-23 16:22:45,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=114440.0, ans=0.125 2023-06-23 16:22:50,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=114440.0, ans=0.025 2023-06-23 16:22:54,790 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.45 vs. limit=15.0 2023-06-23 16:23:02,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=114506.66666666667, ans=0.0 2023-06-23 16:23:13,396 INFO [train.py:1008] (2/4) Epoch 33, batch 150, loss[loss=0.2014, simple_loss=0.2819, pruned_loss=0.06046, over 18260.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.288, pruned_loss=0.07188, over 1977768.83 frames. ], batch size: 74, lr: 9.94e-03, grad_scale: 32.0 2023-06-23 16:23:24,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=114573.33333333333, ans=0.0 2023-06-23 16:23:33,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=114640.0, ans=0.0 2023-06-23 16:24:07,345 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.527e+02 1.778e+02 2.103e+02 2.536e+02 3.974e+02, threshold=4.205e+02, percent-clipped=0.0 2023-06-23 16:24:18,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=114840.0, ans=0.125 2023-06-23 16:24:21,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=114840.0, ans=0.125 2023-06-23 16:24:25,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=114840.0, ans=0.125 2023-06-23 16:24:34,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=114906.66666666667, ans=0.0 2023-06-23 16:24:35,500 INFO [train.py:1008] (2/4) Epoch 33, batch 200, loss[loss=0.2033, simple_loss=0.281, pruned_loss=0.06283, over 18620.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2881, pruned_loss=0.07129, over 2373202.23 frames. ], batch size: 80, lr: 9.93e-03, grad_scale: 32.0 2023-06-23 16:24:46,395 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.42 vs. limit=15.0 2023-06-23 16:24:47,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=114906.66666666667, ans=0.0 2023-06-23 16:24:52,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=114973.33333333333, ans=0.1 2023-06-23 16:25:29,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=115106.66666666667, ans=0.0 2023-06-23 16:25:57,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=115240.0, ans=0.1 2023-06-23 16:25:59,183 INFO [train.py:1008] (2/4) Epoch 33, batch 250, loss[loss=0.2051, simple_loss=0.2787, pruned_loss=0.06572, over 19324.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2869, pruned_loss=0.07053, over 2689651.08 frames. ], batch size: 98, lr: 9.92e-03, grad_scale: 32.0 2023-06-23 16:26:06,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=115240.0, ans=0.125 2023-06-23 16:26:15,188 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.94 vs. limit=10.0 2023-06-23 16:26:22,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=115306.66666666667, ans=0.125 2023-06-23 16:26:29,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=115306.66666666667, ans=0.125 2023-06-23 16:26:29,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=115306.66666666667, ans=0.125 2023-06-23 16:26:35,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=115373.33333333333, ans=0.125 2023-06-23 16:26:47,240 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.07 vs. limit=22.5 2023-06-23 16:26:54,118 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.507e+02 1.756e+02 1.892e+02 2.103e+02 3.317e+02, threshold=3.784e+02, percent-clipped=0.0 2023-06-23 16:26:56,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=115440.0, ans=0.1 2023-06-23 16:27:19,361 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-06-23 16:27:23,046 INFO [train.py:1008] (2/4) Epoch 33, batch 300, loss[loss=0.2181, simple_loss=0.281, pruned_loss=0.07761, over 20212.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2861, pruned_loss=0.07063, over 2943852.73 frames. ], batch size: 239, lr: 9.90e-03, grad_scale: 32.0 2023-06-23 16:27:39,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=115640.0, ans=0.1 2023-06-23 16:27:41,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=115640.0, ans=0.0 2023-06-23 16:27:44,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=115640.0, ans=0.2 2023-06-23 16:28:01,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=115706.66666666667, ans=0.1 2023-06-23 16:28:03,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=115706.66666666667, ans=0.04949747468305833 2023-06-23 16:28:03,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=115706.66666666667, ans=0.125 2023-06-23 16:28:10,714 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.38 vs. limit=22.5 2023-06-23 16:28:46,492 INFO [train.py:1008] (2/4) Epoch 33, batch 350, loss[loss=0.2209, simple_loss=0.2618, pruned_loss=0.08994, over 16717.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2856, pruned_loss=0.07047, over 3120673.75 frames. ], batch size: 391, lr: 9.89e-03, grad_scale: 32.0 2023-06-23 16:28:52,585 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=2.63 vs. limit=15.0 2023-06-23 16:29:22,784 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=15.16 vs. limit=22.5 2023-06-23 16:29:42,024 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.443e+02 1.796e+02 1.968e+02 2.174e+02 3.064e+02, threshold=3.936e+02, percent-clipped=0.0 2023-06-23 16:29:50,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=116106.66666666667, ans=0.0 2023-06-23 16:30:09,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=116240.0, ans=0.125 2023-06-23 16:30:10,440 INFO [train.py:1008] (2/4) Epoch 33, batch 400, loss[loss=0.2098, simple_loss=0.2855, pruned_loss=0.06711, over 19214.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2854, pruned_loss=0.06977, over 3268537.88 frames. ], batch size: 92, lr: 9.88e-03, grad_scale: 32.0 2023-06-23 16:30:13,331 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-06-23 16:30:41,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=116373.33333333333, ans=0.125 2023-06-23 16:30:51,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=116373.33333333333, ans=0.0 2023-06-23 16:31:19,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=116506.66666666667, ans=0.125 2023-06-23 16:31:32,698 INFO [train.py:1008] (2/4) Epoch 33, batch 450, loss[loss=0.2047, simple_loss=0.2931, pruned_loss=0.05816, over 17028.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2862, pruned_loss=0.06967, over 3380712.77 frames. ], batch size: 60, lr: 9.87e-03, grad_scale: 32.0 2023-06-23 16:31:38,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=116573.33333333333, ans=0.1 2023-06-23 16:31:58,250 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.69 vs. limit=10.0 2023-06-23 16:32:27,201 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.736e+02 1.935e+02 2.162e+02 3.591e+02, threshold=3.870e+02, percent-clipped=0.0 2023-06-23 16:32:35,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=116773.33333333333, ans=0.0 2023-06-23 16:32:37,937 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.62 vs. limit=15.0 2023-06-23 16:32:43,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=116840.0, ans=0.125 2023-06-23 16:32:49,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=116840.0, ans=0.125 2023-06-23 16:32:54,586 INFO [train.py:1008] (2/4) Epoch 33, batch 500, loss[loss=0.2074, simple_loss=0.2861, pruned_loss=0.06436, over 19128.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2863, pruned_loss=0.06955, over 3470806.68 frames. ], batch size: 94, lr: 9.86e-03, grad_scale: 32.0 2023-06-23 16:32:57,158 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.12 vs. limit=15.0 2023-06-23 16:33:02,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=116906.66666666667, ans=0.2 2023-06-23 16:33:13,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=116973.33333333333, ans=0.0 2023-06-23 16:33:16,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=116973.33333333333, ans=0.125 2023-06-23 16:34:05,559 INFO [train.py:1008] (2/4) Epoch 34, batch 0, loss[loss=0.2075, simple_loss=0.2864, pruned_loss=0.06433, over 18937.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2864, pruned_loss=0.06433, over 18937.00 frames. ], batch size: 86, lr: 9.70e-03, grad_scale: 32.0 2023-06-23 16:34:05,560 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 16:34:13,116 INFO [train.py:1040] (2/4) Epoch 34, validation: loss=0.1987, simple_loss=0.2934, pruned_loss=0.05199, over 143649.00 frames. 2023-06-23 16:34:13,116 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 16:34:32,929 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=12.43 vs. limit=15.0 2023-06-23 16:34:38,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=117193.33333333333, ans=0.0 2023-06-23 16:34:47,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=117260.0, ans=0.0 2023-06-23 16:35:17,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=117326.66666666667, ans=0.2 2023-06-23 16:35:37,119 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.307e+02 1.709e+02 1.865e+02 2.103e+02 3.251e+02, threshold=3.730e+02, percent-clipped=0.0 2023-06-23 16:35:37,165 INFO [train.py:1008] (2/4) Epoch 34, batch 50, loss[loss=0.2219, simple_loss=0.2533, pruned_loss=0.09524, over 17114.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2804, pruned_loss=0.069, over 866894.00 frames. ], batch size: 391, lr: 9.69e-03, grad_scale: 32.0 2023-06-23 16:36:10,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=117593.33333333333, ans=0.1 2023-06-23 16:36:21,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=117593.33333333333, ans=0.0 2023-06-23 16:36:31,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=117660.0, ans=0.125 2023-06-23 16:36:34,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=117660.0, ans=0.2 2023-06-23 16:36:36,514 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.36 vs. limit=15.0 2023-06-23 16:36:58,666 INFO [train.py:1008] (2/4) Epoch 34, batch 100, loss[loss=0.2124, simple_loss=0.2749, pruned_loss=0.07495, over 20699.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2822, pruned_loss=0.07065, over 1528971.13 frames. ], batch size: 211, lr: 9.68e-03, grad_scale: 32.0 2023-06-23 16:37:16,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=117860.0, ans=0.1 2023-06-23 16:37:20,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=117860.0, ans=0.2 2023-06-23 16:37:24,037 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.99 vs. limit=22.5 2023-06-23 16:37:24,203 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.49 vs. limit=15.0 2023-06-23 16:37:50,768 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.62 vs. limit=15.0 2023-06-23 16:38:19,254 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.71 vs. limit=15.0 2023-06-23 16:38:21,411 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.484e+02 1.747e+02 1.970e+02 2.348e+02 3.776e+02, threshold=3.940e+02, percent-clipped=1.0 2023-06-23 16:38:21,458 INFO [train.py:1008] (2/4) Epoch 34, batch 150, loss[loss=0.2156, simple_loss=0.2881, pruned_loss=0.07152, over 19968.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2836, pruned_loss=0.07032, over 2031717.28 frames. ], batch size: 126, lr: 9.67e-03, grad_scale: 32.0 2023-06-23 16:38:42,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=118193.33333333333, ans=0.125 2023-06-23 16:39:02,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=118260.0, ans=0.2 2023-06-23 16:39:12,845 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=7.26 vs. limit=15.0 2023-06-23 16:39:12,939 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=9.43 vs. limit=15.0 2023-06-23 16:39:23,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=118326.66666666667, ans=0.2 2023-06-23 16:39:31,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer2.prob, batch_count=118393.33333333333, ans=0.125 2023-06-23 16:39:34,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff2_skip_rate, batch_count=118393.33333333333, ans=0.0 2023-06-23 16:39:39,987 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=6.05 vs. limit=15.0 2023-06-23 16:39:42,098 INFO [train.py:1008] (2/4) Epoch 34, batch 200, loss[loss=0.2142, simple_loss=0.3042, pruned_loss=0.06209, over 18326.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2848, pruned_loss=0.07019, over 2398033.04 frames. ], batch size: 72, lr: 9.65e-03, grad_scale: 32.0 2023-06-23 16:39:44,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=118460.0, ans=0.125 2023-06-23 16:39:49,345 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 16:39:55,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=118460.0, ans=0.2 2023-06-23 16:40:03,372 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=15.0 2023-06-23 16:40:09,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=118526.66666666667, ans=0.0 2023-06-23 16:40:09,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=118526.66666666667, ans=0.2 2023-06-23 16:40:34,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=118660.0, ans=0.125 2023-06-23 16:41:03,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=118793.33333333333, ans=0.0 2023-06-23 16:41:05,109 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.498e+02 1.757e+02 2.010e+02 2.304e+02 3.025e+02, threshold=4.020e+02, percent-clipped=0.0 2023-06-23 16:41:05,157 INFO [train.py:1008] (2/4) Epoch 34, batch 250, loss[loss=0.2259, simple_loss=0.3115, pruned_loss=0.07016, over 15165.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2843, pruned_loss=0.06972, over 2699011.45 frames. ], batch size: 43, lr: 9.64e-03, grad_scale: 32.0 2023-06-23 16:41:08,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=118793.33333333333, ans=0.0 2023-06-23 16:42:00,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=118993.33333333333, ans=0.125 2023-06-23 16:42:13,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=119060.0, ans=0.0 2023-06-23 16:42:27,335 INFO [train.py:1008] (2/4) Epoch 34, batch 300, loss[loss=0.2386, simple_loss=0.3155, pruned_loss=0.08087, over 17637.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2849, pruned_loss=0.07027, over 2925611.28 frames. ], batch size: 67, lr: 9.63e-03, grad_scale: 32.0 2023-06-23 16:42:37,930 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.53 vs. limit=15.0 2023-06-23 16:43:07,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=119260.0, ans=0.1 2023-06-23 16:43:27,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=119326.66666666667, ans=0.0 2023-06-23 16:43:35,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=119393.33333333333, ans=0.0 2023-06-23 16:43:44,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=119393.33333333333, ans=0.0 2023-06-23 16:43:49,844 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.773e+02 2.027e+02 2.280e+02 3.489e+02, threshold=4.054e+02, percent-clipped=0.0 2023-06-23 16:43:49,892 INFO [train.py:1008] (2/4) Epoch 34, batch 350, loss[loss=0.1991, simple_loss=0.2781, pruned_loss=0.06009, over 18633.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.284, pruned_loss=0.0696, over 3132779.85 frames. ], batch size: 80, lr: 9.62e-03, grad_scale: 32.0 2023-06-23 16:44:21,765 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=4.26 vs. limit=15.0 2023-06-23 16:44:30,402 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-06-23 16:44:59,572 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.63 vs. limit=10.0 2023-06-23 16:45:11,393 INFO [train.py:1008] (2/4) Epoch 34, batch 400, loss[loss=0.234, simple_loss=0.2987, pruned_loss=0.08462, over 20247.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2844, pruned_loss=0.06925, over 3286075.59 frames. ], batch size: 141, lr: 9.61e-03, grad_scale: 32.0 2023-06-23 16:45:19,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=119793.33333333333, ans=0.0 2023-06-23 16:45:24,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=119793.33333333333, ans=0.0 2023-06-23 16:45:25,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=119860.0, ans=0.05 2023-06-23 16:45:33,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=119860.0, ans=0.125 2023-06-23 16:45:48,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=119926.66666666667, ans=0.95 2023-06-23 16:45:48,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=119926.66666666667, ans=0.125 2023-06-23 16:45:55,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=119926.66666666667, ans=0.0 2023-06-23 16:46:04,452 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.22 vs. limit=22.5 2023-06-23 16:46:12,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=119993.33333333333, ans=0.1 2023-06-23 16:46:32,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=120126.66666666667, ans=0.0 2023-06-23 16:46:33,394 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.462e+02 1.782e+02 1.974e+02 2.224e+02 3.060e+02, threshold=3.949e+02, percent-clipped=0.0 2023-06-23 16:46:33,441 INFO [train.py:1008] (2/4) Epoch 34, batch 450, loss[loss=0.2145, simple_loss=0.2884, pruned_loss=0.07027, over 18764.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2843, pruned_loss=0.06927, over 3412023.40 frames. ], batch size: 83, lr: 9.60e-03, grad_scale: 32.0 2023-06-23 16:47:00,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=120193.33333333333, ans=0.0 2023-06-23 16:47:19,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=120260.0, ans=0.125 2023-06-23 16:47:53,645 INFO [train.py:1008] (2/4) Epoch 34, batch 500, loss[loss=0.2139, simple_loss=0.2861, pruned_loss=0.07089, over 19964.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2847, pruned_loss=0.06918, over 3494234.84 frames. ], batch size: 126, lr: 9.59e-03, grad_scale: 64.0 2023-06-23 16:48:08,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=120526.66666666667, ans=15.0 2023-06-23 16:48:11,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=120526.66666666667, ans=0.0 2023-06-23 16:48:24,124 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=2.98 vs. limit=12.0 2023-06-23 16:48:24,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=120593.33333333333, ans=0.125 2023-06-23 16:48:28,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=120593.33333333333, ans=0.02 2023-06-23 16:49:04,716 INFO [train.py:1008] (2/4) Epoch 35, batch 0, loss[loss=0.2237, simple_loss=0.2912, pruned_loss=0.07806, over 19203.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2912, pruned_loss=0.07806, over 19203.00 frames. ], batch size: 92, lr: 9.44e-03, grad_scale: 32.0 2023-06-23 16:49:04,717 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 16:49:10,318 INFO [train.py:1040] (2/4) Epoch 35, validation: loss=0.1946, simple_loss=0.2902, pruned_loss=0.04948, over 143649.00 frames. 2023-06-23 16:49:10,318 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 16:49:19,909 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.12 vs. limit=6.0 2023-06-23 16:49:31,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=120746.66666666667, ans=0.125 2023-06-23 16:49:39,554 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 1.767e+02 1.966e+02 2.184e+02 2.859e+02, threshold=3.932e+02, percent-clipped=0.0 2023-06-23 16:49:50,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=120813.33333333333, ans=0.0 2023-06-23 16:49:59,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=120880.0, ans=0.125 2023-06-23 16:50:13,451 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=6.13 vs. limit=15.0 2023-06-23 16:50:32,624 INFO [train.py:1008] (2/4) Epoch 35, batch 50, loss[loss=0.2034, simple_loss=0.2779, pruned_loss=0.06442, over 18647.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2838, pruned_loss=0.06735, over 867742.72 frames. ], batch size: 80, lr: 9.43e-03, grad_scale: 32.0 2023-06-23 16:50:53,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=121080.0, ans=0.125 2023-06-23 16:50:59,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=121080.0, ans=0.125 2023-06-23 16:51:01,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=121080.0, ans=0.125 2023-06-23 16:51:22,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1.whitening_limit, batch_count=121213.33333333333, ans=10.0 2023-06-23 16:51:54,141 INFO [train.py:1008] (2/4) Epoch 35, batch 100, loss[loss=0.2049, simple_loss=0.2862, pruned_loss=0.06179, over 18616.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2847, pruned_loss=0.06754, over 1519402.58 frames. ], batch size: 80, lr: 9.42e-03, grad_scale: 16.0 2023-06-23 16:51:58,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=121346.66666666667, ans=0.0 2023-06-23 16:52:11,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=121413.33333333333, ans=0.0 2023-06-23 16:52:25,016 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 1.884e+02 2.196e+02 2.729e+02 3.922e+02, threshold=4.391e+02, percent-clipped=0.0 2023-06-23 16:52:41,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=121480.0, ans=10.0 2023-06-23 16:53:02,011 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.21 vs. limit=22.5 2023-06-23 16:53:15,641 INFO [train.py:1008] (2/4) Epoch 35, batch 150, loss[loss=0.2234, simple_loss=0.2825, pruned_loss=0.08212, over 20337.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2856, pruned_loss=0.06883, over 2031940.88 frames. ], batch size: 240, lr: 9.41e-03, grad_scale: 16.0 2023-06-23 16:53:27,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=121680.0, ans=0.2 2023-06-23 16:53:38,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=121746.66666666667, ans=0.1 2023-06-23 16:54:00,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=121813.33333333333, ans=0.125 2023-06-23 16:54:00,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=121813.33333333333, ans=0.0 2023-06-23 16:54:11,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=121880.0, ans=0.1 2023-06-23 16:54:12,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=121880.0, ans=0.1 2023-06-23 16:54:37,539 INFO [train.py:1008] (2/4) Epoch 35, batch 200, loss[loss=0.2166, simple_loss=0.2799, pruned_loss=0.07662, over 20701.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2835, pruned_loss=0.06912, over 2438735.91 frames. ], batch size: 211, lr: 9.40e-03, grad_scale: 16.0 2023-06-23 16:54:44,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=122013.33333333333, ans=0.2 2023-06-23 16:54:49,663 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.073e-01 2023-06-23 16:54:52,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=122080.0, ans=0.125 2023-06-23 16:55:08,349 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.752e+02 1.972e+02 2.179e+02 3.541e+02, threshold=3.944e+02, percent-clipped=0.0 2023-06-23 16:55:10,805 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.85 vs. limit=10.0 2023-06-23 16:55:56,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=122280.0, ans=0.1 2023-06-23 16:55:58,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=122346.66666666667, ans=0.2 2023-06-23 16:55:58,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=122346.66666666667, ans=0.125 2023-06-23 16:55:59,553 INFO [train.py:1008] (2/4) Epoch 35, batch 250, loss[loss=0.2167, simple_loss=0.2747, pruned_loss=0.07935, over 20410.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2839, pruned_loss=0.06906, over 2745681.15 frames. ], batch size: 239, lr: 9.38e-03, grad_scale: 16.0 2023-06-23 16:56:18,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=122413.33333333333, ans=0.1 2023-06-23 16:56:24,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=122413.33333333333, ans=0.125 2023-06-23 16:56:50,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=122546.66666666667, ans=0.125 2023-06-23 16:56:52,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=122546.66666666667, ans=0.1 2023-06-23 16:56:57,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=122546.66666666667, ans=0.125 2023-06-23 16:57:21,546 INFO [train.py:1008] (2/4) Epoch 35, batch 300, loss[loss=0.2003, simple_loss=0.2738, pruned_loss=0.06342, over 18760.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2836, pruned_loss=0.06844, over 2973868.92 frames. ], batch size: 83, lr: 9.37e-03, grad_scale: 16.0 2023-06-23 16:57:52,639 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.728e+02 1.902e+02 2.188e+02 3.019e+02, threshold=3.803e+02, percent-clipped=0.0 2023-06-23 16:58:32,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=122946.66666666667, ans=0.1 2023-06-23 16:58:35,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=122946.66666666667, ans=0.125 2023-06-23 16:58:43,124 INFO [train.py:1008] (2/4) Epoch 35, batch 350, loss[loss=0.1988, simple_loss=0.2759, pruned_loss=0.06085, over 20325.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2836, pruned_loss=0.06843, over 3165916.28 frames. ], batch size: 149, lr: 9.36e-03, grad_scale: 16.0 2023-06-23 16:59:00,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=123080.0, ans=0.125 2023-06-23 16:59:18,441 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.88 vs. limit=15.0 2023-06-23 17:00:03,641 INFO [train.py:1008] (2/4) Epoch 35, batch 400, loss[loss=0.2128, simple_loss=0.2886, pruned_loss=0.06855, over 19674.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2842, pruned_loss=0.06829, over 3290140.06 frames. ], batch size: 110, lr: 9.35e-03, grad_scale: 32.0 2023-06-23 17:00:16,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=123346.66666666667, ans=0.125 2023-06-23 17:00:24,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=123413.33333333333, ans=0.0 2023-06-23 17:00:32,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=123413.33333333333, ans=0.0 2023-06-23 17:00:35,362 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.350e+02 1.770e+02 1.936e+02 2.100e+02 3.299e+02, threshold=3.871e+02, percent-clipped=0.0 2023-06-23 17:00:59,989 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-06-23 17:01:25,555 INFO [train.py:1008] (2/4) Epoch 35, batch 450, loss[loss=0.2159, simple_loss=0.2822, pruned_loss=0.07478, over 20613.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2841, pruned_loss=0.06773, over 3411127.67 frames. ], batch size: 189, lr: 9.34e-03, grad_scale: 32.0 2023-06-23 17:01:34,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=123680.0, ans=0.125 2023-06-23 17:02:18,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=123880.0, ans=0.5 2023-06-23 17:02:44,860 INFO [train.py:1008] (2/4) Epoch 35, batch 500, loss[loss=0.2195, simple_loss=0.2823, pruned_loss=0.07835, over 20209.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2838, pruned_loss=0.06802, over 3492381.47 frames. ], batch size: 239, lr: 9.33e-03, grad_scale: 32.0 2023-06-23 17:03:15,057 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.683e+02 1.866e+02 2.092e+02 3.633e+02, threshold=3.733e+02, percent-clipped=0.0 2023-06-23 17:03:55,162 INFO [train.py:1008] (2/4) Epoch 36, batch 0, loss[loss=0.2059, simple_loss=0.2853, pruned_loss=0.06329, over 18287.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2853, pruned_loss=0.06329, over 18287.00 frames. ], batch size: 74, lr: 9.19e-03, grad_scale: 32.0 2023-06-23 17:03:55,162 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 17:04:00,894 INFO [train.py:1040] (2/4) Epoch 36, validation: loss=0.1946, simple_loss=0.2906, pruned_loss=0.04927, over 143649.00 frames. 2023-06-23 17:04:00,895 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 17:04:07,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=124226.66666666667, ans=0.0 2023-06-23 17:04:09,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=124226.66666666667, ans=0.0 2023-06-23 17:05:12,666 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.79 vs. limit=15.0 2023-06-23 17:05:23,502 INFO [train.py:1008] (2/4) Epoch 36, batch 50, loss[loss=0.215, simple_loss=0.2823, pruned_loss=0.07385, over 20124.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2836, pruned_loss=0.06848, over 852659.92 frames. ], batch size: 133, lr: 9.18e-03, grad_scale: 32.0 2023-06-23 17:05:28,304 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:05:34,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=124560.0, ans=0.0 2023-06-23 17:05:36,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=124560.0, ans=0.0 2023-06-23 17:05:45,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=124626.66666666667, ans=0.2 2023-06-23 17:05:56,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=124693.33333333333, ans=0.125 2023-06-23 17:06:05,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=124693.33333333333, ans=0.125 2023-06-23 17:06:17,483 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:06:23,163 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.316e+02 1.772e+02 1.989e+02 2.325e+02 4.599e+02, threshold=3.978e+02, percent-clipped=6.0 2023-06-23 17:06:28,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=124826.66666666667, ans=0.125 2023-06-23 17:06:28,633 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.27 vs. limit=15.0 2023-06-23 17:06:29,046 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.03 vs. limit=22.5 2023-06-23 17:06:31,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=124826.66666666667, ans=0.1 2023-06-23 17:06:31,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=124826.66666666667, ans=0.1 2023-06-23 17:06:35,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=124826.66666666667, ans=0.2 2023-06-23 17:06:45,489 INFO [train.py:1008] (2/4) Epoch 36, batch 100, loss[loss=0.2107, simple_loss=0.288, pruned_loss=0.06671, over 18627.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2823, pruned_loss=0.06763, over 1515053.70 frames. ], batch size: 80, lr: 9.17e-03, grad_scale: 32.0 2023-06-23 17:07:07,494 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=512, metric=2.94 vs. limit=15.0 2023-06-23 17:07:11,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=124960.0, ans=0.0 2023-06-23 17:07:11,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.skip_rate, batch_count=124960.0, ans=0.07 2023-06-23 17:07:26,396 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.23 vs. limit=15.0 2023-06-23 17:07:29,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=125026.66666666667, ans=0.0 2023-06-23 17:07:32,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=125026.66666666667, ans=0.2 2023-06-23 17:07:33,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=125093.33333333333, ans=0.95 2023-06-23 17:07:48,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=125093.33333333333, ans=0.0 2023-06-23 17:07:59,335 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.06 vs. limit=12.0 2023-06-23 17:08:03,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=125160.0, ans=0.5 2023-06-23 17:08:08,827 INFO [train.py:1008] (2/4) Epoch 36, batch 150, loss[loss=0.1972, simple_loss=0.2739, pruned_loss=0.06025, over 19050.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2827, pruned_loss=0.06754, over 2023715.44 frames. ], batch size: 89, lr: 9.16e-03, grad_scale: 32.0 2023-06-23 17:08:20,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=125226.66666666667, ans=0.125 2023-06-23 17:08:21,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=125226.66666666667, ans=0.1 2023-06-23 17:08:23,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=125293.33333333333, ans=10.0 2023-06-23 17:08:25,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=125293.33333333333, ans=0.125 2023-06-23 17:08:28,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=125293.33333333333, ans=0.125 2023-06-23 17:08:49,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=125360.0, ans=0.125 2023-06-23 17:09:03,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=125426.66666666667, ans=0.125 2023-06-23 17:09:10,266 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 1.880e+02 2.081e+02 2.354e+02 3.612e+02, threshold=4.162e+02, percent-clipped=0.0 2023-06-23 17:09:32,140 INFO [train.py:1008] (2/4) Epoch 36, batch 200, loss[loss=0.1902, simple_loss=0.2701, pruned_loss=0.05511, over 19689.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2821, pruned_loss=0.06739, over 2420168.87 frames. ], batch size: 110, lr: 9.15e-03, grad_scale: 32.0 2023-06-23 17:09:42,751 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.32 vs. limit=22.5 2023-06-23 17:09:44,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=125560.0, ans=0.125 2023-06-23 17:09:56,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=125626.66666666667, ans=0.0 2023-06-23 17:10:22,182 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.40 vs. limit=15.0 2023-06-23 17:10:30,190 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.35 vs. limit=15.0 2023-06-23 17:10:55,755 INFO [train.py:1008] (2/4) Epoch 36, batch 250, loss[loss=0.2082, simple_loss=0.2952, pruned_loss=0.06057, over 18340.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2819, pruned_loss=0.06675, over 2722853.64 frames. ], batch size: 72, lr: 9.14e-03, grad_scale: 32.0 2023-06-23 17:11:21,305 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=15.0 2023-06-23 17:11:25,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=125960.0, ans=0.125 2023-06-23 17:11:57,046 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.714e+02 1.828e+02 2.101e+02 3.725e+02, threshold=3.656e+02, percent-clipped=0.0 2023-06-23 17:12:20,018 INFO [train.py:1008] (2/4) Epoch 36, batch 300, loss[loss=0.2082, simple_loss=0.2829, pruned_loss=0.06676, over 19129.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2813, pruned_loss=0.06681, over 2973977.28 frames. ], batch size: 94, lr: 9.13e-03, grad_scale: 32.0 2023-06-23 17:12:26,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=126226.66666666667, ans=10.0 2023-06-23 17:12:28,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=126226.66666666667, ans=0.125 2023-06-23 17:12:45,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=126293.33333333333, ans=0.2 2023-06-23 17:13:43,220 INFO [train.py:1008] (2/4) Epoch 36, batch 350, loss[loss=0.2048, simple_loss=0.2713, pruned_loss=0.06918, over 20099.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2812, pruned_loss=0.06632, over 3163210.84 frames. ], batch size: 133, lr: 9.12e-03, grad_scale: 32.0 2023-06-23 17:13:57,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=126560.0, ans=0.0 2023-06-23 17:14:37,594 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.29 vs. limit=22.5 2023-06-23 17:14:44,138 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.718e+02 1.947e+02 2.162e+02 2.915e+02, threshold=3.893e+02, percent-clipped=1.0 2023-06-23 17:14:54,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=126826.66666666667, ans=0.125 2023-06-23 17:15:06,620 INFO [train.py:1008] (2/4) Epoch 36, batch 400, loss[loss=0.1876, simple_loss=0.2659, pruned_loss=0.05468, over 19697.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2804, pruned_loss=0.06585, over 3322765.78 frames. ], batch size: 110, lr: 9.11e-03, grad_scale: 32.0 2023-06-23 17:15:13,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=126893.33333333333, ans=0.125 2023-06-23 17:15:34,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=126960.0, ans=0.1 2023-06-23 17:15:44,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=127026.66666666667, ans=0.1 2023-06-23 17:15:53,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=127026.66666666667, ans=0.1 2023-06-23 17:16:07,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=127093.33333333333, ans=0.2 2023-06-23 17:16:28,376 INFO [train.py:1008] (2/4) Epoch 36, batch 450, loss[loss=0.2197, simple_loss=0.2931, pruned_loss=0.07312, over 18623.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2812, pruned_loss=0.06627, over 3435395.47 frames. ], batch size: 80, lr: 9.10e-03, grad_scale: 32.0 2023-06-23 17:16:31,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=127226.66666666667, ans=0.0 2023-06-23 17:17:03,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=127360.0, ans=0.05 2023-06-23 17:17:08,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=127360.0, ans=0.09899494936611666 2023-06-23 17:17:17,220 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.31 vs. limit=15.0 2023-06-23 17:17:20,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=127426.66666666667, ans=0.125 2023-06-23 17:17:28,947 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.593e+02 1.767e+02 1.977e+02 2.245e+02 3.178e+02, threshold=3.955e+02, percent-clipped=0.0 2023-06-23 17:17:35,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=127493.33333333333, ans=0.2 2023-06-23 17:17:36,268 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=11.98 vs. limit=15.0 2023-06-23 17:17:50,286 INFO [train.py:1008] (2/4) Epoch 36, batch 500, loss[loss=0.1879, simple_loss=0.2683, pruned_loss=0.0538, over 18781.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2818, pruned_loss=0.06632, over 3515707.50 frames. ], batch size: 83, lr: 9.09e-03, grad_scale: 32.0 2023-06-23 17:17:50,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=127560.0, ans=0.2 2023-06-23 17:18:04,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=127626.66666666667, ans=0.125 2023-06-23 17:18:20,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=127693.33333333333, ans=0.125 2023-06-23 17:19:03,433 INFO [train.py:1008] (2/4) Epoch 37, batch 0, loss[loss=0.2407, simple_loss=0.2717, pruned_loss=0.1048, over 16921.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.2717, pruned_loss=0.1048, over 16921.00 frames. ], batch size: 392, lr: 8.96e-03, grad_scale: 32.0 2023-06-23 17:19:03,433 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 17:19:09,065 INFO [train.py:1040] (2/4) Epoch 37, validation: loss=0.1945, simple_loss=0.2907, pruned_loss=0.04917, over 143649.00 frames. 2023-06-23 17:19:09,066 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 17:19:14,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=127780.0, ans=0.0 2023-06-23 17:19:14,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=127780.0, ans=0.0 2023-06-23 17:19:33,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=127846.66666666667, ans=0.0 2023-06-23 17:19:40,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=127913.33333333333, ans=0.125 2023-06-23 17:19:49,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=127913.33333333333, ans=0.125 2023-06-23 17:19:57,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=127980.0, ans=0.125 2023-06-23 17:20:28,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=128046.66666666667, ans=0.09899494936611666 2023-06-23 17:20:31,322 INFO [train.py:1008] (2/4) Epoch 37, batch 50, loss[loss=0.2031, simple_loss=0.2853, pruned_loss=0.06051, over 18266.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2844, pruned_loss=0.06606, over 852870.15 frames. ], batch size: 74, lr: 8.95e-03, grad_scale: 16.0 2023-06-23 17:20:39,826 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.347e+02 1.710e+02 1.841e+02 2.087e+02 3.007e+02, threshold=3.682e+02, percent-clipped=0.0 2023-06-23 17:20:44,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=128113.33333333333, ans=0.0 2023-06-23 17:20:51,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=128180.0, ans=0.125 2023-06-23 17:20:52,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=128180.0, ans=0.125 2023-06-23 17:20:58,829 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=4.34 vs. limit=15.0 2023-06-23 17:21:04,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=128246.66666666667, ans=0.1 2023-06-23 17:21:20,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=128313.33333333333, ans=0.0 2023-06-23 17:21:45,667 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=16.53 vs. limit=22.5 2023-06-23 17:21:53,145 INFO [train.py:1008] (2/4) Epoch 37, batch 100, loss[loss=0.2279, simple_loss=0.3083, pruned_loss=0.07377, over 16971.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.283, pruned_loss=0.06527, over 1495444.01 frames. ], batch size: 60, lr: 8.94e-03, grad_scale: 16.0 2023-06-23 17:21:58,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=128446.66666666667, ans=0.125 2023-06-23 17:22:25,507 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.42 vs. limit=15.0 2023-06-23 17:22:26,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=128580.0, ans=0.0 2023-06-23 17:22:38,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=128580.0, ans=0.125 2023-06-23 17:22:58,334 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=8.562e-02 2023-06-23 17:23:17,073 INFO [train.py:1008] (2/4) Epoch 37, batch 150, loss[loss=0.2041, simple_loss=0.271, pruned_loss=0.06864, over 20479.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2825, pruned_loss=0.06639, over 1997362.32 frames. ], batch size: 189, lr: 8.93e-03, grad_scale: 16.0 2023-06-23 17:23:19,598 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=6.26 vs. limit=10.0 2023-06-23 17:23:22,842 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=512, metric=3.56 vs. limit=15.0 2023-06-23 17:23:25,129 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 1.708e+02 1.911e+02 2.291e+02 3.817e+02, threshold=3.822e+02, percent-clipped=2.0 2023-06-23 17:24:21,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=128980.0, ans=0.0 2023-06-23 17:24:29,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=129046.66666666667, ans=0.0 2023-06-23 17:24:29,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=129046.66666666667, ans=0.125 2023-06-23 17:24:40,632 INFO [train.py:1008] (2/4) Epoch 37, batch 200, loss[loss=0.2155, simple_loss=0.2987, pruned_loss=0.06614, over 17630.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2831, pruned_loss=0.06676, over 2376374.68 frames. ], batch size: 67, lr: 8.92e-03, grad_scale: 16.0 2023-06-23 17:25:33,545 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-23 17:25:37,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=129313.33333333333, ans=0.0 2023-06-23 17:25:44,926 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.23 vs. limit=15.0 2023-06-23 17:25:49,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=129380.0, ans=0.09899494936611666 2023-06-23 17:26:00,274 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:26:04,546 INFO [train.py:1008] (2/4) Epoch 37, batch 250, loss[loss=0.2033, simple_loss=0.2825, pruned_loss=0.06209, over 19218.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2819, pruned_loss=0.06582, over 2699926.33 frames. ], batch size: 92, lr: 8.91e-03, grad_scale: 16.0 2023-06-23 17:26:08,087 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:26:12,467 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.534e+02 1.920e+02 2.271e+02 2.637e+02 3.680e+02, threshold=4.543e+02, percent-clipped=0.0 2023-06-23 17:26:14,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=129446.66666666667, ans=0.125 2023-06-23 17:26:57,340 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.08 vs. limit=6.0 2023-06-23 17:26:59,200 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.64 vs. limit=22.5 2023-06-23 17:27:28,480 INFO [train.py:1008] (2/4) Epoch 37, batch 300, loss[loss=0.2157, simple_loss=0.2857, pruned_loss=0.0729, over 20104.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2821, pruned_loss=0.06682, over 2944559.40 frames. ], batch size: 133, lr: 8.90e-03, grad_scale: 16.0 2023-06-23 17:28:05,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=129913.33333333333, ans=0.125 2023-06-23 17:28:15,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=129913.33333333333, ans=0.125 2023-06-23 17:28:19,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=129980.0, ans=0.035 2023-06-23 17:28:23,377 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.99 vs. limit=10.0 2023-06-23 17:28:52,304 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=8.41 vs. limit=15.0 2023-06-23 17:28:53,090 INFO [train.py:1008] (2/4) Epoch 37, batch 350, loss[loss=0.2133, simple_loss=0.2874, pruned_loss=0.06957, over 18615.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.282, pruned_loss=0.06673, over 3126747.82 frames. ], batch size: 80, lr: 8.89e-03, grad_scale: 16.0 2023-06-23 17:28:55,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=130113.33333333333, ans=0.125 2023-06-23 17:28:58,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=130113.33333333333, ans=0.125 2023-06-23 17:29:01,117 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.765e+02 1.983e+02 2.245e+02 4.016e+02, threshold=3.966e+02, percent-clipped=0.0 2023-06-23 17:29:17,409 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:29:25,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=130246.66666666667, ans=0.125 2023-06-23 17:29:52,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=130313.33333333333, ans=0.125 2023-06-23 17:30:17,145 INFO [train.py:1008] (2/4) Epoch 37, batch 400, loss[loss=0.2074, simple_loss=0.2758, pruned_loss=0.06949, over 20648.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2825, pruned_loss=0.06656, over 3247329.31 frames. ], batch size: 211, lr: 8.88e-03, grad_scale: 32.0 2023-06-23 17:30:22,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=130446.66666666667, ans=0.125 2023-06-23 17:30:35,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=130513.33333333333, ans=0.1 2023-06-23 17:31:09,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=130646.66666666667, ans=0.1 2023-06-23 17:31:10,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=130646.66666666667, ans=0.0 2023-06-23 17:31:12,847 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:31:40,930 INFO [train.py:1008] (2/4) Epoch 37, batch 450, loss[loss=0.2139, simple_loss=0.2687, pruned_loss=0.07957, over 19943.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2821, pruned_loss=0.0664, over 3366407.72 frames. ], batch size: 294, lr: 8.87e-03, grad_scale: 32.0 2023-06-23 17:31:41,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=130780.0, ans=0.1 2023-06-23 17:31:49,222 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.360e+02 1.712e+02 1.928e+02 2.270e+02 2.975e+02, threshold=3.856e+02, percent-clipped=0.0 2023-06-23 17:31:51,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=130780.0, ans=0.125 2023-06-23 17:31:52,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=130780.0, ans=0.0 2023-06-23 17:32:43,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=130980.0, ans=0.0 2023-06-23 17:33:01,920 INFO [train.py:1008] (2/4) Epoch 37, batch 500, loss[loss=0.1966, simple_loss=0.277, pruned_loss=0.05812, over 19787.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2822, pruned_loss=0.06595, over 3451276.26 frames. ], batch size: 115, lr: 8.86e-03, grad_scale: 32.0 2023-06-23 17:33:06,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=131113.33333333334, ans=0.125 2023-06-23 17:33:09,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=131113.33333333334, ans=0.125 2023-06-23 17:33:10,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=131113.33333333334, ans=0.2 2023-06-23 17:33:30,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=131180.0, ans=0.0 2023-06-23 17:33:38,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=131246.66666666666, ans=0.04949747468305833 2023-06-23 17:34:13,570 INFO [train.py:1008] (2/4) Epoch 38, batch 0, loss[loss=0.2051, simple_loss=0.2669, pruned_loss=0.07166, over 20274.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2669, pruned_loss=0.07166, over 20274.00 frames. ], batch size: 239, lr: 8.73e-03, grad_scale: 32.0 2023-06-23 17:34:13,570 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 17:34:19,207 INFO [train.py:1040] (2/4) Epoch 38, validation: loss=0.1953, simple_loss=0.2909, pruned_loss=0.04986, over 143649.00 frames. 2023-06-23 17:34:19,208 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 17:34:27,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=131326.66666666666, ans=0.2 2023-06-23 17:34:33,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=131393.33333333334, ans=0.1 2023-06-23 17:34:46,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=131393.33333333334, ans=0.1 2023-06-23 17:34:52,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=131460.0, ans=0.125 2023-06-23 17:34:56,688 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.327e+02 1.790e+02 1.941e+02 2.313e+02 3.641e+02, threshold=3.881e+02, percent-clipped=0.0 2023-06-23 17:34:58,506 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.57 vs. limit=10.0 2023-06-23 17:35:42,559 INFO [train.py:1008] (2/4) Epoch 38, batch 50, loss[loss=0.2402, simple_loss=0.3185, pruned_loss=0.08097, over 16469.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2824, pruned_loss=0.06655, over 853401.67 frames. ], batch size: 52, lr: 8.72e-03, grad_scale: 32.0 2023-06-23 17:36:08,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=131726.66666666666, ans=0.0 2023-06-23 17:36:54,746 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-06-23 17:37:04,494 INFO [train.py:1008] (2/4) Epoch 38, batch 100, loss[loss=0.1827, simple_loss=0.2619, pruned_loss=0.05182, over 19546.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.28, pruned_loss=0.06487, over 1512621.17 frames. ], batch size: 102, lr: 8.71e-03, grad_scale: 32.0 2023-06-23 17:37:25,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=132060.0, ans=0.2 2023-06-23 17:37:26,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=132060.0, ans=0.1 2023-06-23 17:37:43,087 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.457e+02 1.740e+02 1.898e+02 2.194e+02 3.363e+02, threshold=3.797e+02, percent-clipped=0.0 2023-06-23 17:38:06,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=132193.33333333334, ans=0.125 2023-06-23 17:38:28,449 INFO [train.py:1008] (2/4) Epoch 38, batch 150, loss[loss=0.222, simple_loss=0.2896, pruned_loss=0.07723, over 20259.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2817, pruned_loss=0.06641, over 2013036.90 frames. ], batch size: 141, lr: 8.70e-03, grad_scale: 32.0 2023-06-23 17:38:36,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=132326.66666666666, ans=0.0 2023-06-23 17:39:10,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=132460.0, ans=0.0 2023-06-23 17:39:16,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=132526.66666666666, ans=0.025 2023-06-23 17:39:26,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=132526.66666666666, ans=0.125 2023-06-23 17:39:28,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=132526.66666666666, ans=0.125 2023-06-23 17:39:44,378 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=18.73 vs. limit=22.5 2023-06-23 17:39:51,931 INFO [train.py:1008] (2/4) Epoch 38, batch 200, loss[loss=0.1942, simple_loss=0.2783, pruned_loss=0.05506, over 18793.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2816, pruned_loss=0.06581, over 2391688.66 frames. ], batch size: 83, lr: 8.69e-03, grad_scale: 32.0 2023-06-23 17:40:11,443 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-23 17:40:19,088 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.19 vs. limit=15.0 2023-06-23 17:40:30,917 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.758e+02 1.987e+02 2.155e+02 3.076e+02, threshold=3.973e+02, percent-clipped=0.0 2023-06-23 17:40:55,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=132860.0, ans=0.1 2023-06-23 17:41:16,078 INFO [train.py:1008] (2/4) Epoch 38, batch 250, loss[loss=0.2066, simple_loss=0.2805, pruned_loss=0.0663, over 19530.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.28, pruned_loss=0.06516, over 2713498.37 frames. ], batch size: 102, lr: 8.68e-03, grad_scale: 32.0 2023-06-23 17:42:15,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=133193.33333333334, ans=0.05 2023-06-23 17:42:30,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=133260.0, ans=0.1 2023-06-23 17:42:35,086 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=7.81 vs. limit=10.0 2023-06-23 17:42:41,907 INFO [train.py:1008] (2/4) Epoch 38, batch 300, loss[loss=0.1912, simple_loss=0.2748, pruned_loss=0.05387, over 19826.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2796, pruned_loss=0.06527, over 2949005.85 frames. ], batch size: 115, lr: 8.67e-03, grad_scale: 32.0 2023-06-23 17:42:56,086 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=384, metric=18.51 vs. limit=22.5 2023-06-23 17:43:04,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=133393.33333333334, ans=0.1 2023-06-23 17:43:23,644 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.417e+02 1.801e+02 1.994e+02 2.321e+02 3.312e+02, threshold=3.988e+02, percent-clipped=0.0 2023-06-23 17:43:24,331 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.96 vs. limit=15.0 2023-06-23 17:43:27,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=133460.0, ans=0.125 2023-06-23 17:43:29,032 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=3.359e-02 2023-06-23 17:43:55,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=133593.33333333334, ans=0.07 2023-06-23 17:44:05,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=133593.33333333334, ans=0.2 2023-06-23 17:44:08,564 INFO [train.py:1008] (2/4) Epoch 38, batch 350, loss[loss=0.214, simple_loss=0.2952, pruned_loss=0.06639, over 18468.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2794, pruned_loss=0.06553, over 3115626.04 frames. ], batch size: 77, lr: 8.66e-03, grad_scale: 32.0 2023-06-23 17:44:29,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=133726.66666666666, ans=0.125 2023-06-23 17:44:47,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=133793.33333333334, ans=0.0 2023-06-23 17:45:03,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=133860.0, ans=0.1 2023-06-23 17:45:15,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=133926.66666666666, ans=0.0 2023-06-23 17:45:32,591 INFO [train.py:1008] (2/4) Epoch 38, batch 400, loss[loss=0.1914, simple_loss=0.2718, pruned_loss=0.05554, over 19525.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2792, pruned_loss=0.06524, over 3275884.02 frames. ], batch size: 102, lr: 8.65e-03, grad_scale: 32.0 2023-06-23 17:45:54,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=134060.0, ans=0.125 2023-06-23 17:46:08,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=134126.66666666666, ans=0.0 2023-06-23 17:46:11,386 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.397e+02 1.765e+02 1.961e+02 2.163e+02 2.989e+02, threshold=3.921e+02, percent-clipped=0.0 2023-06-23 17:46:26,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=134193.33333333334, ans=0.1 2023-06-23 17:46:40,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=134260.0, ans=22.5 2023-06-23 17:46:48,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=134260.0, ans=0.125 2023-06-23 17:46:50,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=134260.0, ans=0.0 2023-06-23 17:46:55,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=134326.66666666666, ans=0.0 2023-06-23 17:46:56,484 INFO [train.py:1008] (2/4) Epoch 38, batch 450, loss[loss=0.204, simple_loss=0.2836, pruned_loss=0.06214, over 18456.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2791, pruned_loss=0.06545, over 3393651.02 frames. ], batch size: 77, lr: 8.65e-03, grad_scale: 32.0 2023-06-23 17:46:59,061 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.96 vs. limit=15.0 2023-06-23 17:47:03,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=134326.66666666666, ans=0.125 2023-06-23 17:48:06,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=134593.33333333334, ans=0.2 2023-06-23 17:48:17,196 INFO [train.py:1008] (2/4) Epoch 38, batch 500, loss[loss=0.2138, simple_loss=0.2982, pruned_loss=0.06471, over 17572.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2786, pruned_loss=0.065, over 3489281.80 frames. ], batch size: 67, lr: 8.64e-03, grad_scale: 32.0 2023-06-23 17:48:33,983 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=8.45 vs. limit=15.0 2023-06-23 17:48:53,622 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.506e+02 1.690e+02 1.875e+02 2.333e+02 3.146e+02, threshold=3.750e+02, percent-clipped=0.0 2023-06-23 17:49:03,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=134860.0, ans=0.1 2023-06-23 17:49:28,570 INFO [train.py:1008] (2/4) Epoch 39, batch 0, loss[loss=0.209, simple_loss=0.2877, pruned_loss=0.06514, over 19647.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2877, pruned_loss=0.06514, over 19647.00 frames. ], batch size: 110, lr: 8.52e-03, grad_scale: 32.0 2023-06-23 17:49:28,570 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 17:49:34,252 INFO [train.py:1040] (2/4) Epoch 39, validation: loss=0.1961, simple_loss=0.2911, pruned_loss=0.05057, over 143649.00 frames. 2023-06-23 17:49:34,252 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 17:50:07,762 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=18.90 vs. limit=22.5 2023-06-23 17:50:26,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=135073.33333333334, ans=0.1 2023-06-23 17:50:31,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=135073.33333333334, ans=0.125 2023-06-23 17:50:35,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=135073.33333333334, ans=0.2 2023-06-23 17:50:57,626 INFO [train.py:1008] (2/4) Epoch 39, batch 50, loss[loss=0.1978, simple_loss=0.2771, pruned_loss=0.05924, over 19470.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2787, pruned_loss=0.06383, over 873392.34 frames. ], batch size: 105, lr: 8.51e-03, grad_scale: 32.0 2023-06-23 17:51:07,369 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.53 vs. limit=15.0 2023-06-23 17:51:12,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=135273.33333333334, ans=0.0 2023-06-23 17:52:05,155 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.463e+02 1.783e+02 2.011e+02 2.337e+02 3.361e+02, threshold=4.022e+02, percent-clipped=0.0 2023-06-23 17:52:21,073 INFO [train.py:1008] (2/4) Epoch 39, batch 100, loss[loss=0.1982, simple_loss=0.2792, pruned_loss=0.05859, over 18288.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2783, pruned_loss=0.06461, over 1524853.37 frames. ], batch size: 74, lr: 8.50e-03, grad_scale: 32.0 2023-06-23 17:52:34,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=135540.0, ans=0.04949747468305833 2023-06-23 17:52:38,512 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.69 vs. limit=6.0 2023-06-23 17:52:39,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=135606.66666666666, ans=10.0 2023-06-23 17:53:39,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=135806.66666666666, ans=0.1 2023-06-23 17:53:43,306 INFO [train.py:1008] (2/4) Epoch 39, batch 150, loss[loss=0.2095, simple_loss=0.2941, pruned_loss=0.06244, over 16285.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2804, pruned_loss=0.06433, over 2004999.70 frames. ], batch size: 52, lr: 8.49e-03, grad_scale: 32.0 2023-06-23 17:53:54,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=135873.33333333334, ans=0.0 2023-06-23 17:53:56,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=135873.33333333334, ans=0.125 2023-06-23 17:53:59,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=135940.0, ans=0.0 2023-06-23 17:54:45,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=136073.33333333334, ans=0.125 2023-06-23 17:54:52,780 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.284e+02 1.711e+02 1.880e+02 2.080e+02 3.230e+02, threshold=3.760e+02, percent-clipped=0.0 2023-06-23 17:55:01,855 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.32 vs. limit=15.0 2023-06-23 17:55:07,140 INFO [train.py:1008] (2/4) Epoch 39, batch 200, loss[loss=0.1964, simple_loss=0.2795, pruned_loss=0.05664, over 19850.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2789, pruned_loss=0.06323, over 2403219.27 frames. ], batch size: 115, lr: 8.48e-03, grad_scale: 32.0 2023-06-23 17:55:20,760 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.43 vs. limit=15.0 2023-06-23 17:55:55,768 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.71 vs. limit=12.0 2023-06-23 17:56:05,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=136406.66666666666, ans=0.125 2023-06-23 17:56:23,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=136473.33333333334, ans=0.125 2023-06-23 17:56:31,278 INFO [train.py:1008] (2/4) Epoch 39, batch 250, loss[loss=0.1919, simple_loss=0.2736, pruned_loss=0.05506, over 19463.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2798, pruned_loss=0.06327, over 2694138.32 frames. ], batch size: 105, lr: 8.47e-03, grad_scale: 32.0 2023-06-23 17:56:50,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=136606.66666666666, ans=0.125 2023-06-23 17:56:50,332 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:57:11,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=136673.33333333334, ans=0.0 2023-06-23 17:57:11,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=136673.33333333334, ans=0.0 2023-06-23 17:57:15,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=136673.33333333334, ans=0.125 2023-06-23 17:57:19,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=136740.0, ans=0.5 2023-06-23 17:57:25,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=136740.0, ans=0.0 2023-06-23 17:57:40,022 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.286e+02 1.691e+02 1.862e+02 2.348e+02 3.821e+02, threshold=3.724e+02, percent-clipped=1.0 2023-06-23 17:57:54,445 INFO [train.py:1008] (2/4) Epoch 39, batch 300, loss[loss=0.2065, simple_loss=0.2481, pruned_loss=0.08246, over 16606.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2797, pruned_loss=0.06344, over 2940799.26 frames. ], batch size: 392, lr: 8.46e-03, grad_scale: 32.0 2023-06-23 17:58:17,283 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.12 vs. limit=12.0 2023-06-23 17:58:19,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=136940.0, ans=0.125 2023-06-23 17:58:40,669 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.98 vs. limit=15.0 2023-06-23 17:58:50,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=137073.33333333334, ans=0.125 2023-06-23 17:59:15,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=137140.0, ans=0.0 2023-06-23 17:59:18,322 INFO [train.py:1008] (2/4) Epoch 39, batch 350, loss[loss=0.207, simple_loss=0.2695, pruned_loss=0.07219, over 20234.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2801, pruned_loss=0.06355, over 3118959.90 frames. ], batch size: 239, lr: 8.45e-03, grad_scale: 32.0 2023-06-23 17:59:20,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=137206.66666666666, ans=0.2 2023-06-23 17:59:32,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=137206.66666666666, ans=0.1 2023-06-23 17:59:39,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=137273.33333333334, ans=0.1 2023-06-23 17:59:40,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=137273.33333333334, ans=0.0 2023-06-23 17:59:44,633 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=512, metric=17.86 vs. limit=22.5 2023-06-23 17:59:58,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=137340.0, ans=0.125 2023-06-23 18:00:22,508 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=4.79 vs. limit=15.0 2023-06-23 18:00:24,267 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.71 vs. limit=12.0 2023-06-23 18:00:27,886 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.398e+02 1.736e+02 1.973e+02 2.243e+02 3.453e+02, threshold=3.946e+02, percent-clipped=0.0 2023-06-23 18:00:43,648 INFO [train.py:1008] (2/4) Epoch 39, batch 400, loss[loss=0.2053, simple_loss=0.2735, pruned_loss=0.06855, over 19941.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2795, pruned_loss=0.06382, over 3282593.31 frames. ], batch size: 126, lr: 8.44e-03, grad_scale: 32.0 2023-06-23 18:00:44,700 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.21 vs. limit=6.0 2023-06-23 18:01:00,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=137606.66666666666, ans=0.1 2023-06-23 18:01:44,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=137740.0, ans=0.2 2023-06-23 18:02:08,799 INFO [train.py:1008] (2/4) Epoch 39, batch 450, loss[loss=0.197, simple_loss=0.2652, pruned_loss=0.06442, over 20641.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2791, pruned_loss=0.06439, over 3403202.05 frames. ], batch size: 211, lr: 8.44e-03, grad_scale: 32.0 2023-06-23 18:02:10,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=137873.33333333334, ans=0.2 2023-06-23 18:02:15,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=137873.33333333334, ans=0.125 2023-06-23 18:02:36,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=137940.0, ans=0.0 2023-06-23 18:03:05,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=138073.33333333334, ans=0.0 2023-06-23 18:03:14,782 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.813e+02 2.153e+02 2.467e+02 3.224e+02, threshold=4.306e+02, percent-clipped=0.0 2023-06-23 18:03:24,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=138140.0, ans=0.125 2023-06-23 18:03:28,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=138206.66666666666, ans=0.125 2023-06-23 18:03:29,410 INFO [train.py:1008] (2/4) Epoch 39, batch 500, loss[loss=0.1959, simple_loss=0.2735, pruned_loss=0.05919, over 18782.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2796, pruned_loss=0.0641, over 3488256.76 frames. ], batch size: 83, lr: 8.43e-03, grad_scale: 32.0 2023-06-23 18:03:34,768 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.36 vs. limit=15.0 2023-06-23 18:03:49,263 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.23 vs. limit=10.0 2023-06-23 18:04:15,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=138406.66666666666, ans=0.1 2023-06-23 18:04:43,030 INFO [train.py:1008] (2/4) Epoch 40, batch 0, loss[loss=0.2138, simple_loss=0.2913, pruned_loss=0.06819, over 19821.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2913, pruned_loss=0.06819, over 19821.00 frames. ], batch size: 115, lr: 8.31e-03, grad_scale: 32.0 2023-06-23 18:04:43,031 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 18:04:48,658 INFO [train.py:1040] (2/4) Epoch 40, validation: loss=0.198, simple_loss=0.2915, pruned_loss=0.05224, over 143649.00 frames. 2023-06-23 18:04:48,659 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 18:05:33,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=138553.33333333334, ans=0.125 2023-06-23 18:05:40,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=138620.0, ans=0.1 2023-06-23 18:05:41,926 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.38 vs. limit=22.5 2023-06-23 18:05:52,474 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=10.68 vs. limit=15.0 2023-06-23 18:05:58,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=138686.66666666666, ans=0.125 2023-06-23 18:06:12,688 INFO [train.py:1008] (2/4) Epoch 40, batch 50, loss[loss=0.1953, simple_loss=0.2734, pruned_loss=0.05864, over 19695.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2787, pruned_loss=0.06389, over 851679.67 frames. ], batch size: 110, lr: 8.31e-03, grad_scale: 32.0 2023-06-23 18:06:20,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.prob, batch_count=138753.33333333334, ans=0.125 2023-06-23 18:06:26,971 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.446e+02 1.678e+02 1.899e+02 2.125e+02 3.076e+02, threshold=3.798e+02, percent-clipped=0.0 2023-06-23 18:06:59,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=138886.66666666666, ans=0.0 2023-06-23 18:07:01,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=138953.33333333334, ans=0.1 2023-06-23 18:07:36,458 INFO [train.py:1008] (2/4) Epoch 40, batch 100, loss[loss=0.2169, simple_loss=0.3059, pruned_loss=0.06391, over 18300.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2774, pruned_loss=0.0629, over 1502735.30 frames. ], batch size: 72, lr: 8.30e-03, grad_scale: 16.0 2023-06-23 18:07:49,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=139086.66666666666, ans=0.0 2023-06-23 18:08:02,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=139153.33333333334, ans=0.125 2023-06-23 18:08:20,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=139220.0, ans=0.0 2023-06-23 18:08:40,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=139353.33333333334, ans=0.0 2023-06-23 18:08:43,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=139353.33333333334, ans=0.125 2023-06-23 18:08:52,820 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=19.73 vs. limit=22.5 2023-06-23 18:08:55,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.scale_min, batch_count=139353.33333333334, ans=0.2 2023-06-23 18:08:59,323 INFO [train.py:1008] (2/4) Epoch 40, batch 150, loss[loss=0.2076, simple_loss=0.2871, pruned_loss=0.0641, over 16363.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2797, pruned_loss=0.06425, over 1998759.22 frames. ], batch size: 52, lr: 8.29e-03, grad_scale: 16.0 2023-06-23 18:08:59,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=139420.0, ans=0.125 2023-06-23 18:08:59,647 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:08:59,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=139420.0, ans=0.0 2023-06-23 18:09:09,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=139420.0, ans=0.125 2023-06-23 18:09:15,501 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.551e+02 1.764e+02 1.962e+02 2.283e+02 3.805e+02, threshold=3.924e+02, percent-clipped=1.0 2023-06-23 18:09:15,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=139486.66666666666, ans=0.125 2023-06-23 18:09:15,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=139486.66666666666, ans=0.025 2023-06-23 18:09:54,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=139620.0, ans=0.0 2023-06-23 18:09:54,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=139620.0, ans=0.125 2023-06-23 18:10:21,916 INFO [train.py:1008] (2/4) Epoch 40, batch 200, loss[loss=0.1907, simple_loss=0.2726, pruned_loss=0.05445, over 18288.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2793, pruned_loss=0.06433, over 2406605.58 frames. ], batch size: 74, lr: 8.28e-03, grad_scale: 16.0 2023-06-23 18:10:28,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=139753.33333333334, ans=0.04949747468305833 2023-06-23 18:10:34,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=139753.33333333334, ans=0.125 2023-06-23 18:10:48,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=139820.0, ans=0.1 2023-06-23 18:11:26,735 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:11:44,628 INFO [train.py:1008] (2/4) Epoch 40, batch 250, loss[loss=0.2, simple_loss=0.2765, pruned_loss=0.06171, over 18614.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2786, pruned_loss=0.06477, over 2704730.08 frames. ], batch size: 80, lr: 8.27e-03, grad_scale: 16.0 2023-06-23 18:11:53,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=140086.66666666666, ans=0.0 2023-06-23 18:12:01,726 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.501e+02 1.803e+02 1.945e+02 2.158e+02 3.592e+02, threshold=3.889e+02, percent-clipped=0.0 2023-06-23 18:12:13,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=140153.33333333334, ans=0.025 2023-06-23 18:12:13,755 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=10.18 vs. limit=15.0 2023-06-23 18:12:16,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=140220.0, ans=0.125 2023-06-23 18:12:36,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=140286.66666666666, ans=0.0 2023-06-23 18:12:41,458 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=8.35 vs. limit=15.0 2023-06-23 18:12:45,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=140286.66666666666, ans=0.125 2023-06-23 18:12:48,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=140286.66666666666, ans=0.125 2023-06-23 18:13:02,219 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.66 vs. limit=22.5 2023-06-23 18:13:08,096 INFO [train.py:1008] (2/4) Epoch 40, batch 300, loss[loss=0.1878, simple_loss=0.267, pruned_loss=0.05433, over 19517.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2776, pruned_loss=0.06424, over 2958610.57 frames. ], batch size: 102, lr: 8.26e-03, grad_scale: 16.0 2023-06-23 18:13:15,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=140420.0, ans=0.04949747468305833 2023-06-23 18:13:24,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=140486.66666666666, ans=0.125 2023-06-23 18:13:38,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=140486.66666666666, ans=0.0 2023-06-23 18:13:44,137 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=6.24 vs. limit=15.0 2023-06-23 18:13:46,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=140553.33333333334, ans=0.2 2023-06-23 18:14:30,696 INFO [train.py:1008] (2/4) Epoch 40, batch 350, loss[loss=0.2098, simple_loss=0.2958, pruned_loss=0.06191, over 15586.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2767, pruned_loss=0.06403, over 3147040.05 frames. ], batch size: 44, lr: 8.25e-03, grad_scale: 16.0 2023-06-23 18:14:48,205 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.452e+02 1.677e+02 1.868e+02 2.086e+02 3.212e+02, threshold=3.736e+02, percent-clipped=0.0 2023-06-23 18:14:55,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=140820.0, ans=0.1 2023-06-23 18:15:12,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=140886.66666666666, ans=0.125 2023-06-23 18:15:22,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=140953.33333333334, ans=0.2 2023-06-23 18:15:41,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=141020.0, ans=0.0 2023-06-23 18:15:44,079 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:15:53,143 INFO [train.py:1008] (2/4) Epoch 40, batch 400, loss[loss=0.2069, simple_loss=0.2918, pruned_loss=0.06094, over 16957.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.2771, pruned_loss=0.0637, over 3301724.82 frames. ], batch size: 60, lr: 8.24e-03, grad_scale: 32.0 2023-06-23 18:15:54,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff2_skip_rate, batch_count=141086.66666666666, ans=0.0 2023-06-23 18:15:55,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=141086.66666666666, ans=0.2 2023-06-23 18:16:00,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=141086.66666666666, ans=0.2 2023-06-23 18:16:17,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=141153.33333333334, ans=0.0 2023-06-23 18:16:28,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=141220.0, ans=0.125 2023-06-23 18:17:15,517 INFO [train.py:1008] (2/4) Epoch 40, batch 450, loss[loss=0.2023, simple_loss=0.2728, pruned_loss=0.06596, over 19938.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2777, pruned_loss=0.06384, over 3403580.89 frames. ], batch size: 126, lr: 8.24e-03, grad_scale: 32.0 2023-06-23 18:17:22,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=141420.0, ans=0.1 2023-06-23 18:17:32,350 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.729e+02 1.897e+02 2.097e+02 3.976e+02, threshold=3.793e+02, percent-clipped=1.0 2023-06-23 18:17:55,256 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=11.72 vs. limit=15.0 2023-06-23 18:17:56,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=141553.33333333334, ans=0.0 2023-06-23 18:17:59,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=141553.33333333334, ans=0.125 2023-06-23 18:18:03,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=141620.0, ans=0.5 2023-06-23 18:18:33,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=141686.66666666666, ans=0.125 2023-06-23 18:18:35,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=141753.33333333334, ans=0.125 2023-06-23 18:18:36,744 INFO [train.py:1008] (2/4) Epoch 40, batch 500, loss[loss=0.2092, simple_loss=0.2642, pruned_loss=0.07709, over 19925.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2778, pruned_loss=0.06372, over 3490759.55 frames. ], batch size: 293, lr: 8.23e-03, grad_scale: 32.0 2023-06-23 18:18:40,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=141753.33333333334, ans=0.1 2023-06-23 18:18:46,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=141753.33333333334, ans=0.0 2023-06-23 18:19:05,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=141820.0, ans=0.125 2023-06-23 18:19:07,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=141886.66666666666, ans=0.025 2023-06-23 18:19:10,411 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.78 vs. limit=12.0 2023-06-23 18:19:19,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=141886.66666666666, ans=0.0 2023-06-23 18:19:20,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=141886.66666666666, ans=0.0 2023-06-23 18:19:45,996 INFO [train.py:1008] (2/4) Epoch 41, batch 0, loss[loss=0.2087, simple_loss=0.2911, pruned_loss=0.06311, over 16401.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2911, pruned_loss=0.06311, over 16401.00 frames. ], batch size: 52, lr: 8.12e-03, grad_scale: 32.0 2023-06-23 18:19:45,996 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 18:19:51,589 INFO [train.py:1040] (2/4) Epoch 41, validation: loss=0.1955, simple_loss=0.2897, pruned_loss=0.05062, over 143649.00 frames. 2023-06-23 18:19:51,590 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 18:20:13,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=142040.0, ans=0.0 2023-06-23 18:20:24,447 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.78 vs. limit=6.0 2023-06-23 18:20:35,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=142106.66666666666, ans=0.0 2023-06-23 18:20:36,136 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.433e+02 1.747e+02 2.012e+02 2.296e+02 3.578e+02, threshold=4.024e+02, percent-clipped=0.0 2023-06-23 18:20:39,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=142173.33333333334, ans=0.125 2023-06-23 18:21:01,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=142240.0, ans=0.125 2023-06-23 18:21:13,876 INFO [train.py:1008] (2/4) Epoch 41, batch 50, loss[loss=0.1933, simple_loss=0.2724, pruned_loss=0.05706, over 19470.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2764, pruned_loss=0.06273, over 859703.42 frames. ], batch size: 105, lr: 8.11e-03, grad_scale: 32.0 2023-06-23 18:21:39,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=142373.33333333334, ans=0.125 2023-06-23 18:21:59,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=142440.0, ans=0.0 2023-06-23 18:22:30,102 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=384, metric=5.54 vs. limit=15.0 2023-06-23 18:22:32,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=142573.33333333334, ans=0.125 2023-06-23 18:22:36,794 INFO [train.py:1008] (2/4) Epoch 41, batch 100, loss[loss=0.2058, simple_loss=0.2761, pruned_loss=0.06775, over 19943.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2763, pruned_loss=0.06227, over 1516003.58 frames. ], batch size: 126, lr: 8.10e-03, grad_scale: 32.0 2023-06-23 18:23:03,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=142706.66666666666, ans=0.025 2023-06-23 18:23:09,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=142773.33333333334, ans=0.0 2023-06-23 18:23:17,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=142773.33333333334, ans=0.1 2023-06-23 18:23:22,005 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.493e+02 1.755e+02 2.017e+02 2.293e+02 3.854e+02, threshold=4.034e+02, percent-clipped=0.0 2023-06-23 18:23:24,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=142773.33333333334, ans=0.0 2023-06-23 18:23:24,373 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.20 vs. limit=15.0 2023-06-23 18:23:33,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=142840.0, ans=0.125 2023-06-23 18:23:56,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=142906.66666666666, ans=0.125 2023-06-23 18:24:00,507 INFO [train.py:1008] (2/4) Epoch 41, batch 150, loss[loss=0.1864, simple_loss=0.2656, pruned_loss=0.05356, over 19066.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2774, pruned_loss=0.0625, over 2022197.48 frames. ], batch size: 89, lr: 8.09e-03, grad_scale: 32.0 2023-06-23 18:24:31,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=143106.66666666666, ans=0.0 2023-06-23 18:24:50,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=143173.33333333334, ans=0.0 2023-06-23 18:24:52,882 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-23 18:25:22,958 INFO [train.py:1008] (2/4) Epoch 41, batch 200, loss[loss=0.1901, simple_loss=0.2631, pruned_loss=0.05853, over 20571.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2772, pruned_loss=0.06272, over 2393384.80 frames. ], batch size: 189, lr: 8.09e-03, grad_scale: 32.0 2023-06-23 18:25:57,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=143440.0, ans=0.0 2023-06-23 18:26:04,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=143440.0, ans=0.0 2023-06-23 18:26:07,057 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.429e+02 1.697e+02 1.861e+02 2.042e+02 2.751e+02, threshold=3.722e+02, percent-clipped=0.0 2023-06-23 18:26:26,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=143573.33333333334, ans=0.125 2023-06-23 18:26:44,769 INFO [train.py:1008] (2/4) Epoch 41, batch 250, loss[loss=0.1992, simple_loss=0.2783, pruned_loss=0.06011, over 19102.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2772, pruned_loss=0.06223, over 2702083.19 frames. ], batch size: 89, lr: 8.08e-03, grad_scale: 32.0 2023-06-23 18:27:44,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=143840.0, ans=0.0 2023-06-23 18:28:08,509 INFO [train.py:1008] (2/4) Epoch 41, batch 300, loss[loss=0.2008, simple_loss=0.2747, pruned_loss=0.06344, over 20284.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2763, pruned_loss=0.06304, over 2951255.96 frames. ], batch size: 141, lr: 8.07e-03, grad_scale: 32.0 2023-06-23 18:28:30,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=144040.0, ans=0.0 2023-06-23 18:28:53,829 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.647e+02 1.872e+02 2.166e+02 2.965e+02, threshold=3.744e+02, percent-clipped=0.0 2023-06-23 18:29:04,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer2.prob, batch_count=144173.33333333334, ans=0.125 2023-06-23 18:29:31,884 INFO [train.py:1008] (2/4) Epoch 41, batch 350, loss[loss=0.2241, simple_loss=0.3094, pruned_loss=0.06936, over 16369.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2764, pruned_loss=0.06246, over 3146356.09 frames. ], batch size: 52, lr: 8.06e-03, grad_scale: 32.0 2023-06-23 18:29:32,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=144306.66666666666, ans=0.125 2023-06-23 18:29:33,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=144306.66666666666, ans=0.0 2023-06-23 18:29:37,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=144306.66666666666, ans=0.125 2023-06-23 18:30:09,143 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.37 vs. limit=22.5 2023-06-23 18:30:15,521 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.10 vs. limit=15.0 2023-06-23 18:30:18,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=144440.0, ans=0.025 2023-06-23 18:30:24,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=144506.66666666666, ans=0.0 2023-06-23 18:30:25,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=144506.66666666666, ans=0.0 2023-06-23 18:30:38,446 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=7.27 vs. limit=15.0 2023-06-23 18:30:42,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=144573.33333333334, ans=0.1 2023-06-23 18:30:49,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=144573.33333333334, ans=0.125 2023-06-23 18:30:51,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=144573.33333333334, ans=0.125 2023-06-23 18:30:51,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=144573.33333333334, ans=0.1 2023-06-23 18:30:56,082 INFO [train.py:1008] (2/4) Epoch 41, batch 400, loss[loss=0.2189, simple_loss=0.2884, pruned_loss=0.07466, over 20257.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2762, pruned_loss=0.06238, over 3279117.68 frames. ], batch size: 141, lr: 8.05e-03, grad_scale: 32.0 2023-06-23 18:30:56,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module2.balancer1.prob, batch_count=144640.0, ans=0.125 2023-06-23 18:31:24,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=144706.66666666666, ans=0.05 2023-06-23 18:31:41,102 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.489e+02 1.737e+02 1.969e+02 2.223e+02 3.144e+02, threshold=3.939e+02, percent-clipped=0.0 2023-06-23 18:31:58,989 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.30 vs. limit=15.0 2023-06-23 18:32:19,918 INFO [train.py:1008] (2/4) Epoch 41, batch 450, loss[loss=0.2122, simple_loss=0.2895, pruned_loss=0.0675, over 18509.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.276, pruned_loss=0.06219, over 3393090.26 frames. ], batch size: 77, lr: 8.04e-03, grad_scale: 32.0 2023-06-23 18:32:45,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=145040.0, ans=0.125 2023-06-23 18:32:54,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=145106.66666666666, ans=0.125 2023-06-23 18:33:26,861 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=9.13 vs. limit=15.0 2023-06-23 18:33:40,003 INFO [train.py:1008] (2/4) Epoch 41, batch 500, loss[loss=0.208, simple_loss=0.2905, pruned_loss=0.0628, over 18292.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2777, pruned_loss=0.06248, over 3473641.25 frames. ], batch size: 74, lr: 8.04e-03, grad_scale: 32.0 2023-06-23 18:33:41,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=145306.66666666666, ans=0.0 2023-06-23 18:33:46,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=145306.66666666666, ans=0.1 2023-06-23 18:34:21,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=145440.0, ans=0.1 2023-06-23 18:34:22,580 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.373e+02 1.720e+02 1.909e+02 2.113e+02 2.699e+02, threshold=3.817e+02, percent-clipped=0.0 2023-06-23 18:34:51,173 INFO [train.py:1008] (2/4) Epoch 42, batch 0, loss[loss=0.1897, simple_loss=0.2652, pruned_loss=0.05712, over 20285.00 frames. ], tot_loss[loss=0.1897, simple_loss=0.2652, pruned_loss=0.05712, over 20285.00 frames. ], batch size: 149, lr: 7.93e-03, grad_scale: 32.0 2023-06-23 18:34:51,174 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 18:34:56,872 INFO [train.py:1040] (2/4) Epoch 42, validation: loss=0.1951, simple_loss=0.2892, pruned_loss=0.05045, over 143649.00 frames. 2023-06-23 18:34:56,872 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 18:35:08,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=145520.0, ans=0.125 2023-06-23 18:35:20,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=145586.66666666666, ans=10.0 2023-06-23 18:35:20,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=145586.66666666666, ans=10.0 2023-06-23 18:35:31,367 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=11.05 vs. limit=15.0 2023-06-23 18:36:19,524 INFO [train.py:1008] (2/4) Epoch 42, batch 50, loss[loss=0.1931, simple_loss=0.2726, pruned_loss=0.05677, over 19225.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2763, pruned_loss=0.06279, over 849021.14 frames. ], batch size: 92, lr: 7.93e-03, grad_scale: 32.0 2023-06-23 18:36:29,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=145853.33333333334, ans=0.0 2023-06-23 18:36:38,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=145920.0, ans=0.125 2023-06-23 18:37:15,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=146053.33333333334, ans=0.125 2023-06-23 18:37:21,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=146053.33333333334, ans=0.125 2023-06-23 18:37:30,774 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=4.60 vs. limit=12.0 2023-06-23 18:37:32,829 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.363e+02 1.693e+02 1.862e+02 2.191e+02 3.305e+02, threshold=3.724e+02, percent-clipped=0.0 2023-06-23 18:37:41,589 INFO [train.py:1008] (2/4) Epoch 42, batch 100, loss[loss=0.1934, simple_loss=0.2766, pruned_loss=0.05507, over 19339.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2752, pruned_loss=0.06131, over 1514276.52 frames. ], batch size: 98, lr: 7.92e-03, grad_scale: 32.0 2023-06-23 18:37:56,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=146253.33333333334, ans=0.0 2023-06-23 18:38:30,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=146386.66666666666, ans=0.2 2023-06-23 18:39:02,333 INFO [train.py:1008] (2/4) Epoch 42, batch 150, loss[loss=0.178, simple_loss=0.2583, pruned_loss=0.04883, over 18934.00 frames. ], tot_loss[loss=0.199, simple_loss=0.2755, pruned_loss=0.06125, over 2024373.92 frames. ], batch size: 86, lr: 7.91e-03, grad_scale: 16.0 2023-06-23 18:39:02,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=146520.0, ans=0.125 2023-06-23 18:39:21,986 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=7.56 vs. limit=15.0 2023-06-23 18:39:30,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=146586.66666666666, ans=0.125 2023-06-23 18:39:46,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=146653.33333333334, ans=0.1 2023-06-23 18:39:51,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=146720.0, ans=15.0 2023-06-23 18:40:17,244 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.530e+02 1.884e+02 2.149e+02 2.581e+02 4.173e+02, threshold=4.298e+02, percent-clipped=2.0 2023-06-23 18:40:24,538 INFO [train.py:1008] (2/4) Epoch 42, batch 200, loss[loss=0.2082, simple_loss=0.2772, pruned_loss=0.0696, over 18628.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2757, pruned_loss=0.06166, over 2424211.32 frames. ], batch size: 80, lr: 7.90e-03, grad_scale: 16.0 2023-06-23 18:41:37,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=147120.0, ans=0.125 2023-06-23 18:41:46,142 INFO [train.py:1008] (2/4) Epoch 42, batch 250, loss[loss=0.1932, simple_loss=0.2666, pruned_loss=0.05993, over 20557.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2753, pruned_loss=0.06195, over 2734043.31 frames. ], batch size: 173, lr: 7.89e-03, grad_scale: 16.0 2023-06-23 18:41:49,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.min_positive, batch_count=147186.66666666666, ans=0.025 2023-06-23 18:41:59,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=147186.66666666666, ans=0.125 2023-06-23 18:42:13,777 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2023-06-23 18:42:18,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=147320.0, ans=15.0 2023-06-23 18:42:26,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=147320.0, ans=0.125 2023-06-23 18:42:57,947 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=384, metric=5.31 vs. limit=15.0 2023-06-23 18:43:04,695 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.392e+02 1.767e+02 1.922e+02 2.149e+02 5.047e+02, threshold=3.844e+02, percent-clipped=1.0 2023-06-23 18:43:09,461 INFO [train.py:1008] (2/4) Epoch 42, batch 300, loss[loss=0.1955, simple_loss=0.2573, pruned_loss=0.06685, over 20297.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2758, pruned_loss=0.06171, over 2960968.45 frames. ], batch size: 239, lr: 7.88e-03, grad_scale: 8.0 2023-06-23 18:43:16,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=147520.0, ans=0.2 2023-06-23 18:43:35,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=147586.66666666666, ans=0.125 2023-06-23 18:44:32,324 INFO [train.py:1008] (2/4) Epoch 42, batch 350, loss[loss=0.2209, simple_loss=0.2637, pruned_loss=0.08908, over 17030.00 frames. ], tot_loss[loss=0.1999, simple_loss=0.2755, pruned_loss=0.0621, over 3144997.65 frames. ], batch size: 391, lr: 7.88e-03, grad_scale: 8.0 2023-06-23 18:44:53,547 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:45:51,448 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.394e+02 1.757e+02 1.899e+02 2.123e+02 3.456e+02, threshold=3.798e+02, percent-clipped=0.0 2023-06-23 18:45:56,305 INFO [train.py:1008] (2/4) Epoch 42, batch 400, loss[loss=0.1926, simple_loss=0.2651, pruned_loss=0.06003, over 20311.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2758, pruned_loss=0.06155, over 3277649.51 frames. ], batch size: 149, lr: 7.87e-03, grad_scale: 16.0 2023-06-23 18:45:58,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=148186.66666666666, ans=0.1 2023-06-23 18:46:15,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=148253.33333333334, ans=15.0 2023-06-23 18:46:25,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=148253.33333333334, ans=0.2 2023-06-23 18:47:17,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=148453.33333333334, ans=0.1 2023-06-23 18:47:19,656 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.25 vs. limit=10.0 2023-06-23 18:47:20,125 INFO [train.py:1008] (2/4) Epoch 42, batch 450, loss[loss=0.2015, simple_loss=0.2747, pruned_loss=0.06416, over 20274.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2765, pruned_loss=0.06173, over 3391804.87 frames. ], batch size: 141, lr: 7.86e-03, grad_scale: 16.0 2023-06-23 18:47:22,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=148520.0, ans=0.125 2023-06-23 18:47:31,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=148520.0, ans=0.125 2023-06-23 18:47:39,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=148586.66666666666, ans=0.125 2023-06-23 18:47:42,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=148586.66666666666, ans=0.0 2023-06-23 18:47:43,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=148586.66666666666, ans=0.125 2023-06-23 18:47:49,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=148586.66666666666, ans=0.0 2023-06-23 18:47:52,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=148653.33333333334, ans=0.125 2023-06-23 18:47:53,231 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=13.66 vs. limit=15.0 2023-06-23 18:48:17,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=148720.0, ans=0.09899494936611666 2023-06-23 18:48:21,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=148720.0, ans=0.0 2023-06-23 18:48:36,016 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.491e+02 1.803e+02 1.999e+02 2.413e+02 3.336e+02, threshold=3.998e+02, percent-clipped=0.0 2023-06-23 18:48:40,745 INFO [train.py:1008] (2/4) Epoch 42, batch 500, loss[loss=0.2074, simple_loss=0.292, pruned_loss=0.06145, over 16336.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2765, pruned_loss=0.06213, over 3492302.42 frames. ], batch size: 52, lr: 7.85e-03, grad_scale: 16.0 2023-06-23 18:48:44,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass_mid.scale_min, batch_count=148853.33333333334, ans=0.2 2023-06-23 18:48:47,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=148853.33333333334, ans=0.1 2023-06-23 18:49:07,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=148920.0, ans=0.125 2023-06-23 18:49:21,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=148986.66666666666, ans=0.0 2023-06-23 18:49:53,210 INFO [train.py:1008] (2/4) Epoch 43, batch 0, loss[loss=0.2268, simple_loss=0.2634, pruned_loss=0.09507, over 16769.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2634, pruned_loss=0.09507, over 16769.00 frames. ], batch size: 391, lr: 7.76e-03, grad_scale: 32.0 2023-06-23 18:49:53,211 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 18:49:58,908 INFO [train.py:1040] (2/4) Epoch 43, validation: loss=0.1949, simple_loss=0.29, pruned_loss=0.04993, over 143649.00 frames. 2023-06-23 18:49:58,909 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 18:50:00,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=149073.33333333334, ans=0.125 2023-06-23 18:50:08,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=149073.33333333334, ans=0.1 2023-06-23 18:50:16,086 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.89 vs. limit=6.0 2023-06-23 18:50:16,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass_mid.scale_min, batch_count=149140.0, ans=0.2 2023-06-23 18:50:19,022 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.67 vs. limit=12.0 2023-06-23 18:50:40,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=149206.66666666666, ans=0.125 2023-06-23 18:50:57,468 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:51:02,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=149340.0, ans=0.125 2023-06-23 18:51:21,834 INFO [train.py:1008] (2/4) Epoch 43, batch 50, loss[loss=0.2077, simple_loss=0.2484, pruned_loss=0.08349, over 16972.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2763, pruned_loss=0.0608, over 848777.72 frames. ], batch size: 391, lr: 7.75e-03, grad_scale: 32.0 2023-06-23 18:51:43,789 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.648e+02 1.897e+02 2.082e+02 3.369e+02, threshold=3.795e+02, percent-clipped=0.0 2023-06-23 18:52:00,338 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=6.30 vs. limit=15.0 2023-06-23 18:52:01,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=149540.0, ans=0.2 2023-06-23 18:52:08,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=149540.0, ans=0.0 2023-06-23 18:52:12,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=149606.66666666666, ans=0.125 2023-06-23 18:52:43,351 INFO [train.py:1008] (2/4) Epoch 43, batch 100, loss[loss=0.1941, simple_loss=0.2646, pruned_loss=0.06184, over 20705.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2758, pruned_loss=0.06215, over 1498597.18 frames. ], batch size: 211, lr: 7.74e-03, grad_scale: 32.0 2023-06-23 18:53:14,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=149873.33333333334, ans=0.125 2023-06-23 18:53:29,932 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.67 vs. limit=15.0 2023-06-23 18:53:38,306 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.20 vs. limit=6.0 2023-06-23 18:54:01,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=150006.66666666666, ans=0.0 2023-06-23 18:54:04,750 INFO [train.py:1008] (2/4) Epoch 43, batch 150, loss[loss=0.197, simple_loss=0.2753, pruned_loss=0.05933, over 19076.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2755, pruned_loss=0.06144, over 2006763.94 frames. ], batch size: 89, lr: 7.73e-03, grad_scale: 32.0 2023-06-23 18:54:16,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=150073.33333333334, ans=0.125 2023-06-23 18:54:28,198 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.456e+02 1.713e+02 1.940e+02 2.500e+02 3.729e+02, threshold=3.879e+02, percent-clipped=0.0 2023-06-23 18:54:50,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.skip_rate, batch_count=150206.66666666666, ans=0.09899494936611666 2023-06-23 18:54:51,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=150206.66666666666, ans=0.125 2023-06-23 18:55:22,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=150340.0, ans=0.0 2023-06-23 18:55:24,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=150340.0, ans=0.0 2023-06-23 18:55:25,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=150406.66666666666, ans=0.0 2023-06-23 18:55:26,936 INFO [train.py:1008] (2/4) Epoch 43, batch 200, loss[loss=0.1912, simple_loss=0.2849, pruned_loss=0.0488, over 15500.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.276, pruned_loss=0.06127, over 2404505.69 frames. ], batch size: 44, lr: 7.72e-03, grad_scale: 32.0 2023-06-23 18:55:35,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=150406.66666666666, ans=0.125 2023-06-23 18:55:40,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer1.prob, batch_count=150406.66666666666, ans=0.125 2023-06-23 18:55:41,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=150473.33333333334, ans=0.125 2023-06-23 18:56:12,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=150540.0, ans=0.125 2023-06-23 18:56:19,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.pos_emb_skip_rate, batch_count=150606.66666666666, ans=0.0 2023-06-23 18:56:32,218 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.70 vs. limit=15.0 2023-06-23 18:56:49,136 INFO [train.py:1008] (2/4) Epoch 43, batch 250, loss[loss=0.187, simple_loss=0.2707, pruned_loss=0.05165, over 18644.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2754, pruned_loss=0.06083, over 2727102.49 frames. ], batch size: 80, lr: 7.72e-03, grad_scale: 32.0 2023-06-23 18:56:51,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=150740.0, ans=0.125 2023-06-23 18:57:03,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.prob, batch_count=150740.0, ans=0.125 2023-06-23 18:57:08,800 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=21.14 vs. limit=22.5 2023-06-23 18:57:08,851 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.91 vs. limit=10.0 2023-06-23 18:57:12,516 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.729e+02 1.933e+02 2.195e+02 2.886e+02, threshold=3.865e+02, percent-clipped=0.0 2023-06-23 18:57:12,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_skip_rate, batch_count=150806.66666666666, ans=0.0 2023-06-23 18:57:46,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=150940.0, ans=0.0 2023-06-23 18:57:49,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=150940.0, ans=0.0 2023-06-23 18:58:12,324 INFO [train.py:1008] (2/4) Epoch 43, batch 300, loss[loss=0.1858, simple_loss=0.2662, pruned_loss=0.05275, over 18616.00 frames. ], tot_loss[loss=0.1991, simple_loss=0.2758, pruned_loss=0.06126, over 2961985.65 frames. ], batch size: 80, lr: 7.71e-03, grad_scale: 32.0 2023-06-23 18:58:16,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=151073.33333333334, ans=0.125 2023-06-23 18:58:43,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=151140.0, ans=0.1 2023-06-23 18:58:47,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=151206.66666666666, ans=0.1 2023-06-23 18:59:08,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=151273.33333333334, ans=0.0 2023-06-23 18:59:17,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=151340.0, ans=0.125 2023-06-23 18:59:28,238 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.74 vs. limit=12.0 2023-06-23 18:59:34,811 INFO [train.py:1008] (2/4) Epoch 43, batch 350, loss[loss=0.2005, simple_loss=0.2743, pruned_loss=0.06334, over 20525.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2758, pruned_loss=0.06057, over 3151496.55 frames. ], batch size: 160, lr: 7.70e-03, grad_scale: 32.0 2023-06-23 18:59:57,696 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.471e+02 1.705e+02 1.912e+02 2.088e+02 2.578e+02, threshold=3.825e+02, percent-clipped=0.0 2023-06-23 19:00:09,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=151540.0, ans=0.125 2023-06-23 19:00:14,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=151540.0, ans=0.0 2023-06-23 19:00:37,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=151606.66666666666, ans=0.125 2023-06-23 19:00:56,371 INFO [train.py:1008] (2/4) Epoch 43, batch 400, loss[loss=0.1981, simple_loss=0.2804, pruned_loss=0.05787, over 19253.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.2758, pruned_loss=0.06068, over 3294018.77 frames. ], batch size: 92, lr: 7.69e-03, grad_scale: 32.0 2023-06-23 19:01:03,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=151740.0, ans=0.0 2023-06-23 19:01:46,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=151940.0, ans=0.2 2023-06-23 19:02:09,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=152006.66666666666, ans=0.0 2023-06-23 19:02:18,544 INFO [train.py:1008] (2/4) Epoch 43, batch 450, loss[loss=0.2046, simple_loss=0.2868, pruned_loss=0.06118, over 13460.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2755, pruned_loss=0.06057, over 3403997.46 frames. ], batch size: 38, lr: 7.69e-03, grad_scale: 32.0 2023-06-23 19:02:41,708 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.517e+02 1.753e+02 2.020e+02 2.242e+02 3.524e+02, threshold=4.040e+02, percent-clipped=0.0 2023-06-23 19:02:57,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=152206.66666666666, ans=0.0 2023-06-23 19:03:14,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=152273.33333333334, ans=0.125 2023-06-23 19:03:37,433 INFO [train.py:1008] (2/4) Epoch 43, batch 500, loss[loss=0.2281, simple_loss=0.2632, pruned_loss=0.0965, over 17055.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2761, pruned_loss=0.06079, over 3478799.57 frames. ], batch size: 392, lr: 7.68e-03, grad_scale: 32.0 2023-06-23 19:03:49,165 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.11 vs. limit=22.5 2023-06-23 19:04:01,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=152473.33333333334, ans=0.125 2023-06-23 19:04:05,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=152473.33333333334, ans=0.05 2023-06-23 19:04:12,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=152540.0, ans=0.125 2023-06-23 19:04:14,180 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-06-23 19:04:51,191 INFO [train.py:1008] (2/4) Epoch 44, batch 0, loss[loss=0.1947, simple_loss=0.2736, pruned_loss=0.05788, over 18649.00 frames. ], tot_loss[loss=0.1947, simple_loss=0.2736, pruned_loss=0.05788, over 18649.00 frames. ], batch size: 80, lr: 7.58e-03, grad_scale: 32.0 2023-06-23 19:04:51,192 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 19:04:56,827 INFO [train.py:1040] (2/4) Epoch 44, validation: loss=0.1969, simple_loss=0.2906, pruned_loss=0.05158, over 143649.00 frames. 2023-06-23 19:04:56,828 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 19:05:08,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=152626.66666666666, ans=0.125 2023-06-23 19:05:25,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=152693.33333333334, ans=0.125 2023-06-23 19:05:44,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=152826.66666666666, ans=0.07 2023-06-23 19:05:47,390 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.338e+02 1.798e+02 2.006e+02 2.189e+02 3.546e+02, threshold=4.013e+02, percent-clipped=0.0 2023-06-23 19:06:13,041 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=11.38 vs. limit=15.0 2023-06-23 19:06:18,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=152960.0, ans=0.125 2023-06-23 19:06:19,935 INFO [train.py:1008] (2/4) Epoch 44, batch 50, loss[loss=0.191, simple_loss=0.2726, pruned_loss=0.05465, over 19302.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2766, pruned_loss=0.06222, over 866535.67 frames. ], batch size: 98, lr: 7.58e-03, grad_scale: 32.0 2023-06-23 19:06:28,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=152960.0, ans=0.125 2023-06-23 19:06:31,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=152960.0, ans=0.2 2023-06-23 19:07:02,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=153093.33333333334, ans=0.035 2023-06-23 19:07:13,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.ff3_skip_rate, batch_count=153160.0, ans=0.0 2023-06-23 19:07:41,185 INFO [train.py:1008] (2/4) Epoch 44, batch 100, loss[loss=0.2286, simple_loss=0.3108, pruned_loss=0.07314, over 18314.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2781, pruned_loss=0.06186, over 1521265.55 frames. ], batch size: 72, lr: 7.57e-03, grad_scale: 32.0 2023-06-23 19:07:47,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=153293.33333333334, ans=0.1 2023-06-23 19:08:02,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=153360.0, ans=0.125 2023-06-23 19:08:14,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=153426.66666666666, ans=0.2 2023-06-23 19:08:24,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=153426.66666666666, ans=0.125 2023-06-23 19:08:32,161 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.411e+02 1.721e+02 1.906e+02 2.161e+02 3.199e+02, threshold=3.812e+02, percent-clipped=0.0 2023-06-23 19:08:40,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=153493.33333333334, ans=0.2 2023-06-23 19:08:42,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.balancer2.prob, batch_count=153493.33333333334, ans=0.125 2023-06-23 19:08:55,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=153560.0, ans=0.0 2023-06-23 19:08:55,871 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=384, metric=20.12 vs. limit=22.5 2023-06-23 19:08:58,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=153560.0, ans=0.125 2023-06-23 19:09:04,503 INFO [train.py:1008] (2/4) Epoch 44, batch 150, loss[loss=0.1896, simple_loss=0.2748, pruned_loss=0.05223, over 18637.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2755, pruned_loss=0.0615, over 2029557.15 frames. ], batch size: 80, lr: 7.56e-03, grad_scale: 32.0 2023-06-23 19:09:08,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=153626.66666666666, ans=0.125 2023-06-23 19:09:14,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=153626.66666666666, ans=0.125 2023-06-23 19:09:19,483 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=19.44 vs. limit=22.5 2023-06-23 19:09:54,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=153826.66666666666, ans=0.1 2023-06-23 19:10:15,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=153893.33333333334, ans=0.0 2023-06-23 19:10:26,214 INFO [train.py:1008] (2/4) Epoch 44, batch 200, loss[loss=0.1939, simple_loss=0.2677, pruned_loss=0.06001, over 20518.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.275, pruned_loss=0.06076, over 2423101.87 frames. ], batch size: 173, lr: 7.56e-03, grad_scale: 32.0 2023-06-23 19:10:31,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=153960.0, ans=0.1 2023-06-23 19:10:40,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=154026.66666666666, ans=0.125 2023-06-23 19:11:01,461 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=4.20 vs. limit=15.0 2023-06-23 19:11:06,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=154093.33333333334, ans=0.0 2023-06-23 19:11:16,795 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.441e+02 1.708e+02 1.966e+02 2.304e+02 3.229e+02, threshold=3.932e+02, percent-clipped=0.0 2023-06-23 19:11:49,421 INFO [train.py:1008] (2/4) Epoch 44, batch 250, loss[loss=0.2107, simple_loss=0.2742, pruned_loss=0.07366, over 20263.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.2752, pruned_loss=0.06048, over 2717016.95 frames. ], batch size: 239, lr: 7.55e-03, grad_scale: 32.0 2023-06-23 19:11:52,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=154293.33333333334, ans=0.125 2023-06-23 19:12:35,754 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:12:57,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=154560.0, ans=22.5 2023-06-23 19:13:06,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=154560.0, ans=0.125 2023-06-23 19:13:11,226 INFO [train.py:1008] (2/4) Epoch 44, batch 300, loss[loss=0.2071, simple_loss=0.2717, pruned_loss=0.07128, over 20663.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2754, pruned_loss=0.06058, over 2946126.24 frames. ], batch size: 211, lr: 7.54e-03, grad_scale: 32.0 2023-06-23 19:13:46,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=154760.0, ans=0.1 2023-06-23 19:14:02,952 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.525e+02 1.753e+02 1.948e+02 2.277e+02 3.619e+02, threshold=3.896e+02, percent-clipped=0.0 2023-06-23 19:14:18,950 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.28 vs. limit=6.0 2023-06-23 19:14:33,605 INFO [train.py:1008] (2/4) Epoch 44, batch 350, loss[loss=0.1957, simple_loss=0.2743, pruned_loss=0.05854, over 19812.00 frames. ], tot_loss[loss=0.1979, simple_loss=0.2752, pruned_loss=0.06031, over 3137568.27 frames. ], batch size: 115, lr: 7.53e-03, grad_scale: 32.0 2023-06-23 19:15:15,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=155093.33333333334, ans=0.0 2023-06-23 19:15:22,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer2.prob, batch_count=155160.0, ans=0.125 2023-06-23 19:15:25,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=155160.0, ans=0.1 2023-06-23 19:15:36,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=155160.0, ans=0.125 2023-06-23 19:15:57,317 INFO [train.py:1008] (2/4) Epoch 44, batch 400, loss[loss=0.1675, simple_loss=0.2488, pruned_loss=0.04314, over 19639.00 frames. ], tot_loss[loss=0.197, simple_loss=0.2749, pruned_loss=0.0596, over 3280946.78 frames. ], batch size: 110, lr: 7.53e-03, grad_scale: 32.0 2023-06-23 19:15:57,961 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.96 vs. limit=10.0 2023-06-23 19:16:06,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=155293.33333333334, ans=0.125 2023-06-23 19:16:18,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward3.hidden_balancer.prob, batch_count=155360.0, ans=0.125 2023-06-23 19:16:21,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=155360.0, ans=0.125 2023-06-23 19:16:44,473 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.74 vs. limit=15.0 2023-06-23 19:16:48,285 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.368e+02 1.681e+02 1.920e+02 2.248e+02 3.618e+02, threshold=3.840e+02, percent-clipped=0.0 2023-06-23 19:16:55,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=155493.33333333334, ans=0.125 2023-06-23 19:17:12,892 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=5.97 vs. limit=15.0 2023-06-23 19:17:19,735 INFO [train.py:1008] (2/4) Epoch 44, batch 450, loss[loss=0.1836, simple_loss=0.2662, pruned_loss=0.05049, over 19539.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2752, pruned_loss=0.05997, over 3388585.39 frames. ], batch size: 102, lr: 7.52e-03, grad_scale: 32.0 2023-06-23 19:17:43,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=155693.33333333334, ans=0.125 2023-06-23 19:17:44,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=155693.33333333334, ans=0.125 2023-06-23 19:17:49,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=155693.33333333334, ans=0.0 2023-06-23 19:17:58,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward2.hidden_balancer.prob, batch_count=155760.0, ans=0.125 2023-06-23 19:17:59,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=155760.0, ans=0.0 2023-06-23 19:18:00,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=155760.0, ans=0.2 2023-06-23 19:18:40,431 INFO [train.py:1008] (2/4) Epoch 44, batch 500, loss[loss=0.2015, simple_loss=0.272, pruned_loss=0.06547, over 20083.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2747, pruned_loss=0.06, over 3478519.57 frames. ], batch size: 133, lr: 7.51e-03, grad_scale: 32.0 2023-06-23 19:18:43,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=155960.0, ans=0.2 2023-06-23 19:19:19,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=156093.33333333334, ans=0.1 2023-06-23 19:19:24,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=156093.33333333334, ans=0.125 2023-06-23 19:19:29,814 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.329e+02 1.714e+02 1.879e+02 2.067e+02 4.059e+02, threshold=3.757e+02, percent-clipped=1.0 2023-06-23 19:19:48,881 INFO [train.py:1008] (2/4) Epoch 45, batch 0, loss[loss=0.1831, simple_loss=0.266, pruned_loss=0.05013, over 19106.00 frames. ], tot_loss[loss=0.1831, simple_loss=0.266, pruned_loss=0.05013, over 19106.00 frames. ], batch size: 94, lr: 7.42e-03, grad_scale: 32.0 2023-06-23 19:19:48,882 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 19:19:54,599 INFO [train.py:1040] (2/4) Epoch 45, validation: loss=0.1941, simple_loss=0.2881, pruned_loss=0.05006, over 143649.00 frames. 2023-06-23 19:19:54,599 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 19:20:45,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=156373.33333333334, ans=0.2 2023-06-23 19:20:50,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.bypass.scale_min, batch_count=156373.33333333334, ans=0.2 2023-06-23 19:20:51,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=156373.33333333334, ans=0.1 2023-06-23 19:21:03,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=156440.0, ans=0.0 2023-06-23 19:21:11,079 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=384, metric=3.12 vs. limit=12.0 2023-06-23 19:21:16,259 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.34 vs. limit=15.0 2023-06-23 19:21:17,117 INFO [train.py:1008] (2/4) Epoch 45, batch 50, loss[loss=0.1942, simple_loss=0.276, pruned_loss=0.05614, over 19466.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.2746, pruned_loss=0.05952, over 860765.87 frames. ], batch size: 105, lr: 7.41e-03, grad_scale: 32.0 2023-06-23 19:21:26,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.prob, batch_count=156506.66666666666, ans=0.125 2023-06-23 19:21:29,169 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=22.5 2023-06-23 19:21:49,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=156640.0, ans=0.05 2023-06-23 19:21:58,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=156640.0, ans=0.125 2023-06-23 19:21:58,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=156640.0, ans=0.1 2023-06-23 19:22:08,346 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-23 19:22:25,586 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.53 vs. limit=10.0 2023-06-23 19:22:33,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=156773.33333333334, ans=15.0 2023-06-23 19:22:38,918 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.475e+02 1.738e+02 1.900e+02 2.147e+02 3.061e+02, threshold=3.801e+02, percent-clipped=0.0 2023-06-23 19:22:39,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=156840.0, ans=0.1 2023-06-23 19:22:40,563 INFO [train.py:1008] (2/4) Epoch 45, batch 100, loss[loss=0.1829, simple_loss=0.2662, pruned_loss=0.04983, over 19490.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.2738, pruned_loss=0.0585, over 1514492.16 frames. ], batch size: 102, lr: 7.41e-03, grad_scale: 32.0 2023-06-23 19:22:47,266 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:23:32,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=157040.0, ans=0.0 2023-06-23 19:23:32,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=157040.0, ans=0.0 2023-06-23 19:23:40,173 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.41 vs. limit=15.0 2023-06-23 19:23:59,779 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:24:04,602 INFO [train.py:1008] (2/4) Epoch 45, batch 150, loss[loss=0.1868, simple_loss=0.2708, pruned_loss=0.05135, over 19072.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.2739, pruned_loss=0.05813, over 2020271.51 frames. ], batch size: 89, lr: 7.40e-03, grad_scale: 32.0 2023-06-23 19:25:01,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer2.min_positive, batch_count=157373.33333333334, ans=0.05 2023-06-23 19:25:11,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=157440.0, ans=0.2 2023-06-23 19:25:13,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=157440.0, ans=0.2 2023-06-23 19:25:25,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=157440.0, ans=0.0 2023-06-23 19:25:26,499 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.361e+02 1.703e+02 1.879e+02 2.064e+02 3.143e+02, threshold=3.758e+02, percent-clipped=0.0 2023-06-23 19:25:28,170 INFO [train.py:1008] (2/4) Epoch 45, batch 200, loss[loss=0.2093, simple_loss=0.303, pruned_loss=0.05775, over 17609.00 frames. ], tot_loss[loss=0.196, simple_loss=0.2744, pruned_loss=0.05881, over 2416810.09 frames. ], batch size: 67, lr: 7.39e-03, grad_scale: 32.0 2023-06-23 19:26:12,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=157640.0, ans=0.125 2023-06-23 19:26:20,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=157706.66666666666, ans=0.125 2023-06-23 19:26:53,038 INFO [train.py:1008] (2/4) Epoch 45, batch 250, loss[loss=0.1899, simple_loss=0.2729, pruned_loss=0.05344, over 19098.00 frames. ], tot_loss[loss=0.1966, simple_loss=0.2744, pruned_loss=0.05942, over 2713592.36 frames. ], batch size: 94, lr: 7.39e-03, grad_scale: 32.0 2023-06-23 19:26:53,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=157840.0, ans=0.0 2023-06-23 19:26:58,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=157840.0, ans=0.125 2023-06-23 19:27:04,142 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=512, metric=17.88 vs. limit=22.5 2023-06-23 19:27:25,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=157973.33333333334, ans=0.0 2023-06-23 19:27:37,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=157973.33333333334, ans=0.2 2023-06-23 19:28:09,373 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.16 vs. limit=6.0 2023-06-23 19:28:16,265 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.503e+02 1.803e+02 2.062e+02 2.505e+02 3.858e+02, threshold=4.125e+02, percent-clipped=1.0 2023-06-23 19:28:17,884 INFO [train.py:1008] (2/4) Epoch 45, batch 300, loss[loss=0.1811, simple_loss=0.2653, pruned_loss=0.04839, over 19477.00 frames. ], tot_loss[loss=0.1969, simple_loss=0.2744, pruned_loss=0.05969, over 2950380.08 frames. ], batch size: 105, lr: 7.38e-03, grad_scale: 32.0 2023-06-23 19:28:29,005 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=5.25 vs. limit=12.0 2023-06-23 19:28:34,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=158240.0, ans=0.2 2023-06-23 19:29:03,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=158306.66666666666, ans=0.2 2023-06-23 19:29:17,376 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:29:42,241 INFO [train.py:1008] (2/4) Epoch 45, batch 350, loss[loss=0.1962, simple_loss=0.2707, pruned_loss=0.06087, over 19964.00 frames. ], tot_loss[loss=0.1964, simple_loss=0.274, pruned_loss=0.05938, over 3140090.18 frames. ], batch size: 126, lr: 7.37e-03, grad_scale: 32.0 2023-06-23 19:29:42,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=158506.66666666666, ans=0.0 2023-06-23 19:30:39,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=158706.66666666666, ans=0.1 2023-06-23 19:30:45,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=158706.66666666666, ans=0.125 2023-06-23 19:31:07,425 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 1.753e+02 1.983e+02 2.244e+02 3.226e+02, threshold=3.966e+02, percent-clipped=0.0 2023-06-23 19:31:07,472 INFO [train.py:1008] (2/4) Epoch 45, batch 400, loss[loss=0.202, simple_loss=0.2803, pruned_loss=0.06183, over 17002.00 frames. ], tot_loss[loss=0.1963, simple_loss=0.2739, pruned_loss=0.05937, over 3282251.03 frames. ], batch size: 60, lr: 7.36e-03, grad_scale: 32.0 2023-06-23 19:31:21,826 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=6.90 vs. limit=15.0 2023-06-23 19:31:25,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=158906.66666666666, ans=0.0 2023-06-23 19:31:36,632 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=6.47 vs. limit=10.0 2023-06-23 19:31:43,658 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module2.whiten, num_groups=1, num_channels=384, metric=2.57 vs. limit=15.0 2023-06-23 19:31:48,552 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2023-06-23 19:31:57,782 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.33 vs. limit=22.5 2023-06-23 19:32:11,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=159040.0, ans=0.2 2023-06-23 19:32:20,148 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.30 vs. limit=15.0 2023-06-23 19:32:21,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=159106.66666666666, ans=0.125 2023-06-23 19:32:29,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=159173.33333333334, ans=0.025 2023-06-23 19:32:30,813 INFO [train.py:1008] (2/4) Epoch 45, batch 450, loss[loss=0.2162, simple_loss=0.3036, pruned_loss=0.06443, over 18344.00 frames. ], tot_loss[loss=0.1966, simple_loss=0.2744, pruned_loss=0.05941, over 3394939.81 frames. ], batch size: 72, lr: 7.36e-03, grad_scale: 32.0 2023-06-23 19:32:33,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=159173.33333333334, ans=0.0 2023-06-23 19:33:52,061 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.444e+02 1.694e+02 1.864e+02 2.179e+02 3.076e+02, threshold=3.729e+02, percent-clipped=0.0 2023-06-23 19:33:52,107 INFO [train.py:1008] (2/4) Epoch 45, batch 500, loss[loss=0.1998, simple_loss=0.2766, pruned_loss=0.06149, over 20248.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.2743, pruned_loss=0.05967, over 3490879.46 frames. ], batch size: 141, lr: 7.35e-03, grad_scale: 32.0 2023-06-23 19:34:28,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=159640.0, ans=0.125 2023-06-23 19:34:34,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=159640.0, ans=0.0 2023-06-23 19:34:59,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=159726.66666666666, ans=0.0 2023-06-23 19:35:08,884 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=384, metric=3.42 vs. limit=15.0 2023-06-23 19:35:09,619 INFO [train.py:1008] (2/4) Epoch 46, batch 0, loss[loss=0.1791, simple_loss=0.2607, pruned_loss=0.04872, over 19548.00 frames. ], tot_loss[loss=0.1791, simple_loss=0.2607, pruned_loss=0.04872, over 19548.00 frames. ], batch size: 102, lr: 7.27e-03, grad_scale: 32.0 2023-06-23 19:35:09,619 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 19:35:15,388 INFO [train.py:1040] (2/4) Epoch 46, validation: loss=0.1946, simple_loss=0.2884, pruned_loss=0.05047, over 143649.00 frames. 2023-06-23 19:35:15,389 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 19:35:27,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=159726.66666666666, ans=0.0 2023-06-23 19:35:32,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=159793.33333333334, ans=0.125 2023-06-23 19:35:34,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=159793.33333333334, ans=0.2 2023-06-23 19:35:36,707 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.58 vs. limit=15.0 2023-06-23 19:35:38,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=159793.33333333334, ans=10.0 2023-06-23 19:35:53,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=159860.0, ans=0.1 2023-06-23 19:36:29,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=159993.33333333334, ans=0.1 2023-06-23 19:36:40,392 INFO [train.py:1008] (2/4) Epoch 46, batch 50, loss[loss=0.188, simple_loss=0.26, pruned_loss=0.05797, over 20131.00 frames. ], tot_loss[loss=0.1937, simple_loss=0.2728, pruned_loss=0.05727, over 836495.31 frames. ], batch size: 239, lr: 7.26e-03, grad_scale: 32.0 2023-06-23 19:37:07,793 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.430e+02 1.754e+02 1.890e+02 2.079e+02 2.977e+02, threshold=3.779e+02, percent-clipped=0.0 2023-06-23 19:38:02,592 INFO [train.py:1008] (2/4) Epoch 46, batch 100, loss[loss=0.1948, simple_loss=0.2685, pruned_loss=0.06059, over 20302.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.2725, pruned_loss=0.05912, over 1497613.79 frames. ], batch size: 149, lr: 7.25e-03, grad_scale: 32.0 2023-06-23 19:38:50,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=160593.33333333334, ans=0.2 2023-06-23 19:39:10,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=160660.0, ans=0.0 2023-06-23 19:39:20,357 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=384, metric=3.87 vs. limit=15.0 2023-06-23 19:39:22,618 INFO [train.py:1008] (2/4) Epoch 46, batch 150, loss[loss=0.1976, simple_loss=0.24, pruned_loss=0.07762, over 16988.00 frames. ], tot_loss[loss=0.196, simple_loss=0.2733, pruned_loss=0.05932, over 2003538.79 frames. ], batch size: 391, lr: 7.24e-03, grad_scale: 32.0 2023-06-23 19:39:26,776 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:39:42,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=160793.33333333334, ans=0.95 2023-06-23 19:39:43,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=160793.33333333334, ans=0.0 2023-06-23 19:39:50,372 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.385e+02 1.683e+02 1.848e+02 2.036e+02 2.686e+02, threshold=3.697e+02, percent-clipped=0.0 2023-06-23 19:40:14,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=160926.66666666666, ans=0.0 2023-06-23 19:40:44,447 INFO [train.py:1008] (2/4) Epoch 46, batch 200, loss[loss=0.195, simple_loss=0.2746, pruned_loss=0.05774, over 19075.00 frames. ], tot_loss[loss=0.1955, simple_loss=0.2727, pruned_loss=0.05918, over 2411806.79 frames. ], batch size: 89, lr: 7.24e-03, grad_scale: 32.0 2023-06-23 19:41:02,749 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=16.92 vs. limit=22.5 2023-06-23 19:41:05,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=161126.66666666666, ans=0.0 2023-06-23 19:41:06,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=161126.66666666666, ans=0.125 2023-06-23 19:41:30,883 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.93 vs. limit=15.0 2023-06-23 19:41:41,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=161260.0, ans=0.2 2023-06-23 19:41:53,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=161326.66666666666, ans=0.125 2023-06-23 19:42:03,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=161326.66666666666, ans=0.125 2023-06-23 19:42:08,719 INFO [train.py:1008] (2/4) Epoch 46, batch 250, loss[loss=0.1933, simple_loss=0.2662, pruned_loss=0.06022, over 20299.00 frames. ], tot_loss[loss=0.1961, simple_loss=0.2727, pruned_loss=0.05975, over 2723088.31 frames. ], batch size: 141, lr: 7.23e-03, grad_scale: 32.0 2023-06-23 19:42:13,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=161393.33333333334, ans=0.0 2023-06-23 19:42:33,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=161460.0, ans=0.2 2023-06-23 19:42:35,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer2.prob, batch_count=161460.0, ans=0.125 2023-06-23 19:42:36,535 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.531e+02 1.820e+02 2.032e+02 2.554e+02 4.119e+02, threshold=4.065e+02, percent-clipped=4.0 2023-06-23 19:43:11,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=161593.33333333334, ans=0.125 2023-06-23 19:43:22,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=161660.0, ans=0.07 2023-06-23 19:43:30,575 INFO [train.py:1008] (2/4) Epoch 46, batch 300, loss[loss=0.2028, simple_loss=0.2745, pruned_loss=0.06553, over 20312.00 frames. ], tot_loss[loss=0.1969, simple_loss=0.2738, pruned_loss=0.05993, over 2959404.16 frames. ], batch size: 149, lr: 7.22e-03, grad_scale: 32.0 2023-06-23 19:43:30,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module2.balancer1.prob, batch_count=161726.66666666666, ans=0.125 2023-06-23 19:43:32,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer1.max_abs, batch_count=161726.66666666666, ans=10.0 2023-06-23 19:44:03,778 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.18 vs. limit=15.0 2023-06-23 19:44:12,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=161860.0, ans=0.0 2023-06-23 19:44:48,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=161993.33333333334, ans=0.035 2023-06-23 19:44:51,368 INFO [train.py:1008] (2/4) Epoch 46, batch 350, loss[loss=0.1773, simple_loss=0.2554, pruned_loss=0.04958, over 19892.00 frames. ], tot_loss[loss=0.1966, simple_loss=0.2741, pruned_loss=0.05953, over 3131915.28 frames. ], batch size: 120, lr: 7.22e-03, grad_scale: 32.0 2023-06-23 19:45:19,765 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.401e+02 1.692e+02 1.874e+02 2.105e+02 2.686e+02, threshold=3.748e+02, percent-clipped=0.0 2023-06-23 19:45:35,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=162193.33333333334, ans=0.2 2023-06-23 19:45:42,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=162260.0, ans=0.125 2023-06-23 19:45:58,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=162326.66666666666, ans=0.125 2023-06-23 19:46:04,691 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.96 vs. limit=22.5 2023-06-23 19:46:14,898 INFO [train.py:1008] (2/4) Epoch 46, batch 400, loss[loss=0.2232, simple_loss=0.3038, pruned_loss=0.07134, over 18276.00 frames. ], tot_loss[loss=0.1967, simple_loss=0.2742, pruned_loss=0.05966, over 3278565.98 frames. ], batch size: 74, lr: 7.21e-03, grad_scale: 32.0 2023-06-23 19:46:34,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff3_skip_rate, batch_count=162460.0, ans=0.0 2023-06-23 19:46:50,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=162526.66666666666, ans=0.035 2023-06-23 19:46:56,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=162526.66666666666, ans=0.0 2023-06-23 19:47:07,348 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:47:13,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=162593.33333333334, ans=0.0 2023-06-23 19:47:28,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer1.min_positive, batch_count=162660.0, ans=0.025 2023-06-23 19:47:34,105 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.77 vs. limit=15.0 2023-06-23 19:47:36,231 INFO [train.py:1008] (2/4) Epoch 46, batch 450, loss[loss=0.213, simple_loss=0.298, pruned_loss=0.06404, over 18320.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2736, pruned_loss=0.05895, over 3411872.74 frames. ], batch size: 72, lr: 7.20e-03, grad_scale: 32.0 2023-06-23 19:47:59,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=162793.33333333334, ans=0.1 2023-06-23 19:48:04,456 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.410e+02 1.727e+02 2.003e+02 2.275e+02 3.093e+02, threshold=4.006e+02, percent-clipped=0.0 2023-06-23 19:48:20,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=162860.0, ans=0.125 2023-06-23 19:48:55,551 INFO [train.py:1008] (2/4) Epoch 46, batch 500, loss[loss=0.192, simple_loss=0.2638, pruned_loss=0.0601, over 20709.00 frames. ], tot_loss[loss=0.1955, simple_loss=0.2738, pruned_loss=0.05858, over 3502644.47 frames. ], batch size: 211, lr: 7.20e-03, grad_scale: 32.0 2023-06-23 19:49:37,733 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.92 vs. limit=6.0 2023-06-23 19:50:03,329 INFO [train.py:1008] (2/4) Epoch 47, batch 0, loss[loss=0.2061, simple_loss=0.2919, pruned_loss=0.06017, over 19792.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2919, pruned_loss=0.06017, over 19792.00 frames. ], batch size: 115, lr: 7.11e-03, grad_scale: 32.0 2023-06-23 19:50:03,330 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 19:50:08,993 INFO [train.py:1040] (2/4) Epoch 47, validation: loss=0.1939, simple_loss=0.288, pruned_loss=0.04994, over 143649.00 frames. 2023-06-23 19:50:08,994 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 19:50:13,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=163280.0, ans=0.1 2023-06-23 19:50:24,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=163346.66666666666, ans=0.0 2023-06-23 19:50:40,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass_mid.scale_min, batch_count=163413.33333333334, ans=0.2 2023-06-23 19:50:43,620 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.60 vs. limit=22.5 2023-06-23 19:51:04,874 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.424e+02 1.745e+02 1.917e+02 2.227e+02 3.172e+02, threshold=3.834e+02, percent-clipped=0.0 2023-06-23 19:51:31,074 INFO [train.py:1008] (2/4) Epoch 47, batch 50, loss[loss=0.2194, simple_loss=0.2637, pruned_loss=0.08761, over 17014.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2751, pruned_loss=0.0612, over 839983.27 frames. ], batch size: 392, lr: 7.11e-03, grad_scale: 32.0 2023-06-23 19:51:44,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.pos_emb_skip_rate, batch_count=163613.33333333334, ans=0.0 2023-06-23 19:51:47,959 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.whiten, num_groups=1, num_channels=384, metric=5.72 vs. limit=12.0 2023-06-23 19:52:03,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=163746.66666666666, ans=0.0 2023-06-23 19:52:24,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=163813.33333333334, ans=0.2 2023-06-23 19:52:27,343 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:52:53,253 INFO [train.py:1008] (2/4) Epoch 47, batch 100, loss[loss=0.2225, simple_loss=0.2931, pruned_loss=0.07601, over 19983.00 frames. ], tot_loss[loss=0.1961, simple_loss=0.2736, pruned_loss=0.0593, over 1488427.26 frames. ], batch size: 126, lr: 7.10e-03, grad_scale: 32.0 2023-06-23 19:53:39,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=164080.0, ans=0.0 2023-06-23 19:53:45,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=164146.66666666666, ans=0.015 2023-06-23 19:53:48,240 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.432e+02 1.729e+02 1.911e+02 2.192e+02 3.082e+02, threshold=3.821e+02, percent-clipped=0.0 2023-06-23 19:53:48,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=164146.66666666666, ans=0.125 2023-06-23 19:54:03,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=164213.33333333334, ans=0.0 2023-06-23 19:54:11,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=164213.33333333334, ans=0.125 2023-06-23 19:54:11,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=164213.33333333334, ans=0.0 2023-06-23 19:54:14,615 INFO [train.py:1008] (2/4) Epoch 47, batch 150, loss[loss=0.2229, simple_loss=0.3057, pruned_loss=0.07, over 16393.00 frames. ], tot_loss[loss=0.1947, simple_loss=0.273, pruned_loss=0.05824, over 2015294.29 frames. ], batch size: 52, lr: 7.10e-03, grad_scale: 32.0 2023-06-23 19:54:14,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=164280.0, ans=0.125 2023-06-23 19:54:35,539 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:54:52,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=164413.33333333334, ans=0.125 2023-06-23 19:55:03,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=164480.0, ans=0.125 2023-06-23 19:55:05,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=164480.0, ans=0.0 2023-06-23 19:55:08,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=164480.0, ans=0.125 2023-06-23 19:55:38,220 INFO [train.py:1008] (2/4) Epoch 47, batch 200, loss[loss=0.1866, simple_loss=0.2583, pruned_loss=0.05742, over 20570.00 frames. ], tot_loss[loss=0.1942, simple_loss=0.2724, pruned_loss=0.05801, over 2411082.66 frames. ], batch size: 189, lr: 7.09e-03, grad_scale: 32.0 2023-06-23 19:55:56,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=164680.0, ans=0.0 2023-06-23 19:56:08,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=164680.0, ans=0.1 2023-06-23 19:56:15,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=164746.66666666666, ans=0.125 2023-06-23 19:56:34,205 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.425e+02 1.753e+02 1.915e+02 2.109e+02 2.762e+02, threshold=3.830e+02, percent-clipped=0.0 2023-06-23 19:56:37,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=164813.33333333334, ans=0.125 2023-06-23 19:56:41,092 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-23 19:57:01,071 INFO [train.py:1008] (2/4) Epoch 47, batch 250, loss[loss=0.1795, simple_loss=0.2621, pruned_loss=0.04846, over 19695.00 frames. ], tot_loss[loss=0.1936, simple_loss=0.2718, pruned_loss=0.05772, over 2715565.34 frames. ], batch size: 110, lr: 7.08e-03, grad_scale: 32.0 2023-06-23 19:57:14,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=164946.66666666666, ans=0.125 2023-06-23 19:57:32,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=165080.0, ans=0.125 2023-06-23 19:57:57,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=165146.66666666666, ans=0.125 2023-06-23 19:58:02,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=165146.66666666666, ans=0.0 2023-06-23 19:58:12,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=165213.33333333334, ans=0.0 2023-06-23 19:58:23,410 INFO [train.py:1008] (2/4) Epoch 47, batch 300, loss[loss=0.1874, simple_loss=0.2636, pruned_loss=0.05561, over 19209.00 frames. ], tot_loss[loss=0.1936, simple_loss=0.2717, pruned_loss=0.05778, over 2970676.25 frames. ], batch size: 92, lr: 7.08e-03, grad_scale: 16.0 2023-06-23 19:58:33,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=165280.0, ans=0.125 2023-06-23 19:58:35,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=165280.0, ans=0.0 2023-06-23 19:58:51,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=165346.66666666666, ans=0.2 2023-06-23 19:59:07,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=165413.33333333334, ans=0.0 2023-06-23 19:59:22,066 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.466e+02 1.773e+02 1.990e+02 2.235e+02 3.211e+02, threshold=3.980e+02, percent-clipped=0.0 2023-06-23 19:59:30,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=165546.66666666666, ans=0.125 2023-06-23 19:59:46,579 INFO [train.py:1008] (2/4) Epoch 47, batch 350, loss[loss=0.1876, simple_loss=0.2654, pruned_loss=0.05491, over 19076.00 frames. ], tot_loss[loss=0.1939, simple_loss=0.272, pruned_loss=0.05789, over 3145142.62 frames. ], batch size: 89, lr: 7.07e-03, grad_scale: 16.0 2023-06-23 19:59:52,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=165613.33333333334, ans=0.0 2023-06-23 20:00:02,029 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:00:22,822 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-23 20:00:33,156 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=384, metric=4.50 vs. limit=15.0 2023-06-23 20:01:00,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=165880.0, ans=0.07 2023-06-23 20:01:02,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=165880.0, ans=0.125 2023-06-23 20:01:05,859 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:01:10,411 INFO [train.py:1008] (2/4) Epoch 47, batch 400, loss[loss=0.1732, simple_loss=0.259, pruned_loss=0.04366, over 19836.00 frames. ], tot_loss[loss=0.1943, simple_loss=0.272, pruned_loss=0.05831, over 3292482.97 frames. ], batch size: 115, lr: 7.06e-03, grad_scale: 32.0 2023-06-23 20:01:12,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=165946.66666666666, ans=0.1 2023-06-23 20:01:13,616 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.35 vs. limit=22.5 2023-06-23 20:01:28,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=166013.33333333334, ans=0.2 2023-06-23 20:01:35,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=166013.33333333334, ans=0.1 2023-06-23 20:01:47,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.nonlin_attention.whiten2.whitening_limit, batch_count=166080.0, ans=15.0 2023-06-23 20:02:09,662 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.405e+02 1.715e+02 1.880e+02 2.205e+02 3.538e+02, threshold=3.760e+02, percent-clipped=0.0 2023-06-23 20:02:10,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=166146.66666666666, ans=0.125 2023-06-23 20:02:34,897 INFO [train.py:1008] (2/4) Epoch 47, batch 450, loss[loss=0.1945, simple_loss=0.269, pruned_loss=0.06, over 20586.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.2724, pruned_loss=0.05889, over 3391731.47 frames. ], batch size: 189, lr: 7.06e-03, grad_scale: 32.0 2023-06-23 20:02:53,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=166346.66666666666, ans=0.125 2023-06-23 20:03:03,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=166346.66666666666, ans=0.125 2023-06-23 20:03:13,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=166413.33333333334, ans=0.2 2023-06-23 20:03:16,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=166413.33333333334, ans=0.1 2023-06-23 20:03:21,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=166413.33333333334, ans=0.1 2023-06-23 20:03:29,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=166480.0, ans=0.125 2023-06-23 20:03:31,083 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.62 vs. limit=6.0 2023-06-23 20:03:35,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=166480.0, ans=0.0 2023-06-23 20:03:38,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_skip_rate, batch_count=166546.66666666666, ans=0.0 2023-06-23 20:03:41,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.attention_skip_rate, batch_count=166546.66666666666, ans=0.0 2023-06-23 20:03:44,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=166546.66666666666, ans=0.1 2023-06-23 20:03:55,633 INFO [train.py:1008] (2/4) Epoch 47, batch 500, loss[loss=0.1958, simple_loss=0.2743, pruned_loss=0.05867, over 20294.00 frames. ], tot_loss[loss=0.1948, simple_loss=0.2723, pruned_loss=0.05866, over 3492617.99 frames. ], batch size: 141, lr: 7.05e-03, grad_scale: 32.0 2023-06-23 20:04:05,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=166613.33333333334, ans=0.0 2023-06-23 20:04:16,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=166680.0, ans=0.1 2023-06-23 20:04:25,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.prob, batch_count=166746.66666666666, ans=0.125 2023-06-23 20:04:39,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=166746.66666666666, ans=0.125 2023-06-23 20:05:01,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=166826.66666666666, ans=0.125 2023-06-23 20:05:06,526 INFO [train.py:1008] (2/4) Epoch 48, batch 0, loss[loss=0.1825, simple_loss=0.2677, pruned_loss=0.04864, over 19707.00 frames. ], tot_loss[loss=0.1825, simple_loss=0.2677, pruned_loss=0.04864, over 19707.00 frames. ], batch size: 110, lr: 6.97e-03, grad_scale: 32.0 2023-06-23 20:05:06,526 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 20:05:12,228 INFO [train.py:1040] (2/4) Epoch 48, validation: loss=0.196, simple_loss=0.2888, pruned_loss=0.05157, over 143649.00 frames. 2023-06-23 20:05:12,230 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 20:05:16,870 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.378e+02 1.645e+02 1.899e+02 2.115e+02 3.302e+02, threshold=3.798e+02, percent-clipped=0.0 2023-06-23 20:05:29,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=166893.33333333334, ans=0.2 2023-06-23 20:05:55,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module1.balancer1.prob, batch_count=166960.0, ans=0.125 2023-06-23 20:06:23,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=167093.33333333334, ans=0.0 2023-06-23 20:06:32,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=167093.33333333334, ans=0.0 2023-06-23 20:06:35,249 INFO [train.py:1008] (2/4) Epoch 48, batch 50, loss[loss=0.1754, simple_loss=0.2616, pruned_loss=0.04461, over 18948.00 frames. ], tot_loss[loss=0.192, simple_loss=0.27, pruned_loss=0.05702, over 854403.50 frames. ], batch size: 86, lr: 6.96e-03, grad_scale: 32.0 2023-06-23 20:06:43,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.scale_min, batch_count=167160.0, ans=0.2 2023-06-23 20:07:10,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=167293.33333333334, ans=0.125 2023-06-23 20:07:28,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=167360.0, ans=0.125 2023-06-23 20:07:35,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=167360.0, ans=0.125 2023-06-23 20:07:58,690 INFO [train.py:1008] (2/4) Epoch 48, batch 100, loss[loss=0.2025, simple_loss=0.2716, pruned_loss=0.06664, over 20602.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.2723, pruned_loss=0.05711, over 1504622.72 frames. ], batch size: 189, lr: 6.96e-03, grad_scale: 32.0 2023-06-23 20:08:03,595 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.423e+02 1.804e+02 2.047e+02 2.284e+02 3.996e+02, threshold=4.095e+02, percent-clipped=1.0 2023-06-23 20:08:11,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=167493.33333333334, ans=0.1 2023-06-23 20:08:17,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=167560.0, ans=0.125 2023-06-23 20:08:21,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=167560.0, ans=0.95 2023-06-23 20:08:30,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=167626.66666666666, ans=0.125 2023-06-23 20:08:58,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.ff2_skip_rate, batch_count=167693.33333333334, ans=0.0 2023-06-23 20:09:13,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=167760.0, ans=0.0 2023-06-23 20:09:21,645 INFO [train.py:1008] (2/4) Epoch 48, batch 150, loss[loss=0.1775, simple_loss=0.2594, pruned_loss=0.0478, over 19094.00 frames. ], tot_loss[loss=0.1939, simple_loss=0.2727, pruned_loss=0.05759, over 1991024.11 frames. ], batch size: 89, lr: 6.95e-03, grad_scale: 32.0 2023-06-23 20:09:29,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=167826.66666666666, ans=0.09899494936611666 2023-06-23 20:09:33,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=167826.66666666666, ans=0.1 2023-06-23 20:10:01,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=167960.0, ans=0.035 2023-06-23 20:10:06,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=167960.0, ans=0.125 2023-06-23 20:10:11,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=168026.66666666666, ans=0.2 2023-06-23 20:10:40,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=168093.33333333334, ans=0.125 2023-06-23 20:10:44,612 INFO [train.py:1008] (2/4) Epoch 48, batch 200, loss[loss=0.1697, simple_loss=0.2533, pruned_loss=0.0431, over 19436.00 frames. ], tot_loss[loss=0.193, simple_loss=0.2714, pruned_loss=0.05736, over 2401588.74 frames. ], batch size: 105, lr: 6.95e-03, grad_scale: 16.0 2023-06-23 20:10:51,406 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.331e+02 1.772e+02 1.937e+02 2.210e+02 3.460e+02, threshold=3.874e+02, percent-clipped=0.0 2023-06-23 20:11:04,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=168226.66666666666, ans=0.95 2023-06-23 20:11:49,567 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=4.13 vs. limit=12.0 2023-06-23 20:12:07,141 INFO [train.py:1008] (2/4) Epoch 48, batch 250, loss[loss=0.2, simple_loss=0.266, pruned_loss=0.06701, over 20228.00 frames. ], tot_loss[loss=0.1926, simple_loss=0.2715, pruned_loss=0.05682, over 2719743.84 frames. ], batch size: 239, lr: 6.94e-03, grad_scale: 16.0 2023-06-23 20:12:12,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=168493.33333333334, ans=0.0 2023-06-23 20:12:12,649 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=9.49 vs. limit=15.0 2023-06-23 20:12:19,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=168493.33333333334, ans=0.125 2023-06-23 20:12:19,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=168493.33333333334, ans=0.125 2023-06-23 20:12:33,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer2.prob, batch_count=168560.0, ans=0.125 2023-06-23 20:12:33,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.balancer2.prob, batch_count=168560.0, ans=0.125 2023-06-23 20:13:03,975 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=384, metric=15.37 vs. limit=22.5 2023-06-23 20:13:17,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=168760.0, ans=0.2 2023-06-23 20:13:28,474 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.77 vs. limit=6.0 2023-06-23 20:13:29,117 INFO [train.py:1008] (2/4) Epoch 48, batch 300, loss[loss=0.1995, simple_loss=0.2781, pruned_loss=0.06041, over 19830.00 frames. ], tot_loss[loss=0.1931, simple_loss=0.2719, pruned_loss=0.05716, over 2969144.14 frames. ], batch size: 115, lr: 6.93e-03, grad_scale: 16.0 2023-06-23 20:13:33,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=168826.66666666666, ans=0.125 2023-06-23 20:13:35,932 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.461e+02 1.680e+02 1.833e+02 2.048e+02 2.749e+02, threshold=3.665e+02, percent-clipped=0.0 2023-06-23 20:13:47,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=168893.33333333334, ans=0.125 2023-06-23 20:14:05,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=168960.0, ans=0.0 2023-06-23 20:14:32,346 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.15 vs. limit=10.0 2023-06-23 20:14:52,080 INFO [train.py:1008] (2/4) Epoch 48, batch 350, loss[loss=0.199, simple_loss=0.2817, pruned_loss=0.05815, over 19449.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.2718, pruned_loss=0.05734, over 3155947.29 frames. ], batch size: 105, lr: 6.93e-03, grad_scale: 16.0 2023-06-23 20:16:14,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=169493.33333333334, ans=0.0 2023-06-23 20:16:15,518 INFO [train.py:1008] (2/4) Epoch 48, batch 400, loss[loss=0.1785, simple_loss=0.2657, pruned_loss=0.04565, over 19212.00 frames. ], tot_loss[loss=0.1937, simple_loss=0.2725, pruned_loss=0.05743, over 3287543.70 frames. ], batch size: 92, lr: 6.92e-03, grad_scale: 32.0 2023-06-23 20:16:22,151 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.353e+02 1.719e+02 1.891e+02 2.067e+02 3.319e+02, threshold=3.781e+02, percent-clipped=0.0 2023-06-23 20:16:28,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=169493.33333333334, ans=0.125 2023-06-23 20:16:29,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=169493.33333333334, ans=0.0 2023-06-23 20:17:03,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=169693.33333333334, ans=0.07 2023-06-23 20:17:03,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=169693.33333333334, ans=0.125 2023-06-23 20:17:11,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.ff3_skip_rate, batch_count=169693.33333333334, ans=0.0 2023-06-23 20:17:37,644 INFO [train.py:1008] (2/4) Epoch 48, batch 450, loss[loss=0.195, simple_loss=0.2757, pruned_loss=0.05715, over 19051.00 frames. ], tot_loss[loss=0.1932, simple_loss=0.2719, pruned_loss=0.0572, over 3414639.15 frames. ], batch size: 89, lr: 6.91e-03, grad_scale: 32.0 2023-06-23 20:17:50,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=169826.66666666666, ans=15.0 2023-06-23 20:17:53,298 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.75 vs. limit=22.5 2023-06-23 20:17:58,199 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=13.31 vs. limit=22.5 2023-06-23 20:18:14,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=169960.0, ans=0.125 2023-06-23 20:18:31,730 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=4.22 vs. limit=15.0 2023-06-23 20:18:33,754 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.57 vs. limit=8.0 2023-06-23 20:18:45,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=170093.33333333334, ans=0.1 2023-06-23 20:18:55,369 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=5.38 vs. limit=10.0 2023-06-23 20:18:59,434 INFO [train.py:1008] (2/4) Epoch 48, batch 500, loss[loss=0.1831, simple_loss=0.254, pruned_loss=0.05604, over 20675.00 frames. ], tot_loss[loss=0.1929, simple_loss=0.2718, pruned_loss=0.057, over 3501157.75 frames. ], batch size: 211, lr: 6.91e-03, grad_scale: 32.0 2023-06-23 20:19:06,592 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.455e+02 1.730e+02 1.897e+02 2.100e+02 3.393e+02, threshold=3.794e+02, percent-clipped=0.0 2023-06-23 20:19:06,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=170160.0, ans=0.1 2023-06-23 20:19:19,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=170226.66666666666, ans=0.0 2023-06-23 20:19:29,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=170293.33333333334, ans=0.07 2023-06-23 20:19:36,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.hidden_balancer.prob, batch_count=170293.33333333334, ans=0.125 2023-06-23 20:19:37,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=170293.33333333334, ans=0.0 2023-06-23 20:19:44,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=170360.0, ans=0.95 2023-06-23 20:19:46,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.bypass.skip_rate, batch_count=170360.0, ans=0.07 2023-06-23 20:20:11,714 INFO [train.py:1008] (2/4) Epoch 49, batch 0, loss[loss=0.1801, simple_loss=0.256, pruned_loss=0.05216, over 20543.00 frames. ], tot_loss[loss=0.1801, simple_loss=0.256, pruned_loss=0.05216, over 20543.00 frames. ], batch size: 189, lr: 6.83e-03, grad_scale: 32.0 2023-06-23 20:20:11,714 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 20:20:17,874 INFO [train.py:1040] (2/4) Epoch 49, validation: loss=0.1978, simple_loss=0.2901, pruned_loss=0.05269, over 143649.00 frames. 2023-06-23 20:20:17,875 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 20:20:31,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=170373.33333333334, ans=0.125 2023-06-23 20:20:34,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=170440.0, ans=0.1 2023-06-23 20:20:37,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=170440.0, ans=0.1 2023-06-23 20:20:39,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=170440.0, ans=0.0 2023-06-23 20:21:35,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.bypass.skip_rate, batch_count=170640.0, ans=0.09899494936611666 2023-06-23 20:21:41,446 INFO [train.py:1008] (2/4) Epoch 49, batch 50, loss[loss=0.1859, simple_loss=0.2736, pruned_loss=0.04913, over 18312.00 frames. ], tot_loss[loss=0.1924, simple_loss=0.2683, pruned_loss=0.05819, over 869648.83 frames. ], batch size: 74, lr: 6.83e-03, grad_scale: 32.0 2023-06-23 20:21:57,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=170773.33333333334, ans=0.1 2023-06-23 20:22:08,968 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=4.50 vs. limit=10.0 2023-06-23 20:22:11,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=170773.33333333334, ans=0.125 2023-06-23 20:22:13,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=170840.0, ans=0.2 2023-06-23 20:22:19,630 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.528e+02 1.847e+02 2.066e+02 2.446e+02 3.977e+02, threshold=4.132e+02, percent-clipped=1.0 2023-06-23 20:23:00,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=170973.33333333334, ans=0.0 2023-06-23 20:23:03,338 INFO [train.py:1008] (2/4) Epoch 49, batch 100, loss[loss=0.2034, simple_loss=0.2862, pruned_loss=0.06024, over 17036.00 frames. ], tot_loss[loss=0.1931, simple_loss=0.2711, pruned_loss=0.05755, over 1509065.61 frames. ], batch size: 60, lr: 6.82e-03, grad_scale: 32.0 2023-06-23 20:23:06,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=171040.0, ans=0.05 2023-06-23 20:23:06,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=171040.0, ans=0.125 2023-06-23 20:23:14,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=171040.0, ans=0.015 2023-06-23 20:23:40,635 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=4.12 vs. limit=10.0 2023-06-23 20:24:26,718 INFO [train.py:1008] (2/4) Epoch 49, batch 150, loss[loss=0.198, simple_loss=0.2818, pruned_loss=0.05711, over 14853.00 frames. ], tot_loss[loss=0.1916, simple_loss=0.2696, pruned_loss=0.05679, over 2001804.40 frames. ], batch size: 42, lr: 6.81e-03, grad_scale: 32.0 2023-06-23 20:24:35,924 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=8.59 vs. limit=15.0 2023-06-23 20:24:40,356 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.3.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:25:01,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=171506.66666666666, ans=0.1 2023-06-23 20:25:04,230 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.439e+02 1.656e+02 1.789e+02 2.179e+02 3.725e+02, threshold=3.578e+02, percent-clipped=0.0 2023-06-23 20:25:48,571 INFO [train.py:1008] (2/4) Epoch 49, batch 200, loss[loss=0.1827, simple_loss=0.2618, pruned_loss=0.05179, over 19725.00 frames. ], tot_loss[loss=0.1918, simple_loss=0.2693, pruned_loss=0.05718, over 2397361.48 frames. ], batch size: 110, lr: 6.81e-03, grad_scale: 32.0 2023-06-23 20:25:55,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=171706.66666666666, ans=0.125 2023-06-23 20:26:20,293 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=11.32 vs. limit=15.0 2023-06-23 20:26:32,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=171840.0, ans=0.0 2023-06-23 20:26:44,084 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=12.93 vs. limit=15.0 2023-06-23 20:26:53,768 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-23 20:27:07,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=171973.33333333334, ans=0.1 2023-06-23 20:27:12,550 INFO [train.py:1008] (2/4) Epoch 49, batch 250, loss[loss=0.1956, simple_loss=0.2756, pruned_loss=0.05773, over 18431.00 frames. ], tot_loss[loss=0.1919, simple_loss=0.2699, pruned_loss=0.05697, over 2712408.95 frames. ], batch size: 77, lr: 6.80e-03, grad_scale: 32.0 2023-06-23 20:27:28,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=172106.66666666666, ans=0.0 2023-06-23 20:27:39,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=172106.66666666666, ans=0.2 2023-06-23 20:27:48,873 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.346e+02 1.691e+02 1.914e+02 2.245e+02 3.185e+02, threshold=3.827e+02, percent-clipped=0.0 2023-06-23 20:27:49,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=172173.33333333334, ans=0.2 2023-06-23 20:27:58,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=172173.33333333334, ans=0.125 2023-06-23 20:28:09,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=172240.0, ans=0.0 2023-06-23 20:28:20,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=172306.66666666666, ans=0.1 2023-06-23 20:28:23,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=172306.66666666666, ans=0.125 2023-06-23 20:28:34,556 INFO [train.py:1008] (2/4) Epoch 49, batch 300, loss[loss=0.2034, simple_loss=0.271, pruned_loss=0.06792, over 20299.00 frames. ], tot_loss[loss=0.1924, simple_loss=0.2703, pruned_loss=0.05722, over 2936044.06 frames. ], batch size: 149, lr: 6.80e-03, grad_scale: 16.0 2023-06-23 20:29:28,420 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.62 vs. limit=22.5 2023-06-23 20:29:43,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=172640.0, ans=0.125 2023-06-23 20:29:57,316 INFO [train.py:1008] (2/4) Epoch 49, batch 350, loss[loss=0.1824, simple_loss=0.266, pruned_loss=0.04941, over 19230.00 frames. ], tot_loss[loss=0.1915, simple_loss=0.2701, pruned_loss=0.05649, over 3128170.38 frames. ], batch size: 92, lr: 6.79e-03, grad_scale: 16.0 2023-06-23 20:30:03,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_module2.balancer2.min_abs, batch_count=172706.66666666666, ans=0.5 2023-06-23 20:30:10,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=172706.66666666666, ans=0.2 2023-06-23 20:30:37,377 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.450e+02 1.714e+02 1.918e+02 2.192e+02 2.856e+02, threshold=3.837e+02, percent-clipped=0.0 2023-06-23 20:31:11,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=172973.33333333334, ans=0.0 2023-06-23 20:31:19,821 INFO [train.py:1008] (2/4) Epoch 49, batch 400, loss[loss=0.1727, simple_loss=0.2564, pruned_loss=0.04452, over 18477.00 frames. ], tot_loss[loss=0.1911, simple_loss=0.2697, pruned_loss=0.05627, over 3282163.61 frames. ], batch size: 77, lr: 6.78e-03, grad_scale: 32.0 2023-06-23 20:31:35,545 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=22.5 2023-06-23 20:31:50,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=173173.33333333334, ans=0.125 2023-06-23 20:31:54,590 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.52 vs. limit=6.0 2023-06-23 20:31:59,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=173173.33333333334, ans=0.125 2023-06-23 20:32:32,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=173306.66666666666, ans=0.125 2023-06-23 20:32:42,986 INFO [train.py:1008] (2/4) Epoch 49, batch 450, loss[loss=0.1862, simple_loss=0.2662, pruned_loss=0.05307, over 19876.00 frames. ], tot_loss[loss=0.1909, simple_loss=0.2701, pruned_loss=0.05582, over 3382758.79 frames. ], batch size: 120, lr: 6.78e-03, grad_scale: 32.0 2023-06-23 20:32:52,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=173373.33333333334, ans=0.125 2023-06-23 20:33:08,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward3.hidden_balancer.prob, batch_count=173440.0, ans=0.125 2023-06-23 20:33:12,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=173440.0, ans=0.05 2023-06-23 20:33:16,879 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=384, metric=20.42 vs. limit=22.5 2023-06-23 20:33:17,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=173506.66666666666, ans=0.0 2023-06-23 20:33:22,119 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.706e+02 1.898e+02 2.100e+02 2.777e+02, threshold=3.796e+02, percent-clipped=0.0 2023-06-23 20:33:44,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=173573.33333333334, ans=0.125 2023-06-23 20:34:04,002 INFO [train.py:1008] (2/4) Epoch 49, batch 500, loss[loss=0.1988, simple_loss=0.2727, pruned_loss=0.06247, over 20513.00 frames. ], tot_loss[loss=0.191, simple_loss=0.2696, pruned_loss=0.05622, over 3471574.67 frames. ], batch size: 160, lr: 6.77e-03, grad_scale: 32.0 2023-06-23 20:34:31,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=173773.33333333334, ans=0.125 2023-06-23 20:34:32,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=173773.33333333334, ans=0.0 2023-06-23 20:35:12,127 INFO [train.py:1008] (2/4) Epoch 50, batch 0, loss[loss=0.1941, simple_loss=0.277, pruned_loss=0.05561, over 19677.00 frames. ], tot_loss[loss=0.1941, simple_loss=0.277, pruned_loss=0.05561, over 19677.00 frames. ], batch size: 110, lr: 6.70e-03, grad_scale: 32.0 2023-06-23 20:35:12,128 INFO [train.py:1031] (2/4) Computing validation loss 2023-06-23 20:35:17,752 INFO [train.py:1040] (2/4) Epoch 50, validation: loss=0.1967, simple_loss=0.2888, pruned_loss=0.05235, over 143649.00 frames. 2023-06-23 20:35:17,753 INFO [train.py:1041] (2/4) Maximum memory allocated so far is 13783MB 2023-06-23 20:36:01,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.attention_skip_rate, batch_count=174060.0, ans=0.0 2023-06-23 20:36:13,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=174126.66666666666, ans=0.025 2023-06-23 20:36:25,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=174193.33333333334, ans=0.04949747468305833 2023-06-23 20:36:26,510 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.426e+02 1.756e+02 1.957e+02 2.326e+02 3.234e+02, threshold=3.914e+02, percent-clipped=0.0 2023-06-23 20:36:40,748 INFO [train.py:1008] (2/4) Epoch 50, batch 50, loss[loss=0.1998, simple_loss=0.2699, pruned_loss=0.06487, over 20119.00 frames. ], tot_loss[loss=0.1924, simple_loss=0.2726, pruned_loss=0.05612, over 862491.83 frames. ], batch size: 133, lr: 6.69e-03, grad_scale: 32.0 2023-06-23 20:36:43,254 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=12.56 vs. limit=22.5 2023-06-23 20:37:33,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=174460.0, ans=0.0 2023-06-23 20:38:00,321 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.2.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:38:04,574 INFO [train.py:1008] (2/4) Epoch 50, batch 100, loss[loss=0.1795, simple_loss=0.2669, pruned_loss=0.04602, over 19869.00 frames. ], tot_loss[loss=0.1913, simple_loss=0.271, pruned_loss=0.05582, over 1507340.79 frames. ], batch size: 120, lr: 6.69e-03, grad_scale: 32.0 2023-06-23 20:38:04,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=174593.33333333334, ans=0.125 2023-06-23 20:38:13,616 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.83 vs. limit=22.5 2023-06-23 20:38:25,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=174660.0, ans=0.05 2023-06-23 20:38:41,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=174726.66666666666, ans=0.125 2023-06-23 20:38:43,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=174726.66666666666, ans=0.125 2023-06-23 20:39:13,761 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.412e+02 1.691e+02 1.852e+02 2.068e+02 3.727e+02, threshold=3.704e+02, percent-clipped=0.0 2023-06-23 20:39:17,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=174860.0, ans=0.2 2023-06-23 20:39:17,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=174860.0, ans=0.125 2023-06-23 20:39:25,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.balancer1.prob, batch_count=174860.0, ans=0.125 2023-06-23 20:39:28,301 INFO [train.py:1008] (2/4) Epoch 50, batch 150, loss[loss=0.2066, simple_loss=0.2949, pruned_loss=0.05914, over 16688.00 frames. ], tot_loss[loss=0.191, simple_loss=0.2713, pruned_loss=0.05535, over 2010501.93 frames. ], batch size: 59, lr: 6.68e-03, grad_scale: 32.0 2023-06-23 20:39:35,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=174926.66666666666, ans=0.2 2023-06-23 20:39:38,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=174926.66666666666, ans=0.125 2023-06-23 20:40:00,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=175060.0, ans=0.125 2023-06-23 20:40:25,985 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.35 vs. limit=22.5 2023-06-23 20:40:45,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=175193.33333333334, ans=0.0 2023-06-23 20:40:51,248 INFO [train.py:1008] (2/4) Epoch 50, batch 200, loss[loss=0.1934, simple_loss=0.2721, pruned_loss=0.05741, over 18599.00 frames. ], tot_loss[loss=0.1915, simple_loss=0.2711, pruned_loss=0.05596, over 2398853.01 frames. ], batch size: 80, lr: 6.68e-03, grad_scale: 16.0 2023-06-23 20:41:32,404 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.66 vs. limit=15.0 2023-06-23 20:41:56,899 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.88 vs. limit=6.0 2023-06-23 20:42:01,412 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.435e+02 1.719e+02 1.914e+02 2.210e+02 3.241e+02, threshold=3.828e+02, percent-clipped=0.0 2023-06-23 20:42:05,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=175526.66666666666, ans=0.0 2023-06-23 20:42:08,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=175526.66666666666, ans=0.1 2023-06-23 20:42:14,845 INFO [train.py:1008] (2/4) Epoch 50, batch 250, loss[loss=0.1951, simple_loss=0.2744, pruned_loss=0.05796, over 18444.00 frames. ], tot_loss[loss=0.1916, simple_loss=0.2717, pruned_loss=0.05572, over 2698047.20 frames. ], batch size: 77, lr: 6.67e-03, grad_scale: 16.0 2023-06-23 20:42:31,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=175660.0, ans=0.0 2023-06-23 20:42:39,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=175660.0, ans=0.1 2023-06-23 20:42:43,546 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.70 vs. limit=12.0 2023-06-23 20:42:46,063 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:43:01,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.feed_forward1.hidden_balancer.prob, batch_count=175726.66666666666, ans=0.125 2023-06-23 20:43:06,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=175793.33333333334, ans=0.125 2023-06-23 20:43:23,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=175860.0, ans=0.125 2023-06-23 20:43:38,653 INFO [train.py:1008] (2/4) Epoch 50, batch 300, loss[loss=0.1928, simple_loss=0.2639, pruned_loss=0.06081, over 20679.00 frames. ], tot_loss[loss=0.192, simple_loss=0.2715, pruned_loss=0.0563, over 2932655.60 frames. ], batch size: 211, lr: 6.66e-03, grad_scale: 16.0 2023-06-23 20:43:53,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=175993.33333333334, ans=0.1 2023-06-23 20:44:01,268 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=512, metric=3.60 vs. limit=15.0 2023-06-23 20:44:07,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=175993.33333333334, ans=0.0 2023-06-23 20:44:40,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.prob, batch_count=176126.66666666666, ans=0.125 2023-06-23 20:44:49,237 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.449e+02 1.741e+02 1.939e+02 2.314e+02 3.083e+02, threshold=3.877e+02, percent-clipped=0.0 2023-06-23 20:44:55,722 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=288, metric=10.78 vs. limit=10.0 2023-06-23 20:45:00,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.2.conv_module1.balancer1.prob, batch_count=176193.33333333334, ans=0.125 2023-06-23 20:45:03,835 INFO [train.py:1008] (2/4) Epoch 50, batch 350, loss[loss=0.1911, simple_loss=0.2765, pruned_loss=0.05282, over 19530.00 frames. ], tot_loss[loss=0.1915, simple_loss=0.2705, pruned_loss=0.05628, over 3101199.39 frames. ], batch size: 102, lr: 6.66e-03, grad_scale: 16.0 2023-06-23 20:45:19,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=176326.66666666666, ans=0.1 2023-06-23 20:45:54,760 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.99 vs. limit=10.0 2023-06-23 20:46:07,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=176460.0, ans=0.0 2023-06-23 20:46:18,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.bypass.scale_min, batch_count=176526.66666666666, ans=0.2 2023-06-23 20:46:25,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=176526.66666666666, ans=0.0 2023-06-23 20:46:27,884 INFO [train.py:1008] (2/4) Epoch 50, batch 400, loss[loss=0.1903, simple_loss=0.2629, pruned_loss=0.05883, over 20648.00 frames. ], tot_loss[loss=0.1912, simple_loss=0.27, pruned_loss=0.05617, over 3272410.86 frames. ], batch size: 173, lr: 6.65e-03, grad_scale: 32.0 2023-06-23 20:46:33,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.nonlin_attention.balancer.min_positive, batch_count=176593.33333333334, ans=0.05 2023-06-23 20:46:41,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=176593.33333333334, ans=0.1 2023-06-23 20:47:29,886 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=384, metric=3.55 vs. limit=12.0 2023-06-23 20:47:36,789 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.404e+02 1.753e+02 1.915e+02 2.152e+02 3.496e+02, threshold=3.829e+02, percent-clipped=0.0 2023-06-23 20:47:47,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=176860.0, ans=0.0 2023-06-23 20:47:51,021 INFO [train.py:1008] (2/4) Epoch 50, batch 450, loss[loss=0.1949, simple_loss=0.2658, pruned_loss=0.06196, over 19971.00 frames. ], tot_loss[loss=0.1909, simple_loss=0.2699, pruned_loss=0.05594, over 3394734.77 frames. ], batch size: 126, lr: 6.65e-03, grad_scale: 32.0 2023-06-23 20:48:14,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=176993.33333333334, ans=0.125 2023-06-23 20:48:18,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=176993.33333333334, ans=0.125 2023-06-23 20:49:12,813 INFO [train.py:1008] (2/4) Epoch 50, batch 500, loss[loss=0.2018, simple_loss=0.2839, pruned_loss=0.05981, over 16313.00 frames. ], tot_loss[loss=0.1912, simple_loss=0.2703, pruned_loss=0.05601, over 3479927.76 frames. ], batch size: 52, lr: 6.64e-03, grad_scale: 32.0 2023-06-23 20:49:36,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_skip_rate, batch_count=177326.66666666666, ans=0.0 2023-06-23 20:50:03,815 INFO [train.py:1221] (2/4) Done!