/home/aiscuser/.local/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4 warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}" 2023/07/19 14:34:12 WARNING mlflow.utils.autologging_utils: You are using an unsupported version of transformers. If you encounter errors during autologging, try upgrading / downgrading transformers to a supported version, or try upgrading MLflow. 2023/07/19 14:34:13 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn. 2023/07/19 14:34:13 INFO mlflow.tracking.fluent: Autologging successfully enabled for transformers. Using the `WAND_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). Downloading and preparing dataset glue/qnli to /home/aiscuser/.cache/huggingface/datasets/glue/qnli/1.0.0/a420f5e518f42454003587c47467370329f9fc0c6508d1ae0c45b58ea266a353... Downloading data: 0%| | 0.00/10.6M [00:00 Training Arguments TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, bf16=False, bf16_full_eval=False, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, debug=[], deepspeed=None, disable_tqdm=False, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_steps=500, evaluation_strategy=IntervalStrategy.STEPS, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, gradient_accumulation_steps=1, gradient_checkpointing=False, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_model_id=None, hub_strategy=HubStrategy.EVERY_SAVE, hub_token=, ignore_data_skip=False, label_names=None, label_smoothing_factor=0.0, learning_rate=2e-05, length_column_name=length, load_best_model_at_end=False, local_rank=-1, log_level=40, log_level_replica=-1, log_on_each_node=True, logging_dir=/mnt/data/device-aware-bert/token_pruning/experiments/QNLI/reproduce1/s0.58_lr2e-05_reglr0.01_alpha0.0002_warmup10_bin50/runs/Jul19_14-34-14_node-0, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=100, logging_strategy=IntervalStrategy.STEPS, lr_scheduler_type=SchedulerType.LINEAR, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=40.0, optim=OptimizerNames.ADAMW_HF, output_dir=/mnt/data/device-aware-bert/token_pruning/experiments/QNLI/reproduce1/s0.58_lr2e-05_reglr0.01_alpha0.0002_warmup10_bin50, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=32, per_device_train_batch_size=32, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, remove_unused_columns=True, report_to=['mlflow'], resume_from_checkpoint=None, run_name=/mnt/data/device-aware-bert/token_pruning/experiments/QNLI/reproduce1/s0.58_lr2e-05_reglr0.01_alpha0.0002_warmup10_bin50, save_on_each_node=False, save_steps=0, save_strategy=IntervalStrategy.STEPS, save_total_limit=None, seed=57, sharded_ddp=[], skip_memory_metrics=True, tf32=None, tpu_metrics_debug=False, tpu_num_cores=None, use_legacy_prediction_loop=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) Additional Arguments AdditionalArguments(test=False, ex_name='s0.58_lr2e-05_reglr0.01_alpha0.0002_warmup10_bin50', pruning_type='token+pruner', reg_learning_rate=0.01, scheduler_type='linear', freeze_embeddings=True, pretrained_pruned_model=None, droprate_init=0.01, temperature=0.6666666666666666, prepruning_finetune_epochs=1, lagrangian_warmup_epochs=10, target_sparsity=0.58, sparsity_epsilon=0, distillation_path='/mnt/data/device-aware-bert/token_pruning/teachers/QNLI', do_distill=True, do_layer_distill=False, layer_distill_version=4, distill_loss_alpha=0.9, distill_ce_loss_alpha=0.0002, distill_temp=2.0, use_mac_l0=True, prune_location=[2, 3, 4, 5, 6, 7, 8, 9, 10, 11], bin_num=50, topk=20) ---------------------------------------------------------------------- time: 2023-07-19 14:35:57 Evaluating: accuracy: 0.9165, eval_loss: 0.2978, step: 0 lambda_1: 0.0000, lambda_2: 0.0000 lambda_3: 0.0000 Starting l0 regularization! using , temperature: 0.67, init drop rate: 0.01 token_loga shape: [10, 50] prune location: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11] NDCG TOPK= 20 loss: 0.151606, lagrangian_loss: -0.001826, attention_score_distillation_loss: 0.001971 ---------------------------------------------------------------------- time: 2023-07-19 14:38:49 Evaluating: accuracy: 0.912, eval_loss: 0.3301, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.6069, target_sparsity: 0.0088, step: 500 lambda_1: 0.6712, lambda_2: 5.6444 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 loss: 0.249322, lagrangian_loss: -0.001798, attention_score_distillation_loss: 0.001956 loss: 0.021923, lagrangian_loss: 0.025116, attention_score_distillation_loss: 0.001941 ---------------------------------------------------------------------- time: 2023-07-19 14:41:40 Evaluating: accuracy: 0.9112, eval_loss: 0.3331, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0136, expected_sparsity: 0.0136, expected_sequence_sparsity: 0.6122, target_sparsity: 0.0177, step: 1000 lambda_1: -5.9906, lambda_2: 14.0124 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 1. 0.99 0.92] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 01111111111111111111111111111111111111110111110100 loss: 0.039104, lagrangian_loss: -0.028921, attention_score_distillation_loss: 0.001926 loss: 0.021610, lagrangian_loss: -0.007255, attention_score_distillation_loss: 0.001912 ---------------------------------------------------------------------- time: 2023-07-19 14:44:30 Evaluating: accuracy: 0.907, eval_loss: 0.3781, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0191, expected_sparsity: 0.019, expected_sequence_sparsity: 0.6144, target_sparsity: 0.0265, step: 1500 lambda_1: 1.4706, lambda_2: 22.5720 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 1. 1. 1. 1. 0.99 0.88] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 01111111111111111111111111111111111011110111100100 loss: 0.021560, lagrangian_loss: -0.000941, attention_score_distillation_loss: 0.001896 loss: 0.228817, lagrangian_loss: 0.012241, attention_score_distillation_loss: 0.001879 ---------------------------------------------------------------------- time: 2023-07-19 14:47:24 Evaluating: accuracy: 0.9063, eval_loss: 0.3578, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0245, expected_sparsity: 0.0245, expected_sequence_sparsity: 0.6165, target_sparsity: 0.0354, step: 2000 lambda_1: -3.3169, lambda_2: 26.2438 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.99 1. 1. 1. 0.99 0.84] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 01111111111111111111111111111111111011100111000100 loss: 0.135710, lagrangian_loss: 0.013382, attention_score_distillation_loss: 0.001867 loss: 0.136615, lagrangian_loss: -0.005482, attention_score_distillation_loss: 0.001851 ---------------------------------------------------------------------- time: 2023-07-19 14:50:16 Evaluating: accuracy: 0.909, eval_loss: 0.3409, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0327, expected_sparsity: 0.0326, expected_sequence_sparsity: 0.6197, target_sparsity: 0.0443, step: 2500 lambda_1: -1.6806, lambda_2: 28.0291 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.98 0.99 1. 1. 0.98 0.77] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 01111111111111111111111111111111001011100111000000 loss: 0.061354, lagrangian_loss: -0.011220, attention_score_distillation_loss: 0.001835 loss: 0.048383, lagrangian_loss: -0.000103, attention_score_distillation_loss: 0.001821 ---------------------------------------------------------------------- time: 2023-07-19 14:53:11 Evaluating: accuracy: 0.909, eval_loss: 0.3394, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0327, expected_sparsity: 0.0326, expected_sequence_sparsity: 0.6197, target_sparsity: 0.0531, step: 3000 lambda_1: -1.2986, lambda_2: 30.1448 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.99 1. 1. 1. 0.98 0.77] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 01111111111111111111111111111111001011100111000000 loss: 0.039288, lagrangian_loss: 0.010155, attention_score_distillation_loss: 0.001804 loss: 0.246656, lagrangian_loss: 0.010731, attention_score_distillation_loss: 0.001787 ETA: 12:04:03 | Epoch 0 finished. Took 1113.93 seconds. ---------------------------------------------------------------------- time: 2023-07-19 14:56:04 Evaluating: accuracy: 0.9132, eval_loss: 0.3481, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0529, expected_sparsity: 0.0524, expected_sequence_sparsity: 0.6275, target_sparsity: 0.062, step: 3500 lambda_1: -2.9663, lambda_2: 32.3318 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.98 1. 1. 0.99 0.95 0.72] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.72] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.66] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111101111111111101111010 01111111111111111111110111111101001011100111000000 loss: 0.167372, lagrangian_loss: -0.016864, attention_score_distillation_loss: 0.001776 loss: 0.348476, lagrangian_loss: -0.000200, attention_score_distillation_loss: 0.001758 ---------------------------------------------------------------------- time: 2023-07-19 14:58:57 Evaluating: accuracy: 0.9116, eval_loss: 0.3552, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0529, expected_sparsity: 0.0524, expected_sequence_sparsity: 0.6275, target_sparsity: 0.0708, step: 4000 lambda_1: -1.6665, lambda_2: 35.0824 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.98 1. 1. 0.99 0.95 0.72] infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.72] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.66] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111101111111111101111010 01111111111111111111110111111101001011100111000000 loss: 0.028537, lagrangian_loss: 0.007263, attention_score_distillation_loss: 0.001742 loss: 0.115705, lagrangian_loss: -0.003871, attention_score_distillation_loss: 0.001729 ---------------------------------------------------------------------- time: 2023-07-19 15:01:51 Evaluating: accuracy: 0.9088, eval_loss: 0.3366, token_prune_loc: [False, False, False, False, True, False, False, False, True, True], macs_sparsity: 0.0968, expected_sparsity: 0.0943, expected_sequence_sparsity: 0.6441, target_sparsity: 0.0797, step: 4500 lambda_1: -0.7962, lambda_2: 36.9589 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.96 1. 1. 0.98 0.94 0.7 ] infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 1.0, 1.0, 1.0, 0.92, 0.7] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.92, 0.92, 0.85, 0.59] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111011111111111110100 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111101111111111101111010 01011111111111111111110111111101001011100111000000 loss: 0.149126, lagrangian_loss: -0.001737, attention_score_distillation_loss: 0.001714 loss: 0.121980, lagrangian_loss: 0.009807, attention_score_distillation_loss: 0.001698 ---------------------------------------------------------------------- time: 2023-07-19 15:04:42 Evaluating: accuracy: 0.9072, eval_loss: 0.3316, token_prune_loc: [False, False, False, False, True, False, False, False, True, True], macs_sparsity: 0.0995, expected_sparsity: 0.0966, expected_sequence_sparsity: 0.645, target_sparsity: 0.0885, step: 5000 lambda_1: -3.8493, lambda_2: 38.8414 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.96 1. 1. 0.98 0.93 0.69] infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 1.0, 1.0, 1.0, 0.92, 0.68] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.92, 0.92, 0.85, 0.58] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111011111111111101100 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111101111111111101111010 01011111111111111111110110111101001011100111000000 loss: 0.522274, lagrangian_loss: 0.000473, attention_score_distillation_loss: 0.001683 loss: 0.259186, lagrangian_loss: -0.000771, attention_score_distillation_loss: 0.001668 ---------------------------------------------------------------------- time: 2023-07-19 15:07:35 Evaluating: accuracy: 0.907, eval_loss: 0.3364, token_prune_loc: [False, False, False, False, True, False, False, False, True, True], macs_sparsity: 0.0995, expected_sparsity: 0.0966, expected_sequence_sparsity: 0.645, target_sparsity: 0.0974, step: 5500 lambda_1: 0.0348, lambda_2: 44.5078 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.95 1. 1. 0.98 0.94 0.69] infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 1.0, 1.0, 1.0, 0.92, 0.68] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.92, 0.92, 0.85, 0.58] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111011111111111110100 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111101111111111101111010 01011111111111111111110111111101001011100011000000 loss: 0.122168, lagrangian_loss: 0.001977, attention_score_distillation_loss: 0.001651 loss: 0.342982, lagrangian_loss: 0.010217, attention_score_distillation_loss: 0.001638 ---------------------------------------------------------------------- time: 2023-07-19 15:10:27 Evaluating: accuracy: 0.9032, eval_loss: 0.3616, token_prune_loc: [False, False, False, False, True, False, False, True, True, True], macs_sparsity: 0.1257, expected_sparsity: 0.1245, expected_sequence_sparsity: 0.656, target_sparsity: 0.1063, step: 6000 lambda_1: -2.0497, lambda_2: 48.2584 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.92 0.99 1. 0.97 0.92 0.67] infer remain: [1.0, 1.0, 1.0, 1.0, 0.9, 1.0, 1.0, 0.94, 0.9, 0.66] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.9, 0.9, 0.85, 0.76, 0.5] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111011111111111100100 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111011111100 11111111111111111111111111111101111111111101110010 01011111111111111111110110111101001011100011000000 loss: 0.628217, lagrangian_loss: -0.010831, attention_score_distillation_loss: 0.001621 loss: 0.434961, lagrangian_loss: 0.000020, attention_score_distillation_loss: 0.001606 ---------------------------------------------------------------------- time: 2023-07-19 15:13:20 Evaluating: accuracy: 0.9085, eval_loss: 0.3411, token_prune_loc: [False, False, False, False, True, False, False, True, True, True], macs_sparsity: 0.1257, expected_sparsity: 0.1245, expected_sequence_sparsity: 0.656, target_sparsity: 0.1151, step: 6500 lambda_1: -1.6943, lambda_2: 50.6412 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.93 0.99 1. 0.97 0.92 0.67] infer remain: [1.0, 1.0, 1.0, 1.0, 0.9, 1.0, 1.0, 0.94, 0.9, 0.66] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.9, 0.9, 0.85, 0.76, 0.5] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111011111111111100100 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111011111100 11111111111111111111111111111101111111111101110010 01011111111111111111110110111101001011100011000000 loss: 0.028765, lagrangian_loss: 0.009957, attention_score_distillation_loss: 0.001592 ETA: 11:54:39 | Epoch 1 finished. Took 1142.86 seconds. loss: 0.329838, lagrangian_loss: -0.003458, attention_score_distillation_loss: 0.001577 ---------------------------------------------------------------------- time: 2023-07-19 15:16:12 Evaluating: accuracy: 0.9096, eval_loss: 0.3387, token_prune_loc: [False, False, False, False, True, False, False, True, True, True], macs_sparsity: 0.1257, expected_sparsity: 0.1245, expected_sequence_sparsity: 0.656, target_sparsity: 0.124, step: 7000 lambda_1: -1.1721, lambda_2: 52.2661 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.91 0.99 1. 0.95 0.9 0.66] infer remain: [1.0, 1.0, 1.0, 1.0, 0.9, 1.0, 1.0, 0.94, 0.9, 0.66] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.9, 0.9, 0.85, 0.76, 0.5] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111011111111111100100 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111011111100 11111111111111111111111111111101111111111101110010 01011111111111111111110110111101001011100011000000 loss: 0.370096, lagrangian_loss: -0.002987, attention_score_distillation_loss: 0.001562 loss: 0.029515, lagrangian_loss: 0.001664, attention_score_distillation_loss: 0.001547 ---------------------------------------------------------------------- time: 2023-07-19 15:19:09 Evaluating: accuracy: 0.9046, eval_loss: 0.3777, token_prune_loc: [False, False, False, False, True, False, False, True, True, True], macs_sparsity: 0.1443, expected_sparsity: 0.1427, expected_sequence_sparsity: 0.6631, target_sparsity: 0.1328, step: 7500 lambda_1: -2.2503, lambda_2: 53.1401 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.9 0.99 1. 0.95 0.9 0.65] infer remain: [1.0, 1.0, 1.0, 1.0, 0.88, 1.0, 1.0, 0.92, 0.88, 0.64] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.88, 0.88, 0.81, 0.71, 0.46] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111011111111111100000 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111011111000 01111111111111111111111111111101111111111101110010 01011111111111111111110110111001001011100011000000 loss: 0.124168, lagrangian_loss: 0.001704, attention_score_distillation_loss: 0.001532 loss: 0.047143, lagrangian_loss: -0.002382, attention_score_distillation_loss: 0.001516 ---------------------------------------------------------------------- time: 2023-07-19 15:22:02 Evaluating: accuracy: 0.9063, eval_loss: 0.3526, token_prune_loc: [False, False, False, False, True, False, False, True, True, True], macs_sparsity: 0.1443, expected_sparsity: 0.1427, expected_sequence_sparsity: 0.6631, target_sparsity: 0.1417, step: 8000 lambda_1: -1.5352, lambda_2: 53.3740 lambda_3: 0.0000 train remain: [1. 1. 1. 1. 0.89 0.99 1. 0.94 0.89 0.65] infer remain: [1.0, 1.0, 1.0, 1.0, 0.88, 1.0, 1.0, 0.92, 0.88, 0.64] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.88, 0.88, 0.81, 0.71, 0.46] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111011111111111100000 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111011111000 01111111111111111111111111111101111111111101110010 01011111111111111111110110111001001011100011000000 loss: 0.096316, lagrangian_loss: 0.000284, attention_score_distillation_loss: 0.001501 loss: 0.049778, lagrangian_loss: 0.003615, attention_score_distillation_loss: 0.001486 ---------------------------------------------------------------------- time: 2023-07-19 15:24:54 Evaluating: accuracy: 0.9052, eval_loss: 0.378, token_prune_loc: [False, False, False, False, True, False, False, True, True, True], macs_sparsity: 0.1536, expected_sparsity: 0.1519, expected_sequence_sparsity: 0.6668, target_sparsity: 0.1505, step: 8500 lambda_1: -1.1228, lambda_2: 55.3826 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.88 0.96 1. 0.93 0.89 0.64] infer remain: [1.0, 1.0, 1.0, 1.0, 0.86, 1.0, 1.0, 0.92, 0.88, 0.64] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.86, 0.86, 0.79, 0.7, 0.45] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111011101111111100000 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111011110100 01111111111111111111111111111101111111111101110010 01011111111111111111110110111001001011100011000000 loss: 0.244794, lagrangian_loss: -0.003341, attention_score_distillation_loss: 0.001469 loss: 0.225897, lagrangian_loss: 0.004802, attention_score_distillation_loss: 0.001457 ---------------------------------------------------------------------- time: 2023-07-19 15:27:46 Evaluating: accuracy: 0.9065, eval_loss: 0.3857, token_prune_loc: [False, False, False, False, True, True, False, True, True, True], macs_sparsity: 0.1886, expected_sparsity: 0.1842, expected_sequence_sparsity: 0.6795, target_sparsity: 0.1594, step: 9000 lambda_1: -2.3467, lambda_2: 57.3825 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.89 0.94 1. 0.93 0.89 0.64] infer remain: [1.0, 1.0, 1.0, 1.0, 0.86, 0.9, 1.0, 0.92, 0.88, 0.64] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.77, 0.77, 0.71, 0.63, 0.4] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111011101111111100000 11111111111111111111111111110111110111101101111110 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111011110100 01111111111111111111111111111101111111111101110010 01011111111111111111110110111001001011100011000000 loss: 0.704547, lagrangian_loss: -0.005717, attention_score_distillation_loss: 0.001441 loss: 0.104996, lagrangian_loss: 0.001554, attention_score_distillation_loss: 0.001426 ---------------------------------------------------------------------- time: 2023-07-19 15:30:38 Evaluating: accuracy: 0.9076, eval_loss: 0.3862, token_prune_loc: [False, False, False, False, True, True, False, True, True, True], macs_sparsity: 0.1946, expected_sparsity: 0.1941, expected_sequence_sparsity: 0.6834, target_sparsity: 0.1683, step: 9500 lambda_1: -3.6773, lambda_2: 61.1385 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.89 0.93 1. 0.92 0.88 0.64] infer remain: [1.0, 1.0, 1.0, 1.0, 0.86, 0.88, 1.0, 0.9, 0.88, 0.64] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.76, 0.76, 0.68, 0.6, 0.38] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111011101111111100000 11111111111111111111111111110111110111101101101110 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111011110000 01111111111111111111111111111101111111111101110010 01011111111111111111110110111001001011100011000000 loss: 0.325564, lagrangian_loss: 0.002754, attention_score_distillation_loss: 0.001411 loss: 0.064415, lagrangian_loss: -0.000777, attention_score_distillation_loss: 0.001395 ETA: 11:33:46 | Epoch 2 finished. Took 1118.35 seconds. ---------------------------------------------------------------------- time: 2023-07-19 15:33:32 Evaluating: accuracy: 0.9026, eval_loss: 0.377, token_prune_loc: [False, False, False, False, True, True, False, True, True, True], macs_sparsity: 0.1946, expected_sparsity: 0.1941, expected_sequence_sparsity: 0.6834, target_sparsity: 0.1771, step: 10000 lambda_1: -2.8310, lambda_2: 65.7308 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.88 0.91 1. 0.91 0.88 0.64] infer remain: [1.0, 1.0, 1.0, 1.0, 0.86, 0.88, 1.0, 0.9, 0.88, 0.64] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.76, 0.76, 0.68, 0.6, 0.38] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111011101111111100000 11111111111111111111111111110111110111101101101110 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111011110000 01111111111111111111111111111101111111111101110010 01011111111111111111110110111001001011100011000000 loss: 0.727551, lagrangian_loss: 0.002070, attention_score_distillation_loss: 0.001380 loss: 0.024805, lagrangian_loss: -0.003132, attention_score_distillation_loss: 0.001365 ---------------------------------------------------------------------- time: 2023-07-19 15:36:27 Evaluating: accuracy: 0.9063, eval_loss: 0.3844, token_prune_loc: [False, False, False, False, True, True, False, True, True, True], macs_sparsity: 0.1979, expected_sparsity: 0.1976, expected_sequence_sparsity: 0.6848, target_sparsity: 0.186, step: 10500 lambda_1: -2.0482, lambda_2: 68.1585 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.88 0.91 1. 0.9 0.88 0.64] infer remain: [1.0, 1.0, 1.0, 1.0, 0.86, 0.88, 1.0, 0.88, 0.88, 0.64] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.76, 0.76, 0.67, 0.59, 0.38] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111011101111111100000 11111111111111111111111111110111110111101101101110 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111011011110000 01111111111111111111111111111101111111111101110010 01011111111111111111110110111001001011100011000000 loss: 0.017083, lagrangian_loss: 0.006027, attention_score_distillation_loss: 0.001350 loss: 0.179024, lagrangian_loss: -0.000477, attention_score_distillation_loss: 0.001334 ---------------------------------------------------------------------- time: 2023-07-19 15:39:18 Evaluating: accuracy: 0.8986, eval_loss: 0.4046, token_prune_loc: [False, False, False, False, True, True, False, True, True, True], macs_sparsity: 0.2072, expected_sparsity: 0.2039, expected_sequence_sparsity: 0.6873, target_sparsity: 0.1948, step: 11000 lambda_1: -0.9741, lambda_2: 70.7487 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.87 0.89 1. 0.88 0.88 0.64] infer remain: [1.0, 1.0, 1.0, 1.0, 0.86, 0.86, 1.0, 0.88, 0.88, 0.64] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.74, 0.74, 0.65, 0.57, 0.37] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111011101111111100000 11111111111111111111111111110111110110101101101110 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111011011110000 01111111111111111111111111111101111111111101110010 01011111111111111111110110111001001011100011000000 loss: 0.043905, lagrangian_loss: -0.001082, attention_score_distillation_loss: 0.001319 loss: 0.537168, lagrangian_loss: 0.007126, attention_score_distillation_loss: 0.001306 ---------------------------------------------------------------------- time: 2023-07-19 15:42:13 Evaluating: accuracy: 0.9054, eval_loss: 0.3918, token_prune_loc: [False, False, False, False, True, True, False, True, True, True], macs_sparsity: 0.2246, expected_sparsity: 0.2212, expected_sequence_sparsity: 0.6941, target_sparsity: 0.2037, step: 11500 lambda_1: -3.1719, lambda_2: 74.2290 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.86 0.88 0.99 0.87 0.88 0.64] infer remain: [1.0, 1.0, 1.0, 1.0, 0.84, 0.84, 1.0, 0.86, 0.88, 0.64] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.71, 0.71, 0.61, 0.53, 0.34] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111011101101111100000 11111111111111111111111111110111110110101101101100 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111001011110000 01111111111111111111111111111101111111111101110010 01011111111111111111110110111001001011100011000000 loss: 0.176914, lagrangian_loss: -0.012776, attention_score_distillation_loss: 0.001289 loss: 0.117696, lagrangian_loss: 0.000195, attention_score_distillation_loss: 0.001274 ---------------------------------------------------------------------- time: 2023-07-19 15:45:04 Evaluating: accuracy: 0.9046, eval_loss: 0.4019, token_prune_loc: [False, False, False, False, True, True, False, True, True, True], macs_sparsity: 0.2246, expected_sparsity: 0.2212, expected_sequence_sparsity: 0.6941, target_sparsity: 0.2125, step: 12000 lambda_1: -3.5314, lambda_2: 80.4260 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.86 0.87 0.99 0.86 0.88 0.63] infer remain: [1.0, 1.0, 1.0, 1.0, 0.84, 0.84, 1.0, 0.86, 0.88, 0.64] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.71, 0.71, 0.61, 0.53, 0.34] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111011101011111100000 11111111111111111111111111110111110110101101101100 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111001011110000 01111111111111111111111111111101111111111101110010 01011111111111111111110110111001001011100011000000 loss: 0.043688, lagrangian_loss: 0.006887, attention_score_distillation_loss: 0.001259 loss: 0.171558, lagrangian_loss: -0.009203, attention_score_distillation_loss: 0.001244 ---------------------------------------------------------------------- time: 2023-07-19 15:47:59 Evaluating: accuracy: 0.9035, eval_loss: 0.4037, token_prune_loc: [False, False, False, False, True, True, False, True, True, True], macs_sparsity: 0.2273, expected_sparsity: 0.2226, expected_sequence_sparsity: 0.6947, target_sparsity: 0.2214, step: 12500 lambda_1: -0.7780, lambda_2: 83.1429 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.84 0.86 0.99 0.86 0.88 0.63] infer remain: [1.0, 1.0, 1.0, 1.0, 0.84, 0.84, 1.0, 0.86, 0.88, 0.62] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.71, 0.71, 0.61, 0.53, 0.33] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111011101011111100000 11111111111111111111111111110111110110101101101100 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111001011110000 01111111111111111111111111111101111111111101110010 01011111111111111111110010111001001011100011000000 loss: 0.121763, lagrangian_loss: 0.001610, attention_score_distillation_loss: 0.001215 loss: 0.370005, lagrangian_loss: 0.000966, attention_score_distillation_loss: 0.001214 ---------------------------------------------------------------------- time: 2023-07-19 15:50:51 Evaluating: accuracy: 0.8993, eval_loss: 0.3841, token_prune_loc: [False, False, False, False, True, True, False, True, True, True], macs_sparsity: 0.2404, expected_sparsity: 0.2362, expected_sequence_sparsity: 0.7, target_sparsity: 0.2303, step: 13000 lambda_1: -2.3161, lambda_2: 84.2666 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.83 0.85 0.98 0.86 0.88 0.62] infer remain: [1.0, 1.0, 1.0, 1.0, 0.82, 0.82, 1.0, 0.86, 0.88, 0.62] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.67, 0.67, 0.58, 0.51, 0.32] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111011101001111100000 11111111111111111111111111110111110110001101101100 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111001011110000 01111111111111111111111111111101111111111101110010 01011111111111111111110010111001001011100011000000 loss: 0.358073, lagrangian_loss: -0.002303, attention_score_distillation_loss: 0.001187 ETA: 11:18:16 | Epoch 3 finished. Took 1146.7 seconds. loss: 0.053012, lagrangian_loss: -0.000268, attention_score_distillation_loss: 0.001184 ---------------------------------------------------------------------- time: 2023-07-19 15:53:42 Evaluating: accuracy: 0.8995, eval_loss: 0.4072, token_prune_loc: [False, False, False, False, True, True, False, True, True, True], macs_sparsity: 0.2404, expected_sparsity: 0.2362, expected_sequence_sparsity: 0.7, target_sparsity: 0.2391, step: 13500 lambda_1: -1.1716, lambda_2: 85.2143 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.83 0.84 0.98 0.86 0.88 0.62] infer remain: [1.0, 1.0, 1.0, 1.0, 0.82, 0.82, 1.0, 0.86, 0.88, 0.62] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.67, 0.67, 0.58, 0.51, 0.32] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111011101001111100000 11111111111111111111111111110111110110001101101100 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111001011110000 01111111111111111111111111111101111111111101110010 01011111111111111111110010111001001011100011000000 loss: 0.137737, lagrangian_loss: 0.001275, attention_score_distillation_loss: 0.001169 loss: 0.155063, lagrangian_loss: -0.000272, attention_score_distillation_loss: 0.001152 ---------------------------------------------------------------------- time: 2023-07-19 15:56:35 Evaluating: accuracy: 0.9026, eval_loss: 0.4081, token_prune_loc: [False, False, False, False, True, True, False, True, True, True], macs_sparsity: 0.2513, expected_sparsity: 0.2495, expected_sequence_sparsity: 0.7052, target_sparsity: 0.248, step: 14000 lambda_1: -0.9965, lambda_2: 87.0314 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.82 0.83 0.97 0.86 0.88 0.62] infer remain: [1.0, 1.0, 1.0, 1.0, 0.8, 0.8, 1.0, 0.86, 0.88, 0.62] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.64, 0.64, 0.55, 0.48, 0.3] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111010101001111100000 11111111111111111111111111110111110100001101101100 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111001011110000 01111111111111111111111111111101111111111101110010 01011111111111111111110010111001001011100011000000 loss: 0.075654, lagrangian_loss: -0.000343, attention_score_distillation_loss: 0.001137 loss: 0.389590, lagrangian_loss: 0.000534, attention_score_distillation_loss: 0.001122 ---------------------------------------------------------------------- time: 2023-07-19 15:59:29 Evaluating: accuracy: 0.9043, eval_loss: 0.3973, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.2655, expected_sparsity: 0.2636, expected_sequence_sparsity: 0.7108, target_sparsity: 0.2568, step: 14500 lambda_1: -3.1236, lambda_2: 88.0091 lambda_3: 0.0000 train remain: [1. 1. 1. 0.99 0.81 0.81 0.96 0.86 0.88 0.61] infer remain: [1.0, 1.0, 1.0, 1.0, 0.8, 0.8, 0.92, 0.86, 0.88, 0.62] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.64, 0.59, 0.51, 0.45, 0.28] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111010101001111100000 11111111111111111111111111110111110100001101101100 11111111111111111111111111101111111011111111111010 11111111111111111111111111111111111111001011110000 01111111111111111111111111111101111111111101110010 01011111111111111111110010111001001011100011000000 loss: 0.376011, lagrangian_loss: -0.002480, attention_score_distillation_loss: 0.001104 loss: 0.168905, lagrangian_loss: -0.000104, attention_score_distillation_loss: 0.001091 ---------------------------------------------------------------------- time: 2023-07-19 16:02:26 Evaluating: accuracy: 0.9001, eval_loss: 0.3743, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.2747, expected_sparsity: 0.2702, expected_sequence_sparsity: 0.7134, target_sparsity: 0.2657, step: 15000 lambda_1: -2.7815, lambda_2: 91.4272 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.99 0.81 0.8 0.95 0.86 0.87 0.61] infer remain: [1.0, 1.0, 1.0, 1.0, 0.8, 0.78, 0.92, 0.86, 0.88, 0.6] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.62, 0.57, 0.49, 0.43, 0.26] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111010101001111100000 11111111111111111111111111110111010100001101101100 11111111111111111111111111101111111011111111111010 11111111111111111111111111111111111111001011110000 01111111111111111111111111111101111111111101110010 01011111111111111110110010111001001011100011000000 loss: 0.047740, lagrangian_loss: 0.003428, attention_score_distillation_loss: 0.001077 loss: 0.263340, lagrangian_loss: -0.006191, attention_score_distillation_loss: 0.001062 ---------------------------------------------------------------------- time: 2023-07-19 16:05:19 Evaluating: accuracy: 0.8997, eval_loss: 0.4169, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.2774, expected_sparsity: 0.2718, expected_sequence_sparsity: 0.714, target_sparsity: 0.2746, step: 15500 lambda_1: -0.5111, lambda_2: 94.1950 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.99 0.8 0.79 0.95 0.85 0.87 0.6 ] infer remain: [1.0, 1.0, 1.0, 1.0, 0.8, 0.78, 0.92, 0.86, 0.86, 0.6] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.62, 0.57, 0.49, 0.42, 0.25] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111010101001111100000 11111111111111111111111111110111010100001101101100 11111111111111111111111111101111111011111111111010 11111111111111111111111111111111111111001011110000 00111111111111111111111111111101111111111101110010 01011111111111111110110010111001001011100011000000 loss: 0.133294, lagrangian_loss: 0.001181, attention_score_distillation_loss: 0.001046 loss: 0.044360, lagrangian_loss: -0.001412, attention_score_distillation_loss: 0.001031 ---------------------------------------------------------------------- time: 2023-07-19 16:08:12 Evaluating: accuracy: 0.8929, eval_loss: 0.403, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.2889, expected_sparsity: 0.2862, expected_sequence_sparsity: 0.7197, target_sparsity: 0.2834, step: 16000 lambda_1: -1.3668, lambda_2: 98.4045 lambda_3: 0.0000 train remain: [1. 1. 0.99 0.99 0.78 0.78 0.95 0.85 0.86 0.59] infer remain: [1.0, 1.0, 1.0, 1.0, 0.78, 0.76, 0.92, 0.84, 0.86, 0.6] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.59, 0.55, 0.46, 0.39, 0.24] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111010101000111100000 11111111111111111111111111110111010100001101100100 11111111111111111111111111101111111011111111111010 11111111111111111111111111111111111111001011010000 00111111111111111111111111111101111111111101110010 01011111111111111110110010111001001011100011000000 loss: 0.151958, lagrangian_loss: -0.001073, attention_score_distillation_loss: 0.001016 loss: 0.405352, lagrangian_loss: 0.003347, attention_score_distillation_loss: 0.001001 ETA: 10:58:21 | Epoch 4 finished. Took 1121.19 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:11:05 Evaluating: accuracy: 0.9008, eval_loss: 0.4311, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.2916, expected_sparsity: 0.2872, expected_sequence_sparsity: 0.7201, target_sparsity: 0.2923, step: 16500 lambda_1: -2.1987, lambda_2: 99.9726 lambda_3: 0.0000 train remain: [1. 1. 0.97 0.99 0.78 0.78 0.94 0.85 0.86 0.58] infer remain: [1.0, 1.0, 1.0, 1.0, 0.78, 0.76, 0.92, 0.84, 0.86, 0.58] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.59, 0.55, 0.46, 0.39, 0.23] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111010101000111100000 11111111111111111111111111110111010100001101100100 11111111111111111111111111101111111011111111111010 11111111111111111111111111111111111111001011010000 00111111111111111111111111111101111111111101110010 00011111111111111110110010111001001011100011000000 loss: 0.588161, lagrangian_loss: -0.005038, attention_score_distillation_loss: 0.000986 loss: 0.295328, lagrangian_loss: 0.004551, attention_score_distillation_loss: 0.000971 ---------------------------------------------------------------------- time: 2023-07-19 16:13:59 Evaluating: accuracy: 0.8986, eval_loss: 0.4349, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3014, expected_sparsity: 0.2988, expected_sequence_sparsity: 0.7247, target_sparsity: 0.3011, step: 17000 lambda_1: -2.7316, lambda_2: 102.9767 lambda_3: 0.0000 train remain: [1. 1. 0.97 0.99 0.77 0.76 0.94 0.85 0.85 0.58] infer remain: [1.0, 1.0, 1.0, 1.0, 0.76, 0.74, 0.92, 0.84, 0.86, 0.58] layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.56, 0.52, 0.43, 0.37, 0.22] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111010101000110100000 11111111111111111111111111110111010100000101100100 11111111111111111111111111101111111011111111111010 11111111111111111111111111111111111111001011010000 00111111111111111111111111111101111111111101110010 00011111111111111110110010111001001011100011000000 loss: 0.020533, lagrangian_loss: -0.003328, attention_score_distillation_loss: 0.000954 loss: 0.196831, lagrangian_loss: 0.001611, attention_score_distillation_loss: 0.000940 ---------------------------------------------------------------------- time: 2023-07-19 16:16:53 Evaluating: accuracy: 0.8871, eval_loss: 0.4462, token_prune_loc: [False, False, True, False, True, True, True, True, True, True], macs_sparsity: 0.3287, expected_sparsity: 0.3243, expected_sequence_sparsity: 0.7348, target_sparsity: 0.31, step: 17500 lambda_1: -4.1269, lambda_2: 109.3017 lambda_3: 0.0000 train remain: [1. 1. 0.97 0.98 0.77 0.75 0.94 0.84 0.85 0.57] infer remain: [1.0, 1.0, 0.94, 1.0, 0.76, 0.74, 0.92, 0.84, 0.86, 0.56] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.94, 0.94, 0.71, 0.53, 0.49, 0.41, 0.35, 0.2] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111111111111111111111111011011111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111010101000110100000 11111111111111111111111111110111010100000101100100 11111111111111111111111111101111111011111111111010 11111111111111111111111111111111111111001011010000 00111111111111111111111111111101111111111101110010 00011111111111111110110010111001001011000011000000 loss: 0.395330, lagrangian_loss: -0.000525, attention_score_distillation_loss: 0.000925 loss: 0.143363, lagrangian_loss: -0.002671, attention_score_distillation_loss: 0.000910 ---------------------------------------------------------------------- time: 2023-07-19 16:19:45 Evaluating: accuracy: 0.8843, eval_loss: 0.4669, token_prune_loc: [False, False, True, False, True, True, True, True, True, True], macs_sparsity: 0.3314, expected_sparsity: 0.3256, expected_sequence_sparsity: 0.7353, target_sparsity: 0.3188, step: 18000 lambda_1: -1.2977, lambda_2: 112.8449 lambda_3: 0.0000 train remain: [1. 1. 0.96 0.98 0.77 0.75 0.94 0.84 0.85 0.57] infer remain: [1.0, 1.0, 0.94, 1.0, 0.76, 0.74, 0.92, 0.84, 0.84, 0.56] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.94, 0.94, 0.71, 0.53, 0.49, 0.41, 0.34, 0.19] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111111111111111111111111011011111 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111010101000110100000 11111111111111111111111111110111010100000101100100 11111111111111111111111111101111111011111111111010 11111111111111111111111111111111111111001011010000 00111111111111111111111111111101111111111100110010 00011111111111111110110010111001001011000011000000 loss: 0.150135, lagrangian_loss: 0.006139, attention_score_distillation_loss: 0.000893 loss: 0.297260, lagrangian_loss: 0.001615, attention_score_distillation_loss: 0.000879 ---------------------------------------------------------------------- time: 2023-07-19 16:22:40 Evaluating: accuracy: 0.8827, eval_loss: 0.5073, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.3527, expected_sparsity: 0.3482, expected_sequence_sparsity: 0.7442, target_sparsity: 0.3277, step: 18500 lambda_1: -2.0872, lambda_2: 116.5439 lambda_3: 0.0000 train remain: [1. 1. 0.96 0.97 0.76 0.73 0.93 0.84 0.85 0.55] infer remain: [1.0, 1.0, 0.94, 0.94, 0.76, 0.72, 0.92, 0.84, 0.84, 0.56] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.94, 0.88, 0.67, 0.48, 0.44, 0.37, 0.31, 0.18] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111111111111111111111111011011111 11111111111111111111111110111111111111111111111010 11111111111111111111111111111111010101000110100000 11111111111111111111111111110111010100000101100000 11111111111111111111111111101111111011111111111010 11111111111111111111111111111111111111001011010000 00111111111111111111111111111101111111111100110010 00011111111111111110110010111001001011000011000000 loss: 0.147284, lagrangian_loss: -0.002430, attention_score_distillation_loss: 0.000865 loss: 0.316753, lagrangian_loss: -0.002380, attention_score_distillation_loss: 0.000850 ---------------------------------------------------------------------- time: 2023-07-19 16:25:34 Evaluating: accuracy: 0.8871, eval_loss: 0.4389, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.3604, expected_sparsity: 0.3567, expected_sequence_sparsity: 0.7475, target_sparsity: 0.3366, step: 19000 lambda_1: -3.0211, lambda_2: 117.8724 lambda_3: 0.0000 train remain: [1. 1. 0.95 0.96 0.76 0.72 0.93 0.84 0.85 0.54] infer remain: [1.0, 1.0, 0.92, 0.94, 0.76, 0.72, 0.92, 0.84, 0.84, 0.54] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.92, 0.86, 0.66, 0.47, 0.44, 0.37, 0.31, 0.17] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111111111111111111110111011011111 11111111111111111111111111111111111111111111111000 11111111111111111111111111111111010101000110100000 11111111111111111111111111110111010100000101100000 11111111111111111111111111101111111011111111111010 11111111111111111111111111111111111111001011010000 00111111111111111111111111111101111111111100110010 00011111111111111110110010111001001011000010000000 loss: 0.550799, lagrangian_loss: -0.003408, attention_score_distillation_loss: 0.000833 loss: 0.252318, lagrangian_loss: -0.001510, attention_score_distillation_loss: 0.000819 ---------------------------------------------------------------------- time: 2023-07-19 16:28:28 Evaluating: accuracy: 0.8929, eval_loss: 0.4493, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.3604, expected_sparsity: 0.3567, expected_sequence_sparsity: 0.7475, target_sparsity: 0.3454, step: 19500 lambda_1: -3.1941, lambda_2: 120.0675 lambda_3: 0.0000 train remain: [1. 1. 0.94 0.96 0.76 0.72 0.93 0.84 0.85 0.53] infer remain: [1.0, 1.0, 0.92, 0.94, 0.76, 0.72, 0.92, 0.84, 0.84, 0.54] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.92, 0.86, 0.66, 0.47, 0.44, 0.37, 0.31, 0.17] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111111111111111111111111011011110 11111111111111111111111111111111111111111111111000 11111111111111111111111111111111010101000110100000 11111111111111111111111111110111010100000101100000 11111111111111111111111111101111111011111111111010 11111111111111111111111111111111111111001011010000 00111111111111111111111111111101111111111100110010 00011111111111111110110010111001001011000010000000 loss: 0.170814, lagrangian_loss: 0.005152, attention_score_distillation_loss: 0.000803 ETA: 10:41:36 | Epoch 5 finished. Took 1150.5 seconds. loss: 0.098720, lagrangian_loss: -0.000152, attention_score_distillation_loss: 0.000788 ---------------------------------------------------------------------- time: 2023-07-19 16:31:24 Evaluating: accuracy: 0.8821, eval_loss: 0.4723, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.386, expected_sparsity: 0.3799, expected_sequence_sparsity: 0.7567, target_sparsity: 0.3543, step: 20000 lambda_1: -2.4006, lambda_2: 121.4914 lambda_3: 0.0000 train remain: [1. 1. 0.93 0.95 0.75 0.72 0.93 0.84 0.85 0.52] infer remain: [1.0, 1.0, 0.9, 0.92, 0.74, 0.7, 0.92, 0.84, 0.84, 0.52] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.9, 0.83, 0.61, 0.43, 0.39, 0.33, 0.28, 0.14] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111111111111111111110111011011110 11111111111111111111111111111111101111111111111000 11111111111111111111111111111101010101000110100000 11111111111111111111111111110111010000000101100000 11111111111111111111111111101111111011111111111010 11111111111111111111111111111111111111001011010000 00111111111111111111111111111101111111111100110010 00011111111111111110110010111001001010000010000000 loss: 0.077142, lagrangian_loss: 0.002063, attention_score_distillation_loss: 0.000773 loss: 0.189560, lagrangian_loss: 0.003385, attention_score_distillation_loss: 0.000757 ---------------------------------------------------------------------- time: 2023-07-19 16:34:17 Evaluating: accuracy: 0.8867, eval_loss: 0.4654, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.386, expected_sparsity: 0.3807, expected_sequence_sparsity: 0.757, target_sparsity: 0.3631, step: 20500 lambda_1: -3.9850, lambda_2: 122.5684 lambda_3: 0.0000 train remain: [1. 1. 0.92 0.93 0.75 0.71 0.93 0.84 0.85 0.51] infer remain: [1.0, 1.0, 0.9, 0.92, 0.74, 0.7, 0.92, 0.84, 0.84, 0.5] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.9, 0.83, 0.61, 0.43, 0.39, 0.33, 0.28, 0.14] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111111111111111111110111011011110 11111111111111111111111111111111101111111111111000 11111111111111111111111111111101010101000110100000 11111111111111111111111111110111010000000101100000 11111111111111111111111111101111111011111111111010 11111111111111111111111111111111111111001011010000 00111111111111111111111111111101111111111100110010 00011111111111110110110010111001001010000010000000 loss: 0.241112, lagrangian_loss: -0.005537, attention_score_distillation_loss: 0.000743 loss: 0.166410, lagrangian_loss: 0.000345, attention_score_distillation_loss: 0.000728 ---------------------------------------------------------------------- time: 2023-07-19 16:37:11 Evaluating: accuracy: 0.8905, eval_loss: 0.4697, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.3876, expected_sparsity: 0.3828, expected_sequence_sparsity: 0.7578, target_sparsity: 0.372, step: 21000 lambda_1: -2.7639, lambda_2: 124.4564 lambda_3: 0.0000 train remain: [1. 1. 0.92 0.93 0.74 0.7 0.93 0.84 0.85 0.5 ] infer remain: [1.0, 1.0, 0.9, 0.92, 0.74, 0.7, 0.9, 0.84, 0.84, 0.5] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.9, 0.83, 0.61, 0.43, 0.39, 0.32, 0.27, 0.14] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111111111111111111110111011011110 11111111111111111111111111111111101111111111111000 11111111111111111111111111111101010101000110100000 11111111111111111111111111110111010000000101100000 11111111111111111111111111101111111011111011111010 11111111111111111111111111111111111111001011010000 00111111111111111111111111111101111111111100110010 00011111111111110110110010111001001010000010000000 loss: 0.640742, lagrangian_loss: -0.001453, attention_score_distillation_loss: 0.000713 loss: 0.767744, lagrangian_loss: -0.003106, attention_score_distillation_loss: 0.000697 ---------------------------------------------------------------------- time: 2023-07-19 16:40:03 Evaluating: accuracy: 0.8898, eval_loss: 0.4671, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.3919, expected_sparsity: 0.389, expected_sequence_sparsity: 0.7602, target_sparsity: 0.3808, step: 21500 lambda_1: -3.9626, lambda_2: 127.4678 lambda_3: 0.0000 train remain: [1. 1. 0.91 0.91 0.74 0.7 0.93 0.84 0.84 0.49] infer remain: [1.0, 1.0, 0.9, 0.9, 0.74, 0.7, 0.9, 0.84, 0.84, 0.48] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.9, 0.81, 0.6, 0.42, 0.38, 0.32, 0.27, 0.13] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111111111111111111110111011011110 11111111111111111111111111111111101111111011111000 11111111111111111111111111111101010101000110100000 11111111111111111111111111110111010000000101100000 11111111111111111111111111101111111011111011111010 11111111111111111111111111111111111111001011010000 00111111111111111111111111111101111111111100110010 00011111111111110110100010111001001010000010000000 loss: 0.389449, lagrangian_loss: 0.008071, attention_score_distillation_loss: 0.000682 loss: 0.092713, lagrangian_loss: -0.002831, attention_score_distillation_loss: 0.000666 ---------------------------------------------------------------------- time: 2023-07-19 16:42:56 Evaluating: accuracy: 0.888, eval_loss: 0.4444, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.4082, expected_sparsity: 0.4027, expected_sequence_sparsity: 0.7656, target_sparsity: 0.3897, step: 22000 lambda_1: -1.9608, lambda_2: 131.4487 lambda_3: 0.0000 train remain: [1. 1. 0.91 0.9 0.73 0.69 0.93 0.84 0.84 0.49] infer remain: [1.0, 1.0, 0.9, 0.88, 0.72, 0.68, 0.9, 0.84, 0.84, 0.48] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.9, 0.79, 0.57, 0.39, 0.35, 0.29, 0.25, 0.12] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111111111111111111110111011011110 11111111111111111111111110111111101111111011111000 11111111111111111111111101111101010101000110100000 11111111111111111111111111110111010000000101000000 11111111111111111111111111101111111011111011111010 11111111111111111111111111111111111111001011010000 00111111111111111111111111111101111111111100110010 00011111111111110110100010111001001010000010000000 loss: 0.087542, lagrangian_loss: 0.002828, attention_score_distillation_loss: 0.000652 loss: 0.278232, lagrangian_loss: -0.009213, attention_score_distillation_loss: 0.000637 ---------------------------------------------------------------------- time: 2023-07-19 16:45:49 Evaluating: accuracy: 0.8856, eval_loss: 0.4385, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.4115, expected_sparsity: 0.4095, expected_sequence_sparsity: 0.7683, target_sparsity: 0.3986, step: 22500 lambda_1: -1.7196, lambda_2: 136.8272 lambda_3: 0.0000 train remain: [1. 1. 0.91 0.89 0.72 0.69 0.92 0.84 0.83 0.48] infer remain: [1.0, 1.0, 0.88, 0.88, 0.72, 0.68, 0.9, 0.84, 0.84, 0.48] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.88, 0.77, 0.56, 0.38, 0.34, 0.29, 0.24, 0.12] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111111111111111111110111010011110 11111111111111111111111110111111101111111011111000 11111111111111111111111101111101010101000110100000 11111111111111111111111111110111010000000101000000 11111111111111111111111111101111111011111011111010 11111111111111111111111111111111111111001011010000 00111111111111111111111111111101111111111100110010 00011111111111110110100010111001001010000010000000 loss: 0.282216, lagrangian_loss: 0.006731, attention_score_distillation_loss: 0.000622 loss: 0.402815, lagrangian_loss: 0.001438, attention_score_distillation_loss: 0.000607 ETA: 10:21:59 | Epoch 6 finished. Took 1122.81 seconds. ---------------------------------------------------------------------- time: 2023-07-19 16:48:45 Evaluating: accuracy: 0.877, eval_loss: 0.5176, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.4208, expected_sparsity: 0.4148, expected_sequence_sparsity: 0.7704, target_sparsity: 0.4074, step: 23000 lambda_1: -3.0206, lambda_2: 140.3611 lambda_3: 0.0000 train remain: [1. 1. 0.89 0.88 0.7 0.68 0.92 0.83 0.82 0.48] infer remain: [1.0, 1.0, 0.88, 0.88, 0.7, 0.68, 0.9, 0.84, 0.82, 0.48] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.88, 0.77, 0.54, 0.37, 0.33, 0.28, 0.23, 0.11] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111111111111111111110111010011110 11111111111111111111111110111111101111111011111000 11111111111111111111111101111001010101000110100000 11111111111111111111111111110111010000000101000000 11111111111111111111111111101111111011111011111010 11111111111111111111111111111111111111001011010000 00111111111111111111111111111101111111111100110000 00011111111111110110100010111001001010000010000000 loss: 0.370469, lagrangian_loss: -0.006916, attention_score_distillation_loss: 0.000591 loss: 0.098022, lagrangian_loss: -0.000028, attention_score_distillation_loss: 0.000576 ---------------------------------------------------------------------- time: 2023-07-19 16:51:40 Evaluating: accuracy: 0.8827, eval_loss: 0.4909, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.424, expected_sparsity: 0.4211, expected_sequence_sparsity: 0.7729, target_sparsity: 0.4163, step: 23500 lambda_1: -2.9186, lambda_2: 142.3789 lambda_3: 0.0000 train remain: [1. 1. 0.89 0.87 0.7 0.68 0.92 0.83 0.81 0.48] infer remain: [1.0, 1.0, 0.88, 0.86, 0.7, 0.68, 0.9, 0.82, 0.82, 0.48] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.88, 0.76, 0.53, 0.36, 0.32, 0.27, 0.22, 0.1] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111111101111111111110111011011110 11111111111111111111111110111111101111111011011000 11111111111111111111111101111001010101000110100000 11111111111111111111111111110111010000000101000000 11111111111111111111111111101111111011111011111010 11111111111111111111111111111111111111001010010000 00111111111111111111111111111101111111111100110000 00011111111111110110100010111001001010000010000000 loss: 0.305864, lagrangian_loss: 0.000868, attention_score_distillation_loss: 0.000561 loss: 0.111837, lagrangian_loss: -0.004155, attention_score_distillation_loss: 0.000546 ---------------------------------------------------------------------- time: 2023-07-19 16:54:34 Evaluating: accuracy: 0.8711, eval_loss: 0.5274, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.4398, expected_sparsity: 0.4356, expected_sequence_sparsity: 0.7786, target_sparsity: 0.4251, step: 24000 lambda_1: -2.1886, lambda_2: 144.7647 lambda_3: 0.0000 train remain: [1. 1. 0.88 0.86 0.69 0.67 0.91 0.83 0.81 0.47] infer remain: [1.0, 1.0, 0.86, 0.86, 0.68, 0.66, 0.9, 0.82, 0.8, 0.48] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.86, 0.74, 0.5, 0.33, 0.3, 0.24, 0.2, 0.09] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111111101111111111110111010011110 11111111111111111111111110111111101111111011011000 11111111111111111111111101111001010101000110000000 11111111111111111111111111110111010000000100000000 11111111111111111111111111101111111011111011111010 11111111111111111111111111111111111111001010010000 00111111111111111111111110111101111111111100110000 00011111111111110110100010011101001010000010000000 loss: 0.527923, lagrangian_loss: 0.000995, attention_score_distillation_loss: 0.000531 loss: 0.388383, lagrangian_loss: -0.002323, attention_score_distillation_loss: 0.000515 ---------------------------------------------------------------------- time: 2023-07-19 16:57:24 Evaluating: accuracy: 0.8777, eval_loss: 0.4877, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.4463, expected_sparsity: 0.4408, expected_sequence_sparsity: 0.7807, target_sparsity: 0.434, step: 24500 lambda_1: -3.9907, lambda_2: 147.1421 lambda_3: 0.0000 train remain: [1. 1. 0.87 0.85 0.68 0.66 0.92 0.82 0.81 0.47] infer remain: [1.0, 1.0, 0.86, 0.84, 0.68, 0.66, 0.9, 0.82, 0.8, 0.46] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.86, 0.72, 0.49, 0.32, 0.29, 0.24, 0.19, 0.09] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111111101111111111110111010011110 11111111111111111111111110111111101111110011011000 11111111111111111111111101111001010101000110000000 11111111111111111111111111110111010000000100000000 11111111111111111111111111101111111011111011111010 11111111111111111111111111111111111111001010010000 00111111111111111111111110111101111111111100110000 00011111111111110110100010011001001010000010000000 loss: 0.105796, lagrangian_loss: -0.000029, attention_score_distillation_loss: 0.000500 loss: 0.151019, lagrangian_loss: -0.002177, attention_score_distillation_loss: 0.000485 ---------------------------------------------------------------------- time: 2023-07-19 17:00:21 Evaluating: accuracy: 0.8737, eval_loss: 0.5023, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.4496, expected_sparsity: 0.4437, expected_sequence_sparsity: 0.7818, target_sparsity: 0.4428, step: 25000 lambda_1: -3.5187, lambda_2: 148.9342 lambda_3: 0.0000 train remain: [1. 1. 0.87 0.84 0.67 0.64 0.91 0.82 0.81 0.47] infer remain: [1.0, 1.0, 0.86, 0.84, 0.68, 0.64, 0.9, 0.82, 0.8, 0.46] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.86, 0.72, 0.49, 0.31, 0.28, 0.23, 0.19, 0.09] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111111101111111111110111010011110 11111111111111111111111110111111101111110011011000 11111111111111111111111101111001010101000110000000 11111111111111111111111111110101010000000100000000 11111111111111111111111111101111111011111011111010 11111111111111111111111111111111111111001010010000 00111111111111111111111110111101111111111100110000 00011111111111110110100010011001001010000010000000 loss: 0.346406, lagrangian_loss: 0.002880, attention_score_distillation_loss: 0.000470 loss: 0.484141, lagrangian_loss: -0.003893, attention_score_distillation_loss: 0.000455 ---------------------------------------------------------------------- time: 2023-07-19 17:03:13 Evaluating: accuracy: 0.8737, eval_loss: 0.5246, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.461, expected_sparsity: 0.4563, expected_sequence_sparsity: 0.7868, target_sparsity: 0.4517, step: 25500 lambda_1: -3.3258, lambda_2: 150.9814 lambda_3: 0.0000 train remain: [1. 1. 0.86 0.83 0.67 0.63 0.89 0.81 0.81 0.46] infer remain: [1.0, 1.0, 0.86, 0.82, 0.66, 0.62, 0.88, 0.82, 0.8, 0.46] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.86, 0.71, 0.47, 0.29, 0.25, 0.21, 0.17, 0.08] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111111101111111111110111010011110 11111111111111111111111110111111101101110011011000 11111111111111111111111101111001010101000100000000 11111111111111111111111111110101010000000000000000 11111111111111111111111111101111111011111011111000 11111111111111111111111111111111111111001010010000 00111111111111111111111110111101111111111100110000 00011111111111110110100010011001001010000010000000 loss: 0.301119, lagrangian_loss: -0.000738, attention_score_distillation_loss: 0.000440 loss: 0.268379, lagrangian_loss: -0.005348, attention_score_distillation_loss: 0.000425 ---------------------------------------------------------------------- time: 2023-07-19 17:06:08 Evaluating: accuracy: 0.8647, eval_loss: 0.5535, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.4751, expected_sparsity: 0.4687, expected_sequence_sparsity: 0.7917, target_sparsity: 0.4606, step: 26000 lambda_1: -3.1859, lambda_2: 153.2161 lambda_3: 0.0000 train remain: [1. 1. 0.85 0.81 0.66 0.62 0.88 0.81 0.81 0.46] infer remain: [1.0, 1.0, 0.84, 0.8, 0.66, 0.62, 0.86, 0.8, 0.8, 0.46] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.84, 0.67, 0.44, 0.27, 0.24, 0.19, 0.15, 0.07] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111111101111111111110101010011110 11111111111111111111111110111111001101110011011000 11111111111111111111111101111001010101000100000000 11111111111111111111111111110101010000000000000000 11111111111111111111111111101111111011111011011000 11111111111111111111111111111111111111001010000000 00111111111111111111111110111101111111111100110000 00011111111111100110100010011101001010000010000000 loss: 0.357663, lagrangian_loss: 0.000254, attention_score_distillation_loss: 0.000409 ETA: 10:04:37 | Epoch 7 finished. Took 1152.98 seconds. loss: 0.422970, lagrangian_loss: 0.008728, attention_score_distillation_loss: 0.000393 ---------------------------------------------------------------------- time: 2023-07-19 17:09:05 Evaluating: accuracy: 0.8669, eval_loss: 0.577, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.4767, expected_sparsity: 0.4712, expected_sequence_sparsity: 0.7926, target_sparsity: 0.4694, step: 26500 lambda_1: -2.5899, lambda_2: 155.5544 lambda_3: 0.0000 train remain: [1. 1. 0.85 0.8 0.65 0.61 0.88 0.81 0.81 0.46] infer remain: [1.0, 1.0, 0.84, 0.8, 0.66, 0.6, 0.86, 0.8, 0.8, 0.46] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.84, 0.67, 0.44, 0.27, 0.23, 0.18, 0.15, 0.07] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111111101111111111110101010011110 11111111111111111111111110111111101101110010011000 11111111111111111111111101111001010101000100000000 11111111111111111111111111110001010000000000000000 11111111111111111111111111101111111011111011011000 11111111111111111111111111111111111111001010000000 00111111111111111111111110111101111111111100110000 00011111111111100111100010011001001010000010000000 loss: 0.278651, lagrangian_loss: -0.002803, attention_score_distillation_loss: 0.000379 loss: 0.507491, lagrangian_loss: 0.002914, attention_score_distillation_loss: 0.000364 ---------------------------------------------------------------------- time: 2023-07-19 17:12:01 Evaluating: accuracy: 0.8536, eval_loss: 0.5614, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.4832, expected_sparsity: 0.4803, expected_sequence_sparsity: 0.7962, target_sparsity: 0.4783, step: 27000 lambda_1: -3.6136, lambda_2: 157.7836 lambda_3: 0.0000 train remain: [1. 1. 0.84 0.79 0.64 0.6 0.86 0.81 0.81 0.45] infer remain: [1.0, 1.0, 0.84, 0.78, 0.64, 0.6, 0.84, 0.8, 0.8, 0.44] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.84, 0.66, 0.42, 0.25, 0.21, 0.17, 0.14, 0.06] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111111101111111111110101010011110 11111111111111111111111110111111001101110010011000 11111111111111111111111101111001010001000100000000 11111111111111111111111111110001010000000000000000 11111111111111111111111110101111111011111011011000 11111111111111111111111111111111111111001010000000 00111111111111111111111110111101111111111100110000 00011111111111100110100010011001001010000010000000 loss: 0.378651, lagrangian_loss: -0.003923, attention_score_distillation_loss: 0.000349 loss: 0.051046, lagrangian_loss: 0.004717, attention_score_distillation_loss: 0.000333 ---------------------------------------------------------------------- time: 2023-07-19 17:14:56 Evaluating: accuracy: 0.8675, eval_loss: 0.525, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.4914, expected_sparsity: 0.4908, expected_sequence_sparsity: 0.8003, target_sparsity: 0.4871, step: 27500 lambda_1: -5.3874, lambda_2: 160.2762 lambda_3: 0.0000 train remain: [1. 0.99 0.83 0.77 0.64 0.59 0.84 0.8 0.81 0.45] infer remain: [1.0, 1.0, 0.82, 0.76, 0.64, 0.6, 0.82, 0.8, 0.8, 0.44] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.82, 0.62, 0.4, 0.24, 0.2, 0.16, 0.13, 0.06] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111111101111111111110101010001110 11111111111111111111111110111111001101010010011000 11111111111111111111111101111001010001000100000000 11111111111111111111111111110001010000000000000000 10111111111111111111111110101111111011111011011000 11111111111111111111111111111111111111001010000000 00111111111111111111111110111101111111111100110000 00011111111111100110100010011001001010000010000000 loss: 0.316514, lagrangian_loss: 0.004224, attention_score_distillation_loss: 0.000318 loss: 0.214739, lagrangian_loss: -0.000967, attention_score_distillation_loss: 0.000303 ---------------------------------------------------------------------- time: 2023-07-19 17:17:48 Evaluating: accuracy: 0.858, eval_loss: 0.5666, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.499, expected_sparsity: 0.495, expected_sequence_sparsity: 0.802, target_sparsity: 0.496, step: 28000 lambda_1: -6.6857, lambda_2: 162.8883 lambda_3: 0.0000 train remain: [1. 0.99 0.82 0.76 0.63 0.58 0.82 0.79 0.81 0.43] infer remain: [1.0, 1.0, 0.82, 0.76, 0.64, 0.58, 0.8, 0.78, 0.8, 0.42] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.82, 0.62, 0.4, 0.23, 0.19, 0.14, 0.12, 0.05] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111111101111111111110101010001110 11111111111111111111111110111111001101010010011000 11111111111111111111111101111001010001000100000000 11111111111111111111111110110001010000000000000000 10011111111111111111111110101111111011111011011000 11111111111111111111111110111111111111001010000000 00111111111111111111111110111101111111111100110000 00011111111111100110100010011001001010000000000000 loss: 0.257862, lagrangian_loss: 0.002369, attention_score_distillation_loss: 0.000288 loss: 0.274288, lagrangian_loss: -0.005006, attention_score_distillation_loss: 0.000273 ---------------------------------------------------------------------- time: 2023-07-19 17:20:43 Evaluating: accuracy: 0.8528, eval_loss: 0.5108, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5136, expected_sparsity: 0.5074, expected_sequence_sparsity: 0.8069, target_sparsity: 0.5049, step: 28500 lambda_1: -5.1503, lambda_2: 166.8255 lambda_3: 0.0000 train remain: [1. 0.99 0.8 0.75 0.63 0.57 0.81 0.76 0.81 0.42] infer remain: [1.0, 1.0, 0.8, 0.74, 0.62, 0.58, 0.8, 0.76, 0.8, 0.42] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.8, 0.59, 0.37, 0.21, 0.17, 0.13, 0.1, 0.04] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111111101101111111110101010001110 11111111111111111111111110111111001101010000011000 11111111111111111111111101111001010000000100000000 11111111111111111111111110110001010000000000000000 10011111111111111111111110101111111011111011011000 10111111111111111111111110111111111111001010000000 00111111111111111111111110111101111111111100110000 00011111110111100110100010011101001010000000000000 loss: 0.348947, lagrangian_loss: -0.006852, attention_score_distillation_loss: 0.000258 loss: 0.232811, lagrangian_loss: 0.006092, attention_score_distillation_loss: 0.000242 ---------------------------------------------------------------------- time: 2023-07-19 17:23:39 Evaluating: accuracy: 0.8642, eval_loss: 0.4739, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5136, expected_sparsity: 0.5079, expected_sequence_sparsity: 0.8071, target_sparsity: 0.5137, step: 29000 lambda_1: -8.2196, lambda_2: 173.1122 lambda_3: 0.0000 train remain: [1. 0.99 0.79 0.74 0.62 0.57 0.8 0.75 0.81 0.39] infer remain: [1.0, 1.0, 0.8, 0.74, 0.62, 0.58, 0.8, 0.76, 0.8, 0.38] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.8, 0.59, 0.37, 0.21, 0.17, 0.13, 0.1, 0.04] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111111101101111111110101010001110 11111111111111111111111110111111001101010000011000 11111111111111111111111101111001010000000100000000 11111111111111111111111110110100010000000000000000 10011111111111111111111110101111111011111011011000 10111111111111111111111110111111111111001010000000 00111111111111111111111110111101111111111100110000 00011111110111100010100010011001001010000000000000 loss: 0.377321, lagrangian_loss: 0.018673, attention_score_distillation_loss: 0.000227 loss: 0.348316, lagrangian_loss: -0.021290, attention_score_distillation_loss: 0.000212 ETA: 9:45:28 | Epoch 8 finished. Took 1129.39 seconds. ---------------------------------------------------------------------- time: 2023-07-19 17:26:33 Evaluating: accuracy: 0.8719, eval_loss: 0.4985, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5218, expected_sparsity: 0.5188, expected_sequence_sparsity: 0.8114, target_sparsity: 0.5226, step: 29500 lambda_1: -5.2242, lambda_2: 179.2311 lambda_3: 0.0000 train remain: [1. 0.99 0.78 0.71 0.61 0.57 0.8 0.75 0.81 0.38] infer remain: [1.0, 1.0, 0.78, 0.72, 0.62, 0.56, 0.8, 0.74, 0.8, 0.38] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.78, 0.56, 0.35, 0.19, 0.16, 0.12, 0.09, 0.04] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011111101101111111110101010001110 11111111111111111111111110111111001001010000011000 11111111111111111111111101111001010000000100000000 11111111111111111111111110110000010000000000000000 10011111111111111111111110101111111011111011011000 10111111111111111111111110111111111111000010000000 00111111111111111111111110111101111111111100110000 10011111110111100010100010010001001010000000000000 loss: 0.610753, lagrangian_loss: -0.012237, attention_score_distillation_loss: 0.000197 loss: 0.527285, lagrangian_loss: 0.001909, attention_score_distillation_loss: 0.000182 ---------------------------------------------------------------------- time: 2023-07-19 17:29:29 Evaluating: accuracy: 0.8521, eval_loss: 0.5629, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5343, expected_sparsity: 0.5301, expected_sequence_sparsity: 0.8158, target_sparsity: 0.5314, step: 30000 lambda_1: -5.1625, lambda_2: 182.8359 lambda_3: 0.0000 train remain: [1. 0.99 0.77 0.7 0.6 0.57 0.8 0.75 0.81 0.37] infer remain: [1.0, 1.0, 0.76, 0.7, 0.6, 0.56, 0.78, 0.74, 0.8, 0.38] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.76, 0.53, 0.32, 0.18, 0.14, 0.1, 0.08, 0.03] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011111101100111111110101010001110 11111111111111111111101110111111001001010000011000 11111111111111111111111101011001010000000100000000 11111111111111111111111110110000010000000000000000 10011111111111111111111110101111111011110011011000 10111111111111111111111110111111111111000010000000 00111111111111111111111110111101111111111100110000 00011111110111100011100010010001001010000000000000 loss: 0.346217, lagrangian_loss: 0.020005, attention_score_distillation_loss: 0.000167 loss: 0.515052, lagrangian_loss: 0.004408, attention_score_distillation_loss: 0.000151 ---------------------------------------------------------------------- time: 2023-07-19 17:32:23 Evaluating: accuracy: 0.8349, eval_loss: 0.507, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5474, expected_sparsity: 0.543, expected_sequence_sparsity: 0.8209, target_sparsity: 0.5403, step: 30500 lambda_1: -6.9070, lambda_2: 184.1763 lambda_3: 0.0000 train remain: [0.99 0.98 0.76 0.69 0.58 0.57 0.8 0.75 0.81 0.37] infer remain: [1.0, 0.96, 0.76, 0.7, 0.58, 0.56, 0.78, 0.74, 0.8, 0.36] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.73, 0.51, 0.3, 0.17, 0.13, 0.1, 0.08, 0.03] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111110111011111101100111111110101010001110 11111111111111111111101110111111001001010000011000 11111111111111111111111100011001010000000100000000 11111111111111111111111110110000010000000000000000 10011111111111111111111110101111111011110011011000 10111111111111111111111110111111111111000010000000 00111111111111111111111110111101111111111100110000 00011111110111100010100010010001001010000000000000 loss: 0.512406, lagrangian_loss: 0.009849, attention_score_distillation_loss: 0.000136 loss: 0.502054, lagrangian_loss: 0.003590, attention_score_distillation_loss: 0.000121 ---------------------------------------------------------------------- time: 2023-07-19 17:35:18 Evaluating: accuracy: 0.8224, eval_loss: 0.5701, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5555, expected_sparsity: 0.5505, expected_sequence_sparsity: 0.8239, target_sparsity: 0.5491, step: 31000 lambda_1: -6.6193, lambda_2: 185.4964 lambda_3: 0.0000 train remain: [0.99 0.98 0.73 0.68 0.57 0.56 0.79 0.75 0.8 0.35] infer remain: [1.0, 0.96, 0.74, 0.68, 0.58, 0.56, 0.78, 0.74, 0.8, 0.34] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.71, 0.48, 0.28, 0.16, 0.12, 0.09, 0.07, 0.02] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111110111011111101100111111010101010001110 11111111111111111111101110011111001001010000011000 11111111111111111111111100111001010000000000000000 11111111111111111111111110110000010000000000000000 10011111111111111111111110101111111011110011011000 10111111111111111111111110111111111111000010000000 00111111111111111111111110111101111111111100110000 00011111110111100010100010010001000010000000000000 loss: 0.317155, lagrangian_loss: 0.003263, attention_score_distillation_loss: 0.000106 loss: 0.637762, lagrangian_loss: -0.005628, attention_score_distillation_loss: 0.000091 ---------------------------------------------------------------------- time: 2023-07-19 17:38:16 Evaluating: accuracy: 0.845, eval_loss: 0.5345, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5636, expected_sparsity: 0.5597, expected_sequence_sparsity: 0.8275, target_sparsity: 0.558, step: 31500 lambda_1: -3.5193, lambda_2: 187.7870 lambda_3: 0.0000 train remain: [0.99 0.97 0.72 0.67 0.56 0.55 0.79 0.75 0.8 0.34] infer remain: [1.0, 0.96, 0.72, 0.66, 0.56, 0.56, 0.78, 0.74, 0.8, 0.34] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.69, 0.46, 0.26, 0.14, 0.11, 0.08, 0.07, 0.02] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111110111011101101100111111010101010001110 11111111111111111111101110011111001001010000010000 11111111111111111111111100011001010000000000000000 11111111111111111111111110110000010000000000000000 10011111111111111111111110101111111011110011011000 10111111111111111111111110111111111111000010000000 00111111111111111111111110111101111111111100110000 00011111110111100010000010010001010010000000000000 loss: 0.926725, lagrangian_loss: -0.000676, attention_score_distillation_loss: 0.000075 loss: 0.258766, lagrangian_loss: 0.002009, attention_score_distillation_loss: 0.000060 ---------------------------------------------------------------------- time: 2023-07-19 17:41:10 Evaluating: accuracy: 0.8267, eval_loss: 0.5846, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5702, expected_sparsity: 0.5651, expected_sequence_sparsity: 0.8296, target_sparsity: 0.5669, step: 32000 lambda_1: -6.5978, lambda_2: 191.7556 lambda_3: 0.0000 train remain: [0.99 0.97 0.7 0.66 0.56 0.54 0.79 0.75 0.8 0.34] infer remain: [1.0, 0.96, 0.7, 0.66, 0.56, 0.54, 0.78, 0.74, 0.8, 0.34] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.67, 0.44, 0.25, 0.13, 0.1, 0.08, 0.06, 0.02] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111110111011010101100111111010101010001110 11111111111111111111101110011111001001010000010000 11111111111111111111111100011001010000000000000000 10111111111111111111111110110000010000000000000000 10011111111111111111111110101111111011110011011000 10111111111111111111111110111111111111000010000000 00111111111111111111111110111101111111111100110000 00011111110111100011000010010001000010000000000000 loss: 0.488409, lagrangian_loss: -0.000363, attention_score_distillation_loss: 0.000045 loss: 0.232396, lagrangian_loss: -0.003830, attention_score_distillation_loss: 0.000030 ---------------------------------------------------------------------- time: 2023-07-19 17:44:07 Evaluating: accuracy: 0.8444, eval_loss: 0.5212, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5815, expected_sparsity: 0.5765, expected_sequence_sparsity: 0.8341, target_sparsity: 0.5757, step: 32500 lambda_1: -7.1471, lambda_2: 194.2055 lambda_3: 0.0000 train remain: [0.99 0.97 0.66 0.65 0.55 0.53 0.79 0.75 0.79 0.33] infer remain: [1.0, 0.96, 0.66, 0.66, 0.54, 0.52, 0.78, 0.74, 0.78, 0.34] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.63, 0.42, 0.23, 0.12, 0.09, 0.07, 0.05, 0.02] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111110111011000101100111111010100010001110 11111111111111111111101110011111001001010000010000 11111111111111111111111100011001000000000000000000 10111111111111111111011110110000010000000000000000 10011111111111111111111110101111111011110011011000 10111111111111111111111110111111111111000010000000 00111111111111111111111110111101011111111100110000 00011111110111100010000010010001010010000000000000 loss: 0.309577, lagrangian_loss: -0.004971, attention_score_distillation_loss: 0.000020 ETA: 9:27:56 | Epoch 9 finished. Took 1160.01 seconds. loss: 0.264256, lagrangian_loss: -0.007043, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 17:47:02 Evaluating: accuracy: 0.8422, eval_loss: 0.6016, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5848, expected_sparsity: 0.5796, expected_sequence_sparsity: 0.8353, target_sparsity: 0.58, step: 33000 lambda_1: -2.7647, lambda_2: 197.9735 lambda_3: 0.0000 train remain: [0.99 0.97 0.65 0.65 0.54 0.52 0.79 0.72 0.77 0.32] infer remain: [1.0, 0.96, 0.66, 0.64, 0.54, 0.52, 0.78, 0.72, 0.76, 0.32] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.63, 0.41, 0.22, 0.11, 0.09, 0.06, 0.05, 0.02] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111110111011000101100111111010100010001110 11111111111111111101101110011111001001010000010000 11111111111111111111111100011001000000000000000000 10111111111111111111011110110000010000000000000000 10011111111111111111111110101111111011110011011000 10111111111111111111111110111111011111000010000000 00011111111111111111111110111101011111111100110000 10011110110111100010000010010001000010000000000000 loss: 0.353784, lagrangian_loss: -0.006780, attention_score_distillation_loss: 0.000020 loss: 0.238050, lagrangian_loss: 0.000078, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 17:49:55 Evaluating: accuracy: 0.8398, eval_loss: 0.618, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5848, expected_sparsity: 0.5796, expected_sequence_sparsity: 0.8353, target_sparsity: 0.58, step: 33500 lambda_1: -0.4443, lambda_2: 201.4185 lambda_3: 0.0000 train remain: [0.99 0.97 0.65 0.65 0.55 0.52 0.79 0.72 0.77 0.32] infer remain: [1.0, 0.96, 0.66, 0.64, 0.54, 0.52, 0.78, 0.72, 0.76, 0.32] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.63, 0.41, 0.22, 0.11, 0.09, 0.06, 0.05, 0.02] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111110111011000101100111111010100010001110 11111111111111111101101110011111001001010000010000 11111111111111111111111100011001000000000000000000 10111111111111111111011110110000010000000000000000 10011111111111111111111110101111111011110011011000 10111111111111111111111110111111011111000010000000 00011111111111111111111110111101011111111100110000 10011110110111100010000010010001000010000000000000 loss: 0.460482, lagrangian_loss: 0.000811, attention_score_distillation_loss: 0.000020 loss: 0.726788, lagrangian_loss: -0.000822, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 17:52:50 Evaluating: accuracy: 0.8567, eval_loss: 0.564, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5848, expected_sparsity: 0.5799, expected_sequence_sparsity: 0.8355, target_sparsity: 0.58, step: 34000 lambda_1: -2.8366, lambda_2: 203.4159 lambda_3: 0.0000 train remain: [0.99 0.97 0.65 0.65 0.54 0.52 0.79 0.71 0.77 0.32] infer remain: [1.0, 0.96, 0.66, 0.64, 0.54, 0.52, 0.78, 0.7, 0.76, 0.32] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.63, 0.41, 0.22, 0.11, 0.09, 0.06, 0.05, 0.02] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111110111011000101100111111010100010001110 11111111111111111101101110011111001001010000010000 11111111111111111111111100011001000000000000000000 10111111111111111111011110110000010000000000000000 10011111111111111111111110101111111011110011011000 10011111111111111111111110111111011111000010000000 00011111111111111111111110111101011111111100110000 10001110110111100011000010010001000010000000000000 loss: 0.169909, lagrangian_loss: -0.000294, attention_score_distillation_loss: 0.000020 loss: 0.483647, lagrangian_loss: -0.001945, attention_score_distillation_loss: 0.000020 Starting saving the best from epoch 10 and step 34500 ---------------------------------------------------------------------- time: 2023-07-19 17:55:45 Evaluating: accuracy: 0.8486, eval_loss: 0.614, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5924, expected_sparsity: 0.5841, expected_sequence_sparsity: 0.8371, target_sparsity: 0.58, step: 34500 lambda_1: -0.5275, lambda_2: 205.3826 lambda_3: 0.0000 train remain: [0.99 0.97 0.65 0.65 0.54 0.52 0.79 0.69 0.77 0.32] infer remain: [1.0, 0.96, 0.64, 0.64, 0.54, 0.52, 0.78, 0.68, 0.76, 0.32] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.61, 0.39, 0.21, 0.11, 0.09, 0.06, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111110111011000101100111101010100010001110 11111111111111111101101110011111001001010000010000 11111111111111111111111100011001000000000000000000 10111111111111111111011110110000010000000000000000 10011111111111111111111110101111111011110011011000 10011111111111111011111110111111011111000010000000 00011111111111111111111110111101011111111100110000 10001110110111100010000010010001000010010000000000 Saving the best model so far: [Epoch 10 | Step: 34500 | MACs sparsity: 0.5924 | Score: 0.8486 | Loss: 0.614] loss: 0.228054, lagrangian_loss: -0.000191, attention_score_distillation_loss: 0.000020 loss: 0.320713, lagrangian_loss: 0.002661, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 17:59:09 Evaluating: accuracy: 0.8603, eval_loss: 0.5008, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5924, expected_sparsity: 0.5841, expected_sequence_sparsity: 0.8371, target_sparsity: 0.58, step: 35000 lambda_1: -2.2603, lambda_2: 207.2545 lambda_3: 0.0000 train remain: [0.99 0.97 0.65 0.65 0.54 0.51 0.79 0.68 0.77 0.32] infer remain: [1.0, 0.96, 0.64, 0.64, 0.54, 0.52, 0.78, 0.68, 0.76, 0.32] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.61, 0.39, 0.21, 0.11, 0.09, 0.06, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111110111011000101100111101010100010001110 11111111111111111101101110011111001001010000010000 11111111111111111111111100011001000000000000000000 10111111111111111111011110110000010000000000000000 10011111111111111111111110101111111011110011011000 10011111111111111011111110111111011111000010000000 00011111111111111111111110111101011111111100110000 10001110110111100010000010010001000010010000000000 Best eval score so far: 0.8486 @ step 34500 epoch 10.54 Saving the best model so far: [Epoch 10 | Step: 35000 | MACs sparsity: 0.5924 | Score: 0.8603 | Loss: 0.5008] loss: 0.792879, lagrangian_loss: -0.000407, attention_score_distillation_loss: 0.000020 loss: 0.256392, lagrangian_loss: 0.002997, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 18:02:25 Evaluating: accuracy: 0.8497, eval_loss: 0.5694, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5924, expected_sparsity: 0.5841, expected_sequence_sparsity: 0.8371, target_sparsity: 0.58, step: 35500 lambda_1: -1.2645, lambda_2: 209.4315 lambda_3: 0.0000 train remain: [0.99 0.98 0.65 0.65 0.54 0.52 0.79 0.68 0.77 0.32] infer remain: [1.0, 0.96, 0.64, 0.64, 0.54, 0.52, 0.78, 0.68, 0.76, 0.32] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.61, 0.39, 0.21, 0.11, 0.09, 0.06, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111110111011000101100111101010100010001110 11111111111111111101101110011111001001010000010000 11111111111111111111111100011001000000000000000000 10111111111111111111011110110000010000000000000000 10011111111111111111111110101111111011110011011000 10011111110111111011111110111111011111010010000000 00011111111111111111111110111101011111111100110000 10001110110111100010000010010001010010000000000000 Best eval score so far: 0.8603 @ step 35000 epoch 10.69 loss: 0.739461, lagrangian_loss: 0.002644, attention_score_distillation_loss: 0.000020 loss: 0.184049, lagrangian_loss: 0.002549, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 18:05:19 Evaluating: accuracy: 0.8422, eval_loss: 0.5984, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5924, expected_sparsity: 0.5851, expected_sequence_sparsity: 0.8375, target_sparsity: 0.58, step: 36000 lambda_1: -2.1346, lambda_2: 211.9169 lambda_3: 0.0000 train remain: [0.99 0.98 0.65 0.65 0.54 0.51 0.78 0.68 0.77 0.32] infer remain: [1.0, 0.96, 0.64, 0.64, 0.54, 0.5, 0.78, 0.68, 0.76, 0.32] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.61, 0.39, 0.21, 0.11, 0.08, 0.06, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111110111011000101100111101010100010001110 11111111111111111101101110011111001001010000010000 11111111111111111111111100011001000000000000000000 10011111111111111111011110110000010000000000000000 10011111111111111111111110101111111011110011011000 10011111110111111011111110111111011111010010000000 00011111111111111111111110111101011111111100110000 10001110110111100010000010010001010010000000000000 Best eval score so far: 0.8603 @ step 35000 epoch 10.69 loss: 0.325972, lagrangian_loss: -0.001897, attention_score_distillation_loss: 0.000020 ETA: 9:12:11 | Epoch 10 finished. Took 1208.5 seconds. loss: 0.209630, lagrangian_loss: 0.000010, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 18:08:15 Evaluating: accuracy: 0.8537, eval_loss: 0.5697, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5891, expected_sparsity: 0.5832, expected_sequence_sparsity: 0.8368, target_sparsity: 0.58, step: 36500 lambda_1: -0.7173, lambda_2: 215.0196 lambda_3: 0.0000 train remain: [0.99 0.97 0.65 0.65 0.54 0.51 0.78 0.68 0.77 0.32] infer remain: [1.0, 0.96, 0.64, 0.66, 0.54, 0.5, 0.76, 0.68, 0.76, 0.32] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.61, 0.41, 0.22, 0.11, 0.08, 0.06, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111110111011000101100111101010100010001110 11111111111111111111101110011111001001010000010000 11111111111111111111111100011001000000000000000000 10011111111111111111011110110000010000000000000000 10011111111111111111111110101101111011110011011000 10011111110111111011111110111111011111010010000000 00011111111111111111111110111101011111111100110000 10001110110111100010000010010001000011000000000000 Best eval score so far: 0.8603 @ step 35000 epoch 10.69 loss: 0.317730, lagrangian_loss: -0.000117, attention_score_distillation_loss: 0.000020 loss: 0.233008, lagrangian_loss: 0.000033, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 18:11:10 Evaluating: accuracy: 0.8386, eval_loss: 0.6306, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.594, expected_sparsity: 0.5856, expected_sequence_sparsity: 0.8377, target_sparsity: 0.58, step: 37000 lambda_1: -0.8456, lambda_2: 217.3490 lambda_3: 0.0000 train remain: [0.99 0.98 0.65 0.65 0.54 0.51 0.77 0.68 0.76 0.31] infer remain: [1.0, 0.96, 0.64, 0.64, 0.54, 0.5, 0.76, 0.68, 0.76, 0.32] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.61, 0.39, 0.21, 0.11, 0.08, 0.05, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111110111011000101100111101010100010001110 11111111111111111101101110011111001001010000010000 11111111111111111111111100011001000000000000000000 10011111111111111111011110110000010000000000000000 10011111111111111111111110101101111011110011011000 10011111110111111011111110111111011111010010000000 00011111111111111111111110111101011111111100110000 10001110110111100010000010010001000011000000000000 Best eval score so far: 0.8603 @ step 35000 epoch 10.69 loss: 0.078040, lagrangian_loss: -0.000279, attention_score_distillation_loss: 0.000020 loss: 0.136115, lagrangian_loss: 0.000143, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 18:14:05 Evaluating: accuracy: 0.8517, eval_loss: 0.6361, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.594, expected_sparsity: 0.5858, expected_sequence_sparsity: 0.8378, target_sparsity: 0.58, step: 37500 lambda_1: -1.5742, lambda_2: 219.3876 lambda_3: 0.0000 train remain: [0.99 0.98 0.65 0.65 0.54 0.51 0.76 0.67 0.75 0.3 ] infer remain: [1.0, 0.96, 0.64, 0.64, 0.54, 0.5, 0.76, 0.68, 0.74, 0.3] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.61, 0.39, 0.21, 0.11, 0.08, 0.05, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111110111011000101100111101010100010001110 11111111111111111101101110011111001001010000010000 11111111111111111111111100011001000000000000000000 10011111111111111111011110110000010000000000000000 10011111111111111111111110101101111011110011011000 10011111110111111011111110111111011111010010000000 00011111110111111111111110111101011111111100110000 10001110110111100010000010010001000010000000000000 Best eval score so far: 0.8603 @ step 35000 epoch 10.69 loss: 0.263593, lagrangian_loss: 0.000968, attention_score_distillation_loss: 0.000020 loss: 0.544657, lagrangian_loss: 0.000075, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 18:16:59 Evaluating: accuracy: 0.8539, eval_loss: 0.6166, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.594, expected_sparsity: 0.5862, expected_sequence_sparsity: 0.8379, target_sparsity: 0.58, step: 38000 lambda_1: -0.7117, lambda_2: 221.9220 lambda_3: 0.0000 train remain: [1. 0.98 0.65 0.65 0.54 0.51 0.76 0.68 0.74 0.3 ] infer remain: [1.0, 0.96, 0.64, 0.64, 0.54, 0.5, 0.74, 0.68, 0.74, 0.3] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.61, 0.39, 0.21, 0.11, 0.08, 0.05, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111110111011000101100111101010100010001110 11111111111111111101101110011111001001010000010000 11111111111111111111111100011001000000000000000000 10011111111111111111011110110000010000000000000000 10011111110111111111111110101101111011110011011000 10011111110111111011111110111111011111010010000000 00011111110111111111111110111101011111111100110000 10001110110111101010000010010001000000000000000000 Best eval score so far: 0.8603 @ step 35000 epoch 10.69 loss: 0.066888, lagrangian_loss: 0.002472, attention_score_distillation_loss: 0.000020 loss: 0.182690, lagrangian_loss: -0.000801, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 18:19:55 Evaluating: accuracy: 0.8645, eval_loss: 0.5175, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.594, expected_sparsity: 0.5862, expected_sequence_sparsity: 0.8379, target_sparsity: 0.58, step: 38500 lambda_1: -1.0725, lambda_2: 224.8803 lambda_3: 0.0000 train remain: [1. 0.98 0.65 0.65 0.54 0.51 0.75 0.67 0.74 0.3 ] infer remain: [1.0, 0.96, 0.64, 0.64, 0.54, 0.5, 0.74, 0.68, 0.74, 0.3] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.61, 0.39, 0.21, 0.11, 0.08, 0.05, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111110111011000101100111101010100010001110 11111111111111111101101110011111001001010000010000 11111111111111111111111100011001000000000000000000 10011111111111111111011110110000010000000000000000 10011111110111111111111110101101111011110011011000 10011111110111111011111110111111011111010010000000 00011111110111111111111110111101011111111100110000 10001110110111100010000010010101000000000000000000 Best eval score so far: 0.8603 @ step 35000 epoch 10.69 Saving the best model so far: [Epoch 11 | Step: 38500 | MACs sparsity: 0.594 | Score: 0.8645 | Loss: 0.5175] loss: 0.222608, lagrangian_loss: -0.000724, attention_score_distillation_loss: 0.000020 loss: 0.337309, lagrangian_loss: 0.000538, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 18:23:13 Evaluating: accuracy: 0.8534, eval_loss: 0.4947, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.594, expected_sparsity: 0.5862, expected_sequence_sparsity: 0.8379, target_sparsity: 0.58, step: 39000 lambda_1: -0.6924, lambda_2: 227.3504 lambda_3: 0.0000 train remain: [1. 0.98 0.65 0.64 0.54 0.51 0.74 0.67 0.73 0.3 ] infer remain: [1.0, 0.96, 0.64, 0.64, 0.54, 0.5, 0.74, 0.68, 0.74, 0.3] layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.61, 0.39, 0.21, 0.11, 0.08, 0.05, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 10111111111111111111111111111111111111111111111110 11111111111110111011000101100111101010100010001110 11111111111111111101101110011111001001010000010000 11111111111111111111111100011001000000000000000000 10011111111111111111011110110000010000000000000000 10011111110111111111111110101101011111110011011000 10011111110111111011111110111111011111010010000000 00011111110111111111111110111101011111111100110000 00001110110111100010000010010101000000010000000000 Best eval score so far: 0.8645 @ step 38500 epoch 11.76 loss: 0.250423, lagrangian_loss: 0.000850, attention_score_distillation_loss: 0.000020 loss: 0.126676, lagrangian_loss: -0.000397, attention_score_distillation_loss: 0.000020 ETA: 8:53:39 | Epoch 11 finished. Took 1155.55 seconds. ---------------------------------------------------------------------- time: 2023-07-19 18:26:10 Evaluating: accuracy: 0.8713, eval_loss: 0.5835, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5826, expected_sparsity: 0.5779, expected_sequence_sparsity: 0.8347, target_sparsity: 0.58, step: 39500 lambda_1: -0.2773, lambda_2: 229.9755 lambda_3: 0.0000 train remain: [1. 0.98 0.65 0.64 0.54 0.51 0.75 0.67 0.73 0.29] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.5, 0.74, 0.68, 0.72, 0.3] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.11, 0.08, 0.06, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100111101010100010001110 11111111111111111101101110011111001001010000010000 11111111111111111111111100011001000000000000000000 10011111111111111111011110110000010000000000000000 10011111110111111111111110101101011111110011011000 10011111110111111011111110111111011111010010000000 00011111110111111111111110011101011111111100110000 10001110110111100010000010010001000000010000000000 Best eval score so far: 0.8645 @ step 38500 epoch 11.76 Saving the best model so far: [Epoch 12 | Step: 39500 | MACs sparsity: 0.5826 | Score: 0.8713 | Loss: 0.5835] loss: 0.288629, lagrangian_loss: 0.002659, attention_score_distillation_loss: 0.000020 loss: 0.233647, lagrangian_loss: 0.002637, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 18:29:18 Evaluating: accuracy: 0.8737, eval_loss: 0.5163, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5788, expected_sequence_sparsity: 0.835, target_sparsity: 0.58, step: 40000 lambda_1: -0.9066, lambda_2: 233.1449 lambda_3: 0.0000 train remain: [1. 0.98 0.65 0.64 0.54 0.51 0.74 0.66 0.7 0.28] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.5, 0.72, 0.66, 0.7, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.11, 0.08, 0.05, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100111101010100010001110 11111111111111111101101110011111001001010000010000 11111111111111111111111100011001000000000000000000 10011111111111111111011110110000010000000000000000 10011111110111111111111110101101011011110011011000 10011111110111111011111110111101011111010010000000 00011111110111111011111110011101011111111100110000 10001110110111100010000010010001000000000000000000 Best eval score so far: 0.8713 @ step 39500 epoch 12.06 Saving the best model so far: [Epoch 12 | Step: 40000 | MACs sparsity: 0.5842 | Score: 0.8737 | Loss: 0.5163] loss: 0.081747, lagrangian_loss: -0.000359, attention_score_distillation_loss: 0.000020 loss: 0.485934, lagrangian_loss: -0.000194, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 18:32:32 Evaluating: accuracy: 0.8731, eval_loss: 0.5205, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.579, expected_sequence_sparsity: 0.8351, target_sparsity: 0.58, step: 40500 lambda_1: -1.6143, lambda_2: 235.9640 lambda_3: 0.0000 train remain: [1. 0.99 0.65 0.64 0.54 0.51 0.75 0.66 0.69 0.28] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.5, 0.72, 0.66, 0.68, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.11, 0.08, 0.05, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100111101010100010001110 11111111111111111101101110011111001001010000010000 11111111111111111111111100011001000000000000000000 10011111111111111111011110110000010000000000000000 10011111110111111111111110101101011011110011011000 10011111110111111011111110111101011111010010000000 00011111110111111011111110011101011111111100100000 10001110110111100010000010010001000000000000000000 Best eval score so far: 0.8737 @ step 40000 epoch 12.22 loss: 0.287362, lagrangian_loss: 0.007187, attention_score_distillation_loss: 0.000020 loss: 0.299514, lagrangian_loss: -0.001372, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 18:35:29 Evaluating: accuracy: 0.8706, eval_loss: 0.5592, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.579, expected_sequence_sparsity: 0.8351, target_sparsity: 0.58, step: 41000 lambda_1: -0.1922, lambda_2: 238.8374 lambda_3: 0.0000 train remain: [1. 0.99 0.65 0.64 0.54 0.5 0.75 0.66 0.68 0.28] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.5, 0.72, 0.66, 0.68, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.11, 0.08, 0.05, 0.04, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100111101010100010001110 11111111111111111101101110011111001001010000010000 11111111111111111111111100011001000000000000000000 10011111111111111111011110110000010000000000000000 10011111110111111111111110101101011011110011011000 10011111110111111011111110111101011111010010000000 00011111110111111011111110011101011111111100100000 10001110110110100010000010010101000000000000000000 Best eval score so far: 0.8737 @ step 40000 epoch 12.22 loss: 0.325905, lagrangian_loss: -0.000034, attention_score_distillation_loss: 0.000020 loss: 0.054371, lagrangian_loss: 0.001815, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 18:38:26 Evaluating: accuracy: 0.8689, eval_loss: 0.5074, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5791, expected_sequence_sparsity: 0.8351, target_sparsity: 0.58, step: 41500 lambda_1: -0.3139, lambda_2: 241.9311 lambda_3: 0.0000 train remain: [1. 0.99 0.65 0.64 0.54 0.5 0.74 0.66 0.66 0.29] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.5, 0.72, 0.66, 0.66, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.11, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100111101010100010001110 11111111111111111101101110011111001001010000010000 11111111111111111111111100010001000000000010000000 10011111111111111111011110110000010000000000000000 10011111110111111111111110101101011011110011011000 10011111110111111011111110111101011111010010000000 00011111110111101011111110011101011111111100100000 10001110110110100010000010010101000000000000000000 Best eval score so far: 0.8737 @ step 40000 epoch 12.22 loss: 0.440043, lagrangian_loss: -0.000070, attention_score_distillation_loss: 0.000020 loss: 0.436326, lagrangian_loss: -0.000101, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 18:41:21 Evaluating: accuracy: 0.8737, eval_loss: 0.5319, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5791, expected_sequence_sparsity: 0.8351, target_sparsity: 0.58, step: 42000 lambda_1: -0.5984, lambda_2: 244.5685 lambda_3: 0.0000 train remain: [1. 0.99 0.65 0.64 0.54 0.51 0.75 0.66 0.66 0.29] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.5, 0.72, 0.66, 0.66, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.11, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100111101010100010001110 11111111111111111101101110011111001001010000010000 11111111111111111111111100010001000010000000000000 10011111111111111111011110110000010000000000000000 10011111110111111111111110101101011011110011011000 10011111110111111011111110111101011111010010000000 00011111110111101011111010011101011111111110100000 10001110110110100010000010010101000000000000000000 Best eval score so far: 0.8737 @ step 40000 epoch 12.22 loss: 0.062840, lagrangian_loss: 0.000442, attention_score_distillation_loss: 0.000020 loss: 0.190441, lagrangian_loss: 0.000324, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 18:44:15 Evaluating: accuracy: 0.8457, eval_loss: 0.5385, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5791, expected_sequence_sparsity: 0.8351, target_sparsity: 0.58, step: 42500 lambda_1: -0.6183, lambda_2: 247.5235 lambda_3: 0.0000 train remain: [1. 0.99 0.65 0.64 0.54 0.51 0.75 0.66 0.65 0.31] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.5, 0.72, 0.66, 0.66, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.11, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100111101010100010001110 11111111111111111101101110011111001001010000010000 11111111111111111111111100010001000010000000000000 10011111111111111111011110110000010000000000000000 10011111110111111111111110101101011011110011011000 10011111110111111011111110111101011111000110000000 00011111110111101011111010011101011111111110100000 00001110110110100011000010010101000000000000000000 Best eval score so far: 0.8737 @ step 40000 epoch 12.22 loss: 0.480059, lagrangian_loss: 0.000144, attention_score_distillation_loss: 0.000020 ETA: 8:36:20 | Epoch 12 finished. Took 1193.76 seconds. loss: 0.170166, lagrangian_loss: 0.000170, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 18:47:09 Evaluating: accuracy: 0.8788, eval_loss: 0.5086, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5791, expected_sequence_sparsity: 0.8351, target_sparsity: 0.58, step: 43000 lambda_1: -0.1994, lambda_2: 250.0612 lambda_3: 0.0000 train remain: [1. 0.99 0.65 0.64 0.54 0.51 0.75 0.66 0.65 0.37] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.5, 0.72, 0.66, 0.66, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.11, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100111101010100010001110 11111111111111111101101110011111001001010000010000 11111111111111111111111100010001000000000010000000 10011111111111111111011110110000010000000000000000 10011111110111111111111110101101011011110011011000 10011111110111111011111110111101011111000110000000 00011111110111101011111010011101011111111110100000 10001110110110100011000010010001000000000000000000 Best eval score so far: 0.8737 @ step 40000 epoch 12.22 Saving the best model so far: [Epoch 13 | Step: 43000 | MACs sparsity: 0.5858 | Score: 0.8788 | Loss: 0.5086] loss: 0.040995, lagrangian_loss: 0.000592, attention_score_distillation_loss: 0.000020 loss: 0.251927, lagrangian_loss: 0.000451, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 18:50:28 Evaluating: accuracy: 0.8616, eval_loss: 0.5874, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5791, expected_sequence_sparsity: 0.8351, target_sparsity: 0.58, step: 43500 lambda_1: -0.5277, lambda_2: 253.0344 lambda_3: 0.0000 train remain: [1. 0.99 0.65 0.64 0.54 0.51 0.75 0.66 0.65 0.42] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.5, 0.72, 0.66, 0.66, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.11, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100111101010100010001110 11111111111111111101101110011111001001010000010000 11111111111111111111111100010001000000000000000100 10011111111111111111011110110000010000000000000000 10011111110111111111111110101101011011110011011000 10011111110111111011111110111101011111010010000000 00011111110111101011111010011101011111111110100000 00001110110110100011000010010101000000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.253333, lagrangian_loss: -0.000115, attention_score_distillation_loss: 0.000020 loss: 0.036641, lagrangian_loss: 0.000462, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 18:53:25 Evaluating: accuracy: 0.8766, eval_loss: 0.5508, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5791, expected_sequence_sparsity: 0.8351, target_sparsity: 0.58, step: 44000 lambda_1: -0.1536, lambda_2: 255.9957 lambda_3: 0.0000 train remain: [1. 0.99 0.64 0.64 0.54 0.51 0.75 0.66 0.65 0.38] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.5, 0.72, 0.66, 0.66, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.11, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100111101010100010001110 11111111111111111101101110011111001001010000010000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110000010000000000000000 10011111110111111111111110101101011011110011011000 10011111110111111011111110111101011111010010000000 00011111110111101011111010011101011111111110100000 10001110110110100010000010010101000000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.310916, lagrangian_loss: 0.000239, attention_score_distillation_loss: 0.000020 loss: 0.180653, lagrangian_loss: 0.000746, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 18:56:20 Evaluating: accuracy: 0.8662, eval_loss: 0.537, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5792, expected_sequence_sparsity: 0.8352, target_sparsity: 0.58, step: 44500 lambda_1: -0.3324, lambda_2: 258.6468 lambda_3: 0.0000 train remain: [1. 0.99 0.64 0.64 0.54 0.51 0.74 0.66 0.64 0.4 ] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.5, 0.72, 0.66, 0.64, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.11, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100111101010100010001110 11111111111111111101111110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111011011110110100010000000000000000 10011111110111111111111110101101011011110011011000 10011111110111111011111110111101011111010010000000 00011111110111101011111010011101011111111100100000 00001110110110100010000010010101000001000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.184043, lagrangian_loss: 0.000094, attention_score_distillation_loss: 0.000020 loss: 0.409894, lagrangian_loss: 0.001817, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 18:59:17 Evaluating: accuracy: 0.8656, eval_loss: 0.5098, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5792, expected_sequence_sparsity: 0.8352, target_sparsity: 0.58, step: 45000 lambda_1: -0.1056, lambda_2: 261.8033 lambda_3: 0.0000 train remain: [1. 0.99 0.64 0.64 0.54 0.51 0.74 0.65 0.63 0.45] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.5, 0.72, 0.66, 0.64, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.11, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100111101010100010001110 11111111111111111101111110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111011011110110100010000000000000000 10011111110111111111111110101101011011110011011000 10011111110111111011111110111101011111010010000000 00011111110111101011111010011101011101111110100000 00001110110110100011000010010101000000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.092437, lagrangian_loss: 0.000001, attention_score_distillation_loss: 0.000020 loss: 0.053508, lagrangian_loss: 0.000077, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 19:02:12 Evaluating: accuracy: 0.877, eval_loss: 0.5153, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5786, expected_sequence_sparsity: 0.835, target_sparsity: 0.58, step: 45500 lambda_1: -0.4928, lambda_2: 264.7666 lambda_3: 0.0000 train remain: [1. 0.99 0.64 0.64 0.54 0.52 0.74 0.64 0.63 0.43] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.72, 0.64, 0.62, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100111101010100010001110 11111111111111111101101110011111001001010000010000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10011111110111111111111110101101011011110011011000 10011111110111111011111110111101011111000010000000 00011111110111101011111010011101011101111100100000 10001110110110100010000010010101000000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.382647, lagrangian_loss: -0.000054, attention_score_distillation_loss: 0.000020 loss: 0.238657, lagrangian_loss: 0.001680, attention_score_distillation_loss: 0.000020 ETA: 8:17:35 | Epoch 13 finished. Took 1159.43 seconds. ---------------------------------------------------------------------- time: 2023-07-19 19:05:09 Evaluating: accuracy: 0.8633, eval_loss: 0.4926, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5786, expected_sequence_sparsity: 0.835, target_sparsity: 0.58, step: 46000 lambda_1: -0.9284, lambda_2: 267.9405 lambda_3: 0.0000 train remain: [1. 0.99 0.64 0.64 0.55 0.52 0.73 0.64 0.61 0.45] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.72, 0.64, 0.62, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100111101010100010001110 11111111111111111101111110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10011111110111111111111110101101011011110011011000 10011111110111111011111110111101011111000010000000 00011111110111101011111010011101011101111100100000 00001110110110100011000010010101000000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.109442, lagrangian_loss: -0.000188, attention_score_distillation_loss: 0.000020 loss: 0.059597, lagrangian_loss: -0.000030, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 19:08:05 Evaluating: accuracy: 0.8733, eval_loss: 0.5329, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5786, expected_sequence_sparsity: 0.835, target_sparsity: 0.58, step: 46500 lambda_1: -0.3270, lambda_2: 270.8869 lambda_3: 0.0000 train remain: [1. 0.99 0.64 0.64 0.55 0.52 0.73 0.64 0.62 0.49] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.72, 0.64, 0.62, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100111101010100010001110 11111111111111111101111110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10011111110111111111111110101101011011110011011000 10011111110111111011111110111101011111000010000000 00011111110110101011111010011101011101111110100000 00001110110110101010000010010101000000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.071506, lagrangian_loss: 0.000188, attention_score_distillation_loss: 0.000020 loss: 0.331254, lagrangian_loss: 0.000519, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 19:11:00 Evaluating: accuracy: 0.875, eval_loss: 0.515, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5788, expected_sequence_sparsity: 0.835, target_sparsity: 0.58, step: 47000 lambda_1: -0.4275, lambda_2: 273.4706 lambda_3: 0.0000 train remain: [1. 0.99 0.64 0.64 0.55 0.52 0.73 0.64 0.6 0.44] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.72, 0.64, 0.6, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011001101100110101010100010001110 11111111111111111101111110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10011111110111111111111110101101011011110011011000 10011111110111111011111110111101011111000010000000 00011111110110101011111010011101011101111100100000 00001110110110101010000010010001010000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.367759, lagrangian_loss: 0.002024, attention_score_distillation_loss: 0.000020 loss: 0.027904, lagrangian_loss: 0.001765, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 19:13:56 Evaluating: accuracy: 0.8693, eval_loss: 0.5222, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5788, expected_sequence_sparsity: 0.835, target_sparsity: 0.58, step: 47500 lambda_1: -0.8592, lambda_2: 276.1123 lambda_3: 0.0000 train remain: [1. 0.99 0.64 0.63 0.54 0.52 0.73 0.64 0.6 0.36] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.72, 0.64, 0.6, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010101010001110 11111111111111111101111110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10011111110111111111111110101101011011110011011000 10011111110111111011111110111101011111000010000000 00011111110110101011111010011101011101110110100000 00001110110110101010000010010001010000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.039616, lagrangian_loss: -0.000269, attention_score_distillation_loss: 0.000020 loss: 0.116070, lagrangian_loss: -0.000028, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 19:16:52 Evaluating: accuracy: 0.873, eval_loss: 0.5689, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5788, expected_sequence_sparsity: 0.835, target_sparsity: 0.58, step: 48000 lambda_1: -0.3408, lambda_2: 278.9204 lambda_3: 0.0000 train remain: [1. 0.99 0.64 0.64 0.55 0.52 0.73 0.64 0.6 0.41] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.72, 0.64, 0.6, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010101010001110 11111111111111111101111110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10011111110111111111111110101101011011110011011000 10011111110111101011111110111101011111000010100000 00011111110110101011111010011101011101110110100000 00001110110110101010000010010001010000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.164379, lagrangian_loss: 0.000214, attention_score_distillation_loss: 0.000020 loss: 0.077955, lagrangian_loss: 0.000890, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 19:19:47 Evaluating: accuracy: 0.8691, eval_loss: 0.5247, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5788, expected_sequence_sparsity: 0.835, target_sparsity: 0.58, step: 48500 lambda_1: -0.7698, lambda_2: 281.7541 lambda_3: 0.0000 train remain: [1. 0.99 0.64 0.63 0.55 0.52 0.73 0.64 0.6 0.39] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.72, 0.64, 0.6, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111101111110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110001010000000000000000 10011111110111111111111110101101011011110011011000 10011111110111101011111110111101011111000011000000 00011111110110101011111010011101011101110110100000 00001110110110101010000010010101000000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.130931, lagrangian_loss: 0.000093, attention_score_distillation_loss: 0.000020 loss: 0.027282, lagrangian_loss: -0.000181, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 19:22:41 Evaluating: accuracy: 0.8737, eval_loss: 0.5521, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5792, expected_sequence_sparsity: 0.8352, target_sparsity: 0.58, step: 49000 lambda_1: -0.4425, lambda_2: 284.4858 lambda_3: 0.0000 train remain: [1. 0.99 0.64 0.64 0.55 0.52 0.72 0.64 0.6 0.4 ] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.7, 0.64, 0.6, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111101111110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110001010000000000000000 10011111110111111111111110101101011001110011011000 10011111110111101011111110111101011111010010000000 00011111110110101011111010011101011101110110100000 00001110110110101010000010010001010000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.134201, lagrangian_loss: 0.000294, attention_score_distillation_loss: 0.000020 ETA: 7:58:47 | Epoch 14 finished. Took 1160.45 seconds. loss: 0.148353, lagrangian_loss: -0.000181, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 19:25:36 Evaluating: accuracy: 0.8627, eval_loss: 0.5654, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5792, expected_sequence_sparsity: 0.8352, target_sparsity: 0.58, step: 49500 lambda_1: -0.2822, lambda_2: 287.3346 lambda_3: 0.0000 train remain: [1. 0.99 0.64 0.63 0.55 0.52 0.72 0.64 0.59 0.43] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.7, 0.64, 0.6, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111101111110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110001010000000000000000 10011111110111111111111110101101011001110011011000 10011111110111101011111110111101011111010010000000 00011111110110101011111010011101011101110110100000 00001110110110101010000010010001010000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.042877, lagrangian_loss: 0.000351, attention_score_distillation_loss: 0.000020 loss: 0.086501, lagrangian_loss: 0.000065, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 19:28:32 Evaluating: accuracy: 0.8677, eval_loss: 0.5378, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5793, expected_sequence_sparsity: 0.8352, target_sparsity: 0.58, step: 50000 lambda_1: -0.6061, lambda_2: 290.3694 lambda_3: 0.0000 train remain: [1. 0.99 0.64 0.63 0.55 0.52 0.72 0.63 0.59 0.43] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.7, 0.64, 0.58, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010101110 11111111111111111101111110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110001010000000000000000 10011111110111111111111110101101011001110011011000 10011111110111101011111110111101011111010010000000 00011111110110101011111010011101011101110100100000 00001110110110101010000010010001010000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.040072, lagrangian_loss: 0.000249, attention_score_distillation_loss: 0.000020 loss: 0.287236, lagrangian_loss: 0.000915, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 19:31:26 Evaluating: accuracy: 0.8733, eval_loss: 0.5234, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5793, expected_sequence_sparsity: 0.8352, target_sparsity: 0.58, step: 50500 lambda_1: -0.1086, lambda_2: 293.2389 lambda_3: 0.0000 train remain: [1. 0.99 0.64 0.63 0.55 0.52 0.72 0.63 0.58 0.45] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.7, 0.64, 0.58, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010110010001110 11111111111111111101111110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110001010000000000000000 10011111110111111111111110101101011001110011011000 10011111110111101011111110111101011111010010000000 00011111110110101011111010011101011101110100100000 00001110110110101010000010010001010000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.112167, lagrangian_loss: 0.000069, attention_score_distillation_loss: 0.000020 loss: 0.482560, lagrangian_loss: 0.000126, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 19:34:21 Evaluating: accuracy: 0.8589, eval_loss: 0.5786, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5793, expected_sequence_sparsity: 0.8352, target_sparsity: 0.58, step: 51000 lambda_1: -0.6258, lambda_2: 295.7832 lambda_3: 0.0000 train remain: [1. 0.99 0.64 0.63 0.55 0.52 0.71 0.64 0.58 0.41] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.7, 0.64, 0.58, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010110010001110 11111111111111111101111110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110001010000000000000000 10011111110111111111111110101101011001110011011000 10011111110111101011111110111101011111010010000000 00011111110110101011111010011101011101110100100000 00001110110110101010000010010001000000010000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.273637, lagrangian_loss: 0.000390, attention_score_distillation_loss: 0.000020 loss: 0.045290, lagrangian_loss: -0.000121, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 19:37:14 Evaluating: accuracy: 0.8689, eval_loss: 0.564, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5793, expected_sequence_sparsity: 0.8352, target_sparsity: 0.58, step: 51500 lambda_1: -0.0940, lambda_2: 298.5451 lambda_3: 0.0000 train remain: [1. 0.99 0.64 0.63 0.55 0.52 0.72 0.63 0.58 0.41] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.7, 0.64, 0.58, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111101111110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110001010000000000000000 10011111110111111111111110101101011001110011011000 10011111110111101011111110111101011111000110000000 00011111110110101011111010011101011101110100100000 00001110110110101010000010010001010000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.042610, lagrangian_loss: 0.000492, attention_score_distillation_loss: 0.000020 loss: 0.136418, lagrangian_loss: 0.000698, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 19:40:10 Evaluating: accuracy: 0.8759, eval_loss: 0.5208, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5793, expected_sequence_sparsity: 0.8352, target_sparsity: 0.58, step: 52000 lambda_1: -0.8201, lambda_2: 301.1749 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.63 0.55 0.52 0.71 0.63 0.58 0.37] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.7, 0.64, 0.58, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111101111110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10011111110111111111111110101101011001110011011000 10011111110111101011111110111101011111010010000000 00011111110110101011011010011101011101110110100000 00001110110110101010000010010001010000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.198329, lagrangian_loss: -0.000504, attention_score_distillation_loss: 0.000020 loss: 0.285631, lagrangian_loss: -0.000127, attention_score_distillation_loss: 0.000020 ETA: 7:39:13 | Epoch 15 finished. Took 1132.47 seconds. ---------------------------------------------------------------------- time: 2023-07-19 19:43:08 Evaluating: accuracy: 0.8711, eval_loss: 0.5429, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5797, expected_sequence_sparsity: 0.8354, target_sparsity: 0.58, step: 52500 lambda_1: 0.0207, lambda_2: 304.0379 lambda_3: 0.0000 train remain: [1. 0.99 0.64 0.63 0.55 0.52 0.71 0.63 0.58 0.38] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.68, 0.64, 0.58, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110001010000000000000000 10011111110111111111111110101101011001110011010000 10011111110111101011111110111101011111010010000000 00011111110110101011011010011101011101110110100000 00001110110110101010000010010001010000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.023148, lagrangian_loss: 0.000203, attention_score_distillation_loss: 0.000020 loss: 0.197529, lagrangian_loss: 0.000068, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 19:46:04 Evaluating: accuracy: 0.8614, eval_loss: 0.5646, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5797, expected_sequence_sparsity: 0.8354, target_sparsity: 0.58, step: 53000 lambda_1: -0.7220, lambda_2: 306.8342 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.63 0.55 0.52 0.71 0.63 0.58 0.39] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.68, 0.64, 0.58, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110001010000000000000000 10011111110111111111111110101101011001110011010000 10011111110111101011111110111101011111010010000000 00011111110110101011011010011101011101110110100000 00001110110110101010000010010001010000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.249287, lagrangian_loss: 0.000696, attention_score_distillation_loss: 0.000020 loss: 0.030555, lagrangian_loss: 0.000652, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 19:49:01 Evaluating: accuracy: 0.87, eval_loss: 0.5721, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5799, expected_sequence_sparsity: 0.8355, target_sparsity: 0.58, step: 53500 lambda_1: -0.1421, lambda_2: 309.9918 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.71 0.63 0.58 0.39] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.68, 0.62, 0.58, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010101010001110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110001010000000000000000 10001111110111111111111110101101011001110011011000 10011111110111101011111110111101011111000010000000 00011111110110101011011010011101011101110110100000 00001110110110101010000010010001010000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.018115, lagrangian_loss: 0.000054, attention_score_distillation_loss: 0.000020 loss: 0.121729, lagrangian_loss: 0.000089, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 19:51:55 Evaluating: accuracy: 0.875, eval_loss: 0.5168, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5799, expected_sequence_sparsity: 0.8355, target_sparsity: 0.58, step: 54000 lambda_1: -0.7594, lambda_2: 312.8035 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.63 0.55 0.52 0.71 0.62 0.58 0.39] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.68, 0.62, 0.58, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010101010001110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111111111111110101101011011110011010000 10011111110111101011111110111101011111000010000000 00011111110110101011011010011101011101110110100000 00001110110110101010000010010001010000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.076535, lagrangian_loss: 0.000040, attention_score_distillation_loss: 0.000020 loss: 0.019201, lagrangian_loss: 0.000153, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 19:54:47 Evaluating: accuracy: 0.873, eval_loss: 0.5762, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5802, expected_sequence_sparsity: 0.8356, target_sparsity: 0.58, step: 54500 lambda_1: -0.4042, lambda_2: 315.5988 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.71 0.61 0.58 0.36] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.68, 0.6, 0.58, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010011110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111111111111110101101011011110011010000 10011111110111101011111110011101011111000010000000 10011111110110101011011010011101011101110100100000 00001110110110101010000010010101000000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.458262, lagrangian_loss: 0.000433, attention_score_distillation_loss: 0.000020 loss: 0.019395, lagrangian_loss: 0.000112, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 19:57:43 Evaluating: accuracy: 0.8675, eval_loss: 0.5985, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5802, expected_sequence_sparsity: 0.8356, target_sparsity: 0.58, step: 55000 lambda_1: -0.0856, lambda_2: 318.5368 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.72 0.61 0.58 0.35] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.68, 0.6, 0.58, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011010101100110101010100010001110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111111111111110101101011011110011010000 10011111110111101011111110011101011111000010000000 00011111110110101011011010011101011101110110100000 00001110110110101011000010010001000000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.030892, lagrangian_loss: 0.001352, attention_score_distillation_loss: 0.000020 loss: 0.043613, lagrangian_loss: 0.000026, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 20:00:39 Evaluating: accuracy: 0.877, eval_loss: 0.5395, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5799, expected_sequence_sparsity: 0.8355, target_sparsity: 0.58, step: 55500 lambda_1: -0.6350, lambda_2: 321.4322 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.71 0.61 0.58 0.36] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.68, 0.62, 0.58, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011010101100110101010100010001110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111111111111110101101011011110011010000 10011111110111101011111110011101011111000011000000 00011111110110101011011010011101011101110110100000 00001110110110101011000010010001000000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.056100, lagrangian_loss: -0.000230, attention_score_distillation_loss: 0.000020 ETA: 7:20:20 | Epoch 16 finished. Took 1159.59 seconds. loss: 0.020278, lagrangian_loss: -0.000281, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 20:03:34 Evaluating: accuracy: 0.8717, eval_loss: 0.5872, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5799, expected_sequence_sparsity: 0.8355, target_sparsity: 0.58, step: 56000 lambda_1: -0.3550, lambda_2: 324.4251 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.71 0.61 0.58 0.32] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.68, 0.62, 0.58, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111101101110011111001001011000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101111111110101101011111110011010000 10011111110111101011111110011101011111000011000000 10011111110110101011011010011101011101110100100000 00001110110110101011000010010001000000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.101723, lagrangian_loss: -0.000083, attention_score_distillation_loss: 0.000020 loss: 0.064691, lagrangian_loss: 0.000446, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 20:06:29 Evaluating: accuracy: 0.8753, eval_loss: 0.557, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5799, expected_sequence_sparsity: 0.8355, target_sparsity: 0.58, step: 56500 lambda_1: -0.1721, lambda_2: 327.0652 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.7 0.61 0.57 0.34] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.68, 0.62, 0.58, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101101110101010100010001110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101111111110101101011111110011010000 10011111110111101011111110011101011111010010000000 00011111110110101011011010011101011101110110100000 00001110110110101010000010010101000000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.122554, lagrangian_loss: 0.001092, attention_score_distillation_loss: 0.000020 loss: 0.325221, lagrangian_loss: -0.000051, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 20:09:27 Evaluating: accuracy: 0.8715, eval_loss: 0.5532, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5801, expected_sequence_sparsity: 0.8355, target_sparsity: 0.58, step: 57000 lambda_1: -0.2316, lambda_2: 329.9371 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.69 0.61 0.57 0.34] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.68, 0.62, 0.56, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010101110 11111111111111111101101110011111001001011000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101111111110101101011111110011010000 10011111110111101011111110011101011111010010000000 00011111110110101011011010011101011101110100100000 00000110110110101011000010010101000000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.163286, lagrangian_loss: -0.000038, attention_score_distillation_loss: 0.000020 loss: 0.038924, lagrangian_loss: 0.000436, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 20:12:22 Evaluating: accuracy: 0.8735, eval_loss: 0.5578, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5805, expected_sequence_sparsity: 0.8357, target_sparsity: 0.58, step: 57500 lambda_1: -0.6205, lambda_2: 332.8638 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.68 0.61 0.56 0.35] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.66, 0.62, 0.56, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101011100010001110 11111111111111111101101110011111001001010010000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101111111110101101011011110011010000 10011111110111101011111110011101011111010010000000 00011111110110101011011010011101011101110100100000 00000110110110101011000010010101000000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.187014, lagrangian_loss: -0.000031, attention_score_distillation_loss: 0.000020 loss: 0.040996, lagrangian_loss: 0.000667, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 20:15:16 Evaluating: accuracy: 0.8737, eval_loss: 0.5757, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5807, expected_sequence_sparsity: 0.8358, target_sparsity: 0.58, step: 58000 lambda_1: -0.4421, lambda_2: 335.7651 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.68 0.61 0.56 0.35] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.66, 0.6, 0.56, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101011100010001110 11111111111111111101101110011111001001011000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101111111110101101011011110011010000 10011111110111101011111110011101011111000010000000 10001111110110101011011010011101011101110100100000 00000110110110101011000010010101000000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.265504, lagrangian_loss: -0.000059, attention_score_distillation_loss: 0.000020 loss: 0.283869, lagrangian_loss: -0.000070, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 20:18:12 Evaluating: accuracy: 0.8742, eval_loss: 0.5676, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5807, expected_sequence_sparsity: 0.8358, target_sparsity: 0.58, step: 58500 lambda_1: -0.6451, lambda_2: 338.5070 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.68 0.61 0.56 0.34] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.66, 0.6, 0.56, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000111100110101010100010001110 11111111111111111101101110011111001001011000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101111111110101101011011110011010000 10011111110111101011111110011101011111000010000000 10001111110110101011011010011101011101110100100000 00000110110110101011000010010101000000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.068650, lagrangian_loss: -0.000297, attention_score_distillation_loss: 0.000020 loss: 0.026540, lagrangian_loss: 0.000359, attention_score_distillation_loss: 0.000020 ETA: 7:00:53 | Epoch 17 finished. Took 1133.54 seconds. ---------------------------------------------------------------------- time: 2023-07-19 20:21:07 Evaluating: accuracy: 0.8755, eval_loss: 0.5649, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5807, expected_sequence_sparsity: 0.8358, target_sparsity: 0.58, step: 59000 lambda_1: -0.1968, lambda_2: 341.6069 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.67 0.61 0.56 0.32] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.66, 0.6, 0.56, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000111100110101010100010001110 11111111111111111101101110011111101001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101111111110101101011011110011010000 10011111110111101011111110011101011111000010000000 10001111110110101011011010011101011101110100100000 00000110110110101011000010010101000000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.083371, lagrangian_loss: 0.000198, attention_score_distillation_loss: 0.000020 loss: 0.118603, lagrangian_loss: -0.000006, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 20:24:02 Evaluating: accuracy: 0.8742, eval_loss: 0.5648, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5807, expected_sequence_sparsity: 0.8358, target_sparsity: 0.58, step: 59500 lambda_1: -0.6989, lambda_2: 344.4115 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.68 0.61 0.56 0.34] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.66, 0.6, 0.56, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011010101100110101010100010001110 11111111111111111101101110011111101001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101111111110101101011011110011010000 10011111110111101011111110011101011111000010000000 10001111110110101011011010011101011101110100100000 00000110110110101011000010010101000000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.194317, lagrangian_loss: 0.000329, attention_score_distillation_loss: 0.000020 loss: 0.164261, lagrangian_loss: -0.000186, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 20:26:56 Evaluating: accuracy: 0.8781, eval_loss: 0.5391, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5807, expected_sequence_sparsity: 0.8358, target_sparsity: 0.58, step: 60000 lambda_1: -0.3548, lambda_2: 347.2031 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.68 0.6 0.56 0.34] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.66, 0.6, 0.56, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011010101100110101010100010001110 11111111111111111101101110011111001011010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101111111110101101011001110111010000 10011111110111101011111110011101011111000010000000 10001111110110101011011010011101011101110100100000 00000110110110101011000010010101000000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.031535, lagrangian_loss: 0.000339, attention_score_distillation_loss: 0.000020 loss: 0.132691, lagrangian_loss: 0.000609, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 20:29:51 Evaluating: accuracy: 0.8748, eval_loss: 0.556, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5809, expected_sequence_sparsity: 0.8358, target_sparsity: 0.58, step: 60500 lambda_1: -0.6470, lambda_2: 350.1119 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.7 0.59 0.56 0.36] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.66, 0.58, 0.56, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011010101100110101010100010001110 11111111111111111101101110011111101001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101111111110101101011001110111010000 10001111110111101011111010011101011111000110000000 10001111110110101011011010011101011101110100100000 00000110110110101011000010010101000000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.281685, lagrangian_loss: -0.000284, attention_score_distillation_loss: 0.000020 loss: 0.038755, lagrangian_loss: -0.000014, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 20:32:47 Evaluating: accuracy: 0.87, eval_loss: 0.5432, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5809, expected_sequence_sparsity: 0.8358, target_sparsity: 0.58, step: 61000 lambda_1: -0.3024, lambda_2: 352.7885 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.7 0.59 0.56 0.39] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.66, 0.58, 0.56, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010110010001110 11111111111111111101101110111111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101111111110101101011001110111010000 10001111110111101011111010011101011111000110000000 00001111110110101011011010011101011101110110100000 00000110110110101011000010010101000000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 loss: 0.018308, lagrangian_loss: 0.000454, attention_score_distillation_loss: 0.000020 loss: 0.132226, lagrangian_loss: -0.000008, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 20:35:43 Evaluating: accuracy: 0.879, eval_loss: 0.5427, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5809, expected_sequence_sparsity: 0.8358, target_sparsity: 0.58, step: 61500 lambda_1: -0.5045, lambda_2: 355.7060 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.7 0.59 0.56 0.38] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.66, 0.58, 0.56, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010110010001110 11111111111111111101101110011111001001010000000010 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101111111110101101011001110111010000 10001111110111101011111010011101011111000110000000 00001111110110101011011010011101011101110110100000 10000110110110101011000010010001000000000000000000 Best eval score so far: 0.8788 @ step 43000 epoch 13.13 Saving the best model so far: [Epoch 18 | Step: 61500 | MACs sparsity: 0.5842 | Score: 0.879 | Loss: 0.5427] loss: 0.027696, lagrangian_loss: -0.000166, attention_score_distillation_loss: 0.000020 loss: 0.105174, lagrangian_loss: 0.000082, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 20:38:53 Evaluating: accuracy: 0.8746, eval_loss: 0.5542, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5809, expected_sequence_sparsity: 0.8358, target_sparsity: 0.58, step: 62000 lambda_1: 0.0505, lambda_2: 358.6501 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.7 0.59 0.56 0.41] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.66, 0.58, 0.56, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011001101100110101010100010001110 11111111111111111101101110011111001001010000000010 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101111111110101101011001110111010000 10001111110111101011111010011101011111000110000000 00001111110110101011011010011101011101110110100000 10000110110110101011000010010001000000000000000000 Best eval score so far: 0.8790 @ step 61500 epoch 18.78 loss: 0.024550, lagrangian_loss: 0.000187, attention_score_distillation_loss: 0.000020 ETA: 6:42:16 | Epoch 18 finished. Took 1175.83 seconds. loss: 0.037825, lagrangian_loss: 0.000171, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 20:41:50 Evaluating: accuracy: 0.8781, eval_loss: 0.605, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5809, expected_sequence_sparsity: 0.8358, target_sparsity: 0.58, step: 62500 lambda_1: -0.3468, lambda_2: 361.5862 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.7 0.59 0.56 0.45] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.66, 0.58, 0.56, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011001101100110101010100010001110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101111111110101101011001110111010000 10001111110111101011111010011101011111000110000000 00001111110110101011011010011101011101110110100000 10000110110110101011000010010001000000000000000000 Best eval score so far: 0.8790 @ step 61500 epoch 18.78 loss: 0.079975, lagrangian_loss: -0.000049, attention_score_distillation_loss: 0.000020 loss: 0.018173, lagrangian_loss: 0.003154, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 20:44:46 Evaluating: accuracy: 0.8792, eval_loss: 0.5629, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5813, expected_sequence_sparsity: 0.836, target_sparsity: 0.58, step: 63000 lambda_1: -0.4614, lambda_2: 364.2297 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.54 0.52 0.69 0.59 0.56 0.4 ] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.64, 0.58, 0.56, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111101101110011111001001010000000010 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101111111010101101011001110111010000 10001111110111101011111010011101011111000110000000 00001111110110101011011010011101011101110110100000 10000110110110101011000010010001000000000000000000 Best eval score so far: 0.8790 @ step 61500 epoch 18.78 Saving the best model so far: [Epoch 19 | Step: 63000 | MACs sparsity: 0.5858 | Score: 0.8792 | Loss: 0.5629] loss: 0.012827, lagrangian_loss: -0.000131, attention_score_distillation_loss: 0.000020 loss: 0.020887, lagrangian_loss: 0.000083, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 20:48:05 Evaluating: accuracy: 0.8779, eval_loss: 0.5378, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5813, expected_sequence_sparsity: 0.836, target_sparsity: 0.58, step: 63500 lambda_1: -0.2574, lambda_2: 366.9497 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.69 0.59 0.56 0.38] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.64, 0.58, 0.56, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111101101110011111001001010000000010 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101111111010101101011001110111010000 10001111110111101011111010011101011111000110000000 00001111110110101011011010011101011101110110100000 10000110110110101011000010010001000000000000000000 Best eval score so far: 0.8792 @ step 63000 epoch 19.24 loss: 0.163272, lagrangian_loss: -0.000041, attention_score_distillation_loss: 0.000020 loss: 0.290646, lagrangian_loss: 0.000570, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 20:50:59 Evaluating: accuracy: 0.8689, eval_loss: 0.5094, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5813, expected_sequence_sparsity: 0.836, target_sparsity: 0.58, step: 64000 lambda_1: -0.3177, lambda_2: 369.8412 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.68 0.59 0.56 0.39] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.64, 0.58, 0.56, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111101101110011111001001010000000010 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101111111010101101011001110111010000 10001111110111101011111010011101011111000110000000 00001111110110101011011010011101011101110110100000 00000110110110101011000010010001010000000000000000 Best eval score so far: 0.8792 @ step 63000 epoch 19.24 loss: 0.051100, lagrangian_loss: 0.000511, attention_score_distillation_loss: 0.000020 loss: 0.338850, lagrangian_loss: 0.000601, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 20:53:55 Evaluating: accuracy: 0.8797, eval_loss: 0.5075, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5813, expected_sequence_sparsity: 0.836, target_sparsity: 0.58, step: 64500 lambda_1: -0.3798, lambda_2: 372.7704 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.69 0.59 0.56 0.4 ] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.64, 0.58, 0.56, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101111111010101101011001110111010000 10001111110111101011111010011101011111000110000000 00001111110110101011011010011101011101110110100000 00000110110110101011001010010001000000000000000000 Best eval score so far: 0.8792 @ step 63000 epoch 19.24 Saving the best model so far: [Epoch 19 | Step: 64500 | MACs sparsity: 0.5858 | Score: 0.8797 | Loss: 0.5075] loss: 0.276395, lagrangian_loss: 0.000011, attention_score_distillation_loss: 0.000020 loss: 0.260566, lagrangian_loss: -0.000048, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 20:57:12 Evaluating: accuracy: 0.8781, eval_loss: 0.5454, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5813, expected_sequence_sparsity: 0.836, target_sparsity: 0.58, step: 65000 lambda_1: -0.4760, lambda_2: 375.3930 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.69 0.59 0.56 0.38] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.64, 0.58, 0.56, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101111111010101101011001110111010000 10001111110111101011111010011101011111000110000000 00001111110110101011011010011101011101110110100000 10000110110110101011000010010001000000000000000000 Best eval score so far: 0.8797 @ step 64500 epoch 19.70 loss: 0.072845, lagrangian_loss: -0.000035, attention_score_distillation_loss: 0.000020 loss: 0.072607, lagrangian_loss: 0.000275, attention_score_distillation_loss: 0.000020 ETA: 6:23:35 | Epoch 19 finished. Took 1177.69 seconds. ---------------------------------------------------------------------- time: 2023-07-19 21:00:06 Evaluating: accuracy: 0.881, eval_loss: 0.5374, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5813, expected_sequence_sparsity: 0.836, target_sparsity: 0.58, step: 65500 lambda_1: -0.4677, lambda_2: 378.2476 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.7 0.59 0.56 0.39] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.64, 0.58, 0.56, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101111111010101101011101110011010000 10001111110111101011111010011101011111000110000000 00001111110110101011011010011101011101110110100000 10000110110110101011000010010001000000000000000000 Best eval score so far: 0.8797 @ step 64500 epoch 19.70 Saving the best model so far: [Epoch 20 | Step: 65500 | MACs sparsity: 0.5858 | Score: 0.881 | Loss: 0.5374] loss: 0.023048, lagrangian_loss: -0.000142, attention_score_distillation_loss: 0.000020 loss: 0.017388, lagrangian_loss: -0.000085, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 21:03:21 Evaluating: accuracy: 0.8788, eval_loss: 0.538, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5813, expected_sequence_sparsity: 0.836, target_sparsity: 0.58, step: 66000 lambda_1: -0.6193, lambda_2: 381.0638 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.69 0.59 0.56 0.36] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.64, 0.58, 0.56, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101111111010101101011101110011010000 10001111110111101011111010011101011111000110000000 00001111110110101011011010011101011101110110100000 10000110110110101011000010010001000000000000000000 Best eval score so far: 0.8810 @ step 65500 epoch 20.01 loss: 0.095806, lagrangian_loss: 0.001000, attention_score_distillation_loss: 0.000020 loss: 0.056584, lagrangian_loss: 0.001137, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 21:06:16 Evaluating: accuracy: 0.8731, eval_loss: 0.5281, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5813, expected_sequence_sparsity: 0.836, target_sparsity: 0.58, step: 66500 lambda_1: -0.4764, lambda_2: 383.5851 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.53 0.68 0.59 0.56 0.34] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.64, 0.58, 0.56, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111101111110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101111111010101101011101110011010000 10001111110111101011111010011101011111000110000000 00001111110110101011011010011101011101110110100000 00000110110110101011000010010001000000010000000000 Best eval score so far: 0.8810 @ step 65500 epoch 20.01 loss: 0.018263, lagrangian_loss: -0.000061, attention_score_distillation_loss: 0.000020 loss: 0.110658, lagrangian_loss: -0.000056, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 21:09:11 Evaluating: accuracy: 0.8741, eval_loss: 0.5167, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5818, expected_sequence_sparsity: 0.8362, target_sparsity: 0.58, step: 67000 lambda_1: -0.4481, lambda_2: 386.4257 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.53 0.67 0.59 0.55 0.35] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.62, 0.58, 0.54, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111101111110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101111111010101101011001110011010000 10001111110111101011111010011101011111000110000000 00001111110110101011011010011101011101110100100000 00000110110110101011000010010001010000000000000000 Best eval score so far: 0.8810 @ step 65500 epoch 20.01 loss: 0.021097, lagrangian_loss: -0.000038, attention_score_distillation_loss: 0.000020 loss: 0.038061, lagrangian_loss: 0.000271, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 21:12:07 Evaluating: accuracy: 0.8774, eval_loss: 0.5126, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5803, expected_sequence_sparsity: 0.8356, target_sparsity: 0.58, step: 67500 lambda_1: -0.3271, lambda_2: 389.7024 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.53 0.66 0.58 0.55 0.37] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.62, 0.58, 0.54, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111101111110011111001001010000000000 11111111111111111111111110010001000010000000000000 10011111111111111111011110110100010000000000000000 10001111110111101111111010101101011001110011010000 10001111110111101011111010011101011111000110000000 00001111110110101011011010011101011101110100100000 00000110110110101011000010010001010000000000000000 Best eval score so far: 0.8810 @ step 65500 epoch 20.01 loss: 0.016362, lagrangian_loss: 0.001655, attention_score_distillation_loss: 0.000020 loss: 0.063967, lagrangian_loss: 0.000203, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 21:15:02 Evaluating: accuracy: 0.8807, eval_loss: 0.5286, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5812, expected_sequence_sparsity: 0.8359, target_sparsity: 0.58, step: 68000 lambda_1: -0.1287, lambda_2: 392.2574 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.66 0.58 0.55 0.4 ] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.62, 0.58, 0.54, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111101111110011111001001010000000000 11111111111111111111111110010001001000000000000000 10011111111111111011011110110100010000000000000000 10001111110111101111111010101101011001110011010000 10001111110111101011111010011101011111000110000000 00001111110110101011011010011101011101110100100000 00000110110110101011000010010001000000010000000000 Best eval score so far: 0.8810 @ step 65500 epoch 20.01 loss: 0.252197, lagrangian_loss: 0.000016, attention_score_distillation_loss: 0.000020 loss: 0.258947, lagrangian_loss: -0.000000, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 21:17:57 Evaluating: accuracy: 0.8843, eval_loss: 0.5017, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5826, expected_sequence_sparsity: 0.8365, target_sparsity: 0.58, step: 68500 lambda_1: -0.2580, lambda_2: 395.0177 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.65 0.58 0.55 0.45] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.5, 0.62, 0.58, 0.54, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.11, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111101101110011111001001010001000000 11111111111111111111111110010001000000000000000000 10011111111111111011011110110100010000000000000000 10001111110111101111111010101101011001110011010000 10001111110111101011111010011101011111000110000000 00001111110110101011011010011101011101110100100000 00000110110110101011000010010001000000010000000000 Best eval score so far: 0.8810 @ step 65500 epoch 20.01 Saving the best model so far: [Epoch 20 | Step: 68500 | MACs sparsity: 0.5874 | Score: 0.8843 | Loss: 0.5017] loss: 0.463440, lagrangian_loss: -0.000034, attention_score_distillation_loss: 0.000020 loss: 0.032587, lagrangian_loss: 0.000091, attention_score_distillation_loss: 0.000020 ETA: 6:05:07 | Epoch 20 finished. Took 1197.46 seconds. ---------------------------------------------------------------------- time: 2023-07-19 21:21:12 Evaluating: accuracy: 0.8849, eval_loss: 0.4964, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5812, expected_sequence_sparsity: 0.8359, target_sparsity: 0.58, step: 69000 lambda_1: -0.1886, lambda_2: 398.0453 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.65 0.58 0.55 0.44] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.62, 0.58, 0.54, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101110110101010100010001110 11111111111111111101111110011111001001010000000000 11111111111111111111111110010001000000100000000000 10011111111111111011011110110100010000000000000000 10001111110111101111111010101101011001110011010000 10001111110111101011111010011101011111000110000000 00001111110110101011011010011101011101110100100000 00000110110110101011000010010101000000000000000000 Best eval score so far: 0.8843 @ step 68500 epoch 20.92 Saving the best model so far: [Epoch 21 | Step: 69000 | MACs sparsity: 0.5874 | Score: 0.8849 | Loss: 0.4964] loss: 0.070037, lagrangian_loss: 0.000229, attention_score_distillation_loss: 0.000020 loss: 0.015016, lagrangian_loss: 0.000159, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 21:24:22 Evaluating: accuracy: 0.8814, eval_loss: 0.53, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5803, expected_sequence_sparsity: 0.8356, target_sparsity: 0.58, step: 69500 lambda_1: -0.0890, lambda_2: 400.6562 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.65 0.58 0.55 0.46] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.62, 0.58, 0.54, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111000101100110101010100010001110 11111111111111111101111110011111001001010000000000 11111111111111111111111110010001000000100000000000 10011111111111111011011110110100011000000000000000 10001111110111101111111010101101011001110011010000 10001111110111101011111010011101011111000110000000 00001111110110101011011010011101011101110100100000 00000110110110101011000010010101000000000000000000 Best eval score so far: 0.8849 @ step 69000 epoch 21.08 loss: 0.045294, lagrangian_loss: 0.000109, attention_score_distillation_loss: 0.000020 loss: 0.010873, lagrangian_loss: 0.000007, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 21:27:17 Evaluating: accuracy: 0.8785, eval_loss: 0.5248, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5803, expected_sequence_sparsity: 0.8356, target_sparsity: 0.58, step: 70000 lambda_1: -0.2052, lambda_2: 403.3384 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.65 0.58 0.55 0.46] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.62, 0.58, 0.54, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111000101100110101010100010001110 11111111111111111101111110011111001001010000000000 11111111111111111111111110010001000000100000000000 10011111111111111011011110110100011000000000000000 10001111110111101111111010101101011001110011010000 10001111110111101011111010011101011111000110000000 00001111110110101011011010011101010101110110100000 00000110110110101011000010010101000000000000000000 Best eval score so far: 0.8849 @ step 69000 epoch 21.08 loss: 0.023941, lagrangian_loss: -0.000002, attention_score_distillation_loss: 0.000020 loss: 0.042479, lagrangian_loss: 0.000123, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 21:30:11 Evaluating: accuracy: 0.8794, eval_loss: 0.5079, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5803, expected_sequence_sparsity: 0.8356, target_sparsity: 0.58, step: 70500 lambda_1: -0.1424, lambda_2: 406.1184 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.64 0.58 0.55 0.45] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.62, 0.58, 0.54, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111000101100110101010100010001110 11111111111111111101111110011111001001010000000000 11111111111111111111111110011001000000000000000000 10011111111111111011011110110100011000000000000000 10001111110111101111111010101101011001110011010000 10001111110111101011111010011101011111000110000000 00001111110110101011011010011101010101110110100000 00000110110110101011000010010101000000000000000000 Best eval score so far: 0.8849 @ step 69000 epoch 21.08 loss: 0.335359, lagrangian_loss: -0.000012, attention_score_distillation_loss: 0.000020 loss: 0.114362, lagrangian_loss: 0.000204, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 21:33:05 Evaluating: accuracy: 0.8825, eval_loss: 0.5149, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5805, expected_sequence_sparsity: 0.8357, target_sparsity: 0.58, step: 71000 lambda_1: -0.0808, lambda_2: 408.9560 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.56 0.52 0.65 0.57 0.55 0.47] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.62, 0.56, 0.54, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111000101100110101010100010001110 11111111111111111111101110011111001001010000000000 11111111111111111111111110011001000000000000000000 10011111111111111011011110110100011000000000000000 10001111110111101011111010101101011101110011010000 10001111110111101011111010011101011111000010000000 00001111110110101011011010011101010101110110100000 00000110110110101011000010010101000000000000000000 Best eval score so far: 0.8849 @ step 69000 epoch 21.08 loss: 0.165304, lagrangian_loss: 0.000107, attention_score_distillation_loss: 0.000020 loss: 0.013163, lagrangian_loss: -0.000017, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 21:36:02 Evaluating: accuracy: 0.8836, eval_loss: 0.5286, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5805, expected_sequence_sparsity: 0.8357, target_sparsity: 0.58, step: 71500 lambda_1: -0.7439, lambda_2: 411.7583 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.56 0.52 0.65 0.57 0.55 0.44] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.62, 0.56, 0.54, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111000101100110101010100010001110 11111111111111111101111110011111001001010000000000 11111111111111111111111110011001000000000000000000 10011111111111111011011110110100011000000000000000 10001111110111101011111010101101011101110011010000 10001111110111101011111010011101011101000110000000 00001111110110101011011010011101010101110110100000 00000110110110101011000010010001000000010000000000 Best eval score so far: 0.8849 @ step 69000 epoch 21.08 loss: 0.043684, lagrangian_loss: 0.000143, attention_score_distillation_loss: 0.000020 loss: 0.099347, lagrangian_loss: 0.000058, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 21:38:56 Evaluating: accuracy: 0.8797, eval_loss: 0.5235, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5805, expected_sequence_sparsity: 0.8357, target_sparsity: 0.58, step: 72000 lambda_1: -0.2411, lambda_2: 414.7701 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.56 0.52 0.66 0.57 0.54 0.41] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.62, 0.56, 0.54, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111000101100110101010100010001110 11111111111111111101101110011111011001010000000000 11111111111111111111111110011001000000000000000000 10011111111111111011011110110100011000000000000000 10001111110111101011111010101101011101110011010000 10001111110111101011111010011101011101000110000000 00001111110110101011011010011101010101110110100000 00000110110110101011000010010001000000010000000000 Best eval score so far: 0.8849 @ step 69000 epoch 21.08 loss: 0.301586, lagrangian_loss: 0.000069, attention_score_distillation_loss: 0.000020 ETA: 5:46:11 | Epoch 21 finished. Took 1174.74 seconds. loss: 0.066749, lagrangian_loss: -0.000078, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 21:41:53 Evaluating: accuracy: 0.8781, eval_loss: 0.5336, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5805, expected_sequence_sparsity: 0.8357, target_sparsity: 0.58, step: 72500 lambda_1: -0.1597, lambda_2: 417.6783 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.56 0.52 0.65 0.57 0.55 0.39] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.62, 0.56, 0.54, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111000101100110101010100010001110 11111111111111111101101110011111011001010000000000 11111111111111111111111110011001000000000000000000 10011111111111111011011110110100011000000000000000 10001111110111101011111010101101011101110011010000 10001111110111101011111010011101011101000110000000 00001111110110101011011010011101010101110110100000 00000110110110101011000010010001000000010000000000 Best eval score so far: 0.8849 @ step 69000 epoch 21.08 loss: 0.070522, lagrangian_loss: 0.000030, attention_score_distillation_loss: 0.000020 loss: 0.036869, lagrangian_loss: 0.000113, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 21:44:49 Evaluating: accuracy: 0.881, eval_loss: 0.5276, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5805, expected_sequence_sparsity: 0.8357, target_sparsity: 0.58, step: 73000 lambda_1: -0.3256, lambda_2: 420.6231 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.66 0.57 0.55 0.44] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.62, 0.56, 0.54, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111000101100110101010100010001110 11111111111111111101101110011111011001010000000000 11111111111111111111111110010101000000000000000000 10011111111111111011011110110100011000000000000000 10001111110111101011111010101101011101110011010000 10001111110111101011111010011101011101000110000000 00001111110110101011011010011101010101110110100000 00000110110110101011000010010001000000010000000000 Best eval score so far: 0.8849 @ step 69000 epoch 21.08 loss: 0.067634, lagrangian_loss: 0.000154, attention_score_distillation_loss: 0.000020 loss: 0.216381, lagrangian_loss: -0.000009, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 21:47:46 Evaluating: accuracy: 0.8816, eval_loss: 0.5287, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5805, expected_sequence_sparsity: 0.8357, target_sparsity: 0.58, step: 73500 lambda_1: -0.0074, lambda_2: 423.1298 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.56 0.52 0.66 0.57 0.55 0.45] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.62, 0.56, 0.54, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111000101100110101010100010001110 11111111111111111101101111011111001001010000000000 11111111111111111111111110011001000000000000000000 10011111111111111011011110110100011000000000000000 10001111110111101011111010101101011101110011010000 10001111110111101011111010011101011101000110000000 00001111110110101011011010011101010101110110100000 00000110110110101011000010010001000000010000000000 Best eval score so far: 0.8849 @ step 69000 epoch 21.08 loss: 0.031040, lagrangian_loss: 0.000128, attention_score_distillation_loss: 0.000020 loss: 0.033999, lagrangian_loss: 0.000281, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 21:50:41 Evaluating: accuracy: 0.8797, eval_loss: 0.511, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.581, expected_sequence_sparsity: 0.8359, target_sparsity: 0.58, step: 74000 lambda_1: -0.2138, lambda_2: 426.0714 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.66 0.57 0.55 0.4 ] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.6, 0.56, 0.54, 0.26] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111000101100110101010100010001110 11111111111111111101101111011111001001010000000000 11111111111111111111111110011001000000000000000000 10011111111111111011011110110100011000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101011101000110000000 00001111110110101011011010011101010101110110100000 00000110110110101011000010010001000000000000000000 Best eval score so far: 0.8849 @ step 69000 epoch 21.08 loss: 0.033948, lagrangian_loss: 0.001736, attention_score_distillation_loss: 0.000020 loss: 0.339679, lagrangian_loss: 0.000277, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 21:53:35 Evaluating: accuracy: 0.8816, eval_loss: 0.5339, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5809, expected_sequence_sparsity: 0.8359, target_sparsity: 0.58, step: 74500 lambda_1: -0.1441, lambda_2: 428.8959 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.56 0.53 0.66 0.57 0.55 0.42] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.6, 0.56, 0.54, 0.28] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111000101100110101010100010001110 11111111111111111101101110011111001001010100000000 11111111111111111111111110011001000000000000000000 10011111111111111011011110110100011000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101011101000110000000 00001111110110101011011010011101010101110110100000 00000110110110101011000010010001000000010000000000 Best eval score so far: 0.8849 @ step 69000 epoch 21.08 loss: 0.016333, lagrangian_loss: 0.000049, attention_score_distillation_loss: 0.000020 loss: 0.029728, lagrangian_loss: 0.000155, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 21:56:29 Evaluating: accuracy: 0.8763, eval_loss: 0.5386, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.581, expected_sequence_sparsity: 0.8359, target_sparsity: 0.58, step: 75000 lambda_1: -0.2069, lambda_2: 431.5603 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.66 0.56 0.55 0.44] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.6, 0.56, 0.54, 0.26] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111000101100110101010100010001110 11111111111111111101101110111111001001010000000000 11111111111111111111111110011001000000000000000000 10011111111111111011111110110100010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101011101000110000000 00001111110110101011011010011101010101110110100000 00000110110110101011000010000001000000010000000000 Best eval score so far: 0.8849 @ step 69000 epoch 21.08 loss: 0.024295, lagrangian_loss: 0.000959, attention_score_distillation_loss: 0.000020 loss: 0.130474, lagrangian_loss: -0.000029, attention_score_distillation_loss: 0.000020 ETA: 5:26:41 | Epoch 22 finished. Took 1132.14 seconds. ---------------------------------------------------------------------- time: 2023-07-19 21:59:24 Evaluating: accuracy: 0.877, eval_loss: 0.5532, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5824, expected_sequence_sparsity: 0.8364, target_sparsity: 0.58, step: 75500 lambda_1: -0.2969, lambda_2: 434.6068 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.52 0.68 0.56 0.55 0.4 ] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.6, 0.56, 0.54, 0.26] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111000101100110101010100010001110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111011111110110100010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101011101000110000000 00001111110110101011011010011101010101110110100000 00000110110110101011000010000001000000010000000000 Best eval score so far: 0.8849 @ step 69000 epoch 21.08 loss: 0.117379, lagrangian_loss: 0.000013, attention_score_distillation_loss: 0.000020 loss: 0.012942, lagrangian_loss: 0.000444, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 22:02:19 Evaluating: accuracy: 0.8735, eval_loss: 0.535, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5824, expected_sequence_sparsity: 0.8364, target_sparsity: 0.58, step: 76000 lambda_1: -0.4099, lambda_2: 437.4709 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.53 0.68 0.56 0.55 0.42] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.6, 0.56, 0.54, 0.26] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111000101100110101010100010001110 11111111111111111101101110111111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111011111110110100010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101011101000110000000 00001111110110101011011010011101010101110110100000 00000110110010101011000010000001000001010000000000 Best eval score so far: 0.8849 @ step 69000 epoch 21.08 loss: 0.015356, lagrangian_loss: 0.000100, attention_score_distillation_loss: 0.000020 loss: 0.022805, lagrangian_loss: 0.000466, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 22:05:11 Evaluating: accuracy: 0.877, eval_loss: 0.5574, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5825, expected_sequence_sparsity: 0.8365, target_sparsity: 0.58, step: 76500 lambda_1: -0.2786, lambda_2: 440.0410 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.53 0.67 0.56 0.54 0.42] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.6, 0.56, 0.54, 0.24] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111000101100110101010100010001110 11111111111111111101111110011111001001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111011111110110100010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101011101000110000000 00001111110110101011011010011101010101110110100000 00000110110010101011000010000001000001000000000000 Best eval score so far: 0.8849 @ step 69000 epoch 21.08 loss: 0.060316, lagrangian_loss: -0.000032, attention_score_distillation_loss: 0.000020 loss: 0.036453, lagrangian_loss: 0.000028, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 22:08:06 Evaluating: accuracy: 0.8821, eval_loss: 0.506, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5825, expected_sequence_sparsity: 0.8365, target_sparsity: 0.58, step: 77000 lambda_1: -0.2779, lambda_2: 443.1506 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.54 0.67 0.56 0.54 0.44] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.6, 0.56, 0.54, 0.24] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111000101100110101010100010001110 11111111111111111101101110011111001001010000000100 11111111111111111111111110010001000000000000000000 10011111111111111011111110110100010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101011101000110000000 00001111110110101011011010011101010101110110100000 00000110110010101011000010000001000000010000000000 Best eval score so far: 0.8849 @ step 69000 epoch 21.08 loss: 0.035918, lagrangian_loss: -0.000030, attention_score_distillation_loss: 0.000020 loss: 0.066926, lagrangian_loss: 0.000072, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 22:11:02 Evaluating: accuracy: 0.8819, eval_loss: 0.5381, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5824, expected_sequence_sparsity: 0.8364, target_sparsity: 0.58, step: 77500 lambda_1: -0.5802, lambda_2: 446.2108 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.54 0.69 0.56 0.54 0.39] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.6, 0.56, 0.54, 0.26] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010011110 11111111111111111101101110011111001001110000000000 11111111111111111111111110010001000000000000000000 10011111111111111011111110110100010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101011101000110000000 00001111110110101011011010011101010101110110100000 00000110110010101011000010000101000001000000000000 Best eval score so far: 0.8849 @ step 69000 epoch 21.08 loss: 0.010055, lagrangian_loss: 0.000035, attention_score_distillation_loss: 0.000020 loss: 0.041910, lagrangian_loss: 0.001313, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 22:13:57 Evaluating: accuracy: 0.881, eval_loss: 0.525, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5824, expected_sequence_sparsity: 0.8364, target_sparsity: 0.58, step: 78000 lambda_1: -0.0576, lambda_2: 449.4054 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.54 0.69 0.56 0.54 0.34] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.6, 0.56, 0.54, 0.26] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000111100110101010100010001110 11111111111111111101101110011111001001010000000001 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101011101000110000000 00001111110110101011011010011101010101110110100000 00000110110010101011000010000001000001010000000000 Best eval score so far: 0.8849 @ step 69000 epoch 21.08 loss: 0.083640, lagrangian_loss: 0.001204, attention_score_distillation_loss: 0.000020 loss: 0.039007, lagrangian_loss: 0.000224, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 22:16:55 Evaluating: accuracy: 0.8797, eval_loss: 0.5243, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5825, expected_sequence_sparsity: 0.8365, target_sparsity: 0.58, step: 78500 lambda_1: -0.1890, lambda_2: 452.1937 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.54 0.68 0.55 0.53 0.38] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.6, 0.56, 0.52, 0.24] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100011001110 11111111111111111101101110011111001001010000001000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101011101000110000000 00001111110110101011011010011101010101110100100000 00000110110010101011000010000001000001000000000000 Best eval score so far: 0.8849 @ step 69000 epoch 21.08 loss: 0.036491, lagrangian_loss: 0.000262, attention_score_distillation_loss: 0.000020 ETA: 5:07:32 | Epoch 23 finished. Took 1159.49 seconds. loss: 0.086234, lagrangian_loss: -0.000001, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 22:19:50 Evaluating: accuracy: 0.8752, eval_loss: 0.5335, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5825, expected_sequence_sparsity: 0.8365, target_sparsity: 0.58, step: 79000 lambda_1: -0.1797, lambda_2: 454.9331 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.54 0.68 0.55 0.53 0.42] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.6, 0.56, 0.52, 0.24] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101110110101010100010001110 11111111111111111101101110011111001001010000001000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101011101000110000000 00001111110110101011011010011101010101110100100000 00000110110010101011000010000001000001000000000000 Best eval score so far: 0.8849 @ step 69000 epoch 21.08 loss: 0.037827, lagrangian_loss: -0.000013, attention_score_distillation_loss: 0.000020 loss: 0.024940, lagrangian_loss: 0.000085, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 22:22:45 Evaluating: accuracy: 0.8814, eval_loss: 0.5207, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5827, expected_sequence_sparsity: 0.8366, target_sparsity: 0.58, step: 79500 lambda_1: -0.0871, lambda_2: 457.6575 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.53 0.67 0.55 0.53 0.41] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.6, 0.54, 0.52, 0.24] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111101101110011111011001010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101011101000010000000 00001111110110101011011010001101010101110110100000 00000110110010101011000010000001000001000000000000 Best eval score so far: 0.8849 @ step 69000 epoch 21.08 loss: 0.041184, lagrangian_loss: 0.001178, attention_score_distillation_loss: 0.000020 loss: 0.016891, lagrangian_loss: 0.000692, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 22:25:39 Evaluating: accuracy: 0.8796, eval_loss: 0.5212, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5827, expected_sequence_sparsity: 0.8366, target_sparsity: 0.58, step: 80000 lambda_1: -0.0412, lambda_2: 460.7020 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.53 0.66 0.56 0.53 0.43] infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.6, 0.54, 0.52, 0.24] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111101101110011111001101010000000000 11111111111111111111111110010001000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101011101000010000000 00001111110110101011011010001101010101110110100000 00000110110010101011000010000101000000000000000000 Best eval score so far: 0.8849 @ step 69000 epoch 21.08 loss: 0.028016, lagrangian_loss: 0.000206, attention_score_distillation_loss: 0.000020 loss: 0.014005, lagrangian_loss: -0.000021, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 22:28:34 Evaluating: accuracy: 0.8839, eval_loss: 0.5331, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5813, expected_sequence_sparsity: 0.836, target_sparsity: 0.58, step: 80500 lambda_1: -0.0560, lambda_2: 463.2805 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.53 0.65 0.56 0.53 0.39] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.6, 0.54, 0.52, 0.24] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111101101110011111001101010000000000 11111111111111111111111110010101000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101011101000010000000 00001111110110101011011010001101010101110110100000 00000110110010101011000010000101000000000000000000 Best eval score so far: 0.8849 @ step 69000 epoch 21.08 loss: 0.031058, lagrangian_loss: 0.001869, attention_score_distillation_loss: 0.000020 loss: 0.035282, lagrangian_loss: 0.000179, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 22:31:30 Evaluating: accuracy: 0.8808, eval_loss: 0.5228, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5813, expected_sequence_sparsity: 0.836, target_sparsity: 0.58, step: 81000 lambda_1: -0.0867, lambda_2: 466.2064 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.53 0.65 0.56 0.53 0.4 ] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.6, 0.54, 0.52, 0.24] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111101101110011111001101010000000000 11111111111111111111111110010101000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101011101000010000000 00001111110110101011011010001101010101110110100000 00000110110010101011000010000101000000000000000000 Best eval score so far: 0.8849 @ step 69000 epoch 21.08 loss: 0.020326, lagrangian_loss: 0.000087, attention_score_distillation_loss: 0.000020 loss: 0.237867, lagrangian_loss: 0.000121, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 22:34:23 Evaluating: accuracy: 0.885, eval_loss: 0.498, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5814, expected_sequence_sparsity: 0.836, target_sparsity: 0.58, step: 81500 lambda_1: -0.1589, lambda_2: 469.1951 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.53 0.66 0.55 0.53 0.38] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.6, 0.54, 0.52, 0.22] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011010101100110101010100010001110 11111111111111111101101110011111001101010000000000 11111111111111111111111110010101000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101011101000010000000 00001111110110101011011010001101010101110110100000 00000110110010101011000010000001000000000000000000 Best eval score so far: 0.8849 @ step 69000 epoch 21.08 Saving the best model so far: [Epoch 24 | Step: 81500 | MACs sparsity: 0.5858 | Score: 0.885 | Loss: 0.498] loss: 0.133779, lagrangian_loss: 0.000331, attention_score_distillation_loss: 0.000020 loss: 0.082751, lagrangian_loss: 0.000102, attention_score_distillation_loss: 0.000020 ETA: 4:48:26 | Epoch 24 finished. Took 1165.38 seconds. ---------------------------------------------------------------------- time: 2023-07-19 22:37:54 Evaluating: accuracy: 0.8838, eval_loss: 0.5325, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5814, expected_sequence_sparsity: 0.836, target_sparsity: 0.58, step: 82000 lambda_1: -0.0668, lambda_2: 471.9885 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.53 0.67 0.55 0.53 0.4 ] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.6, 0.54, 0.52, 0.22] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011010101100110101010100010001110 11111111111111111101101110011111001101010000000000 11111111111111111111111110010101000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101010101000110000000 00001111110110101011011010001101010101110110100000 00000110110010101011000010000001000000000000000000 Best eval score so far: 0.8850 @ step 81500 epoch 24.89 loss: 0.078437, lagrangian_loss: 0.000281, attention_score_distillation_loss: 0.000020 loss: 0.097536, lagrangian_loss: 0.000040, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 22:40:51 Evaluating: accuracy: 0.8828, eval_loss: 0.5407, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5814, expected_sequence_sparsity: 0.836, target_sparsity: 0.58, step: 82500 lambda_1: -0.1629, lambda_2: 474.8652 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.55 0.53 0.68 0.55 0.53 0.39] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.6, 0.54, 0.52, 0.22] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101101110101010100010001110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010101000000000000000000 10011111111111111111011110110100010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101010101000110000000 00001111110110101011011010001101010101110110100000 00000010110010101011000010000001000000010000000000 Best eval score so far: 0.8850 @ step 81500 epoch 24.89 loss: 0.230822, lagrangian_loss: 0.000075, attention_score_distillation_loss: 0.000020 loss: 0.017206, lagrangian_loss: 0.000369, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 22:43:46 Evaluating: accuracy: 0.885, eval_loss: 0.5044, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5822, expected_sequence_sparsity: 0.8364, target_sparsity: 0.58, step: 83000 lambda_1: -0.2070, lambda_2: 477.7310 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.56 0.52 0.68 0.55 0.53 0.33] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.54, 0.52, 0.22] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010101000000000000000000 10011111111111111111011110110000010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101010101000110000000 00001111110110101011011010001101010101110110100000 00000010110010101011000010000001000000010000000000 Best eval score so far: 0.8850 @ step 81500 epoch 24.89 loss: 0.071927, lagrangian_loss: -0.000022, attention_score_distillation_loss: 0.000020 loss: 0.026830, lagrangian_loss: 0.001024, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 22:46:41 Evaluating: accuracy: 0.8817, eval_loss: 0.5089, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5822, expected_sequence_sparsity: 0.8364, target_sparsity: 0.58, step: 83500 lambda_1: 0.0075, lambda_2: 480.3314 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.56 0.52 0.68 0.55 0.53 0.37] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.54, 0.52, 0.22] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010101000000000000000000 10011111111111111111011110110000010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101010101000110000000 00001111110110101011011010001101010101110110100000 00000010110010101011000010000100000000010000000000 Best eval score so far: 0.8850 @ step 81500 epoch 24.89 loss: 0.019109, lagrangian_loss: 0.000081, attention_score_distillation_loss: 0.000020 loss: 0.018641, lagrangian_loss: 0.000064, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 22:49:35 Evaluating: accuracy: 0.8803, eval_loss: 0.5375, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5822, expected_sequence_sparsity: 0.8364, target_sparsity: 0.58, step: 84000 lambda_1: -0.1322, lambda_2: 483.6827 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.56 0.52 0.68 0.54 0.53 0.37] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.54, 0.52, 0.22] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010011110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010101000000000000000000 10011111111111111111011110110000010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101010101010010000000 00001111110110101011011010001101010101110110100000 00000010110010101011000010000100000000000100000000 Best eval score so far: 0.8850 @ step 81500 epoch 24.89 loss: 0.209876, lagrangian_loss: 0.000086, attention_score_distillation_loss: 0.000020 loss: 0.057056, lagrangian_loss: 0.000157, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 22:52:33 Evaluating: accuracy: 0.8845, eval_loss: 0.507, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5822, expected_sequence_sparsity: 0.8364, target_sparsity: 0.58, step: 84500 lambda_1: -0.1916, lambda_2: 486.3956 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.56 0.52 0.66 0.54 0.53 0.4 ] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.54, 0.52, 0.22] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010011110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010101000000000000000000 10011111111111111111011110110000010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101010101010010000000 00001111110110101011011010001101010101110110100000 00000010110010101011000010000100000000000100000000 Best eval score so far: 0.8850 @ step 81500 epoch 24.89 loss: 0.322628, lagrangian_loss: 0.000392, attention_score_distillation_loss: 0.000020 loss: 0.166063, lagrangian_loss: 0.000245, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 22:55:26 Evaluating: accuracy: 0.8779, eval_loss: 0.5346, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5822, expected_sequence_sparsity: 0.8364, target_sparsity: 0.58, step: 85000 lambda_1: -0.2417, lambda_2: 488.8941 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.64 0.56 0.52 0.66 0.54 0.53 0.37] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.54, 0.52, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010011110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010101000000000000000000 10011111111111111111011110110000010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101010101010010000000 00001111110110101011011010001101010101110110100000 00000010110010101011000010000100000000000000000000 Best eval score so far: 0.8850 @ step 81500 epoch 24.89 loss: 0.091727, lagrangian_loss: -0.000030, attention_score_distillation_loss: 0.000020 ETA: 4:29:17 | Epoch 25 finished. Took 1161.18 seconds. loss: 0.012089, lagrangian_loss: 0.000028, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 22:58:24 Evaluating: accuracy: 0.8841, eval_loss: 0.5069, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5822, expected_sequence_sparsity: 0.8364, target_sparsity: 0.58, step: 85500 lambda_1: -0.1114, lambda_2: 491.7215 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.65 0.56 0.52 0.65 0.54 0.53 0.37] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.54, 0.52, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010011110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010101000000000000000000 10011111111111111111011110110000010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101010101010010000000 00001111110110101011011010001101010101110110100000 00000010110010101011000010000100000000000000000000 Best eval score so far: 0.8850 @ step 81500 epoch 24.89 loss: 0.030648, lagrangian_loss: 0.000231, attention_score_distillation_loss: 0.000020 loss: 0.012096, lagrangian_loss: 0.000040, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 23:01:18 Evaluating: accuracy: 0.8828, eval_loss: 0.5292, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5822, expected_sequence_sparsity: 0.8364, target_sparsity: 0.58, step: 86000 lambda_1: -0.1465, lambda_2: 494.5087 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.65 0.56 0.52 0.65 0.54 0.53 0.36] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.54, 0.52, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010011110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001000000000000100000 10011111111111111111011110110000010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101010101010010000000 00001111110110101011011010001101010101110110100000 00000010110010101011000010000100000000000000000000 Best eval score so far: 0.8850 @ step 81500 epoch 24.89 loss: 0.016542, lagrangian_loss: -0.000008, attention_score_distillation_loss: 0.000020 loss: 0.016405, lagrangian_loss: 0.000194, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 23:04:14 Evaluating: accuracy: 0.8819, eval_loss: 0.5187, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5822, expected_sequence_sparsity: 0.8364, target_sparsity: 0.58, step: 86500 lambda_1: -0.1348, lambda_2: 497.3129 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.65 0.56 0.52 0.66 0.54 0.53 0.33] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.54, 0.52, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010011110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001000000000000100000 10011111111111111111011110110000010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101010101010010000000 00001111110110101011011010001101010101110110100000 00000010110010101011000010000000000000000100000000 Best eval score so far: 0.8850 @ step 81500 epoch 24.89 loss: 0.067991, lagrangian_loss: 0.000220, attention_score_distillation_loss: 0.000020 loss: 0.016550, lagrangian_loss: 0.000009, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 23:07:08 Evaluating: accuracy: 0.8843, eval_loss: 0.5264, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5822, expected_sequence_sparsity: 0.8364, target_sparsity: 0.58, step: 87000 lambda_1: -0.3611, lambda_2: 500.1249 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.65 0.56 0.52 0.65 0.54 0.53 0.31] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.54, 0.52, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010011110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001000010000000000000 10011111111111111111011110110000010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101010101010010000000 00001111110110101011011010001101010101110110100000 00000010110010101001000010000100000000000100000000 Best eval score so far: 0.8850 @ step 81500 epoch 24.89 loss: 0.016051, lagrangian_loss: -0.000007, attention_score_distillation_loss: 0.000020 loss: 0.027265, lagrangian_loss: -0.000026, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 23:10:02 Evaluating: accuracy: 0.8797, eval_loss: 0.5297, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5822, expected_sequence_sparsity: 0.8364, target_sparsity: 0.58, step: 87500 lambda_1: -0.0226, lambda_2: 502.9646 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.65 0.56 0.52 0.64 0.54 0.53 0.31] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.54, 0.52, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010011110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001000000000000100000 10011111111111111111011110110000010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101010101010010000000 00001111110110101011011010001101010101110110100000 00000010110010101001000010000100000000000010000000 Best eval score so far: 0.8850 @ step 81500 epoch 24.89 loss: 0.060761, lagrangian_loss: 0.000470, attention_score_distillation_loss: 0.000020 loss: 0.014509, lagrangian_loss: 0.000508, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 23:12:57 Evaluating: accuracy: 0.8817, eval_loss: 0.5085, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5824, expected_sequence_sparsity: 0.8365, target_sparsity: 0.58, step: 88000 lambda_1: -0.2795, lambda_2: 505.8651 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.65 0.56 0.52 0.65 0.54 0.53 0.33] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.52, 0.52, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010011110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010011000000000000000000 10011111111111111111011110110000010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101010101000010000000 00001111110110101011011010001101010101110110100000 00000010110010101011000010000000000000000100000000 Best eval score so far: 0.8850 @ step 81500 epoch 24.89 loss: 0.022134, lagrangian_loss: -0.000032, attention_score_distillation_loss: 0.000020 loss: 0.012158, lagrangian_loss: 0.001050, attention_score_distillation_loss: 0.000020 ETA: 4:09:52 | Epoch 26 finished. Took 1131.58 seconds. ---------------------------------------------------------------------- time: 2023-07-19 23:15:52 Evaluating: accuracy: 0.8805, eval_loss: 0.5221, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5824, expected_sequence_sparsity: 0.8365, target_sparsity: 0.58, step: 88500 lambda_1: -0.0967, lambda_2: 508.6771 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.65 0.56 0.52 0.64 0.53 0.53 0.34] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.52, 0.52, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010011110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001000000000000001000 10011111111111111111011110110000010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101010101000010000000 00001111110110101011011010001101010101110110100000 00000011110010101001000010000000010000000000000000 Best eval score so far: 0.8850 @ step 81500 epoch 24.89 loss: 0.013911, lagrangian_loss: 0.000187, attention_score_distillation_loss: 0.000020 loss: 0.028696, lagrangian_loss: 0.000877, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 23:18:48 Evaluating: accuracy: 0.8836, eval_loss: 0.5095, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5825, expected_sequence_sparsity: 0.8365, target_sparsity: 0.58, step: 89000 lambda_1: -0.0785, lambda_2: 511.6286 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.65 0.56 0.52 0.64 0.53 0.52 0.32] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.52, 0.5, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010011110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001001000000000000000 10011111111111111111011110110000010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101010101000010000000 00001111110110101011011010001101010101110110000000 00000010110010101001000010000000010000000100000000 Best eval score so far: 0.8850 @ step 81500 epoch 24.89 loss: 0.030722, lagrangian_loss: 0.000403, attention_score_distillation_loss: 0.000020 loss: 0.038689, lagrangian_loss: 0.000497, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 23:21:40 Evaluating: accuracy: 0.8816, eval_loss: 0.5278, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5825, expected_sequence_sparsity: 0.8365, target_sparsity: 0.58, step: 89500 lambda_1: -0.1155, lambda_2: 514.1971 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.65 0.56 0.52 0.65 0.53 0.52 0.31] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.52, 0.5, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010011110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001001000000000000000 10011111111111111111011110110000010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011111010011101010101000010000000 00001111110110101011011010001101010101110110000000 00000010110010101001000010000000010100000000000000 Best eval score so far: 0.8850 @ step 81500 epoch 24.89 loss: 0.019358, lagrangian_loss: 0.000019, attention_score_distillation_loss: 0.000020 loss: 0.016133, lagrangian_loss: 0.000004, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 23:24:34 Evaluating: accuracy: 0.8819, eval_loss: 0.5351, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5825, expected_sequence_sparsity: 0.8365, target_sparsity: 0.58, step: 90000 lambda_1: -0.1443, lambda_2: 517.0613 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.65 0.56 0.52 0.65 0.53 0.52 0.31] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.52, 0.5, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010011110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001000001000000000000 10011111111111111111011110110000010000000000000000 10001111110111101011111010001101011101110011010000 10001111110111101011011010011101010101010010000000 00001111110110101011011010001101010101110110000000 10000010110010101001000010000000010000000000000000 Best eval score so far: 0.8850 @ step 81500 epoch 24.89 loss: 0.038661, lagrangian_loss: 0.000410, attention_score_distillation_loss: 0.000020 loss: 0.014877, lagrangian_loss: -0.000050, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 23:27:30 Evaluating: accuracy: 0.8812, eval_loss: 0.5484, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.589, expected_sparsity: 0.5828, expected_sequence_sparsity: 0.8366, target_sparsity: 0.58, step: 90500 lambda_1: -0.1692, lambda_2: 519.9286 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.65 0.56 0.51 0.68 0.53 0.51 0.3 ] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.58, 0.52, 0.5, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001000000000000001000 10011111111111111111011110110000010000000000000000 10001111110111101011111010001101011101110011000000 10001111110111101011011010011101010101010010000000 00001111110110101011011010001101010101110110000000 10000010110010101001000010000000010000000000000000 Best eval score so far: 0.8850 @ step 81500 epoch 24.89 loss: 0.060792, lagrangian_loss: 0.000433, attention_score_distillation_loss: 0.000020 loss: 0.017730, lagrangian_loss: 0.002628, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 23:30:26 Evaluating: accuracy: 0.8836, eval_loss: 0.5471, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.589, expected_sparsity: 0.5828, expected_sequence_sparsity: 0.8366, target_sparsity: 0.58, step: 91000 lambda_1: -0.2392, lambda_2: 522.8953 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.65 0.56 0.51 0.67 0.53 0.51 0.33] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.58, 0.52, 0.5, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111000101100110101010100010001110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001000000100000000000 10011111111111111111011110110000010000000000000000 10001111110111101011111010001101011101110011000000 10001111110111101011011010011101010101010010000000 00001111110110101011011010001101010101110110000000 10000010110010101001000010000000010000000000000000 Best eval score so far: 0.8850 @ step 81500 epoch 24.89 loss: 0.010168, lagrangian_loss: 0.000062, attention_score_distillation_loss: 0.000020 loss: 0.026259, lagrangian_loss: 0.000977, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 23:33:22 Evaluating: accuracy: 0.8832, eval_loss: 0.5449, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.589, expected_sparsity: 0.5828, expected_sequence_sparsity: 0.8366, target_sparsity: 0.58, step: 91500 lambda_1: -0.0913, lambda_2: 525.4966 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.65 0.56 0.51 0.65 0.53 0.51 0.34] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.58, 0.52, 0.5, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111000101100110101010100010001110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001000000100000000000 10011111111111111111011110110000010000000000000000 10001111110111101011111010001101011101110011000000 10001111110111101011011010011101010101010010000000 00001111110110101011011010001101010101110110000000 10000010110010101001000010000000010000000000000000 Best eval score so far: 0.8850 @ step 81500 epoch 24.89 loss: 0.009967, lagrangian_loss: 0.000270, attention_score_distillation_loss: 0.000020 ETA: 3:50:41 | Epoch 27 finished. Took 1159.53 seconds. loss: 0.041729, lagrangian_loss: 0.000000, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 23:36:17 Evaluating: accuracy: 0.8796, eval_loss: 0.5416, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.589, expected_sparsity: 0.5828, expected_sequence_sparsity: 0.8366, target_sparsity: 0.58, step: 92000 lambda_1: -0.1914, lambda_2: 528.3075 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.65 0.56 0.5 0.67 0.53 0.51 0.33] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.58, 0.52, 0.5, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100111101010100010001110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010011000000000000000000 10011111111111111111011110110000010000000000000000 10001111110111101011111010001101011101110011000000 10001111110111101011011010011101010101010010000000 00001111110110101011011010001101010101110110000000 10000010110010101001000010000000010000000000000000 Best eval score so far: 0.8850 @ step 81500 epoch 24.89 loss: 0.034962, lagrangian_loss: 0.000348, attention_score_distillation_loss: 0.000020 loss: 0.009019, lagrangian_loss: 0.000027, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 23:39:11 Evaluating: accuracy: 0.8863, eval_loss: 0.5116, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.589, expected_sparsity: 0.5828, expected_sequence_sparsity: 0.8366, target_sparsity: 0.58, step: 92500 lambda_1: -0.0542, lambda_2: 531.2956 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.65 0.56 0.5 0.67 0.53 0.51 0.33] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.58, 0.52, 0.5, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101110110101010100010001110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001000000100000000000 10011111111111111111011110110000010000000000000000 10001111110111101011111010001101011101110011000000 10001111110111101011011010011101010101010010000000 10001111110110101011011010001101010101010110000000 10000010110010101001000010000000010000000000000000 Best eval score so far: 0.8850 @ step 81500 epoch 24.89 Saving the best model so far: [Epoch 28 | Step: 92500 | MACs sparsity: 0.589 | Score: 0.8863 | Loss: 0.5116] loss: 0.024062, lagrangian_loss: 0.000332, attention_score_distillation_loss: 0.000020 loss: 0.021907, lagrangian_loss: 0.000013, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 23:42:24 Evaluating: accuracy: 0.8832, eval_loss: 0.5365, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.589, expected_sparsity: 0.5828, expected_sequence_sparsity: 0.8366, target_sparsity: 0.58, step: 93000 lambda_1: -0.0981, lambda_2: 534.1141 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.65 0.56 0.51 0.67 0.53 0.51 0.31] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.58, 0.52, 0.5, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100111101010100010001110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001000000100000000000 10011111111111111111011110110000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 10001111110110101011011010001101010101010110000000 10000010110010101001000010000000010000000000000000 Best eval score so far: 0.8863 @ step 92500 epoch 28.25 loss: 0.030566, lagrangian_loss: 0.000105, attention_score_distillation_loss: 0.000020 loss: 0.168717, lagrangian_loss: 0.000140, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 23:45:16 Evaluating: accuracy: 0.8823, eval_loss: 0.5358, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.589, expected_sparsity: 0.5828, expected_sequence_sparsity: 0.8366, target_sparsity: 0.58, step: 93500 lambda_1: -0.0281, lambda_2: 536.7365 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.65 0.56 0.51 0.65 0.53 0.51 0.31] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.58, 0.52, 0.5, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100111101010100010001110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001000000100000000000 10011111111111111111011110110000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 10001111110110101011011010001101010101010110000000 10000010110010101001000010000000010000000000000000 Best eval score so far: 0.8863 @ step 92500 epoch 28.25 loss: 0.021299, lagrangian_loss: 0.000387, attention_score_distillation_loss: 0.000020 loss: 0.054373, lagrangian_loss: 0.000078, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 23:48:14 Evaluating: accuracy: 0.8816, eval_loss: 0.5277, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.589, expected_sparsity: 0.5828, expected_sequence_sparsity: 0.8366, target_sparsity: 0.58, step: 94000 lambda_1: -0.1772, lambda_2: 539.5947 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.65 0.56 0.51 0.65 0.53 0.51 0.32] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.58, 0.52, 0.5, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100111101010100010001110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001000000000000100000 10011111111111111111011110110000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 10001111110110101011011010001101010101010110000000 10000010110010101001000010000000010000000000000000 Best eval score so far: 0.8863 @ step 92500 epoch 28.25 loss: 0.025179, lagrangian_loss: 0.000466, attention_score_distillation_loss: 0.000020 loss: 0.025246, lagrangian_loss: 0.000568, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 23:51:11 Evaluating: accuracy: 0.8836, eval_loss: 0.5171, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.589, expected_sparsity: 0.5828, expected_sequence_sparsity: 0.8366, target_sparsity: 0.58, step: 94500 lambda_1: -0.1541, lambda_2: 542.3299 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.65 0.56 0.51 0.66 0.53 0.51 0.3 ] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.58, 0.52, 0.5, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111111101110011111001001010000000000 11111111111111111111111110010001000000000000100000 10011111111111111111011110110000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 10001111110110101011011010001101010101010110000000 00000010110010101001000010000000010000000100000000 Best eval score so far: 0.8863 @ step 92500 epoch 28.25 loss: 0.021549, lagrangian_loss: 0.000965, attention_score_distillation_loss: 0.000020 loss: 0.088531, lagrangian_loss: 0.000408, attention_score_distillation_loss: 0.000020 ETA: 3:31:26 | Epoch 28 finished. Took 1148.81 seconds. ---------------------------------------------------------------------- time: 2023-07-19 23:54:07 Evaluating: accuracy: 0.8832, eval_loss: 0.5069, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.589, expected_sparsity: 0.5829, expected_sequence_sparsity: 0.8366, target_sparsity: 0.58, step: 95000 lambda_1: -0.0965, lambda_2: 545.4095 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.65 0.56 0.51 0.66 0.54 0.51 0.27] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.58, 0.52, 0.5, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111111101110011111001001010000000000 11111111111111111111111110011001000000000000000000 10011111111111111111011110110000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 10001111110110101011011010001101010101010110000000 00000010110010101001000010000000010000000000000000 Best eval score so far: 0.8863 @ step 92500 epoch 28.25 loss: 0.028516, lagrangian_loss: 0.001028, attention_score_distillation_loss: 0.000020 loss: 0.071162, lagrangian_loss: 0.000041, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 23:57:01 Evaluating: accuracy: 0.8814, eval_loss: 0.5242, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.589, expected_sparsity: 0.5829, expected_sequence_sparsity: 0.8366, target_sparsity: 0.58, step: 95500 lambda_1: -0.1236, lambda_2: 548.5181 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.65 0.56 0.51 0.66 0.54 0.51 0.24] infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.58, 0.52, 0.5, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101110100010001110 11111111111111111111101110011111001001010000000000 11111111111111111111111110011001000000000000000000 10011111111111111111011110110000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 10001111110110101011011010001101010101010110000000 00000010110010101001000010000000010000000000000000 Best eval score so far: 0.8863 @ step 92500 epoch 28.25 loss: 0.026110, lagrangian_loss: 0.000573, attention_score_distillation_loss: 0.000020 loss: 0.018340, lagrangian_loss: 0.000261, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-19 23:59:56 Evaluating: accuracy: 0.8838, eval_loss: 0.529, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5825, expected_sparsity: 0.5805, expected_sequence_sparsity: 0.8357, target_sparsity: 0.58, step: 96000 lambda_1: -0.0759, lambda_2: 551.2899 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.66 0.56 0.51 0.64 0.55 0.51 0.25] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.5, 0.58, 0.54, 0.5, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.12, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101110100010001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110011001000000000000000000 10011111111111111111011110110000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101110010000000 00001111110110111011011010001101010101010110000000 00000000110010101001000010000000010000000100000000 Best eval score so far: 0.8863 @ step 92500 epoch 28.25 loss: 0.026097, lagrangian_loss: 0.000302, attention_score_distillation_loss: 0.000020 loss: 0.010432, lagrangian_loss: 0.000542, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 00:02:51 Evaluating: accuracy: 0.8845, eval_loss: 0.5358, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5825, expected_sparsity: 0.5807, expected_sequence_sparsity: 0.8358, target_sparsity: 0.58, step: 96500 lambda_1: -0.1349, lambda_2: 554.2318 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.66 0.56 0.51 0.64 0.54 0.51 0.26] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.5, 0.58, 0.52, 0.5, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.12, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010101010001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110010001000000000000100000 10011111111111111111011110110000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 00001111110110101011011010001101010101010111000000 00000000110010101001000010000000010000000100000000 Best eval score so far: 0.8863 @ step 92500 epoch 28.25 loss: 0.025863, lagrangian_loss: -0.000008, attention_score_distillation_loss: 0.000020 loss: 0.024088, lagrangian_loss: 0.000186, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 00:05:45 Evaluating: accuracy: 0.8814, eval_loss: 0.5225, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5825, expected_sparsity: 0.5805, expected_sequence_sparsity: 0.8357, target_sparsity: 0.58, step: 97000 lambda_1: -0.0319, lambda_2: 557.1033 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.65 0.56 0.51 0.64 0.55 0.52 0.27] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.5, 0.58, 0.54, 0.5, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.12, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010101010001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110010001000000000000100000 10011111111111111111011110110000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010100000 00001111110110101011011010001101010101010111000000 00000000110010101001000010000000010000000100000000 Best eval score so far: 0.8863 @ step 92500 epoch 28.25 loss: 0.034251, lagrangian_loss: 0.000234, attention_score_distillation_loss: 0.000020 loss: 0.019184, lagrangian_loss: 0.000692, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 00:08:38 Evaluating: accuracy: 0.8832, eval_loss: 0.5355, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 97500 lambda_1: -0.3556, lambda_2: 559.8083 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.66 0.56 0.5 0.63 0.54 0.52 0.25] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.5, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011001101100110101010100010001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110011001000000000000000000 10011111111111111111011110010000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 10001111110110101011011010001101010101010110000000 00000000110010101001010010000000010000000000000000 Best eval score so far: 0.8863 @ step 92500 epoch 28.25 loss: 0.027194, lagrangian_loss: 0.000046, attention_score_distillation_loss: 0.000020 loss: 0.042669, lagrangian_loss: 0.000126, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 00:11:32 Evaluating: accuracy: 0.8838, eval_loss: 0.5163, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 98000 lambda_1: -0.3674, lambda_2: 562.6284 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.66 0.56 0.51 0.63 0.54 0.52 0.23] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.5, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011001101100110101010100010001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110011001000000000000000000 10011111111111111111011110010000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 10001111110110101011011010001101010101010110000000 00000000110010101001000010000000010000000100000000 Best eval score so far: 0.8863 @ step 92500 epoch 28.25 loss: 0.009455, lagrangian_loss: 0.000018, attention_score_distillation_loss: 0.000020 ETA: 3:12:13 | Epoch 29 finished. Took 1154.38 seconds. loss: 0.013330, lagrangian_loss: -0.000005, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 00:14:26 Evaluating: accuracy: 0.8856, eval_loss: 0.5263, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 98500 lambda_1: -0.2245, lambda_2: 565.5886 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.66 0.56 0.52 0.63 0.54 0.52 0.23] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.5, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010101110 11111111111111111111111110011111001001010000000000 11111111111111111111111110011001000000000000000000 10011111111111111011011110110000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 10001111110110101011011010001101010101010110000000 00000001110010101001000010000000010000000000000000 Best eval score so far: 0.8863 @ step 92500 epoch 28.25 loss: 0.036448, lagrangian_loss: 0.000187, attention_score_distillation_loss: 0.000020 loss: 0.014051, lagrangian_loss: 0.000138, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 00:17:20 Evaluating: accuracy: 0.8821, eval_loss: 0.519, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 99000 lambda_1: -0.0611, lambda_2: 568.2416 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.66 0.56 0.51 0.63 0.54 0.52 0.23] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.5, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100110001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110010001000000000000001000 10011111111111111011011110110000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 10001111110110101011011010001101010101010110000000 00000000110010101001000010000000010000000100000000 Best eval score so far: 0.8863 @ step 92500 epoch 28.25 loss: 0.019944, lagrangian_loss: 0.000065, attention_score_distillation_loss: 0.000020 loss: 0.009355, lagrangian_loss: 0.000059, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 00:20:18 Evaluating: accuracy: 0.8839, eval_loss: 0.5029, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 99500 lambda_1: -0.0753, lambda_2: 571.0054 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.66 0.56 0.51 0.62 0.54 0.52 0.24] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.5, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101110110101010100010001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110010101000000000000000000 10011111111111111011011110110000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 10001111110110101011011010001101010101010110000000 00000000110010101001000010000000010000000100000000 Best eval score so far: 0.8863 @ step 92500 epoch 28.25 loss: 0.025599, lagrangian_loss: 0.000201, attention_score_distillation_loss: 0.000020 loss: 0.018277, lagrangian_loss: 0.000745, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 00:23:14 Evaluating: accuracy: 0.8827, eval_loss: 0.5216, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 100000 lambda_1: -0.2823, lambda_2: 573.6256 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.66 0.56 0.51 0.62 0.53 0.52 0.24] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.5, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010101110 11111111111111111111111110011111001001010000000000 11111111111111111111111110010101000000000000000000 10011111111111111011011110110000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 10001111110110101011011010001101010101010110000000 00000000110010101001000010000000010000000100000000 Best eval score so far: 0.8863 @ step 92500 epoch 28.25 loss: 0.064611, lagrangian_loss: 0.000288, attention_score_distillation_loss: 0.000020 loss: 0.041355, lagrangian_loss: -0.000049, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 00:26:08 Evaluating: accuracy: 0.8821, eval_loss: 0.5243, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 100500 lambda_1: -0.3691, lambda_2: 576.3730 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.66 0.56 0.5 0.63 0.53 0.52 0.23] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.5, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111000101100110101010100010001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110010001010000000000000000 10011111111111111011011110110000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 10001111110110101011011010001101010101010110000000 00000000110010101001000010000000010000000100000000 Best eval score so far: 0.8863 @ step 92500 epoch 28.25 loss: 0.015014, lagrangian_loss: 0.000531, attention_score_distillation_loss: 0.000020 loss: 0.006308, lagrangian_loss: 0.000607, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 00:29:01 Evaluating: accuracy: 0.8861, eval_loss: 0.4942, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 101000 lambda_1: -0.0898, lambda_2: 579.3690 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.66 0.56 0.5 0.63 0.53 0.52 0.24] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.5, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111000101100110101010100010001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110010001010000000000000000 10011111111111111111011110010000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 00001111110110101011011010001101010101110110000000 00000000110010101001000010000100010000000000000000 Best eval score so far: 0.8863 @ step 92500 epoch 28.25 loss: 0.016041, lagrangian_loss: -0.000003, attention_score_distillation_loss: 0.000020 loss: 0.015913, lagrangian_loss: 0.000060, attention_score_distillation_loss: 0.000020 ETA: 2:52:52 | Epoch 30 finished. Took 1127.96 seconds. ---------------------------------------------------------------------- time: 2023-07-20 00:31:55 Evaluating: accuracy: 0.8876, eval_loss: 0.504, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 101500 lambda_1: -0.1130, lambda_2: 582.3550 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.66 0.56 0.5 0.63 0.53 0.52 0.24] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.5, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111000101100110101010100010001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110010001010000000000000000 10011111111111111111011110010000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 10001111110110101011011010001101010101010110000000 00000000110010101001000010000100010000000000000000 Best eval score so far: 0.8863 @ step 92500 epoch 28.25 Saving the best model so far: [Epoch 31 | Step: 101500 | MACs sparsity: 0.5858 | Score: 0.8876 | Loss: 0.504] loss: 0.029129, lagrangian_loss: 0.000000, attention_score_distillation_loss: 0.000020 loss: 0.010372, lagrangian_loss: 0.000013, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 00:35:13 Evaluating: accuracy: 0.8832, eval_loss: 0.5193, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 102000 lambda_1: -0.0899, lambda_2: 585.4129 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.66 0.56 0.5 0.62 0.53 0.52 0.24] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.5, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111000101100110101010100010001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110010001010000000000000000 10011111111111111011011110110000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 10001111110110101011011010001101010101010110000000 00000000110010101001000010000100010000000000000000 Best eval score so far: 0.8876 @ step 101500 epoch 31.00 loss: 0.016138, lagrangian_loss: 0.000039, attention_score_distillation_loss: 0.000020 loss: 0.018942, lagrangian_loss: 0.000091, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 00:38:09 Evaluating: accuracy: 0.8785, eval_loss: 0.5197, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 102500 lambda_1: -0.0504, lambda_2: 588.0081 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.67 0.56 0.5 0.62 0.53 0.51 0.23] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.5, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011001101100110101010100010001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110010001010000000000000000 10011111111111111111011110010000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 10001111110110101011011010001101010101010110000000 00000000110010101001000010000100010000000000000000 Best eval score so far: 0.8876 @ step 101500 epoch 31.00 loss: 0.017721, lagrangian_loss: 0.000006, attention_score_distillation_loss: 0.000020 loss: 0.104862, lagrangian_loss: -0.000057, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 00:41:01 Evaluating: accuracy: 0.8823, eval_loss: 0.5236, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 103000 lambda_1: -0.2173, lambda_2: 591.1951 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.67 0.56 0.49 0.62 0.53 0.51 0.22] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010110010001110 11111111111111111111101110011111001001010010000000 11111111111111111111111110010001000000010000000000 10011111111111111011011110110000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 00001111110110101011011010001101010101010110000000 00000000110010101001000010000000010000010000000000 Best eval score so far: 0.8876 @ step 101500 epoch 31.00 loss: 0.030129, lagrangian_loss: 0.000093, attention_score_distillation_loss: 0.000020 loss: 0.023923, lagrangian_loss: 0.001481, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 00:43:57 Evaluating: accuracy: 0.8816, eval_loss: 0.5137, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 103500 lambda_1: -0.2420, lambda_2: 593.8513 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.67 0.56 0.49 0.61 0.53 0.51 0.22] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010110010001110 11111111111111111111101110011111001001010010000000 11111111111111111111111110010001000000000010000000 10011111111111111111011110010000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 00001111110110101011011010001101010101010110000000 00000000110010101001000010000100010000000000000000 Best eval score so far: 0.8876 @ step 101500 epoch 31.00 loss: 0.027373, lagrangian_loss: 0.000385, attention_score_distillation_loss: 0.000020 loss: 0.022553, lagrangian_loss: 0.000678, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 00:46:53 Evaluating: accuracy: 0.8814, eval_loss: 0.5183, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 104000 lambda_1: -0.1379, lambda_2: 596.6783 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.67 0.56 0.49 0.61 0.53 0.52 0.22] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011100101100110101010100010001110 11111111111111111111101110011111001001010010000000 11111111111111111111111110010001000000100000000000 10011111111111111111011110010000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 00001111110110101011011010001101010101010110000000 00000000110010101001000010000100010000000000000000 Best eval score so far: 0.8876 @ step 101500 epoch 31.00 loss: 0.019303, lagrangian_loss: -0.000008, attention_score_distillation_loss: 0.000020 loss: 0.010552, lagrangian_loss: 0.000552, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 00:49:46 Evaluating: accuracy: 0.8814, eval_loss: 0.5319, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 104500 lambda_1: -0.2552, lambda_2: 599.6215 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.67 0.56 0.49 0.63 0.53 0.52 0.22] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100110001110 11111111111111111111101110011111001001010010000000 11111111111111111111111110010001000000000010000000 10011111111111111111011110010000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 00001111110110101011011010001101010101010110000000 00000000110010101001000010000100010000000000000000 Best eval score so far: 0.8876 @ step 101500 epoch 31.00 loss: 0.011960, lagrangian_loss: 0.000014, attention_score_distillation_loss: 0.000020 loss: 0.019688, lagrangian_loss: 0.000106, attention_score_distillation_loss: 0.000020 ETA: 2:33:46 | Epoch 31 finished. Took 1178.22 seconds. ---------------------------------------------------------------------- time: 2023-07-20 00:52:37 Evaluating: accuracy: 0.886, eval_loss: 0.5226, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 105000 lambda_1: -0.1972, lambda_2: 602.2722 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.67 0.56 0.49 0.63 0.53 0.53 0.22] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111111000101100110101010100010001110 11111111111111111111101110111111001001010000000000 11111111111111111111111110010001000000001000000000 10011111111111111111011110010000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 00001111110110101011011010001101010101010110000000 00000000110010101001000010000100010000000000000000 Best eval score so far: 0.8876 @ step 101500 epoch 31.00 loss: 0.018185, lagrangian_loss: 0.000124, attention_score_distillation_loss: 0.000020 loss: 0.038352, lagrangian_loss: 0.000876, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 00:55:33 Evaluating: accuracy: 0.8819, eval_loss: 0.5228, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 105500 lambda_1: -0.0381, lambda_2: 604.9241 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.67 0.56 0.49 0.63 0.53 0.54 0.23] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010011110 11111111111111111111101110011111001001010010000000 11111111111111111111111110010001000000000010000000 10011111111111111111011110010000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 00001111110110101011011010001101010101010110000000 00000000110010101001000010000000010000010000000000 Best eval score so far: 0.8876 @ step 101500 epoch 31.00 loss: 0.022027, lagrangian_loss: -0.000001, attention_score_distillation_loss: 0.000020 loss: 0.017946, lagrangian_loss: 0.000056, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 00:58:28 Evaluating: accuracy: 0.8799, eval_loss: 0.519, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 106000 lambda_1: -0.0027, lambda_2: 607.9390 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.67 0.56 0.48 0.61 0.53 0.53 0.23] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010011110 11111111111111111111101110011111001001010010000000 11111111111111111111111110110001000000000000000000 10011111111111111111011110010000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 00001111110110101011011010001101010101010110000000 00000000110010101001000010000000010001000000000000 Best eval score so far: 0.8876 @ step 101500 epoch 31.00 loss: 0.020228, lagrangian_loss: 0.000002, attention_score_distillation_loss: 0.000020 loss: 0.014955, lagrangian_loss: 0.000203, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 01:01:25 Evaluating: accuracy: 0.8832, eval_loss: 0.5164, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 106500 lambda_1: -0.0252, lambda_2: 610.6922 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.67 0.56 0.48 0.6 0.53 0.54 0.24] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101110100010001110 11111111111111111111101110011111001001010010000000 11111111111111111111111110110001000000000000000000 10011111111111111111011110010000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 00001111110110101011011010001101010101010110000000 00000000110010101001000010001000010000000000000000 Best eval score so far: 0.8876 @ step 101500 epoch 31.00 loss: 0.043442, lagrangian_loss: 0.000086, attention_score_distillation_loss: 0.000020 loss: 0.007651, lagrangian_loss: 0.001064, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 01:04:19 Evaluating: accuracy: 0.8854, eval_loss: 0.513, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 107000 lambda_1: -0.1249, lambda_2: 613.4114 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.67 0.56 0.48 0.6 0.53 0.55 0.25] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101110100010001110 11111111111111111111101110011111001001010000000100 11111111111111111111111110110001000000000000000000 10011111111111111111011110010000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 00001111110110101011011010001101010101010110000000 00000000110010101001000010001000010000000000000000 Best eval score so far: 0.8876 @ step 101500 epoch 31.00 loss: 0.013384, lagrangian_loss: -0.000004, attention_score_distillation_loss: 0.000020 loss: 0.023731, lagrangian_loss: 0.000007, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 01:07:11 Evaluating: accuracy: 0.8845, eval_loss: 0.5166, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 107500 lambda_1: -0.1067, lambda_2: 615.9699 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.67 0.56 0.48 0.61 0.53 0.54 0.26] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101110100010001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110110001000000000000000000 10011111111111111011011110110000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 00001111110110101011011010001101010101010110000000 00000000110010101001000010000000010001000000000000 Best eval score so far: 0.8876 @ step 101500 epoch 31.00 loss: 0.036126, lagrangian_loss: 0.000340, attention_score_distillation_loss: 0.000020 loss: 0.017057, lagrangian_loss: 0.000863, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 01:10:07 Evaluating: accuracy: 0.8828, eval_loss: 0.5206, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 108000 lambda_1: -0.1473, lambda_2: 618.8947 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.67 0.56 0.48 0.61 0.53 0.53 0.26] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101110100010001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110110001000000000000000000 10011111111111111011011110110000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 00001111110110101011011010001101010101010110000000 00000000110010101001000010000000010001000000000000 Best eval score so far: 0.8876 @ step 101500 epoch 31.00 loss: 0.029450, lagrangian_loss: 0.000080, attention_score_distillation_loss: 0.000020 ETA: 2:14:33 | Epoch 32 finished. Took 1156.63 seconds. loss: 0.035288, lagrangian_loss: 0.000073, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 01:13:02 Evaluating: accuracy: 0.8812, eval_loss: 0.5171, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 108500 lambda_1: -0.2455, lambda_2: 621.8017 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.67 0.56 0.48 0.6 0.53 0.52 0.25] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101110100010001110 11111111111111111111101110011111101001010000000000 11111111111111111111111110010001000000000000000100 10011111111111111011011110110000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 00001111110110101011011010001101010101010110000000 00000000110010101001000010000000010001000000000000 Best eval score so far: 0.8876 @ step 101500 epoch 31.00 loss: 0.090272, lagrangian_loss: 0.000302, attention_score_distillation_loss: 0.000020 loss: 0.312071, lagrangian_loss: -0.000002, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 01:15:57 Evaluating: accuracy: 0.8816, eval_loss: 0.5111, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 109000 lambda_1: -0.0687, lambda_2: 624.6755 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.67 0.56 0.48 0.6 0.53 0.53 0.28] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010110010001110 11111111111111111111101110011111001001010000000010 11111111111111111111111110010001000000000000000100 10011111111111111011011110110000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 00001111110110101011011010001101010101010110000000 00000000110010101001000010000000010001000000000000 Best eval score so far: 0.8876 @ step 101500 epoch 31.00 loss: 0.016343, lagrangian_loss: 0.001062, attention_score_distillation_loss: 0.000020 loss: 0.022479, lagrangian_loss: 0.000448, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 01:18:51 Evaluating: accuracy: 0.8832, eval_loss: 0.5128, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 109500 lambda_1: -0.0097, lambda_2: 627.6951 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.67 0.56 0.48 0.61 0.54 0.54 0.28] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010110010001110 11111111111111111111101110111111001001010000000000 11111111111111111111111110010001000000000000000100 10011111111111111011011110110000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 00001111110110101011011010001101010101010110000000 00000000110010101001000010000000010001000000000000 Best eval score so far: 0.8876 @ step 101500 epoch 31.00 loss: 0.013309, lagrangian_loss: 0.000062, attention_score_distillation_loss: 0.000020 loss: 0.024009, lagrangian_loss: 0.000019, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 01:21:46 Evaluating: accuracy: 0.8838, eval_loss: 0.487, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 110000 lambda_1: -0.1083, lambda_2: 630.5648 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.67 0.56 0.48 0.62 0.54 0.53 0.32] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101101110101010100010001110 11111111111111111111101110011111001001010000000010 11111111111111111111111110010001001000000000000000 10011111111111111011011110110000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 10000111110110101011011010001101010101010110000000 00000000110010101001000010000000010001000000000000 Best eval score so far: 0.8876 @ step 101500 epoch 31.00 loss: 0.019397, lagrangian_loss: 0.000004, attention_score_distillation_loss: 0.000020 loss: 0.012824, lagrangian_loss: 0.000403, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 01:24:40 Evaluating: accuracy: 0.8839, eval_loss: 0.4972, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 110500 lambda_1: 0.0451, lambda_2: 633.6393 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.67 0.56 0.48 0.62 0.54 0.54 0.33] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101101110101010100010001110 11111111111111111111101110011111001001010000000010 11111111111111111111111110010001001000000000000000 10011111111111111011011110110000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 10000111110110101011011010001101010101010110000000 00000000110010101001000010000000010001000000000000 Best eval score so far: 0.8876 @ step 101500 epoch 31.00 loss: 0.033758, lagrangian_loss: 0.003981, attention_score_distillation_loss: 0.000020 loss: 0.185234, lagrangian_loss: 0.000036, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 01:27:34 Evaluating: accuracy: 0.8808, eval_loss: 0.5088, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 111000 lambda_1: -0.1296, lambda_2: 636.4888 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.67 0.56 0.48 0.62 0.53 0.54 0.33] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101101110101010100010001110 11111111111111111111101110011111001001010000000010 11111111111111111111111110110001000000000000000000 10011111111111111011011110110000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 10000111110110101011011010001101010101010110000000 00000000110010101001000010000000010001000000000000 Best eval score so far: 0.8876 @ step 101500 epoch 31.00 loss: 0.017665, lagrangian_loss: 0.000045, attention_score_distillation_loss: 0.000020 loss: 0.056198, lagrangian_loss: 0.002160, attention_score_distillation_loss: 0.000020 ETA: 1:55:15 | Epoch 33 finished. Took 1125.81 seconds. ---------------------------------------------------------------------- time: 2023-07-20 01:30:28 Evaluating: accuracy: 0.8885, eval_loss: 0.484, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 111500 lambda_1: -0.0298, lambda_2: 639.6103 lambda_3: 0.0000 train remain: [1. 1. 0.64 0.67 0.56 0.48 0.61 0.54 0.51 0.35] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101101110101010100010001110 11111111111111111111101111011111001001010000000000 11111111111111111111111110110001000000000000000000 10011111111111111111011110010000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 10000111110110101011011010001101010101010110000000 00000000110010101001000010000000010001000000000000 Best eval score so far: 0.8876 @ step 101500 epoch 31.00 Saving the best model so far: [Epoch 34 | Step: 111500 | MACs sparsity: 0.5858 | Score: 0.8885 | Loss: 0.484] loss: 0.009672, lagrangian_loss: 0.000016, attention_score_distillation_loss: 0.000020 loss: 0.014149, lagrangian_loss: 0.000600, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 01:33:44 Evaluating: accuracy: 0.8852, eval_loss: 0.4942, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 112000 lambda_1: -0.2453, lambda_2: 642.4060 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.67 0.56 0.48 0.6 0.54 0.51 0.36] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101101110101010100010001110 11111111111111111111101110011111001001011000000000 11111111111111111111111110110001000000000000000000 10011111111111111111011110010000010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 10000111110110101011011010001101010101010110000000 00000000110010101001000010000000010001000000000000 Best eval score so far: 0.8885 @ step 111500 epoch 34.06 loss: 0.016415, lagrangian_loss: 0.001111, attention_score_distillation_loss: 0.000020 loss: 0.014697, lagrangian_loss: 0.000108, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 01:36:36 Evaluating: accuracy: 0.8847, eval_loss: 0.5105, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 112500 lambda_1: -0.1336, lambda_2: 645.3339 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.67 0.56 0.48 0.61 0.54 0.51 0.38] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101101110101010100010001110 11111111111111111111101110011111001001010000100000 11111111111111111111111110010001001000000000000000 10011111111111111011011110010000010100000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 00000111110110101011011010011101010101010110000000 00000000110010101001000010000000010001000000000000 Best eval score so far: 0.8885 @ step 111500 epoch 34.06 loss: 0.012433, lagrangian_loss: 0.000879, attention_score_distillation_loss: 0.000020 loss: 0.015961, lagrangian_loss: 0.000317, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 01:39:32 Evaluating: accuracy: 0.8854, eval_loss: 0.5173, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 113000 lambda_1: -0.0414, lambda_2: 648.1123 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.67 0.56 0.48 0.61 0.54 0.51 0.36] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101101110101010100010001110 11111111111111111111101110011111001001010000100000 11111111111111111111111110010001010000000000000000 10011111111111111011011110010100010000000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 00000111110110101011011010001101010101010111000000 00000000110010101001000010000000010001000100000000 Best eval score so far: 0.8885 @ step 111500 epoch 34.06 loss: 0.023253, lagrangian_loss: 0.000001, attention_score_distillation_loss: 0.000020 loss: 0.013246, lagrangian_loss: 0.000484, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 01:42:28 Evaluating: accuracy: 0.886, eval_loss: 0.494, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 113500 lambda_1: -0.0298, lambda_2: 650.7918 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.67 0.56 0.48 0.61 0.54 0.51 0.36] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101101110101010100010001110 11111111111111111111101110011111001001010000100000 11111111111111111111111110010101000000000000000000 10011111111111111011011110010000010100000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 10000111110110101011011010001101010101010110000000 00000000110010101001000010000000010001000100000000 Best eval score so far: 0.8885 @ step 111500 epoch 34.06 loss: 0.012123, lagrangian_loss: 0.000108, attention_score_distillation_loss: 0.000020 loss: 0.027925, lagrangian_loss: 0.000008, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 01:45:23 Evaluating: accuracy: 0.888, eval_loss: 0.4794, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 114000 lambda_1: -0.0279, lambda_2: 653.8177 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.67 0.56 0.49 0.61 0.54 0.51 0.37] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011010101100110101010100010001110 11111111111111111111101110011111001001010000100000 11111111111111111111111110010001001000000000000000 10011111111111111011011110010000010100000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 10000111110110101011011010001101010101010110000000 00000000110010101001000010000000010001000100000000 Best eval score so far: 0.8885 @ step 111500 epoch 34.06 loss: 0.017866, lagrangian_loss: 0.000185, attention_score_distillation_loss: 0.000020 loss: 0.011997, lagrangian_loss: 0.000330, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 01:48:18 Evaluating: accuracy: 0.8863, eval_loss: 0.5032, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 114500 lambda_1: -0.2001, lambda_2: 656.7768 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.67 0.56 0.49 0.61 0.54 0.51 0.35] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011010101100110101010100010001110 11111111111111111111101110011111001001010000100000 11111111111111111111111110010001001000000000000000 10011111111111111011011110010000010100000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 10000111110110101011011010001101010101010110000000 00000000110010101001000010000000010001000100000000 Best eval score so far: 0.8885 @ step 111500 epoch 34.06 loss: 0.051674, lagrangian_loss: 0.000606, attention_score_distillation_loss: 0.000020 ETA: 1:36:06 | Epoch 34 finished. Took 1179.12 seconds. loss: 0.025404, lagrangian_loss: 0.000522, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 01:51:12 Evaluating: accuracy: 0.8856, eval_loss: 0.5026, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5813, expected_sequence_sparsity: 0.836, target_sparsity: 0.58, step: 115000 lambda_1: -0.0083, lambda_2: 659.4255 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.67 0.56 0.49 0.6 0.54 0.51 0.36] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.54, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101101110101010100010001110 11111111111111111111101110011111001001010000100000 11111111111111111111111110010001001000000000000000 10011111111111111011011110010000010100000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 00000000110010101001000010000000010001000100000000 Best eval score so far: 0.8885 @ step 111500 epoch 34.06 loss: 0.006642, lagrangian_loss: 0.001465, attention_score_distillation_loss: 0.000020 loss: 0.034278, lagrangian_loss: 0.001431, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 01:54:08 Evaluating: accuracy: 0.8882, eval_loss: 0.497, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 115500 lambda_1: -0.0315, lambda_2: 661.9892 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.67 0.56 0.49 0.62 0.54 0.51 0.36] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101101110101010100010001110 11111111111111111111101110011111001001010000100000 11111111111111111111111110010001001000000000000000 10011111111111111011011110010000010100000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 10000111110110101011011010001101010101010110000000 00000000110010101001000010000000010001000100000000 Best eval score so far: 0.8885 @ step 111500 epoch 34.06 loss: 0.023288, lagrangian_loss: 0.000097, attention_score_distillation_loss: 0.000020 loss: 0.124069, lagrangian_loss: 0.001244, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 01:57:00 Evaluating: accuracy: 0.8869, eval_loss: 0.5081, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 116000 lambda_1: -0.0517, lambda_2: 664.7474 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.67 0.56 0.49 0.62 0.54 0.51 0.39] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101101110101010100010001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110010001000000000001000000 10011111111111111011011110010000010100000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010010000000 10000111110110101011011010001101010101010110000000 00000001110010101001000010000000010001000000000000 Best eval score so far: 0.8885 @ step 111500 epoch 34.06 loss: 0.021439, lagrangian_loss: 0.000098, attention_score_distillation_loss: 0.000020 loss: 0.005964, lagrangian_loss: 0.000119, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 01:59:55 Evaluating: accuracy: 0.8889, eval_loss: 0.5052, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5825, expected_sparsity: 0.5805, expected_sequence_sparsity: 0.8357, target_sparsity: 0.58, step: 116500 lambda_1: -0.0838, lambda_2: 667.7610 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.67 0.56 0.5 0.62 0.54 0.51 0.39] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.5, 0.58, 0.54, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.12, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101101110101010100010001110 11111111111111111111101110011111101001010000000000 11111111111111111111111110010001000100000000000000 10011111111111111011011110010100010100000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 00000001110010101001000010000000010001000000000000 Best eval score so far: 0.8885 @ step 111500 epoch 34.06 Saving the best model so far: [Epoch 35 | Step: 116500 | MACs sparsity: 0.5825 | Score: 0.8889 | Loss: 0.5052] loss: 0.018273, lagrangian_loss: 0.000901, attention_score_distillation_loss: 0.000020 loss: 0.034840, lagrangian_loss: 0.000091, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 02:03:16 Evaluating: accuracy: 0.8856, eval_loss: 0.5155, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5825, expected_sparsity: 0.5805, expected_sequence_sparsity: 0.8357, target_sparsity: 0.58, step: 117000 lambda_1: -0.1159, lambda_2: 670.6220 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.67 0.56 0.5 0.62 0.55 0.51 0.33] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.5, 0.58, 0.54, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.12, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101101110101010100010001110 11111111111111111111101110011111101001010000000000 11111111111111111111111110010001000100000000000000 10011111111111111011011110010100010100000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 00000001110010101001000010000000010001000000000000 Best eval score so far: 0.8889 @ step 116500 epoch 35.58 loss: 0.052002, lagrangian_loss: 0.000486, attention_score_distillation_loss: 0.000020 loss: 0.275865, lagrangian_loss: 0.000262, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 02:06:11 Evaluating: accuracy: 0.8849, eval_loss: 0.5017, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5825, expected_sparsity: 0.5805, expected_sequence_sparsity: 0.8357, target_sparsity: 0.58, step: 117500 lambda_1: -0.0307, lambda_2: 673.2815 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.67 0.56 0.5 0.62 0.55 0.51 0.33] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.5, 0.58, 0.54, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.12, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101110100010001110 11111111111111111111101110011111101001010000000000 11111111111111111111111110010001000100000000000000 10011111111111111011011110010100010100000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 00000000110010101001000010000000010001000100000000 Best eval score so far: 0.8889 @ step 116500 epoch 35.58 loss: 0.010287, lagrangian_loss: 0.001855, attention_score_distillation_loss: 0.000020 loss: 0.020126, lagrangian_loss: 0.002051, attention_score_distillation_loss: 0.000020 ETA: 1:16:53 | Epoch 35 finished. Took 1151.57 seconds. ---------------------------------------------------------------------- time: 2023-07-20 02:09:04 Evaluating: accuracy: 0.8861, eval_loss: 0.5174, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5825, expected_sparsity: 0.5805, expected_sequence_sparsity: 0.8357, target_sparsity: 0.58, step: 118000 lambda_1: -0.0687, lambda_2: 676.1065 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.67 0.56 0.5 0.61 0.55 0.52 0.33] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.5, 0.58, 0.54, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.12, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101110100010001110 11111111111111111111101110011111101001010000000000 11111111111111111111111110010001010000000000000000 10011111111111111011011110010100010100000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 00000000110010101001000010000000010001000100000000 Best eval score so far: 0.8889 @ step 116500 epoch 35.58 loss: 0.018968, lagrangian_loss: 0.000013, attention_score_distillation_loss: 0.000020 loss: 0.009348, lagrangian_loss: 0.000524, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 02:11:57 Evaluating: accuracy: 0.8839, eval_loss: 0.5171, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 118500 lambda_1: -0.0222, lambda_2: 678.8206 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.67 0.56 0.5 0.61 0.56 0.51 0.33] infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010001110 11111111111111111111101110011111011001010000000000 11111111111111111111111110010001100000000000000000 10011111111111111011011110010100010100000000000000 10001111110111101011011010001101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 00000000110010101001000010000000010001000100000000 Best eval score so far: 0.8889 @ step 116500 epoch 35.58 loss: 0.094354, lagrangian_loss: 0.000007, attention_score_distillation_loss: 0.000020 loss: 0.072993, lagrangian_loss: 0.001410, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 02:14:51 Evaluating: accuracy: 0.8843, eval_loss: 0.5155, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5825, expected_sparsity: 0.5809, expected_sequence_sparsity: 0.8358, target_sparsity: 0.58, step: 119000 lambda_1: -0.0245, lambda_2: 681.9456 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.67 0.56 0.5 0.61 0.55 0.51 0.31] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.12, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111111101110011111101001010000000000 11111111111111111111111110010001100000000000000000 10011111111111111011011110010100010100000000000000 10001111110111101011011010001101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 00000000110010101001000010001000010001000000000000 Best eval score so far: 0.8889 @ step 116500 epoch 35.58 loss: 0.017492, lagrangian_loss: 0.000471, attention_score_distillation_loss: 0.000020 loss: 0.010002, lagrangian_loss: 0.000056, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 02:17:45 Evaluating: accuracy: 0.8863, eval_loss: 0.5078, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5825, expected_sparsity: 0.5809, expected_sequence_sparsity: 0.8359, target_sparsity: 0.58, step: 119500 lambda_1: -0.2100, lambda_2: 684.5416 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.67 0.56 0.5 0.61 0.56 0.51 0.27] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.12, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110111010100010001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110010011000000000000000000 10011111111111111011011110010100010100000000000000 10001111110111101011011010001101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 00000000110010101001000010001000010000000000000000 Best eval score so far: 0.8889 @ step 116500 epoch 35.58 loss: 0.007061, lagrangian_loss: 0.000066, attention_score_distillation_loss: 0.000020 loss: 0.013536, lagrangian_loss: 0.000733, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 02:20:38 Evaluating: accuracy: 0.8839, eval_loss: 0.5118, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5825, expected_sparsity: 0.5809, expected_sequence_sparsity: 0.8359, target_sparsity: 0.58, step: 120000 lambda_1: -0.1203, lambda_2: 687.2068 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.67 0.56 0.5 0.61 0.56 0.5 0.26] infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.12, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100010001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110010011000000000000000000 10011111111111111011011110010100010100000000000000 10001111110111101011011010001101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 00000000110010101001000010001000010000000000000000 Best eval score so far: 0.8889 @ step 116500 epoch 35.58 loss: 0.021576, lagrangian_loss: -0.000003, attention_score_distillation_loss: 0.000020 loss: 0.012608, lagrangian_loss: 0.003057, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 02:23:34 Evaluating: accuracy: 0.883, eval_loss: 0.5232, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 120500 lambda_1: -0.0161, lambda_2: 690.1986 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.67 0.56 0.51 0.61 0.56 0.5 0.3 ] infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110010001000000000001000000 10011111111111111011011110010100010100000000000000 10001111110111101011011010001101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 00000001110010101001000010000000010000000000000000 Best eval score so far: 0.8889 @ step 116500 epoch 35.58 loss: 0.012848, lagrangian_loss: 0.000075, attention_score_distillation_loss: 0.000020 loss: 0.028520, lagrangian_loss: 0.000243, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 02:26:31 Evaluating: accuracy: 0.8858, eval_loss: 0.5248, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 121000 lambda_1: -0.1056, lambda_2: 693.0524 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.67 0.56 0.5 0.61 0.57 0.5 0.29] infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110010001000100000000000000 10011111111111111011011110010100010100000000000000 10001111110111101011011010001101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 00000001110010101001000010000000010000000000000000 Best eval score so far: 0.8889 @ step 116500 epoch 35.58 loss: 0.013956, lagrangian_loss: 0.001404, attention_score_distillation_loss: 0.000020 ETA: 0:57:40 | Epoch 36 finished. Took 1154.19 seconds. loss: 0.010747, lagrangian_loss: 0.000768, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 02:29:25 Evaluating: accuracy: 0.8836, eval_loss: 0.5295, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 121500 lambda_1: -0.1803, lambda_2: 695.7605 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.67 0.56 0.5 0.61 0.57 0.51 0.29] infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010001110 11111111111111111111101111011111001001010000000000 11111111111111111111111110010001000100000000000000 10011111111111111011011110010100010100000000000000 10001111110111101011011010001101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 00000001110010101001000010000000010000000000000000 Best eval score so far: 0.8889 @ step 116500 epoch 35.58 loss: 0.009487, lagrangian_loss: 0.000044, attention_score_distillation_loss: 0.000020 loss: 0.025625, lagrangian_loss: 0.001445, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 02:32:19 Evaluating: accuracy: 0.8849, eval_loss: 0.5215, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.589, expected_sparsity: 0.5844, expected_sequence_sparsity: 0.8373, target_sparsity: 0.58, step: 122000 lambda_1: 0.0195, lambda_2: 698.6012 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.67 0.56 0.5 0.61 0.57 0.5 0.32] infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.58, 0.54, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110010001000100000000000000 10011111111111111011011110010100010100000000000000 10001111110111101011011010011101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 00000001110010101001000010000100010000000000000000 Best eval score so far: 0.8889 @ step 116500 epoch 35.58 loss: 0.024159, lagrangian_loss: 0.000060, attention_score_distillation_loss: 0.000020 loss: 0.180402, lagrangian_loss: 0.000060, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 02:35:13 Evaluating: accuracy: 0.8896, eval_loss: 0.5091, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 122500 lambda_1: -0.1475, lambda_2: 701.2036 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.66 0.56 0.5 0.62 0.57 0.5 0.3 ] infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.18] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010001110 11111111111111111111101111011111001001010000000000 11111111111111111111111110010001000000000000001000 10011111111111111011011110010100010100000000000000 10001111110111101011011010001101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 00000001110010101001000010000000010000000000000000 Best eval score so far: 0.8889 @ step 116500 epoch 35.58 Saving the best model so far: [Epoch 37 | Step: 122500 | MACs sparsity: 0.5907 | Score: 0.8896 | Loss: 0.5091] loss: 0.027328, lagrangian_loss: 0.000009, attention_score_distillation_loss: 0.000020 loss: 0.015560, lagrangian_loss: 0.001482, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 02:38:28 Evaluating: accuracy: 0.8871, eval_loss: 0.5127, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 123000 lambda_1: -0.0213, lambda_2: 703.9400 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.66 0.56 0.5 0.63 0.57 0.5 0.3 ] infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010001110 11111111111111111111101110011111011001010000000000 11111111111111111111111110010001010000000000000000 10011111111111111011011110010100010100000000000000 10001111110111101011011010001101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 10000001110010101001000010000000010000000000000000 Best eval score so far: 0.8896 @ step 122500 epoch 37.42 loss: 0.008344, lagrangian_loss: 0.000499, attention_score_distillation_loss: 0.000020 loss: 0.019039, lagrangian_loss: 0.000813, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 02:41:22 Evaluating: accuracy: 0.8876, eval_loss: 0.5088, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 123500 lambda_1: -0.1699, lambda_2: 706.9811 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.66 0.56 0.5 0.62 0.57 0.5 0.29] infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010001110 11111111111111111111101110011111011001010000000000 11111111111111111111111110010001000000000000010000 10011111111111111011011110010100010100000000000000 10001111110111101011011010001101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 10000001110010101001000010000000010000000000000000 Best eval score so far: 0.8896 @ step 122500 epoch 37.42 loss: 0.012543, lagrangian_loss: 0.000236, attention_score_distillation_loss: 0.000020 loss: 0.018498, lagrangian_loss: 0.000248, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 02:44:16 Evaluating: accuracy: 0.888, eval_loss: 0.5014, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 124000 lambda_1: -0.0337, lambda_2: 709.7526 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.66 0.57 0.51 0.62 0.57 0.51 0.29] infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010001110 11111111111111111111101111011111001001010000000000 11111111111111111111111110010001001000000000000000 10011111111111111011011110010100010100000000000000 10001111110111101011011010001101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 10000001110010101001000010000000010000000000000000 Best eval score so far: 0.8896 @ step 122500 epoch 37.42 loss: 0.007905, lagrangian_loss: 0.000554, attention_score_distillation_loss: 0.000020 loss: 0.022859, lagrangian_loss: 0.000130, attention_score_distillation_loss: 0.000020 ETA: 0:38:26 | Epoch 37 finished. Took 1145.31 seconds. ---------------------------------------------------------------------- time: 2023-07-20 02:47:10 Evaluating: accuracy: 0.8856, eval_loss: 0.5157, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 124500 lambda_1: -0.0528, lambda_2: 712.5897 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.66 0.57 0.51 0.62 0.58 0.51 0.29] infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010001110 11111111111111111111101110011111011001010000000000 11111111111111111111111110010001001000000000000000 10011111111111111011011110010100010100000000000000 10001111110111101011011010001101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 10000001110010101001000010000000010000000000000000 Best eval score so far: 0.8896 @ step 122500 epoch 37.42 loss: 0.018599, lagrangian_loss: 0.000060, attention_score_distillation_loss: 0.000020 loss: 0.010490, lagrangian_loss: 0.000759, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 02:50:06 Evaluating: accuracy: 0.8856, eval_loss: 0.5223, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 125000 lambda_1: 0.0169, lambda_2: 715.5607 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.66 0.57 0.51 0.62 0.58 0.51 0.33] infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010001110 11111111111111111111101110011111011001010000000000 11111111111111111111111110010001001000000000000000 10011111111111111011011110010100010100000000000000 10001111110111101011011010001101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 10000000110010101001000010001000010000000000000000 Best eval score so far: 0.8896 @ step 122500 epoch 37.42 loss: 0.016980, lagrangian_loss: 0.000000, attention_score_distillation_loss: 0.000020 loss: 0.011128, lagrangian_loss: 0.000029, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 02:52:59 Evaluating: accuracy: 0.8845, eval_loss: 0.5246, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 125500 lambda_1: -0.0808, lambda_2: 718.1848 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.66 0.57 0.51 0.61 0.59 0.51 0.31] infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010001110 11111111111111111111101110011111011001010000000000 11111111111111111111111110010001001000000000000000 10011111111111111011011110010100010100000000000000 10001111110111101011011010001101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 10000000110010101001000010001000010000000000000000 Best eval score so far: 0.8896 @ step 122500 epoch 37.42 loss: 0.004824, lagrangian_loss: 0.000259, attention_score_distillation_loss: 0.000020 loss: 0.011993, lagrangian_loss: 0.000695, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 02:55:52 Evaluating: accuracy: 0.8882, eval_loss: 0.5184, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 126000 lambda_1: -0.1046, lambda_2: 721.0481 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.66 0.57 0.51 0.63 0.59 0.51 0.28] infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110010001001000000000000000 10011111111111111011011110010100010100000000000000 10001111110111101011011010001101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 10000000110010101001010010000000010000000000000000 Best eval score so far: 0.8896 @ step 122500 epoch 37.42 loss: 0.030306, lagrangian_loss: 0.000542, attention_score_distillation_loss: 0.000020 loss: 0.004088, lagrangian_loss: 0.000430, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 02:58:47 Evaluating: accuracy: 0.888, eval_loss: 0.5171, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 126500 lambda_1: -0.1985, lambda_2: 723.7037 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.66 0.56 0.51 0.63 0.59 0.5 0.27] infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110010001001000000000000000 10011111111111111011011110010100010100000000000000 10001111110111101011011010001101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 10000000110010101001000010000000010001000000000000 Best eval score so far: 0.8896 @ step 122500 epoch 37.42 loss: 0.018511, lagrangian_loss: 0.000958, attention_score_distillation_loss: 0.000020 loss: 0.013215, lagrangian_loss: 0.000503, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 03:01:45 Evaluating: accuracy: 0.8883, eval_loss: 0.5052, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 127000 lambda_1: -0.1221, lambda_2: 726.5182 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.67 0.56 0.51 0.63 0.58 0.5 0.27] infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110010001001000000000000000 10011111111111111011011110110100010000000000000000 10001111110111101011011010001101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 10000000110010101001000010000000010001000000000000 Best eval score so far: 0.8896 @ step 122500 epoch 37.42 loss: 0.056806, lagrangian_loss: 0.000014, attention_score_distillation_loss: 0.000020 loss: 0.054186, lagrangian_loss: 0.000159, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 03:04:40 Evaluating: accuracy: 0.888, eval_loss: 0.5053, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5856, expected_sequence_sparsity: 0.8377, target_sparsity: 0.58, step: 127500 lambda_1: -0.0592, lambda_2: 729.4135 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.67 0.56 0.51 0.63 0.58 0.5 0.27] infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.48, 0.56, 0.54, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110110001000000000000000000 10011111111111111011011110110000010000000000000000 10001111110111101011011010001101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 10000000110010101001000010000000010001000000000000 Best eval score so far: 0.8896 @ step 122500 epoch 37.42 loss: 0.015133, lagrangian_loss: 0.000475, attention_score_distillation_loss: 0.000020 ETA: 0:19:13 | Epoch 38 finished. Took 1157.29 seconds. loss: 0.032904, lagrangian_loss: 0.000816, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 03:07:34 Evaluating: accuracy: 0.8883, eval_loss: 0.5059, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 128000 lambda_1: 0.0053, lambda_2: 732.3585 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.67 0.57 0.51 0.63 0.58 0.5 0.27] infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110110001000000000000000000 10011111111111111011011110010100010100000000000000 10001111110111101011011010001101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 10000000110010101001000010000000010001000000000000 Best eval score so far: 0.8896 @ step 122500 epoch 37.42 loss: 0.028195, lagrangian_loss: 0.002901, attention_score_distillation_loss: 0.000020 loss: 0.016411, lagrangian_loss: 0.000012, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 03:10:27 Evaluating: accuracy: 0.8869, eval_loss: 0.5113, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5856, expected_sequence_sparsity: 0.8377, target_sparsity: 0.58, step: 128500 lambda_1: 0.0081, lambda_2: 735.3931 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.67 0.57 0.51 0.62 0.58 0.5 0.27] infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.48, 0.56, 0.54, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110110001000000000000000000 10011111111111111011011110010100010000000000000000 10001111110111101011011010001101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 10000000110010101001000010000100010000000000000000 Best eval score so far: 0.8896 @ step 122500 epoch 37.42 loss: 0.016679, lagrangian_loss: 0.000161, attention_score_distillation_loss: 0.000020 loss: 0.016979, lagrangian_loss: 0.000000, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 03:13:23 Evaluating: accuracy: 0.8894, eval_loss: 0.4967, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5856, expected_sequence_sparsity: 0.8377, target_sparsity: 0.58, step: 129000 lambda_1: -0.0743, lambda_2: 738.4060 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.67 0.57 0.5 0.63 0.57 0.5 0.26] infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.48, 0.56, 0.54, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110110001000000000000000000 10011111111111111011011110010100010000000000000000 10001111110111101011011010001101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 10000000110010101001000010000000010001000000000000 Best eval score so far: 0.8896 @ step 122500 epoch 37.42 loss: 0.297057, lagrangian_loss: 0.000963, attention_score_distillation_loss: 0.000020 loss: 0.012158, lagrangian_loss: 0.000028, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 03:16:20 Evaluating: accuracy: 0.8883, eval_loss: 0.5075, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5856, expected_sequence_sparsity: 0.8377, target_sparsity: 0.58, step: 129500 lambda_1: -0.1297, lambda_2: 740.8897 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.67 0.56 0.5 0.62 0.57 0.5 0.25] infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.48, 0.56, 0.54, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110110001000000000000000000 10011111111111111011011110010100010000000000000000 10001111110111101011011010001101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 10000000110010101001000010000000010001000000000000 Best eval score so far: 0.8896 @ step 122500 epoch 37.42 loss: 0.018539, lagrangian_loss: 0.001272, attention_score_distillation_loss: 0.000020 loss: 0.130012, lagrangian_loss: 0.000784, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 03:19:14 Evaluating: accuracy: 0.8894, eval_loss: 0.5067, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5856, expected_sequence_sparsity: 0.8377, target_sparsity: 0.58, step: 130000 lambda_1: -0.1158, lambda_2: 743.7689 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.67 0.56 0.5 0.62 0.57 0.5 0.24] infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.48, 0.56, 0.54, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111110111011000101100110101010100010001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110110001000000000000000000 10011111111111111011011110010100010000000000000000 10001111110111101011011010001101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 10000000110010101001000010000100010000000000000000 Best eval score so far: 0.8896 @ step 122500 epoch 37.42 loss: 0.005933, lagrangian_loss: 0.000256, attention_score_distillation_loss: 0.000020 loss: 0.022068, lagrangian_loss: 0.000556, attention_score_distillation_loss: 0.000020 ---------------------------------------------------------------------- time: 2023-07-20 03:22:09 Evaluating: accuracy: 0.8902, eval_loss: 0.5014, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5856, expected_sequence_sparsity: 0.8377, target_sparsity: 0.58, step: 130500 lambda_1: -0.0685, lambda_2: 746.9197 lambda_3: 0.0000 train remain: [1. 1. 0.63 0.67 0.57 0.5 0.61 0.57 0.5 0.28] infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.48, 0.56, 0.54, 0.48, 0.2] layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0] 11111111111111111111111111111111111111111111111111 11111111111111111111111111111111111111111111111111 11111111111111111011000101100110101010100000001110 11111111111111111111111110011111001001010000000000 11111111111111111111111110110001000000000000000000 10011111111111111011011110010100010000000000000000 10001111110111101011011010001101011101110011000000 10001111110111101011011010011101010101010110000000 10000111110110101011011010001101010101010110000000 10000000110010101001000010000100010000000000000000 Best eval score so far: 0.8896 @ step 122500 epoch 37.42 Saving the best model so far: [Epoch 39 | Step: 130500 | MACs sparsity: 0.5907 | Score: 0.8902 | Loss: 0.5014] loss: 0.019569, lagrangian_loss: 0.000286, attention_score_distillation_loss: 0.000020 loss: 0.027820, lagrangian_loss: 0.000030, attention_score_distillation_loss: 0.000020 ETA: 0:00:00 | Epoch 39 finished. Took 1159.91 seconds. 07/20/2023 03:26:58 - WARNING - urllib3.connectionpool - Retrying (Retry(total=4, connect=5, read=4, redirect=5, status=5)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='southcentralus.api.azureml.ms', port=443): Read timed out. (read timeout=120)")': /mlflow/v2.0/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourceGroups/gcr-singularity-octo/providers/Microsoft.MachineLearningServices/workspaces/msroctows/api/2.0/mlflow/runs/get?run_uuid=8f1ed327-ef83-4836-9c66-d06bcf6f5683&run_id=8f1ed327-ef83-4836-9c66-d06bcf6f5683