/home/aiscuser/.local/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
2023/07/19 14:34:12 WARNING mlflow.utils.autologging_utils: You are using an unsupported version of transformers. If you encounter errors during autologging, try upgrading / downgrading transformers to a supported version, or try upgrading MLflow.
2023/07/19 14:34:13 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.
2023/07/19 14:34:13 INFO mlflow.tracking.fluent: Autologging successfully enabled for transformers.
Using the `WAND_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Downloading and preparing dataset glue/qnli to /home/aiscuser/.cache/huggingface/datasets/glue/qnli/1.0.0/a420f5e518f42454003587c47467370329f9fc0c6508d1ae0c45b58ea266a353...
Downloading data:   0%|          | 0.00/10.6M [00:00<?, ?B/s]Downloading data:   3%|▎         | 343k/10.6M [00:00<00:03, 3.42MB/s]Downloading data:  41%|████      | 4.30M/10.6M [00:00<00:00, 24.7MB/s]Downloading data:  70%|███████   | 7.49M/10.6M [00:00<00:00, 22.5MB/s]Downloading data: 100%|██████████| 10.6M/10.6M [00:00<00:00, 24.2MB/s]
Generating train split: 0 examples [00:00, ? examples/s]Generating train split: 2449 examples [00:00, 24395.74 examples/s]Generating train split: 5065 examples [00:00, 25425.32 examples/s]Generating train split: 7766 examples [00:00, 26142.29 examples/s]Generating train split: 11690 examples [00:00, 26119.66 examples/s]Generating train split: 14349 examples [00:00, 26074.87 examples/s]Generating train split: 17000 examples [00:00, 26094.54 examples/s]Generating train split: 19707 examples [00:00, 26391.57 examples/s]Generating train split: 22352 examples [00:00, 26311.66 examples/s]Generating train split: 25000 examples [00:00, 26287.44 examples/s]Generating train split: 27710 examples [00:01, 26529.39 examples/s]Generating train split: 31708 examples [00:01, 26509.95 examples/s]Generating train split: 35711 examples [00:01, 26508.86 examples/s]Generating train split: 39693 examples [00:01, 26419.06 examples/s]Generating train split: 42355 examples [00:01, 26376.74 examples/s]Generating train split: 45000 examples [00:01, 26329.89 examples/s]Generating train split: 47703 examples [00:01, 26516.46 examples/s]Generating train split: 50360 examples [00:01, 26447.89 examples/s]Generating train split: 54353 examples [00:02, 26460.06 examples/s]Generating train split: 57000 examples [00:02, 26380.87 examples/s]Generating train split: 59666 examples [00:02, 26454.50 examples/s]Generating train split: 62338 examples [00:02, 26338.74 examples/s]Generating train split: 65000 examples [00:02, 26201.25 examples/s]Generating train split: 67700 examples [00:02, 26428.81 examples/s]Generating train split: 71528 examples [00:02, 25959.77 examples/s]Generating train split: 75436 examples [00:02, 25989.99 examples/s]Generating train split: 79395 examples [00:03, 26121.41 examples/s]Generating train split: 83339 examples [00:03, 26114.99 examples/s]Generating train split: 86000 examples [00:03, 26097.23 examples/s]Generating train split: 88688 examples [00:03, 26292.54 examples/s]Generating train split: 91342 examples [00:03, 26196.69 examples/s]Generating train split: 94000 examples [00:03, 26178.29 examples/s]Generating train split: 96693 examples [00:03, 26387.62 examples/s]Generating train split: 99351 examples [00:03, 26314.90 examples/s]Generating train split: 102000 examples [00:03, 26273.73 examples/s]Generating train split: 104743 examples [00:03, 26280.32 examples/s]                                                                    Generating validation split: 0 examples [00:00, ? examples/s]Generating validation split: 2559 examples [00:00, 25419.06 examples/s]Generating validation split: 5159 examples [00:00, 25752.51 examples/s]                                                                       Generating test split: 0 examples [00:00, ? examples/s]Generating test split: 2888 examples [00:00, 28805.17 examples/s]                                                                 Dataset glue downloaded and prepared to /home/aiscuser/.cache/huggingface/datasets/glue/qnli/1.0.0/a420f5e518f42454003587c47467370329f9fc0c6508d1ae0c45b58ea266a353. Subsequent calls will reuse this data.
  0%|          | 0/3 [00:00<?, ?it/s]100%|██████████| 3/3 [00:00<00:00, 440.33it/s]
disable token pruning.
enable token pruning. token_prune_loc: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
NOTICE: THIS IS PRUNING STAGE
max_seq_length: 256
Running tokenizer on dataset:   0%|          | 0/104743 [00:00<?, ? examples/s]Running tokenizer on dataset:   1%|          | 1000/104743 [00:00<00:28, 3588.44 examples/s]Running tokenizer on dataset:   2%|▏         | 2000/104743 [00:00<00:27, 3768.20 examples/s]Running tokenizer on dataset:   3%|▎         | 3000/104743 [00:00<00:26, 3783.05 examples/s]Running tokenizer on dataset:   4%|▍         | 4000/104743 [00:01<00:26, 3788.93 examples/s]Running tokenizer on dataset:   5%|▍         | 5000/104743 [00:01<00:26, 3779.11 examples/s]Running tokenizer on dataset:   6%|▌         | 6000/104743 [00:01<00:34, 2858.26 examples/s]Running tokenizer on dataset:   7%|▋         | 7000/104743 [00:02<00:31, 3128.74 examples/s]Running tokenizer on dataset:   8%|▊         | 8000/104743 [00:02<00:29, 3312.06 examples/s]Running tokenizer on dataset:   9%|▊         | 9000/104743 [00:02<00:27, 3486.32 examples/s]Running tokenizer on dataset:  10%|▉         | 10000/104743 [00:02<00:26, 3592.34 examples/s]Running tokenizer on dataset:  11%|█         | 11000/104743 [00:03<00:25, 3650.03 examples/s]Running tokenizer on dataset:  11%|█▏        | 12000/104743 [00:03<00:25, 3700.83 examples/s]Running tokenizer on dataset:  12%|█▏        | 13000/104743 [00:03<00:24, 3727.77 examples/s]Running tokenizer on dataset:  13%|█▎        | 14000/104743 [00:03<00:24, 3758.89 examples/s]Running tokenizer on dataset:  14%|█▍        | 15000/104743 [00:04<00:23, 3804.71 examples/s]Running tokenizer on dataset:  15%|█▌        | 16000/104743 [00:04<00:23, 3803.28 examples/s]Running tokenizer on dataset:  16%|█▌        | 17000/104743 [00:04<00:23, 3811.65 examples/s]Running tokenizer on dataset:  17%|█▋        | 18000/104743 [00:04<00:22, 3834.97 examples/s]Running tokenizer on dataset:  18%|█▊        | 19000/104743 [00:05<00:22, 3844.36 examples/s]Running tokenizer on dataset:  19%|█▉        | 20000/104743 [00:05<00:21, 3862.61 examples/s]Running tokenizer on dataset:  20%|██        | 21000/104743 [00:05<00:21, 3854.45 examples/s]Running tokenizer on dataset:  21%|██        | 22000/104743 [00:05<00:21, 3883.32 examples/s]Running tokenizer on dataset:  22%|██▏       | 23000/104743 [00:06<00:21, 3864.38 examples/s]Running tokenizer on dataset:  23%|██▎       | 24000/104743 [00:06<00:21, 3822.39 examples/s]Running tokenizer on dataset:  24%|██▍       | 25000/104743 [00:06<00:20, 3830.25 examples/s]Running tokenizer on dataset:  25%|██▍       | 26000/104743 [00:07<00:25, 3140.93 examples/s]Running tokenizer on dataset:  26%|██▌       | 27000/104743 [00:07<00:23, 3300.32 examples/s]Running tokenizer on dataset:  27%|██▋       | 28000/104743 [00:07<00:22, 3422.18 examples/s]Running tokenizer on dataset:  28%|██▊       | 29000/104743 [00:08<00:21, 3526.05 examples/s]Running tokenizer on dataset:  29%|██▊       | 30000/104743 [00:08<00:20, 3621.19 examples/s]Running tokenizer on dataset:  30%|██▉       | 31000/104743 [00:08<00:20, 3684.77 examples/s]Running tokenizer on dataset:  31%|███       | 32000/104743 [00:08<00:19, 3733.33 examples/s]Running tokenizer on dataset:  32%|███▏      | 33000/104743 [00:09<00:19, 3743.52 examples/s]Running tokenizer on dataset:  32%|███▏      | 34000/104743 [00:09<00:18, 3753.14 examples/s]Running tokenizer on dataset:  33%|███▎      | 35000/104743 [00:09<00:18, 3741.58 examples/s]Running tokenizer on dataset:  34%|███▍      | 36000/104743 [00:09<00:18, 3735.27 examples/s]Running tokenizer on dataset:  35%|███▌      | 37000/104743 [00:10<00:18, 3730.90 examples/s]Running tokenizer on dataset:  36%|███▋      | 38000/104743 [00:10<00:17, 3744.51 examples/s]Running tokenizer on dataset:  37%|███▋      | 39000/104743 [00:10<00:17, 3754.21 examples/s]Running tokenizer on dataset:  38%|███▊      | 40000/104743 [00:10<00:17, 3756.07 examples/s]Running tokenizer on dataset:  39%|███▉      | 41000/104743 [00:11<00:16, 3778.83 examples/s]Running tokenizer on dataset:  40%|████      | 42000/104743 [00:11<00:16, 3793.79 examples/s]Running tokenizer on dataset:  41%|████      | 43000/104743 [00:11<00:16, 3779.77 examples/s]Running tokenizer on dataset:  42%|████▏     | 44000/104743 [00:11<00:15, 3802.09 examples/s]Running tokenizer on dataset:  43%|████▎     | 45000/104743 [00:12<00:15, 3793.17 examples/s]Running tokenizer on dataset:  44%|████▍     | 46000/104743 [00:12<00:18, 3137.75 examples/s]Running tokenizer on dataset:  45%|████▍     | 47000/104743 [00:12<00:17, 3305.97 examples/s]Running tokenizer on dataset:  46%|████▌     | 48000/104743 [00:13<00:16, 3430.01 examples/s]Running tokenizer on dataset:  47%|████▋     | 49000/104743 [00:13<00:15, 3515.77 examples/s]Running tokenizer on dataset:  48%|████▊     | 50000/104743 [00:13<00:15, 3590.44 examples/s]Running tokenizer on dataset:  49%|████▊     | 51000/104743 [00:14<00:14, 3641.58 examples/s]Running tokenizer on dataset:  50%|████▉     | 52000/104743 [00:14<00:14, 3620.62 examples/s]Running tokenizer on dataset:  51%|█████     | 53000/104743 [00:14<00:14, 3678.75 examples/s]Running tokenizer on dataset:  52%|█████▏    | 54000/104743 [00:14<00:13, 3722.96 examples/s]Running tokenizer on dataset:  53%|█████▎    | 55000/104743 [00:15<00:13, 3728.91 examples/s]Running tokenizer on dataset:  53%|█████▎    | 56000/104743 [00:15<00:12, 3752.98 examples/s]Running tokenizer on dataset:  54%|█████▍    | 57000/104743 [00:15<00:12, 3766.37 examples/s]Running tokenizer on dataset:  55%|█████▌    | 58000/104743 [00:15<00:12, 3726.59 examples/s]Running tokenizer on dataset:  56%|█████▋    | 59000/104743 [00:16<00:12, 3661.04 examples/s]Running tokenizer on dataset:  57%|█████▋    | 60000/104743 [00:16<00:12, 3644.69 examples/s]Running tokenizer on dataset:  58%|█████▊    | 61000/104743 [00:16<00:12, 3638.67 examples/s]Running tokenizer on dataset:  59%|█████▉    | 62000/104743 [00:17<00:11, 3624.86 examples/s]Running tokenizer on dataset:  60%|██████    | 63000/104743 [00:17<00:14, 2884.83 examples/s]Running tokenizer on dataset:  61%|██████    | 64000/104743 [00:17<00:13, 3104.03 examples/s]Running tokenizer on dataset:  62%|██████▏   | 65000/104743 [00:18<00:12, 3309.08 examples/s]Running tokenizer on dataset:  63%|██████▎   | 66000/104743 [00:18<00:11, 3431.84 examples/s]Running tokenizer on dataset:  64%|██████▍   | 67000/104743 [00:18<00:10, 3545.85 examples/s]Running tokenizer on dataset:  65%|██████▍   | 68000/104743 [00:18<00:10, 3645.27 examples/s]Running tokenizer on dataset:  66%|██████▌   | 69000/104743 [00:19<00:09, 3708.93 examples/s]Running tokenizer on dataset:  67%|██████▋   | 70000/104743 [00:19<00:09, 3745.12 examples/s]Running tokenizer on dataset:  68%|██████▊   | 71000/104743 [00:19<00:09, 3742.32 examples/s]Running tokenizer on dataset:  69%|██████▊   | 72000/104743 [00:19<00:08, 3770.35 examples/s]Running tokenizer on dataset:  70%|██████▉   | 73000/104743 [00:20<00:08, 3806.51 examples/s]Running tokenizer on dataset:  71%|███████   | 74000/104743 [00:20<00:08, 3808.49 examples/s]Running tokenizer on dataset:  72%|███████▏  | 75000/104743 [00:20<00:07, 3814.84 examples/s]Running tokenizer on dataset:  73%|███████▎  | 76000/104743 [00:20<00:07, 3822.33 examples/s]Running tokenizer on dataset:  74%|███████▎  | 77000/104743 [00:21<00:07, 3849.98 examples/s]Running tokenizer on dataset:  74%|███████▍  | 78000/104743 [00:21<00:06, 3850.90 examples/s]Running tokenizer on dataset:  75%|███████▌  | 79000/104743 [00:21<00:06, 3854.20 examples/s]Running tokenizer on dataset:  76%|███████▋  | 80000/104743 [00:21<00:06, 3852.90 examples/s]Running tokenizer on dataset:  77%|███████▋  | 81000/104743 [00:22<00:06, 3831.17 examples/s]Running tokenizer on dataset:  78%|███████▊  | 82000/104743 [00:22<00:05, 3832.33 examples/s]Running tokenizer on dataset:  79%|███████▉  | 83000/104743 [00:22<00:05, 3829.41 examples/s]Running tokenizer on dataset:  80%|████████  | 84000/104743 [00:23<00:06, 3087.24 examples/s]Running tokenizer on dataset:  81%|████████  | 85000/104743 [00:23<00:06, 3276.70 examples/s]Running tokenizer on dataset:  82%|████████▏ | 86000/104743 [00:23<00:05, 3435.11 examples/s]Running tokenizer on dataset:  83%|████████▎ | 87000/104743 [00:23<00:04, 3559.56 examples/s]Running tokenizer on dataset:  84%|████████▍ | 88000/104743 [00:24<00:04, 3662.95 examples/s]Running tokenizer on dataset:  85%|████████▍ | 89000/104743 [00:24<00:04, 3713.50 examples/s]Running tokenizer on dataset:  86%|████████▌ | 90000/104743 [00:24<00:03, 3774.91 examples/s]Running tokenizer on dataset:  87%|████████▋ | 91000/104743 [00:25<00:03, 3824.22 examples/s]Running tokenizer on dataset:  88%|████████▊ | 92000/104743 [00:25<00:03, 3813.24 examples/s]Running tokenizer on dataset:  89%|████████▉ | 93000/104743 [00:25<00:03, 3804.22 examples/s]Running tokenizer on dataset:  90%|████████▉ | 94000/104743 [00:25<00:02, 3809.42 examples/s]Running tokenizer on dataset:  91%|█████████ | 95000/104743 [00:26<00:02, 3816.00 examples/s]Running tokenizer on dataset:  92%|█████████▏| 96000/104743 [00:26<00:02, 3811.31 examples/s]Running tokenizer on dataset:  93%|█████████▎| 97000/104743 [00:26<00:02, 3790.46 examples/s]Running tokenizer on dataset:  94%|█████████▎| 98000/104743 [00:26<00:01, 3816.75 examples/s]Running tokenizer on dataset:  95%|█████████▍| 99000/104743 [00:27<00:01, 3841.74 examples/s]Running tokenizer on dataset:  95%|█████████▌| 100000/104743 [00:27<00:01, 3861.55 examples/s]Running tokenizer on dataset:  96%|█████████▋| 101000/104743 [00:27<00:00, 3852.76 examples/s]Running tokenizer on dataset:  97%|█████████▋| 102000/104743 [00:27<00:00, 3855.65 examples/s]Running tokenizer on dataset:  98%|█████████▊| 103000/104743 [00:28<00:00, 3857.41 examples/s]Running tokenizer on dataset:  99%|█████████▉| 104000/104743 [00:28<00:00, 3155.14 examples/s]Running tokenizer on dataset: 100%|██████████| 104743/104743 [00:28<00:00, 3300.64 examples/s]                                                                                              Running tokenizer on dataset:   0%|          | 0/5463 [00:00<?, ? examples/s]Running tokenizer on dataset:  18%|█▊        | 1000/5463 [00:00<00:01, 3779.89 examples/s]Running tokenizer on dataset:  37%|███▋      | 2000/5463 [00:00<00:00, 3803.78 examples/s]Running tokenizer on dataset:  55%|█████▍    | 3000/5463 [00:00<00:00, 3827.66 examples/s]Running tokenizer on dataset:  73%|███████▎  | 4000/5463 [00:01<00:00, 3832.52 examples/s]Running tokenizer on dataset:  92%|█████████▏| 5000/5463 [00:01<00:00, 3807.82 examples/s]Running tokenizer on dataset: 100%|██████████| 5463/5463 [00:01<00:00, 3798.82 examples/s]                                                                                          Running tokenizer on dataset:   0%|          | 0/5463 [00:00<?, ? examples/s]Running tokenizer on dataset:  18%|█▊        | 1000/5463 [00:00<00:01, 3818.39 examples/s]Running tokenizer on dataset:  37%|███▋      | 2000/5463 [00:00<00:00, 3787.59 examples/s]Running tokenizer on dataset:  55%|█████▍    | 3000/5463 [00:00<00:00, 3834.33 examples/s]Running tokenizer on dataset:  73%|███████▎  | 4000/5463 [00:01<00:00, 3847.93 examples/s]Running tokenizer on dataset:  92%|█████████▏| 5000/5463 [00:01<00:00, 3856.50 examples/s]Running tokenizer on dataset: 100%|██████████| 5463/5463 [00:01<00:00, 3834.18 examples/s]                                                                                          Downloading builder script:   0%|          | 0.00/1.84k [00:00<?, ?B/s]Downloading builder script: 5.76kB [00:00, 3.97MB/s]                   
double check the prune location is loaded correctly: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
double check hard_token_mask: <class 'NoneType'>
Training Arguments
TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
bf16=False,
bf16_full_eval=False,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_steps=500,
evaluation_strategy=IntervalStrategy.STEPS,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=2e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=40,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=/mnt/data/device-aware-bert/token_pruning/experiments/QNLI/reproduce1/s0.58_lr2e-05_reglr0.01_alpha0.0002_warmup10_bin50/runs/Jul19_14-34-14_node-0,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=100,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=40.0,
optim=OptimizerNames.ADAMW_HF,
output_dir=/mnt/data/device-aware-bert/token_pruning/experiments/QNLI/reproduce1/s0.58_lr2e-05_reglr0.01_alpha0.0002_warmup10_bin50,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=32,
per_device_train_batch_size=32,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
remove_unused_columns=True,
report_to=['mlflow'],
resume_from_checkpoint=None,
run_name=/mnt/data/device-aware-bert/token_pruning/experiments/QNLI/reproduce1/s0.58_lr2e-05_reglr0.01_alpha0.0002_warmup10_bin50,
save_on_each_node=False,
save_steps=0,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=57,
sharded_ddp=[],
skip_memory_metrics=True,
tf32=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_legacy_prediction_loop=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
Additional Arguments
AdditionalArguments(test=False, ex_name='s0.58_lr2e-05_reglr0.01_alpha0.0002_warmup10_bin50', pruning_type='token+pruner', reg_learning_rate=0.01, scheduler_type='linear', freeze_embeddings=True, pretrained_pruned_model=None, droprate_init=0.01, temperature=0.6666666666666666, prepruning_finetune_epochs=1, lagrangian_warmup_epochs=10, target_sparsity=0.58, sparsity_epsilon=0, distillation_path='/mnt/data/device-aware-bert/token_pruning/teachers/QNLI', do_distill=True, do_layer_distill=False, layer_distill_version=4, distill_loss_alpha=0.9, distill_ce_loss_alpha=0.0002, distill_temp=2.0, use_mac_l0=True, prune_location=[2, 3, 4, 5, 6, 7, 8, 9, 10, 11], bin_num=50, topk=20)
----------------------------------------------------------------------
time: 2023-07-19 14:35:57
Evaluating: accuracy: 0.9165, eval_loss: 0.2978, step: 0
lambda_1: 0.0000, lambda_2: 0.0000 lambda_3: 0.0000
Starting l0 regularization! using <class 'models.l0_module.L0ModuleForMAC'>, temperature: 0.67, init drop rate: 0.01 token_loga shape: [10, 50] prune location: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
NDCG TOPK= 20
loss: 0.151606, lagrangian_loss: -0.001826, attention_score_distillation_loss: 0.001971
----------------------------------------------------------------------
time: 2023-07-19 14:38:49
Evaluating: accuracy: 0.912, eval_loss: 0.3301, token_prune_loc: [False, False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.6069, target_sparsity: 0.0088, step: 500
lambda_1: 0.6712, lambda_2: 5.6444 lambda_3: 0.0000
train remain: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
loss: 0.249322, lagrangian_loss: -0.001798, attention_score_distillation_loss: 0.001956
loss: 0.021923, lagrangian_loss: 0.025116, attention_score_distillation_loss: 0.001941
----------------------------------------------------------------------
time: 2023-07-19 14:41:40
Evaluating: accuracy: 0.9112, eval_loss: 0.3331, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0136, expected_sparsity: 0.0136, expected_sequence_sparsity: 0.6122, target_sparsity: 0.0177, step: 1000
lambda_1: -5.9906, lambda_2: 14.0124 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   1.   1.   1.   0.99 0.92]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
01111111111111111111111111111111111111110111110100
loss: 0.039104, lagrangian_loss: -0.028921, attention_score_distillation_loss: 0.001926
loss: 0.021610, lagrangian_loss: -0.007255, attention_score_distillation_loss: 0.001912
----------------------------------------------------------------------
time: 2023-07-19 14:44:30
Evaluating: accuracy: 0.907, eval_loss: 0.3781, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0191, expected_sparsity: 0.019, expected_sequence_sparsity: 0.6144, target_sparsity: 0.0265, step: 1500
lambda_1: 1.4706, lambda_2: 22.5720 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   1.   1.   1.   1.   0.99 0.88]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
01111111111111111111111111111111111011110111100100
loss: 0.021560, lagrangian_loss: -0.000941, attention_score_distillation_loss: 0.001896
loss: 0.228817, lagrangian_loss: 0.012241, attention_score_distillation_loss: 0.001879
----------------------------------------------------------------------
time: 2023-07-19 14:47:24
Evaluating: accuracy: 0.9063, eval_loss: 0.3578, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0245, expected_sparsity: 0.0245, expected_sequence_sparsity: 0.6165, target_sparsity: 0.0354, step: 2000
lambda_1: -3.3169, lambda_2: 26.2438 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 1.   1.   1.   0.99 0.84]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
01111111111111111111111111111111111011100111000100
loss: 0.135710, lagrangian_loss: 0.013382, attention_score_distillation_loss: 0.001867
loss: 0.136615, lagrangian_loss: -0.005482, attention_score_distillation_loss: 0.001851
----------------------------------------------------------------------
time: 2023-07-19 14:50:16
Evaluating: accuracy: 0.909, eval_loss: 0.3409, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0327, expected_sparsity: 0.0326, expected_sequence_sparsity: 0.6197, target_sparsity: 0.0443, step: 2500
lambda_1: -1.6806, lambda_2: 28.0291 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.98 0.99 1.   1.   0.98 0.77]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
01111111111111111111111111111111001011100111000000
loss: 0.061354, lagrangian_loss: -0.011220, attention_score_distillation_loss: 0.001835
loss: 0.048383, lagrangian_loss: -0.000103, attention_score_distillation_loss: 0.001821
----------------------------------------------------------------------
time: 2023-07-19 14:53:11
Evaluating: accuracy: 0.909, eval_loss: 0.3394, token_prune_loc: [False, False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0327, expected_sparsity: 0.0326, expected_sequence_sparsity: 0.6197, target_sparsity: 0.0531, step: 3000
lambda_1: -1.2986, lambda_2: 30.1448 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.99 1.   1.   1.   0.98 0.77]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
01111111111111111111111111111111001011100111000000
loss: 0.039288, lagrangian_loss: 0.010155, attention_score_distillation_loss: 0.001804
loss: 0.246656, lagrangian_loss: 0.010731, attention_score_distillation_loss: 0.001787
ETA: 12:04:03 | Epoch 0 finished. Took 1113.93 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:56:04
Evaluating: accuracy: 0.9132, eval_loss: 0.3481, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0529, expected_sparsity: 0.0524, expected_sequence_sparsity: 0.6275, target_sparsity: 0.062, step: 3500
lambda_1: -2.9663, lambda_2: 32.3318 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.98 1.   1.   0.99 0.95 0.72]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.72]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.66]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111101111111111101111010
01111111111111111111110111111101001011100111000000
loss: 0.167372, lagrangian_loss: -0.016864, attention_score_distillation_loss: 0.001776
loss: 0.348476, lagrangian_loss: -0.000200, attention_score_distillation_loss: 0.001758
----------------------------------------------------------------------
time: 2023-07-19 14:58:57
Evaluating: accuracy: 0.9116, eval_loss: 0.3552, token_prune_loc: [False, False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0529, expected_sparsity: 0.0524, expected_sequence_sparsity: 0.6275, target_sparsity: 0.0708, step: 4000
lambda_1: -1.6665, lambda_2: 35.0824 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.98 1.   1.   0.99 0.95 0.72]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.72]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.66]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111101111111111101111010
01111111111111111111110111111101001011100111000000
loss: 0.028537, lagrangian_loss: 0.007263, attention_score_distillation_loss: 0.001742
loss: 0.115705, lagrangian_loss: -0.003871, attention_score_distillation_loss: 0.001729
----------------------------------------------------------------------
time: 2023-07-19 15:01:51
Evaluating: accuracy: 0.9088, eval_loss: 0.3366, token_prune_loc: [False, False, False, False, True, False, False, False, True, True], macs_sparsity: 0.0968, expected_sparsity: 0.0943, expected_sequence_sparsity: 0.6441, target_sparsity: 0.0797, step: 4500
lambda_1: -0.7962, lambda_2: 36.9589 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.96 1.   1.   0.98 0.94 0.7 ]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 1.0, 1.0, 1.0, 0.92, 0.7]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.92, 0.92, 0.85, 0.59]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011111111111110100
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111101111111111101111010
01011111111111111111110111111101001011100111000000
loss: 0.149126, lagrangian_loss: -0.001737, attention_score_distillation_loss: 0.001714
loss: 0.121980, lagrangian_loss: 0.009807, attention_score_distillation_loss: 0.001698
----------------------------------------------------------------------
time: 2023-07-19 15:04:42
Evaluating: accuracy: 0.9072, eval_loss: 0.3316, token_prune_loc: [False, False, False, False, True, False, False, False, True, True], macs_sparsity: 0.0995, expected_sparsity: 0.0966, expected_sequence_sparsity: 0.645, target_sparsity: 0.0885, step: 5000
lambda_1: -3.8493, lambda_2: 38.8414 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.96 1.   1.   0.98 0.93 0.69]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 1.0, 1.0, 1.0, 0.92, 0.68]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.92, 0.92, 0.85, 0.58]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011111111111101100
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111101111111111101111010
01011111111111111111110110111101001011100111000000
loss: 0.522274, lagrangian_loss: 0.000473, attention_score_distillation_loss: 0.001683
loss: 0.259186, lagrangian_loss: -0.000771, attention_score_distillation_loss: 0.001668
----------------------------------------------------------------------
time: 2023-07-19 15:07:35
Evaluating: accuracy: 0.907, eval_loss: 0.3364, token_prune_loc: [False, False, False, False, True, False, False, False, True, True], macs_sparsity: 0.0995, expected_sparsity: 0.0966, expected_sequence_sparsity: 0.645, target_sparsity: 0.0974, step: 5500
lambda_1: 0.0348, lambda_2: 44.5078 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.95 1.   1.   0.98 0.94 0.69]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.92, 1.0, 1.0, 1.0, 0.92, 0.68]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92, 0.92, 0.92, 0.92, 0.85, 0.58]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011111111111110100
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111101111111111101111010
01011111111111111111110111111101001011100011000000
loss: 0.122168, lagrangian_loss: 0.001977, attention_score_distillation_loss: 0.001651
loss: 0.342982, lagrangian_loss: 0.010217, attention_score_distillation_loss: 0.001638
----------------------------------------------------------------------
time: 2023-07-19 15:10:27
Evaluating: accuracy: 0.9032, eval_loss: 0.3616, token_prune_loc: [False, False, False, False, True, False, False, True, True, True], macs_sparsity: 0.1257, expected_sparsity: 0.1245, expected_sequence_sparsity: 0.656, target_sparsity: 0.1063, step: 6000
lambda_1: -2.0497, lambda_2: 48.2584 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.92 0.99 1.   0.97 0.92 0.67]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.9, 1.0, 1.0, 0.94, 0.9, 0.66]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.9, 0.9, 0.85, 0.76, 0.5]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011111111111100100
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111011111100
11111111111111111111111111111101111111111101110010
01011111111111111111110110111101001011100011000000
loss: 0.628217, lagrangian_loss: -0.010831, attention_score_distillation_loss: 0.001621
loss: 0.434961, lagrangian_loss: 0.000020, attention_score_distillation_loss: 0.001606
----------------------------------------------------------------------
time: 2023-07-19 15:13:20
Evaluating: accuracy: 0.9085, eval_loss: 0.3411, token_prune_loc: [False, False, False, False, True, False, False, True, True, True], macs_sparsity: 0.1257, expected_sparsity: 0.1245, expected_sequence_sparsity: 0.656, target_sparsity: 0.1151, step: 6500
lambda_1: -1.6943, lambda_2: 50.6412 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.93 0.99 1.   0.97 0.92 0.67]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.9, 1.0, 1.0, 0.94, 0.9, 0.66]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.9, 0.9, 0.85, 0.76, 0.5]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011111111111100100
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111011111100
11111111111111111111111111111101111111111101110010
01011111111111111111110110111101001011100011000000
loss: 0.028765, lagrangian_loss: 0.009957, attention_score_distillation_loss: 0.001592
ETA: 11:54:39 | Epoch 1 finished. Took 1142.86 seconds.
loss: 0.329838, lagrangian_loss: -0.003458, attention_score_distillation_loss: 0.001577
----------------------------------------------------------------------
time: 2023-07-19 15:16:12
Evaluating: accuracy: 0.9096, eval_loss: 0.3387, token_prune_loc: [False, False, False, False, True, False, False, True, True, True], macs_sparsity: 0.1257, expected_sparsity: 0.1245, expected_sequence_sparsity: 0.656, target_sparsity: 0.124, step: 7000
lambda_1: -1.1721, lambda_2: 52.2661 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.91 0.99 1.   0.95 0.9  0.66]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.9, 1.0, 1.0, 0.94, 0.9, 0.66]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.9, 0.9, 0.85, 0.76, 0.5]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011111111111100100
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111011111100
11111111111111111111111111111101111111111101110010
01011111111111111111110110111101001011100011000000
loss: 0.370096, lagrangian_loss: -0.002987, attention_score_distillation_loss: 0.001562
loss: 0.029515, lagrangian_loss: 0.001664, attention_score_distillation_loss: 0.001547
----------------------------------------------------------------------
time: 2023-07-19 15:19:09
Evaluating: accuracy: 0.9046, eval_loss: 0.3777, token_prune_loc: [False, False, False, False, True, False, False, True, True, True], macs_sparsity: 0.1443, expected_sparsity: 0.1427, expected_sequence_sparsity: 0.6631, target_sparsity: 0.1328, step: 7500
lambda_1: -2.2503, lambda_2: 53.1401 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.9  0.99 1.   0.95 0.9  0.65]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.88, 1.0, 1.0, 0.92, 0.88, 0.64]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.88, 0.88, 0.81, 0.71, 0.46]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011111111111100000
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111011111000
01111111111111111111111111111101111111111101110010
01011111111111111111110110111001001011100011000000
loss: 0.124168, lagrangian_loss: 0.001704, attention_score_distillation_loss: 0.001532
loss: 0.047143, lagrangian_loss: -0.002382, attention_score_distillation_loss: 0.001516
----------------------------------------------------------------------
time: 2023-07-19 15:22:02
Evaluating: accuracy: 0.9063, eval_loss: 0.3526, token_prune_loc: [False, False, False, False, True, False, False, True, True, True], macs_sparsity: 0.1443, expected_sparsity: 0.1427, expected_sequence_sparsity: 0.6631, target_sparsity: 0.1417, step: 8000
lambda_1: -1.5352, lambda_2: 53.3740 lambda_3: 0.0000
train remain: [1.   1.   1.   1.   0.89 0.99 1.   0.94 0.89 0.65]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.88, 1.0, 1.0, 0.92, 0.88, 0.64]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.88, 0.88, 0.81, 0.71, 0.46]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011111111111100000
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111011111000
01111111111111111111111111111101111111111101110010
01011111111111111111110110111001001011100011000000
loss: 0.096316, lagrangian_loss: 0.000284, attention_score_distillation_loss: 0.001501
loss: 0.049778, lagrangian_loss: 0.003615, attention_score_distillation_loss: 0.001486
----------------------------------------------------------------------
time: 2023-07-19 15:24:54
Evaluating: accuracy: 0.9052, eval_loss: 0.378, token_prune_loc: [False, False, False, False, True, False, False, True, True, True], macs_sparsity: 0.1536, expected_sparsity: 0.1519, expected_sequence_sparsity: 0.6668, target_sparsity: 0.1505, step: 8500
lambda_1: -1.1228, lambda_2: 55.3826 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.88 0.96 1.   0.93 0.89 0.64]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.86, 1.0, 1.0, 0.92, 0.88, 0.64]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.86, 0.86, 0.79, 0.7, 0.45]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011101111111100000
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111011110100
01111111111111111111111111111101111111111101110010
01011111111111111111110110111001001011100011000000
loss: 0.244794, lagrangian_loss: -0.003341, attention_score_distillation_loss: 0.001469
loss: 0.225897, lagrangian_loss: 0.004802, attention_score_distillation_loss: 0.001457
----------------------------------------------------------------------
time: 2023-07-19 15:27:46
Evaluating: accuracy: 0.9065, eval_loss: 0.3857, token_prune_loc: [False, False, False, False, True, True, False, True, True, True], macs_sparsity: 0.1886, expected_sparsity: 0.1842, expected_sequence_sparsity: 0.6795, target_sparsity: 0.1594, step: 9000
lambda_1: -2.3467, lambda_2: 57.3825 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.89 0.94 1.   0.93 0.89 0.64]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.86, 0.9, 1.0, 0.92, 0.88, 0.64]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.77, 0.77, 0.71, 0.63, 0.4]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011101111111100000
11111111111111111111111111110111110111101101111110
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111011110100
01111111111111111111111111111101111111111101110010
01011111111111111111110110111001001011100011000000
loss: 0.704547, lagrangian_loss: -0.005717, attention_score_distillation_loss: 0.001441
loss: 0.104996, lagrangian_loss: 0.001554, attention_score_distillation_loss: 0.001426
----------------------------------------------------------------------
time: 2023-07-19 15:30:38
Evaluating: accuracy: 0.9076, eval_loss: 0.3862, token_prune_loc: [False, False, False, False, True, True, False, True, True, True], macs_sparsity: 0.1946, expected_sparsity: 0.1941, expected_sequence_sparsity: 0.6834, target_sparsity: 0.1683, step: 9500
lambda_1: -3.6773, lambda_2: 61.1385 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.89 0.93 1.   0.92 0.88 0.64]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.86, 0.88, 1.0, 0.9, 0.88, 0.64]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.76, 0.76, 0.68, 0.6, 0.38]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011101111111100000
11111111111111111111111111110111110111101101101110
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111011110000
01111111111111111111111111111101111111111101110010
01011111111111111111110110111001001011100011000000
loss: 0.325564, lagrangian_loss: 0.002754, attention_score_distillation_loss: 0.001411
loss: 0.064415, lagrangian_loss: -0.000777, attention_score_distillation_loss: 0.001395
ETA: 11:33:46 | Epoch 2 finished. Took 1118.35 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:33:32
Evaluating: accuracy: 0.9026, eval_loss: 0.377, token_prune_loc: [False, False, False, False, True, True, False, True, True, True], macs_sparsity: 0.1946, expected_sparsity: 0.1941, expected_sequence_sparsity: 0.6834, target_sparsity: 0.1771, step: 10000
lambda_1: -2.8310, lambda_2: 65.7308 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.88 0.91 1.   0.91 0.88 0.64]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.86, 0.88, 1.0, 0.9, 0.88, 0.64]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.76, 0.76, 0.68, 0.6, 0.38]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011101111111100000
11111111111111111111111111110111110111101101101110
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111011110000
01111111111111111111111111111101111111111101110010
01011111111111111111110110111001001011100011000000
loss: 0.727551, lagrangian_loss: 0.002070, attention_score_distillation_loss: 0.001380
loss: 0.024805, lagrangian_loss: -0.003132, attention_score_distillation_loss: 0.001365
----------------------------------------------------------------------
time: 2023-07-19 15:36:27
Evaluating: accuracy: 0.9063, eval_loss: 0.3844, token_prune_loc: [False, False, False, False, True, True, False, True, True, True], macs_sparsity: 0.1979, expected_sparsity: 0.1976, expected_sequence_sparsity: 0.6848, target_sparsity: 0.186, step: 10500
lambda_1: -2.0482, lambda_2: 68.1585 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.88 0.91 1.   0.9  0.88 0.64]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.86, 0.88, 1.0, 0.88, 0.88, 0.64]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.76, 0.76, 0.67, 0.59, 0.38]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011101111111100000
11111111111111111111111111110111110111101101101110
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111011011110000
01111111111111111111111111111101111111111101110010
01011111111111111111110110111001001011100011000000
loss: 0.017083, lagrangian_loss: 0.006027, attention_score_distillation_loss: 0.001350
loss: 0.179024, lagrangian_loss: -0.000477, attention_score_distillation_loss: 0.001334
----------------------------------------------------------------------
time: 2023-07-19 15:39:18
Evaluating: accuracy: 0.8986, eval_loss: 0.4046, token_prune_loc: [False, False, False, False, True, True, False, True, True, True], macs_sparsity: 0.2072, expected_sparsity: 0.2039, expected_sequence_sparsity: 0.6873, target_sparsity: 0.1948, step: 11000
lambda_1: -0.9741, lambda_2: 70.7487 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.87 0.89 1.   0.88 0.88 0.64]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.86, 0.86, 1.0, 0.88, 0.88, 0.64]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.74, 0.74, 0.65, 0.57, 0.37]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011101111111100000
11111111111111111111111111110111110110101101101110
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111011011110000
01111111111111111111111111111101111111111101110010
01011111111111111111110110111001001011100011000000
loss: 0.043905, lagrangian_loss: -0.001082, attention_score_distillation_loss: 0.001319
loss: 0.537168, lagrangian_loss: 0.007126, attention_score_distillation_loss: 0.001306
----------------------------------------------------------------------
time: 2023-07-19 15:42:13
Evaluating: accuracy: 0.9054, eval_loss: 0.3918, token_prune_loc: [False, False, False, False, True, True, False, True, True, True], macs_sparsity: 0.2246, expected_sparsity: 0.2212, expected_sequence_sparsity: 0.6941, target_sparsity: 0.2037, step: 11500
lambda_1: -3.1719, lambda_2: 74.2290 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.86 0.88 0.99 0.87 0.88 0.64]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.84, 0.84, 1.0, 0.86, 0.88, 0.64]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.71, 0.71, 0.61, 0.53, 0.34]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011101101111100000
11111111111111111111111111110111110110101101101100
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111001011110000
01111111111111111111111111111101111111111101110010
01011111111111111111110110111001001011100011000000
loss: 0.176914, lagrangian_loss: -0.012776, attention_score_distillation_loss: 0.001289
loss: 0.117696, lagrangian_loss: 0.000195, attention_score_distillation_loss: 0.001274
----------------------------------------------------------------------
time: 2023-07-19 15:45:04
Evaluating: accuracy: 0.9046, eval_loss: 0.4019, token_prune_loc: [False, False, False, False, True, True, False, True, True, True], macs_sparsity: 0.2246, expected_sparsity: 0.2212, expected_sequence_sparsity: 0.6941, target_sparsity: 0.2125, step: 12000
lambda_1: -3.5314, lambda_2: 80.4260 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.86 0.87 0.99 0.86 0.88 0.63]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.84, 0.84, 1.0, 0.86, 0.88, 0.64]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.71, 0.71, 0.61, 0.53, 0.34]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011101011111100000
11111111111111111111111111110111110110101101101100
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111001011110000
01111111111111111111111111111101111111111101110010
01011111111111111111110110111001001011100011000000
loss: 0.043688, lagrangian_loss: 0.006887, attention_score_distillation_loss: 0.001259
loss: 0.171558, lagrangian_loss: -0.009203, attention_score_distillation_loss: 0.001244
----------------------------------------------------------------------
time: 2023-07-19 15:47:59
Evaluating: accuracy: 0.9035, eval_loss: 0.4037, token_prune_loc: [False, False, False, False, True, True, False, True, True, True], macs_sparsity: 0.2273, expected_sparsity: 0.2226, expected_sequence_sparsity: 0.6947, target_sparsity: 0.2214, step: 12500
lambda_1: -0.7780, lambda_2: 83.1429 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.84 0.86 0.99 0.86 0.88 0.63]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.84, 0.84, 1.0, 0.86, 0.88, 0.62]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.71, 0.71, 0.61, 0.53, 0.33]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011101011111100000
11111111111111111111111111110111110110101101101100
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111001011110000
01111111111111111111111111111101111111111101110010
01011111111111111111110010111001001011100011000000
loss: 0.121763, lagrangian_loss: 0.001610, attention_score_distillation_loss: 0.001215
loss: 0.370005, lagrangian_loss: 0.000966, attention_score_distillation_loss: 0.001214
----------------------------------------------------------------------
time: 2023-07-19 15:50:51
Evaluating: accuracy: 0.8993, eval_loss: 0.3841, token_prune_loc: [False, False, False, False, True, True, False, True, True, True], macs_sparsity: 0.2404, expected_sparsity: 0.2362, expected_sequence_sparsity: 0.7, target_sparsity: 0.2303, step: 13000
lambda_1: -2.3161, lambda_2: 84.2666 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.83 0.85 0.98 0.86 0.88 0.62]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.82, 0.82, 1.0, 0.86, 0.88, 0.62]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.67, 0.67, 0.58, 0.51, 0.32]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011101001111100000
11111111111111111111111111110111110110001101101100
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111001011110000
01111111111111111111111111111101111111111101110010
01011111111111111111110010111001001011100011000000
loss: 0.358073, lagrangian_loss: -0.002303, attention_score_distillation_loss: 0.001187
ETA: 11:18:16 | Epoch 3 finished. Took 1146.7 seconds.
loss: 0.053012, lagrangian_loss: -0.000268, attention_score_distillation_loss: 0.001184
----------------------------------------------------------------------
time: 2023-07-19 15:53:42
Evaluating: accuracy: 0.8995, eval_loss: 0.4072, token_prune_loc: [False, False, False, False, True, True, False, True, True, True], macs_sparsity: 0.2404, expected_sparsity: 0.2362, expected_sequence_sparsity: 0.7, target_sparsity: 0.2391, step: 13500
lambda_1: -1.1716, lambda_2: 85.2143 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.83 0.84 0.98 0.86 0.88 0.62]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.82, 0.82, 1.0, 0.86, 0.88, 0.62]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.67, 0.67, 0.58, 0.51, 0.32]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111011101001111100000
11111111111111111111111111110111110110001101101100
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111001011110000
01111111111111111111111111111101111111111101110010
01011111111111111111110010111001001011100011000000
loss: 0.137737, lagrangian_loss: 0.001275, attention_score_distillation_loss: 0.001169
loss: 0.155063, lagrangian_loss: -0.000272, attention_score_distillation_loss: 0.001152
----------------------------------------------------------------------
time: 2023-07-19 15:56:35
Evaluating: accuracy: 0.9026, eval_loss: 0.4081, token_prune_loc: [False, False, False, False, True, True, False, True, True, True], macs_sparsity: 0.2513, expected_sparsity: 0.2495, expected_sequence_sparsity: 0.7052, target_sparsity: 0.248, step: 14000
lambda_1: -0.9965, lambda_2: 87.0314 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.82 0.83 0.97 0.86 0.88 0.62]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.8, 0.8, 1.0, 0.86, 0.88, 0.62]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.64, 0.64, 0.55, 0.48, 0.3]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111010101001111100000
11111111111111111111111111110111110100001101101100
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111001011110000
01111111111111111111111111111101111111111101110010
01011111111111111111110010111001001011100011000000
loss: 0.075654, lagrangian_loss: -0.000343, attention_score_distillation_loss: 0.001137
loss: 0.389590, lagrangian_loss: 0.000534, attention_score_distillation_loss: 0.001122
----------------------------------------------------------------------
time: 2023-07-19 15:59:29
Evaluating: accuracy: 0.9043, eval_loss: 0.3973, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.2655, expected_sparsity: 0.2636, expected_sequence_sparsity: 0.7108, target_sparsity: 0.2568, step: 14500
lambda_1: -3.1236, lambda_2: 88.0091 lambda_3: 0.0000
train remain: [1.   1.   1.   0.99 0.81 0.81 0.96 0.86 0.88 0.61]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.8, 0.8, 0.92, 0.86, 0.88, 0.62]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.64, 0.59, 0.51, 0.45, 0.28]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111010101001111100000
11111111111111111111111111110111110100001101101100
11111111111111111111111111101111111011111111111010
11111111111111111111111111111111111111001011110000
01111111111111111111111111111101111111111101110010
01011111111111111111110010111001001011100011000000
loss: 0.376011, lagrangian_loss: -0.002480, attention_score_distillation_loss: 0.001104
loss: 0.168905, lagrangian_loss: -0.000104, attention_score_distillation_loss: 0.001091
----------------------------------------------------------------------
time: 2023-07-19 16:02:26
Evaluating: accuracy: 0.9001, eval_loss: 0.3743, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.2747, expected_sparsity: 0.2702, expected_sequence_sparsity: 0.7134, target_sparsity: 0.2657, step: 15000
lambda_1: -2.7815, lambda_2: 91.4272 lambda_3: 0.0000
train remain: [1.   1.   0.99 0.99 0.81 0.8  0.95 0.86 0.87 0.61]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.8, 0.78, 0.92, 0.86, 0.88, 0.6]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.62, 0.57, 0.49, 0.43, 0.26]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111010101001111100000
11111111111111111111111111110111010100001101101100
11111111111111111111111111101111111011111111111010
11111111111111111111111111111111111111001011110000
01111111111111111111111111111101111111111101110010
01011111111111111110110010111001001011100011000000
loss: 0.047740, lagrangian_loss: 0.003428, attention_score_distillation_loss: 0.001077
loss: 0.263340, lagrangian_loss: -0.006191, attention_score_distillation_loss: 0.001062
----------------------------------------------------------------------
time: 2023-07-19 16:05:19
Evaluating: accuracy: 0.8997, eval_loss: 0.4169, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.2774, expected_sparsity: 0.2718, expected_sequence_sparsity: 0.714, target_sparsity: 0.2746, step: 15500
lambda_1: -0.5111, lambda_2: 94.1950 lambda_3: 0.0000
train remain: [1.   1.   0.99 0.99 0.8  0.79 0.95 0.85 0.87 0.6 ]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.8, 0.78, 0.92, 0.86, 0.86, 0.6]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.62, 0.57, 0.49, 0.42, 0.25]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111010101001111100000
11111111111111111111111111110111010100001101101100
11111111111111111111111111101111111011111111111010
11111111111111111111111111111111111111001011110000
00111111111111111111111111111101111111111101110010
01011111111111111110110010111001001011100011000000
loss: 0.133294, lagrangian_loss: 0.001181, attention_score_distillation_loss: 0.001046
loss: 0.044360, lagrangian_loss: -0.001412, attention_score_distillation_loss: 0.001031
----------------------------------------------------------------------
time: 2023-07-19 16:08:12
Evaluating: accuracy: 0.8929, eval_loss: 0.403, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.2889, expected_sparsity: 0.2862, expected_sequence_sparsity: 0.7197, target_sparsity: 0.2834, step: 16000
lambda_1: -1.3668, lambda_2: 98.4045 lambda_3: 0.0000
train remain: [1.   1.   0.99 0.99 0.78 0.78 0.95 0.85 0.86 0.59]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.78, 0.76, 0.92, 0.84, 0.86, 0.6]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.59, 0.55, 0.46, 0.39, 0.24]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111010101000111100000
11111111111111111111111111110111010100001101100100
11111111111111111111111111101111111011111111111010
11111111111111111111111111111111111111001011010000
00111111111111111111111111111101111111111101110010
01011111111111111110110010111001001011100011000000
loss: 0.151958, lagrangian_loss: -0.001073, attention_score_distillation_loss: 0.001016
loss: 0.405352, lagrangian_loss: 0.003347, attention_score_distillation_loss: 0.001001
ETA: 10:58:21 | Epoch 4 finished. Took 1121.19 seconds.
----------------------------------------------------------------------
time: 2023-07-19 16:11:05
Evaluating: accuracy: 0.9008, eval_loss: 0.4311, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.2916, expected_sparsity: 0.2872, expected_sequence_sparsity: 0.7201, target_sparsity: 0.2923, step: 16500
lambda_1: -2.1987, lambda_2: 99.9726 lambda_3: 0.0000
train remain: [1.   1.   0.97 0.99 0.78 0.78 0.94 0.85 0.86 0.58]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.78, 0.76, 0.92, 0.84, 0.86, 0.58]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.78, 0.59, 0.55, 0.46, 0.39, 0.23]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111010101000111100000
11111111111111111111111111110111010100001101100100
11111111111111111111111111101111111011111111111010
11111111111111111111111111111111111111001011010000
00111111111111111111111111111101111111111101110010
00011111111111111110110010111001001011100011000000
loss: 0.588161, lagrangian_loss: -0.005038, attention_score_distillation_loss: 0.000986
loss: 0.295328, lagrangian_loss: 0.004551, attention_score_distillation_loss: 0.000971
----------------------------------------------------------------------
time: 2023-07-19 16:13:59
Evaluating: accuracy: 0.8986, eval_loss: 0.4349, token_prune_loc: [False, False, False, False, True, True, True, True, True, True], macs_sparsity: 0.3014, expected_sparsity: 0.2988, expected_sequence_sparsity: 0.7247, target_sparsity: 0.3011, step: 17000
lambda_1: -2.7316, lambda_2: 102.9767 lambda_3: 0.0000
train remain: [1.   1.   0.97 0.99 0.77 0.76 0.94 0.85 0.85 0.58]
infer remain: [1.0, 1.0, 1.0, 1.0, 0.76, 0.74, 0.92, 0.84, 0.86, 0.58]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.76, 0.56, 0.52, 0.43, 0.37, 0.22]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111010101000110100000
11111111111111111111111111110111010100000101100100
11111111111111111111111111101111111011111111111010
11111111111111111111111111111111111111001011010000
00111111111111111111111111111101111111111101110010
00011111111111111110110010111001001011100011000000
loss: 0.020533, lagrangian_loss: -0.003328, attention_score_distillation_loss: 0.000954
loss: 0.196831, lagrangian_loss: 0.001611, attention_score_distillation_loss: 0.000940
----------------------------------------------------------------------
time: 2023-07-19 16:16:53
Evaluating: accuracy: 0.8871, eval_loss: 0.4462, token_prune_loc: [False, False, True, False, True, True, True, True, True, True], macs_sparsity: 0.3287, expected_sparsity: 0.3243, expected_sequence_sparsity: 0.7348, target_sparsity: 0.31, step: 17500
lambda_1: -4.1269, lambda_2: 109.3017 lambda_3: 0.0000
train remain: [1.   1.   0.97 0.98 0.77 0.75 0.94 0.84 0.85 0.57]
infer remain: [1.0, 1.0, 0.94, 1.0, 0.76, 0.74, 0.92, 0.84, 0.86, 0.56]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.94, 0.94, 0.71, 0.53, 0.49, 0.41, 0.35, 0.2]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111111111111111111111111011011111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111010101000110100000
11111111111111111111111111110111010100000101100100
11111111111111111111111111101111111011111111111010
11111111111111111111111111111111111111001011010000
00111111111111111111111111111101111111111101110010
00011111111111111110110010111001001011000011000000
loss: 0.395330, lagrangian_loss: -0.000525, attention_score_distillation_loss: 0.000925
loss: 0.143363, lagrangian_loss: -0.002671, attention_score_distillation_loss: 0.000910
----------------------------------------------------------------------
time: 2023-07-19 16:19:45
Evaluating: accuracy: 0.8843, eval_loss: 0.4669, token_prune_loc: [False, False, True, False, True, True, True, True, True, True], macs_sparsity: 0.3314, expected_sparsity: 0.3256, expected_sequence_sparsity: 0.7353, target_sparsity: 0.3188, step: 18000
lambda_1: -1.2977, lambda_2: 112.8449 lambda_3: 0.0000
train remain: [1.   1.   0.96 0.98 0.77 0.75 0.94 0.84 0.85 0.57]
infer remain: [1.0, 1.0, 0.94, 1.0, 0.76, 0.74, 0.92, 0.84, 0.84, 0.56]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.94, 0.94, 0.71, 0.53, 0.49, 0.41, 0.34, 0.19]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111111111111111111111111011011111
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111010101000110100000
11111111111111111111111111110111010100000101100100
11111111111111111111111111101111111011111111111010
11111111111111111111111111111111111111001011010000
00111111111111111111111111111101111111111100110010
00011111111111111110110010111001001011000011000000
loss: 0.150135, lagrangian_loss: 0.006139, attention_score_distillation_loss: 0.000893
loss: 0.297260, lagrangian_loss: 0.001615, attention_score_distillation_loss: 0.000879
----------------------------------------------------------------------
time: 2023-07-19 16:22:40
Evaluating: accuracy: 0.8827, eval_loss: 0.5073, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.3527, expected_sparsity: 0.3482, expected_sequence_sparsity: 0.7442, target_sparsity: 0.3277, step: 18500
lambda_1: -2.0872, lambda_2: 116.5439 lambda_3: 0.0000
train remain: [1.   1.   0.96 0.97 0.76 0.73 0.93 0.84 0.85 0.55]
infer remain: [1.0, 1.0, 0.94, 0.94, 0.76, 0.72, 0.92, 0.84, 0.84, 0.56]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.94, 0.88, 0.67, 0.48, 0.44, 0.37, 0.31, 0.18]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111111111111111111111111011011111
11111111111111111111111110111111111111111111111010
11111111111111111111111111111111010101000110100000
11111111111111111111111111110111010100000101100000
11111111111111111111111111101111111011111111111010
11111111111111111111111111111111111111001011010000
00111111111111111111111111111101111111111100110010
00011111111111111110110010111001001011000011000000
loss: 0.147284, lagrangian_loss: -0.002430, attention_score_distillation_loss: 0.000865
loss: 0.316753, lagrangian_loss: -0.002380, attention_score_distillation_loss: 0.000850
----------------------------------------------------------------------
time: 2023-07-19 16:25:34
Evaluating: accuracy: 0.8871, eval_loss: 0.4389, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.3604, expected_sparsity: 0.3567, expected_sequence_sparsity: 0.7475, target_sparsity: 0.3366, step: 19000
lambda_1: -3.0211, lambda_2: 117.8724 lambda_3: 0.0000
train remain: [1.   1.   0.95 0.96 0.76 0.72 0.93 0.84 0.85 0.54]
infer remain: [1.0, 1.0, 0.92, 0.94, 0.76, 0.72, 0.92, 0.84, 0.84, 0.54]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.92, 0.86, 0.66, 0.47, 0.44, 0.37, 0.31, 0.17]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111111111111111111110111011011111
11111111111111111111111111111111111111111111111000
11111111111111111111111111111111010101000110100000
11111111111111111111111111110111010100000101100000
11111111111111111111111111101111111011111111111010
11111111111111111111111111111111111111001011010000
00111111111111111111111111111101111111111100110010
00011111111111111110110010111001001011000010000000
loss: 0.550799, lagrangian_loss: -0.003408, attention_score_distillation_loss: 0.000833
loss: 0.252318, lagrangian_loss: -0.001510, attention_score_distillation_loss: 0.000819
----------------------------------------------------------------------
time: 2023-07-19 16:28:28
Evaluating: accuracy: 0.8929, eval_loss: 0.4493, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.3604, expected_sparsity: 0.3567, expected_sequence_sparsity: 0.7475, target_sparsity: 0.3454, step: 19500
lambda_1: -3.1941, lambda_2: 120.0675 lambda_3: 0.0000
train remain: [1.   1.   0.94 0.96 0.76 0.72 0.93 0.84 0.85 0.53]
infer remain: [1.0, 1.0, 0.92, 0.94, 0.76, 0.72, 0.92, 0.84, 0.84, 0.54]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.92, 0.86, 0.66, 0.47, 0.44, 0.37, 0.31, 0.17]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111111111111111111111111011011110
11111111111111111111111111111111111111111111111000
11111111111111111111111111111111010101000110100000
11111111111111111111111111110111010100000101100000
11111111111111111111111111101111111011111111111010
11111111111111111111111111111111111111001011010000
00111111111111111111111111111101111111111100110010
00011111111111111110110010111001001011000010000000
loss: 0.170814, lagrangian_loss: 0.005152, attention_score_distillation_loss: 0.000803
ETA: 10:41:36 | Epoch 5 finished. Took 1150.5 seconds.
loss: 0.098720, lagrangian_loss: -0.000152, attention_score_distillation_loss: 0.000788
----------------------------------------------------------------------
time: 2023-07-19 16:31:24
Evaluating: accuracy: 0.8821, eval_loss: 0.4723, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.386, expected_sparsity: 0.3799, expected_sequence_sparsity: 0.7567, target_sparsity: 0.3543, step: 20000
lambda_1: -2.4006, lambda_2: 121.4914 lambda_3: 0.0000
train remain: [1.   1.   0.93 0.95 0.75 0.72 0.93 0.84 0.85 0.52]
infer remain: [1.0, 1.0, 0.9, 0.92, 0.74, 0.7, 0.92, 0.84, 0.84, 0.52]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.9, 0.83, 0.61, 0.43, 0.39, 0.33, 0.28, 0.14]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111111111111111111110111011011110
11111111111111111111111111111111101111111111111000
11111111111111111111111111111101010101000110100000
11111111111111111111111111110111010000000101100000
11111111111111111111111111101111111011111111111010
11111111111111111111111111111111111111001011010000
00111111111111111111111111111101111111111100110010
00011111111111111110110010111001001010000010000000
loss: 0.077142, lagrangian_loss: 0.002063, attention_score_distillation_loss: 0.000773
loss: 0.189560, lagrangian_loss: 0.003385, attention_score_distillation_loss: 0.000757
----------------------------------------------------------------------
time: 2023-07-19 16:34:17
Evaluating: accuracy: 0.8867, eval_loss: 0.4654, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.386, expected_sparsity: 0.3807, expected_sequence_sparsity: 0.757, target_sparsity: 0.3631, step: 20500
lambda_1: -3.9850, lambda_2: 122.5684 lambda_3: 0.0000
train remain: [1.   1.   0.92 0.93 0.75 0.71 0.93 0.84 0.85 0.51]
infer remain: [1.0, 1.0, 0.9, 0.92, 0.74, 0.7, 0.92, 0.84, 0.84, 0.5]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.9, 0.83, 0.61, 0.43, 0.39, 0.33, 0.28, 0.14]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111111111111111111110111011011110
11111111111111111111111111111111101111111111111000
11111111111111111111111111111101010101000110100000
11111111111111111111111111110111010000000101100000
11111111111111111111111111101111111011111111111010
11111111111111111111111111111111111111001011010000
00111111111111111111111111111101111111111100110010
00011111111111110110110010111001001010000010000000
loss: 0.241112, lagrangian_loss: -0.005537, attention_score_distillation_loss: 0.000743
loss: 0.166410, lagrangian_loss: 0.000345, attention_score_distillation_loss: 0.000728
----------------------------------------------------------------------
time: 2023-07-19 16:37:11
Evaluating: accuracy: 0.8905, eval_loss: 0.4697, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.3876, expected_sparsity: 0.3828, expected_sequence_sparsity: 0.7578, target_sparsity: 0.372, step: 21000
lambda_1: -2.7639, lambda_2: 124.4564 lambda_3: 0.0000
train remain: [1.   1.   0.92 0.93 0.74 0.7  0.93 0.84 0.85 0.5 ]
infer remain: [1.0, 1.0, 0.9, 0.92, 0.74, 0.7, 0.9, 0.84, 0.84, 0.5]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.9, 0.83, 0.61, 0.43, 0.39, 0.32, 0.27, 0.14]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111111111111111111110111011011110
11111111111111111111111111111111101111111111111000
11111111111111111111111111111101010101000110100000
11111111111111111111111111110111010000000101100000
11111111111111111111111111101111111011111011111010
11111111111111111111111111111111111111001011010000
00111111111111111111111111111101111111111100110010
00011111111111110110110010111001001010000010000000
loss: 0.640742, lagrangian_loss: -0.001453, attention_score_distillation_loss: 0.000713
loss: 0.767744, lagrangian_loss: -0.003106, attention_score_distillation_loss: 0.000697
----------------------------------------------------------------------
time: 2023-07-19 16:40:03
Evaluating: accuracy: 0.8898, eval_loss: 0.4671, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.3919, expected_sparsity: 0.389, expected_sequence_sparsity: 0.7602, target_sparsity: 0.3808, step: 21500
lambda_1: -3.9626, lambda_2: 127.4678 lambda_3: 0.0000
train remain: [1.   1.   0.91 0.91 0.74 0.7  0.93 0.84 0.84 0.49]
infer remain: [1.0, 1.0, 0.9, 0.9, 0.74, 0.7, 0.9, 0.84, 0.84, 0.48]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.9, 0.81, 0.6, 0.42, 0.38, 0.32, 0.27, 0.13]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111111111111111111110111011011110
11111111111111111111111111111111101111111011111000
11111111111111111111111111111101010101000110100000
11111111111111111111111111110111010000000101100000
11111111111111111111111111101111111011111011111010
11111111111111111111111111111111111111001011010000
00111111111111111111111111111101111111111100110010
00011111111111110110100010111001001010000010000000
loss: 0.389449, lagrangian_loss: 0.008071, attention_score_distillation_loss: 0.000682
loss: 0.092713, lagrangian_loss: -0.002831, attention_score_distillation_loss: 0.000666
----------------------------------------------------------------------
time: 2023-07-19 16:42:56
Evaluating: accuracy: 0.888, eval_loss: 0.4444, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.4082, expected_sparsity: 0.4027, expected_sequence_sparsity: 0.7656, target_sparsity: 0.3897, step: 22000
lambda_1: -1.9608, lambda_2: 131.4487 lambda_3: 0.0000
train remain: [1.   1.   0.91 0.9  0.73 0.69 0.93 0.84 0.84 0.49]
infer remain: [1.0, 1.0, 0.9, 0.88, 0.72, 0.68, 0.9, 0.84, 0.84, 0.48]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.9, 0.79, 0.57, 0.39, 0.35, 0.29, 0.25, 0.12]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111111111111111111110111011011110
11111111111111111111111110111111101111111011111000
11111111111111111111111101111101010101000110100000
11111111111111111111111111110111010000000101000000
11111111111111111111111111101111111011111011111010
11111111111111111111111111111111111111001011010000
00111111111111111111111111111101111111111100110010
00011111111111110110100010111001001010000010000000
loss: 0.087542, lagrangian_loss: 0.002828, attention_score_distillation_loss: 0.000652
loss: 0.278232, lagrangian_loss: -0.009213, attention_score_distillation_loss: 0.000637
----------------------------------------------------------------------
time: 2023-07-19 16:45:49
Evaluating: accuracy: 0.8856, eval_loss: 0.4385, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.4115, expected_sparsity: 0.4095, expected_sequence_sparsity: 0.7683, target_sparsity: 0.3986, step: 22500
lambda_1: -1.7196, lambda_2: 136.8272 lambda_3: 0.0000
train remain: [1.   1.   0.91 0.89 0.72 0.69 0.92 0.84 0.83 0.48]
infer remain: [1.0, 1.0, 0.88, 0.88, 0.72, 0.68, 0.9, 0.84, 0.84, 0.48]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.88, 0.77, 0.56, 0.38, 0.34, 0.29, 0.24, 0.12]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111111111111111111110111010011110
11111111111111111111111110111111101111111011111000
11111111111111111111111101111101010101000110100000
11111111111111111111111111110111010000000101000000
11111111111111111111111111101111111011111011111010
11111111111111111111111111111111111111001011010000
00111111111111111111111111111101111111111100110010
00011111111111110110100010111001001010000010000000
loss: 0.282216, lagrangian_loss: 0.006731, attention_score_distillation_loss: 0.000622
loss: 0.402815, lagrangian_loss: 0.001438, attention_score_distillation_loss: 0.000607
ETA: 10:21:59 | Epoch 6 finished. Took 1122.81 seconds.
----------------------------------------------------------------------
time: 2023-07-19 16:48:45
Evaluating: accuracy: 0.877, eval_loss: 0.5176, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.4208, expected_sparsity: 0.4148, expected_sequence_sparsity: 0.7704, target_sparsity: 0.4074, step: 23000
lambda_1: -3.0206, lambda_2: 140.3611 lambda_3: 0.0000
train remain: [1.   1.   0.89 0.88 0.7  0.68 0.92 0.83 0.82 0.48]
infer remain: [1.0, 1.0, 0.88, 0.88, 0.7, 0.68, 0.9, 0.84, 0.82, 0.48]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.88, 0.77, 0.54, 0.37, 0.33, 0.28, 0.23, 0.11]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111111111111111111110111010011110
11111111111111111111111110111111101111111011111000
11111111111111111111111101111001010101000110100000
11111111111111111111111111110111010000000101000000
11111111111111111111111111101111111011111011111010
11111111111111111111111111111111111111001011010000
00111111111111111111111111111101111111111100110000
00011111111111110110100010111001001010000010000000
loss: 0.370469, lagrangian_loss: -0.006916, attention_score_distillation_loss: 0.000591
loss: 0.098022, lagrangian_loss: -0.000028, attention_score_distillation_loss: 0.000576
----------------------------------------------------------------------
time: 2023-07-19 16:51:40
Evaluating: accuracy: 0.8827, eval_loss: 0.4909, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.424, expected_sparsity: 0.4211, expected_sequence_sparsity: 0.7729, target_sparsity: 0.4163, step: 23500
lambda_1: -2.9186, lambda_2: 142.3789 lambda_3: 0.0000
train remain: [1.   1.   0.89 0.87 0.7  0.68 0.92 0.83 0.81 0.48]
infer remain: [1.0, 1.0, 0.88, 0.86, 0.7, 0.68, 0.9, 0.82, 0.82, 0.48]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.88, 0.76, 0.53, 0.36, 0.32, 0.27, 0.22, 0.1]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111111101111111111110111011011110
11111111111111111111111110111111101111111011011000
11111111111111111111111101111001010101000110100000
11111111111111111111111111110111010000000101000000
11111111111111111111111111101111111011111011111010
11111111111111111111111111111111111111001010010000
00111111111111111111111111111101111111111100110000
00011111111111110110100010111001001010000010000000
loss: 0.305864, lagrangian_loss: 0.000868, attention_score_distillation_loss: 0.000561
loss: 0.111837, lagrangian_loss: -0.004155, attention_score_distillation_loss: 0.000546
----------------------------------------------------------------------
time: 2023-07-19 16:54:34
Evaluating: accuracy: 0.8711, eval_loss: 0.5274, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.4398, expected_sparsity: 0.4356, expected_sequence_sparsity: 0.7786, target_sparsity: 0.4251, step: 24000
lambda_1: -2.1886, lambda_2: 144.7647 lambda_3: 0.0000
train remain: [1.   1.   0.88 0.86 0.69 0.67 0.91 0.83 0.81 0.47]
infer remain: [1.0, 1.0, 0.86, 0.86, 0.68, 0.66, 0.9, 0.82, 0.8, 0.48]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.86, 0.74, 0.5, 0.33, 0.3, 0.24, 0.2, 0.09]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111111101111111111110111010011110
11111111111111111111111110111111101111111011011000
11111111111111111111111101111001010101000110000000
11111111111111111111111111110111010000000100000000
11111111111111111111111111101111111011111011111010
11111111111111111111111111111111111111001010010000
00111111111111111111111110111101111111111100110000
00011111111111110110100010011101001010000010000000
loss: 0.527923, lagrangian_loss: 0.000995, attention_score_distillation_loss: 0.000531
loss: 0.388383, lagrangian_loss: -0.002323, attention_score_distillation_loss: 0.000515
----------------------------------------------------------------------
time: 2023-07-19 16:57:24
Evaluating: accuracy: 0.8777, eval_loss: 0.4877, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.4463, expected_sparsity: 0.4408, expected_sequence_sparsity: 0.7807, target_sparsity: 0.434, step: 24500
lambda_1: -3.9907, lambda_2: 147.1421 lambda_3: 0.0000
train remain: [1.   1.   0.87 0.85 0.68 0.66 0.92 0.82 0.81 0.47]
infer remain: [1.0, 1.0, 0.86, 0.84, 0.68, 0.66, 0.9, 0.82, 0.8, 0.46]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.86, 0.72, 0.49, 0.32, 0.29, 0.24, 0.19, 0.09]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111111101111111111110111010011110
11111111111111111111111110111111101111110011011000
11111111111111111111111101111001010101000110000000
11111111111111111111111111110111010000000100000000
11111111111111111111111111101111111011111011111010
11111111111111111111111111111111111111001010010000
00111111111111111111111110111101111111111100110000
00011111111111110110100010011001001010000010000000
loss: 0.105796, lagrangian_loss: -0.000029, attention_score_distillation_loss: 0.000500
loss: 0.151019, lagrangian_loss: -0.002177, attention_score_distillation_loss: 0.000485
----------------------------------------------------------------------
time: 2023-07-19 17:00:21
Evaluating: accuracy: 0.8737, eval_loss: 0.5023, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.4496, expected_sparsity: 0.4437, expected_sequence_sparsity: 0.7818, target_sparsity: 0.4428, step: 25000
lambda_1: -3.5187, lambda_2: 148.9342 lambda_3: 0.0000
train remain: [1.   1.   0.87 0.84 0.67 0.64 0.91 0.82 0.81 0.47]
infer remain: [1.0, 1.0, 0.86, 0.84, 0.68, 0.64, 0.9, 0.82, 0.8, 0.46]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.86, 0.72, 0.49, 0.31, 0.28, 0.23, 0.19, 0.09]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111111101111111111110111010011110
11111111111111111111111110111111101111110011011000
11111111111111111111111101111001010101000110000000
11111111111111111111111111110101010000000100000000
11111111111111111111111111101111111011111011111010
11111111111111111111111111111111111111001010010000
00111111111111111111111110111101111111111100110000
00011111111111110110100010011001001010000010000000
loss: 0.346406, lagrangian_loss: 0.002880, attention_score_distillation_loss: 0.000470
loss: 0.484141, lagrangian_loss: -0.003893, attention_score_distillation_loss: 0.000455
----------------------------------------------------------------------
time: 2023-07-19 17:03:13
Evaluating: accuracy: 0.8737, eval_loss: 0.5246, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.461, expected_sparsity: 0.4563, expected_sequence_sparsity: 0.7868, target_sparsity: 0.4517, step: 25500
lambda_1: -3.3258, lambda_2: 150.9814 lambda_3: 0.0000
train remain: [1.   1.   0.86 0.83 0.67 0.63 0.89 0.81 0.81 0.46]
infer remain: [1.0, 1.0, 0.86, 0.82, 0.66, 0.62, 0.88, 0.82, 0.8, 0.46]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.86, 0.71, 0.47, 0.29, 0.25, 0.21, 0.17, 0.08]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111111101111111111110111010011110
11111111111111111111111110111111101101110011011000
11111111111111111111111101111001010101000100000000
11111111111111111111111111110101010000000000000000
11111111111111111111111111101111111011111011111000
11111111111111111111111111111111111111001010010000
00111111111111111111111110111101111111111100110000
00011111111111110110100010011001001010000010000000
loss: 0.301119, lagrangian_loss: -0.000738, attention_score_distillation_loss: 0.000440
loss: 0.268379, lagrangian_loss: -0.005348, attention_score_distillation_loss: 0.000425
----------------------------------------------------------------------
time: 2023-07-19 17:06:08
Evaluating: accuracy: 0.8647, eval_loss: 0.5535, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.4751, expected_sparsity: 0.4687, expected_sequence_sparsity: 0.7917, target_sparsity: 0.4606, step: 26000
lambda_1: -3.1859, lambda_2: 153.2161 lambda_3: 0.0000
train remain: [1.   1.   0.85 0.81 0.66 0.62 0.88 0.81 0.81 0.46]
infer remain: [1.0, 1.0, 0.84, 0.8, 0.66, 0.62, 0.86, 0.8, 0.8, 0.46]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.84, 0.67, 0.44, 0.27, 0.24, 0.19, 0.15, 0.07]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111111101111111111110101010011110
11111111111111111111111110111111001101110011011000
11111111111111111111111101111001010101000100000000
11111111111111111111111111110101010000000000000000
11111111111111111111111111101111111011111011011000
11111111111111111111111111111111111111001010000000
00111111111111111111111110111101111111111100110000
00011111111111100110100010011101001010000010000000
loss: 0.357663, lagrangian_loss: 0.000254, attention_score_distillation_loss: 0.000409
ETA: 10:04:37 | Epoch 7 finished. Took 1152.98 seconds.
loss: 0.422970, lagrangian_loss: 0.008728, attention_score_distillation_loss: 0.000393
----------------------------------------------------------------------
time: 2023-07-19 17:09:05
Evaluating: accuracy: 0.8669, eval_loss: 0.577, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.4767, expected_sparsity: 0.4712, expected_sequence_sparsity: 0.7926, target_sparsity: 0.4694, step: 26500
lambda_1: -2.5899, lambda_2: 155.5544 lambda_3: 0.0000
train remain: [1.   1.   0.85 0.8  0.65 0.61 0.88 0.81 0.81 0.46]
infer remain: [1.0, 1.0, 0.84, 0.8, 0.66, 0.6, 0.86, 0.8, 0.8, 0.46]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.84, 0.67, 0.44, 0.27, 0.23, 0.18, 0.15, 0.07]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111111101111111111110101010011110
11111111111111111111111110111111101101110010011000
11111111111111111111111101111001010101000100000000
11111111111111111111111111110001010000000000000000
11111111111111111111111111101111111011111011011000
11111111111111111111111111111111111111001010000000
00111111111111111111111110111101111111111100110000
00011111111111100111100010011001001010000010000000
loss: 0.278651, lagrangian_loss: -0.002803, attention_score_distillation_loss: 0.000379
loss: 0.507491, lagrangian_loss: 0.002914, attention_score_distillation_loss: 0.000364
----------------------------------------------------------------------
time: 2023-07-19 17:12:01
Evaluating: accuracy: 0.8536, eval_loss: 0.5614, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.4832, expected_sparsity: 0.4803, expected_sequence_sparsity: 0.7962, target_sparsity: 0.4783, step: 27000
lambda_1: -3.6136, lambda_2: 157.7836 lambda_3: 0.0000
train remain: [1.   1.   0.84 0.79 0.64 0.6  0.86 0.81 0.81 0.45]
infer remain: [1.0, 1.0, 0.84, 0.78, 0.64, 0.6, 0.84, 0.8, 0.8, 0.44]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.84, 0.66, 0.42, 0.25, 0.21, 0.17, 0.14, 0.06]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111111101111111111110101010011110
11111111111111111111111110111111001101110010011000
11111111111111111111111101111001010001000100000000
11111111111111111111111111110001010000000000000000
11111111111111111111111110101111111011111011011000
11111111111111111111111111111111111111001010000000
00111111111111111111111110111101111111111100110000
00011111111111100110100010011001001010000010000000
loss: 0.378651, lagrangian_loss: -0.003923, attention_score_distillation_loss: 0.000349
loss: 0.051046, lagrangian_loss: 0.004717, attention_score_distillation_loss: 0.000333
----------------------------------------------------------------------
time: 2023-07-19 17:14:56
Evaluating: accuracy: 0.8675, eval_loss: 0.525, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.4914, expected_sparsity: 0.4908, expected_sequence_sparsity: 0.8003, target_sparsity: 0.4871, step: 27500
lambda_1: -5.3874, lambda_2: 160.2762 lambda_3: 0.0000
train remain: [1.   0.99 0.83 0.77 0.64 0.59 0.84 0.8  0.81 0.45]
infer remain: [1.0, 1.0, 0.82, 0.76, 0.64, 0.6, 0.82, 0.8, 0.8, 0.44]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.82, 0.62, 0.4, 0.24, 0.2, 0.16, 0.13, 0.06]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111111101111111111110101010001110
11111111111111111111111110111111001101010010011000
11111111111111111111111101111001010001000100000000
11111111111111111111111111110001010000000000000000
10111111111111111111111110101111111011111011011000
11111111111111111111111111111111111111001010000000
00111111111111111111111110111101111111111100110000
00011111111111100110100010011001001010000010000000
loss: 0.316514, lagrangian_loss: 0.004224, attention_score_distillation_loss: 0.000318
loss: 0.214739, lagrangian_loss: -0.000967, attention_score_distillation_loss: 0.000303
----------------------------------------------------------------------
time: 2023-07-19 17:17:48
Evaluating: accuracy: 0.858, eval_loss: 0.5666, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.499, expected_sparsity: 0.495, expected_sequence_sparsity: 0.802, target_sparsity: 0.496, step: 28000
lambda_1: -6.6857, lambda_2: 162.8883 lambda_3: 0.0000
train remain: [1.   0.99 0.82 0.76 0.63 0.58 0.82 0.79 0.81 0.43]
infer remain: [1.0, 1.0, 0.82, 0.76, 0.64, 0.58, 0.8, 0.78, 0.8, 0.42]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.82, 0.62, 0.4, 0.23, 0.19, 0.14, 0.12, 0.05]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111111101111111111110101010001110
11111111111111111111111110111111001101010010011000
11111111111111111111111101111001010001000100000000
11111111111111111111111110110001010000000000000000
10011111111111111111111110101111111011111011011000
11111111111111111111111110111111111111001010000000
00111111111111111111111110111101111111111100110000
00011111111111100110100010011001001010000000000000
loss: 0.257862, lagrangian_loss: 0.002369, attention_score_distillation_loss: 0.000288
loss: 0.274288, lagrangian_loss: -0.005006, attention_score_distillation_loss: 0.000273
----------------------------------------------------------------------
time: 2023-07-19 17:20:43
Evaluating: accuracy: 0.8528, eval_loss: 0.5108, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5136, expected_sparsity: 0.5074, expected_sequence_sparsity: 0.8069, target_sparsity: 0.5049, step: 28500
lambda_1: -5.1503, lambda_2: 166.8255 lambda_3: 0.0000
train remain: [1.   0.99 0.8  0.75 0.63 0.57 0.81 0.76 0.81 0.42]
infer remain: [1.0, 1.0, 0.8, 0.74, 0.62, 0.58, 0.8, 0.76, 0.8, 0.42]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.8, 0.59, 0.37, 0.21, 0.17, 0.13, 0.1, 0.04]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111111101101111111110101010001110
11111111111111111111111110111111001101010000011000
11111111111111111111111101111001010000000100000000
11111111111111111111111110110001010000000000000000
10011111111111111111111110101111111011111011011000
10111111111111111111111110111111111111001010000000
00111111111111111111111110111101111111111100110000
00011111110111100110100010011101001010000000000000
loss: 0.348947, lagrangian_loss: -0.006852, attention_score_distillation_loss: 0.000258
loss: 0.232811, lagrangian_loss: 0.006092, attention_score_distillation_loss: 0.000242
----------------------------------------------------------------------
time: 2023-07-19 17:23:39
Evaluating: accuracy: 0.8642, eval_loss: 0.4739, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5136, expected_sparsity: 0.5079, expected_sequence_sparsity: 0.8071, target_sparsity: 0.5137, step: 29000
lambda_1: -8.2196, lambda_2: 173.1122 lambda_3: 0.0000
train remain: [1.   0.99 0.79 0.74 0.62 0.57 0.8  0.75 0.81 0.39]
infer remain: [1.0, 1.0, 0.8, 0.74, 0.62, 0.58, 0.8, 0.76, 0.8, 0.38]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.8, 0.59, 0.37, 0.21, 0.17, 0.13, 0.1, 0.04]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111111101101111111110101010001110
11111111111111111111111110111111001101010000011000
11111111111111111111111101111001010000000100000000
11111111111111111111111110110100010000000000000000
10011111111111111111111110101111111011111011011000
10111111111111111111111110111111111111001010000000
00111111111111111111111110111101111111111100110000
00011111110111100010100010011001001010000000000000
loss: 0.377321, lagrangian_loss: 0.018673, attention_score_distillation_loss: 0.000227
loss: 0.348316, lagrangian_loss: -0.021290, attention_score_distillation_loss: 0.000212
ETA: 9:45:28 | Epoch 8 finished. Took 1129.39 seconds.
----------------------------------------------------------------------
time: 2023-07-19 17:26:33
Evaluating: accuracy: 0.8719, eval_loss: 0.4985, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5218, expected_sparsity: 0.5188, expected_sequence_sparsity: 0.8114, target_sparsity: 0.5226, step: 29500
lambda_1: -5.2242, lambda_2: 179.2311 lambda_3: 0.0000
train remain: [1.   0.99 0.78 0.71 0.61 0.57 0.8  0.75 0.81 0.38]
infer remain: [1.0, 1.0, 0.78, 0.72, 0.62, 0.56, 0.8, 0.74, 0.8, 0.38]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.78, 0.56, 0.35, 0.19, 0.16, 0.12, 0.09, 0.04]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011111101101111111110101010001110
11111111111111111111111110111111001001010000011000
11111111111111111111111101111001010000000100000000
11111111111111111111111110110000010000000000000000
10011111111111111111111110101111111011111011011000
10111111111111111111111110111111111111000010000000
00111111111111111111111110111101111111111100110000
10011111110111100010100010010001001010000000000000
loss: 0.610753, lagrangian_loss: -0.012237, attention_score_distillation_loss: 0.000197
loss: 0.527285, lagrangian_loss: 0.001909, attention_score_distillation_loss: 0.000182
----------------------------------------------------------------------
time: 2023-07-19 17:29:29
Evaluating: accuracy: 0.8521, eval_loss: 0.5629, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5343, expected_sparsity: 0.5301, expected_sequence_sparsity: 0.8158, target_sparsity: 0.5314, step: 30000
lambda_1: -5.1625, lambda_2: 182.8359 lambda_3: 0.0000
train remain: [1.   0.99 0.77 0.7  0.6  0.57 0.8  0.75 0.81 0.37]
infer remain: [1.0, 1.0, 0.76, 0.7, 0.6, 0.56, 0.78, 0.74, 0.8, 0.38]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.76, 0.53, 0.32, 0.18, 0.14, 0.1, 0.08, 0.03]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011111101100111111110101010001110
11111111111111111111101110111111001001010000011000
11111111111111111111111101011001010000000100000000
11111111111111111111111110110000010000000000000000
10011111111111111111111110101111111011110011011000
10111111111111111111111110111111111111000010000000
00111111111111111111111110111101111111111100110000
00011111110111100011100010010001001010000000000000
loss: 0.346217, lagrangian_loss: 0.020005, attention_score_distillation_loss: 0.000167
loss: 0.515052, lagrangian_loss: 0.004408, attention_score_distillation_loss: 0.000151
----------------------------------------------------------------------
time: 2023-07-19 17:32:23
Evaluating: accuracy: 0.8349, eval_loss: 0.507, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5474, expected_sparsity: 0.543, expected_sequence_sparsity: 0.8209, target_sparsity: 0.5403, step: 30500
lambda_1: -6.9070, lambda_2: 184.1763 lambda_3: 0.0000
train remain: [0.99 0.98 0.76 0.69 0.58 0.57 0.8  0.75 0.81 0.37]
infer remain: [1.0, 0.96, 0.76, 0.7, 0.58, 0.56, 0.78, 0.74, 0.8, 0.36]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.73, 0.51, 0.3, 0.17, 0.13, 0.1, 0.08, 0.03]
11111111111111111111111111111111111111111111111111
10111111111111111111111111111111111111111111111110
11111111111110111011111101100111111110101010001110
11111111111111111111101110111111001001010000011000
11111111111111111111111100011001010000000100000000
11111111111111111111111110110000010000000000000000
10011111111111111111111110101111111011110011011000
10111111111111111111111110111111111111000010000000
00111111111111111111111110111101111111111100110000
00011111110111100010100010010001001010000000000000
loss: 0.512406, lagrangian_loss: 0.009849, attention_score_distillation_loss: 0.000136
loss: 0.502054, lagrangian_loss: 0.003590, attention_score_distillation_loss: 0.000121
----------------------------------------------------------------------
time: 2023-07-19 17:35:18
Evaluating: accuracy: 0.8224, eval_loss: 0.5701, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5555, expected_sparsity: 0.5505, expected_sequence_sparsity: 0.8239, target_sparsity: 0.5491, step: 31000
lambda_1: -6.6193, lambda_2: 185.4964 lambda_3: 0.0000
train remain: [0.99 0.98 0.73 0.68 0.57 0.56 0.79 0.75 0.8  0.35]
infer remain: [1.0, 0.96, 0.74, 0.68, 0.58, 0.56, 0.78, 0.74, 0.8, 0.34]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.71, 0.48, 0.28, 0.16, 0.12, 0.09, 0.07, 0.02]
11111111111111111111111111111111111111111111111111
10111111111111111111111111111111111111111111111110
11111111111110111011111101100111111010101010001110
11111111111111111111101110011111001001010000011000
11111111111111111111111100111001010000000000000000
11111111111111111111111110110000010000000000000000
10011111111111111111111110101111111011110011011000
10111111111111111111111110111111111111000010000000
00111111111111111111111110111101111111111100110000
00011111110111100010100010010001000010000000000000
loss: 0.317155, lagrangian_loss: 0.003263, attention_score_distillation_loss: 0.000106
loss: 0.637762, lagrangian_loss: -0.005628, attention_score_distillation_loss: 0.000091
----------------------------------------------------------------------
time: 2023-07-19 17:38:16
Evaluating: accuracy: 0.845, eval_loss: 0.5345, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5636, expected_sparsity: 0.5597, expected_sequence_sparsity: 0.8275, target_sparsity: 0.558, step: 31500
lambda_1: -3.5193, lambda_2: 187.7870 lambda_3: 0.0000
train remain: [0.99 0.97 0.72 0.67 0.56 0.55 0.79 0.75 0.8  0.34]
infer remain: [1.0, 0.96, 0.72, 0.66, 0.56, 0.56, 0.78, 0.74, 0.8, 0.34]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.69, 0.46, 0.26, 0.14, 0.11, 0.08, 0.07, 0.02]
11111111111111111111111111111111111111111111111111
10111111111111111111111111111111111111111111111110
11111111111110111011101101100111111010101010001110
11111111111111111111101110011111001001010000010000
11111111111111111111111100011001010000000000000000
11111111111111111111111110110000010000000000000000
10011111111111111111111110101111111011110011011000
10111111111111111111111110111111111111000010000000
00111111111111111111111110111101111111111100110000
00011111110111100010000010010001010010000000000000
loss: 0.926725, lagrangian_loss: -0.000676, attention_score_distillation_loss: 0.000075
loss: 0.258766, lagrangian_loss: 0.002009, attention_score_distillation_loss: 0.000060
----------------------------------------------------------------------
time: 2023-07-19 17:41:10
Evaluating: accuracy: 0.8267, eval_loss: 0.5846, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5702, expected_sparsity: 0.5651, expected_sequence_sparsity: 0.8296, target_sparsity: 0.5669, step: 32000
lambda_1: -6.5978, lambda_2: 191.7556 lambda_3: 0.0000
train remain: [0.99 0.97 0.7  0.66 0.56 0.54 0.79 0.75 0.8  0.34]
infer remain: [1.0, 0.96, 0.7, 0.66, 0.56, 0.54, 0.78, 0.74, 0.8, 0.34]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.67, 0.44, 0.25, 0.13, 0.1, 0.08, 0.06, 0.02]
11111111111111111111111111111111111111111111111111
10111111111111111111111111111111111111111111111110
11111111111110111011010101100111111010101010001110
11111111111111111111101110011111001001010000010000
11111111111111111111111100011001010000000000000000
10111111111111111111111110110000010000000000000000
10011111111111111111111110101111111011110011011000
10111111111111111111111110111111111111000010000000
00111111111111111111111110111101111111111100110000
00011111110111100011000010010001000010000000000000
loss: 0.488409, lagrangian_loss: -0.000363, attention_score_distillation_loss: 0.000045
loss: 0.232396, lagrangian_loss: -0.003830, attention_score_distillation_loss: 0.000030
----------------------------------------------------------------------
time: 2023-07-19 17:44:07
Evaluating: accuracy: 0.8444, eval_loss: 0.5212, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5815, expected_sparsity: 0.5765, expected_sequence_sparsity: 0.8341, target_sparsity: 0.5757, step: 32500
lambda_1: -7.1471, lambda_2: 194.2055 lambda_3: 0.0000
train remain: [0.99 0.97 0.66 0.65 0.55 0.53 0.79 0.75 0.79 0.33]
infer remain: [1.0, 0.96, 0.66, 0.66, 0.54, 0.52, 0.78, 0.74, 0.78, 0.34]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.63, 0.42, 0.23, 0.12, 0.09, 0.07, 0.05, 0.02]
11111111111111111111111111111111111111111111111111
10111111111111111111111111111111111111111111111110
11111111111110111011000101100111111010100010001110
11111111111111111111101110011111001001010000010000
11111111111111111111111100011001000000000000000000
10111111111111111111011110110000010000000000000000
10011111111111111111111110101111111011110011011000
10111111111111111111111110111111111111000010000000
00111111111111111111111110111101011111111100110000
00011111110111100010000010010001010010000000000000
loss: 0.309577, lagrangian_loss: -0.004971, attention_score_distillation_loss: 0.000020
ETA: 9:27:56 | Epoch 9 finished. Took 1160.01 seconds.
loss: 0.264256, lagrangian_loss: -0.007043, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 17:47:02
Evaluating: accuracy: 0.8422, eval_loss: 0.6016, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5848, expected_sparsity: 0.5796, expected_sequence_sparsity: 0.8353, target_sparsity: 0.58, step: 33000
lambda_1: -2.7647, lambda_2: 197.9735 lambda_3: 0.0000
train remain: [0.99 0.97 0.65 0.65 0.54 0.52 0.79 0.72 0.77 0.32]
infer remain: [1.0, 0.96, 0.66, 0.64, 0.54, 0.52, 0.78, 0.72, 0.76, 0.32]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.63, 0.41, 0.22, 0.11, 0.09, 0.06, 0.05, 0.02]
11111111111111111111111111111111111111111111111111
10111111111111111111111111111111111111111111111110
11111111111110111011000101100111111010100010001110
11111111111111111101101110011111001001010000010000
11111111111111111111111100011001000000000000000000
10111111111111111111011110110000010000000000000000
10011111111111111111111110101111111011110011011000
10111111111111111111111110111111011111000010000000
00011111111111111111111110111101011111111100110000
10011110110111100010000010010001000010000000000000
loss: 0.353784, lagrangian_loss: -0.006780, attention_score_distillation_loss: 0.000020
loss: 0.238050, lagrangian_loss: 0.000078, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 17:49:55
Evaluating: accuracy: 0.8398, eval_loss: 0.618, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5848, expected_sparsity: 0.5796, expected_sequence_sparsity: 0.8353, target_sparsity: 0.58, step: 33500
lambda_1: -0.4443, lambda_2: 201.4185 lambda_3: 0.0000
train remain: [0.99 0.97 0.65 0.65 0.55 0.52 0.79 0.72 0.77 0.32]
infer remain: [1.0, 0.96, 0.66, 0.64, 0.54, 0.52, 0.78, 0.72, 0.76, 0.32]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.63, 0.41, 0.22, 0.11, 0.09, 0.06, 0.05, 0.02]
11111111111111111111111111111111111111111111111111
10111111111111111111111111111111111111111111111110
11111111111110111011000101100111111010100010001110
11111111111111111101101110011111001001010000010000
11111111111111111111111100011001000000000000000000
10111111111111111111011110110000010000000000000000
10011111111111111111111110101111111011110011011000
10111111111111111111111110111111011111000010000000
00011111111111111111111110111101011111111100110000
10011110110111100010000010010001000010000000000000
loss: 0.460482, lagrangian_loss: 0.000811, attention_score_distillation_loss: 0.000020
loss: 0.726788, lagrangian_loss: -0.000822, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 17:52:50
Evaluating: accuracy: 0.8567, eval_loss: 0.564, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5848, expected_sparsity: 0.5799, expected_sequence_sparsity: 0.8355, target_sparsity: 0.58, step: 34000
lambda_1: -2.8366, lambda_2: 203.4159 lambda_3: 0.0000
train remain: [0.99 0.97 0.65 0.65 0.54 0.52 0.79 0.71 0.77 0.32]
infer remain: [1.0, 0.96, 0.66, 0.64, 0.54, 0.52, 0.78, 0.7, 0.76, 0.32]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.63, 0.41, 0.22, 0.11, 0.09, 0.06, 0.05, 0.02]
11111111111111111111111111111111111111111111111111
10111111111111111111111111111111111111111111111110
11111111111110111011000101100111111010100010001110
11111111111111111101101110011111001001010000010000
11111111111111111111111100011001000000000000000000
10111111111111111111011110110000010000000000000000
10011111111111111111111110101111111011110011011000
10011111111111111111111110111111011111000010000000
00011111111111111111111110111101011111111100110000
10001110110111100011000010010001000010000000000000
loss: 0.169909, lagrangian_loss: -0.000294, attention_score_distillation_loss: 0.000020
loss: 0.483647, lagrangian_loss: -0.001945, attention_score_distillation_loss: 0.000020
Starting saving the best from epoch 10 and step 34500
----------------------------------------------------------------------
time: 2023-07-19 17:55:45
Evaluating: accuracy: 0.8486, eval_loss: 0.614, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5924, expected_sparsity: 0.5841, expected_sequence_sparsity: 0.8371, target_sparsity: 0.58, step: 34500
lambda_1: -0.5275, lambda_2: 205.3826 lambda_3: 0.0000
train remain: [0.99 0.97 0.65 0.65 0.54 0.52 0.79 0.69 0.77 0.32]
infer remain: [1.0, 0.96, 0.64, 0.64, 0.54, 0.52, 0.78, 0.68, 0.76, 0.32]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.61, 0.39, 0.21, 0.11, 0.09, 0.06, 0.04, 0.01]
11111111111111111111111111111111111111111111111111
10111111111111111111111111111111111111111111111110
11111111111110111011000101100111101010100010001110
11111111111111111101101110011111001001010000010000
11111111111111111111111100011001000000000000000000
10111111111111111111011110110000010000000000000000
10011111111111111111111110101111111011110011011000
10011111111111111011111110111111011111000010000000
00011111111111111111111110111101011111111100110000
10001110110111100010000010010001000010010000000000
Saving the best model so far: [Epoch 10 | Step: 34500 | MACs sparsity: 0.5924 | Score: 0.8486 | Loss: 0.614]
loss: 0.228054, lagrangian_loss: -0.000191, attention_score_distillation_loss: 0.000020
loss: 0.320713, lagrangian_loss: 0.002661, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 17:59:09
Evaluating: accuracy: 0.8603, eval_loss: 0.5008, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5924, expected_sparsity: 0.5841, expected_sequence_sparsity: 0.8371, target_sparsity: 0.58, step: 35000
lambda_1: -2.2603, lambda_2: 207.2545 lambda_3: 0.0000
train remain: [0.99 0.97 0.65 0.65 0.54 0.51 0.79 0.68 0.77 0.32]
infer remain: [1.0, 0.96, 0.64, 0.64, 0.54, 0.52, 0.78, 0.68, 0.76, 0.32]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.61, 0.39, 0.21, 0.11, 0.09, 0.06, 0.04, 0.01]
11111111111111111111111111111111111111111111111111
10111111111111111111111111111111111111111111111110
11111111111110111011000101100111101010100010001110
11111111111111111101101110011111001001010000010000
11111111111111111111111100011001000000000000000000
10111111111111111111011110110000010000000000000000
10011111111111111111111110101111111011110011011000
10011111111111111011111110111111011111000010000000
00011111111111111111111110111101011111111100110000
10001110110111100010000010010001000010010000000000
Best eval score so far: 0.8486 @ step 34500 epoch 10.54
Saving the best model so far: [Epoch 10 | Step: 35000 | MACs sparsity: 0.5924 | Score: 0.8603 | Loss: 0.5008]
loss: 0.792879, lagrangian_loss: -0.000407, attention_score_distillation_loss: 0.000020
loss: 0.256392, lagrangian_loss: 0.002997, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 18:02:25
Evaluating: accuracy: 0.8497, eval_loss: 0.5694, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5924, expected_sparsity: 0.5841, expected_sequence_sparsity: 0.8371, target_sparsity: 0.58, step: 35500
lambda_1: -1.2645, lambda_2: 209.4315 lambda_3: 0.0000
train remain: [0.99 0.98 0.65 0.65 0.54 0.52 0.79 0.68 0.77 0.32]
infer remain: [1.0, 0.96, 0.64, 0.64, 0.54, 0.52, 0.78, 0.68, 0.76, 0.32]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.61, 0.39, 0.21, 0.11, 0.09, 0.06, 0.04, 0.01]
11111111111111111111111111111111111111111111111111
10111111111111111111111111111111111111111111111110
11111111111110111011000101100111101010100010001110
11111111111111111101101110011111001001010000010000
11111111111111111111111100011001000000000000000000
10111111111111111111011110110000010000000000000000
10011111111111111111111110101111111011110011011000
10011111110111111011111110111111011111010010000000
00011111111111111111111110111101011111111100110000
10001110110111100010000010010001010010000000000000
Best eval score so far: 0.8603 @ step 35000 epoch 10.69
loss: 0.739461, lagrangian_loss: 0.002644, attention_score_distillation_loss: 0.000020
loss: 0.184049, lagrangian_loss: 0.002549, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 18:05:19
Evaluating: accuracy: 0.8422, eval_loss: 0.5984, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5924, expected_sparsity: 0.5851, expected_sequence_sparsity: 0.8375, target_sparsity: 0.58, step: 36000
lambda_1: -2.1346, lambda_2: 211.9169 lambda_3: 0.0000
train remain: [0.99 0.98 0.65 0.65 0.54 0.51 0.78 0.68 0.77 0.32]
infer remain: [1.0, 0.96, 0.64, 0.64, 0.54, 0.5, 0.78, 0.68, 0.76, 0.32]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.61, 0.39, 0.21, 0.11, 0.08, 0.06, 0.04, 0.01]
11111111111111111111111111111111111111111111111111
10111111111111111111111111111111111111111111111110
11111111111110111011000101100111101010100010001110
11111111111111111101101110011111001001010000010000
11111111111111111111111100011001000000000000000000
10011111111111111111011110110000010000000000000000
10011111111111111111111110101111111011110011011000
10011111110111111011111110111111011111010010000000
00011111111111111111111110111101011111111100110000
10001110110111100010000010010001010010000000000000
Best eval score so far: 0.8603 @ step 35000 epoch 10.69
loss: 0.325972, lagrangian_loss: -0.001897, attention_score_distillation_loss: 0.000020
ETA: 9:12:11 | Epoch 10 finished. Took 1208.5 seconds.
loss: 0.209630, lagrangian_loss: 0.000010, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 18:08:15
Evaluating: accuracy: 0.8537, eval_loss: 0.5697, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.5891, expected_sparsity: 0.5832, expected_sequence_sparsity: 0.8368, target_sparsity: 0.58, step: 36500
lambda_1: -0.7173, lambda_2: 215.0196 lambda_3: 0.0000
train remain: [0.99 0.97 0.65 0.65 0.54 0.51 0.78 0.68 0.77 0.32]
infer remain: [1.0, 0.96, 0.64, 0.66, 0.54, 0.5, 0.76, 0.68, 0.76, 0.32]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.61, 0.41, 0.22, 0.11, 0.08, 0.06, 0.04, 0.01]
11111111111111111111111111111111111111111111111111
10111111111111111111111111111111111111111111111110
11111111111110111011000101100111101010100010001110
11111111111111111111101110011111001001010000010000
11111111111111111111111100011001000000000000000000
10011111111111111111011110110000010000000000000000
10011111111111111111111110101101111011110011011000
10011111110111111011111110111111011111010010000000
00011111111111111111111110111101011111111100110000
10001110110111100010000010010001000011000000000000
Best eval score so far: 0.8603 @ step 35000 epoch 10.69
loss: 0.317730, lagrangian_loss: -0.000117, attention_score_distillation_loss: 0.000020
loss: 0.233008, lagrangian_loss: 0.000033, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 18:11:10
Evaluating: accuracy: 0.8386, eval_loss: 0.6306, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.594, expected_sparsity: 0.5856, expected_sequence_sparsity: 0.8377, target_sparsity: 0.58, step: 37000
lambda_1: -0.8456, lambda_2: 217.3490 lambda_3: 0.0000
train remain: [0.99 0.98 0.65 0.65 0.54 0.51 0.77 0.68 0.76 0.31]
infer remain: [1.0, 0.96, 0.64, 0.64, 0.54, 0.5, 0.76, 0.68, 0.76, 0.32]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.61, 0.39, 0.21, 0.11, 0.08, 0.05, 0.04, 0.01]
11111111111111111111111111111111111111111111111111
10111111111111111111111111111111111111111111111110
11111111111110111011000101100111101010100010001110
11111111111111111101101110011111001001010000010000
11111111111111111111111100011001000000000000000000
10011111111111111111011110110000010000000000000000
10011111111111111111111110101101111011110011011000
10011111110111111011111110111111011111010010000000
00011111111111111111111110111101011111111100110000
10001110110111100010000010010001000011000000000000
Best eval score so far: 0.8603 @ step 35000 epoch 10.69
loss: 0.078040, lagrangian_loss: -0.000279, attention_score_distillation_loss: 0.000020
loss: 0.136115, lagrangian_loss: 0.000143, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 18:14:05
Evaluating: accuracy: 0.8517, eval_loss: 0.6361, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.594, expected_sparsity: 0.5858, expected_sequence_sparsity: 0.8378, target_sparsity: 0.58, step: 37500
lambda_1: -1.5742, lambda_2: 219.3876 lambda_3: 0.0000
train remain: [0.99 0.98 0.65 0.65 0.54 0.51 0.76 0.67 0.75 0.3 ]
infer remain: [1.0, 0.96, 0.64, 0.64, 0.54, 0.5, 0.76, 0.68, 0.74, 0.3]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.61, 0.39, 0.21, 0.11, 0.08, 0.05, 0.04, 0.01]
11111111111111111111111111111111111111111111111111
10111111111111111111111111111111111111111111111110
11111111111110111011000101100111101010100010001110
11111111111111111101101110011111001001010000010000
11111111111111111111111100011001000000000000000000
10011111111111111111011110110000010000000000000000
10011111111111111111111110101101111011110011011000
10011111110111111011111110111111011111010010000000
00011111110111111111111110111101011111111100110000
10001110110111100010000010010001000010000000000000
Best eval score so far: 0.8603 @ step 35000 epoch 10.69
loss: 0.263593, lagrangian_loss: 0.000968, attention_score_distillation_loss: 0.000020
loss: 0.544657, lagrangian_loss: 0.000075, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 18:16:59
Evaluating: accuracy: 0.8539, eval_loss: 0.6166, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.594, expected_sparsity: 0.5862, expected_sequence_sparsity: 0.8379, target_sparsity: 0.58, step: 38000
lambda_1: -0.7117, lambda_2: 221.9220 lambda_3: 0.0000
train remain: [1.   0.98 0.65 0.65 0.54 0.51 0.76 0.68 0.74 0.3 ]
infer remain: [1.0, 0.96, 0.64, 0.64, 0.54, 0.5, 0.74, 0.68, 0.74, 0.3]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.61, 0.39, 0.21, 0.11, 0.08, 0.05, 0.04, 0.01]
11111111111111111111111111111111111111111111111111
10111111111111111111111111111111111111111111111110
11111111111110111011000101100111101010100010001110
11111111111111111101101110011111001001010000010000
11111111111111111111111100011001000000000000000000
10011111111111111111011110110000010000000000000000
10011111110111111111111110101101111011110011011000
10011111110111111011111110111111011111010010000000
00011111110111111111111110111101011111111100110000
10001110110111101010000010010001000000000000000000
Best eval score so far: 0.8603 @ step 35000 epoch 10.69
loss: 0.066888, lagrangian_loss: 0.002472, attention_score_distillation_loss: 0.000020
loss: 0.182690, lagrangian_loss: -0.000801, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 18:19:55
Evaluating: accuracy: 0.8645, eval_loss: 0.5175, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.594, expected_sparsity: 0.5862, expected_sequence_sparsity: 0.8379, target_sparsity: 0.58, step: 38500
lambda_1: -1.0725, lambda_2: 224.8803 lambda_3: 0.0000
train remain: [1.   0.98 0.65 0.65 0.54 0.51 0.75 0.67 0.74 0.3 ]
infer remain: [1.0, 0.96, 0.64, 0.64, 0.54, 0.5, 0.74, 0.68, 0.74, 0.3]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.61, 0.39, 0.21, 0.11, 0.08, 0.05, 0.04, 0.01]
11111111111111111111111111111111111111111111111111
10111111111111111111111111111111111111111111111110
11111111111110111011000101100111101010100010001110
11111111111111111101101110011111001001010000010000
11111111111111111111111100011001000000000000000000
10011111111111111111011110110000010000000000000000
10011111110111111111111110101101111011110011011000
10011111110111111011111110111111011111010010000000
00011111110111111111111110111101011111111100110000
10001110110111100010000010010101000000000000000000
Best eval score so far: 0.8603 @ step 35000 epoch 10.69
Saving the best model so far: [Epoch 11 | Step: 38500 | MACs sparsity: 0.594 | Score: 0.8645 | Loss: 0.5175]
loss: 0.222608, lagrangian_loss: -0.000724, attention_score_distillation_loss: 0.000020
loss: 0.337309, lagrangian_loss: 0.000538, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 18:23:13
Evaluating: accuracy: 0.8534, eval_loss: 0.4947, token_prune_loc: [False, True, True, True, True, True, True, True, True, True], macs_sparsity: 0.594, expected_sparsity: 0.5862, expected_sequence_sparsity: 0.8379, target_sparsity: 0.58, step: 39000
lambda_1: -0.6924, lambda_2: 227.3504 lambda_3: 0.0000
train remain: [1.   0.98 0.65 0.64 0.54 0.51 0.74 0.67 0.73 0.3 ]
infer remain: [1.0, 0.96, 0.64, 0.64, 0.54, 0.5, 0.74, 0.68, 0.74, 0.3]
layerwise remain: [1.0, 1.0, 1.0, 0.96, 0.61, 0.39, 0.21, 0.11, 0.08, 0.05, 0.04, 0.01]
11111111111111111111111111111111111111111111111111
10111111111111111111111111111111111111111111111110
11111111111110111011000101100111101010100010001110
11111111111111111101101110011111001001010000010000
11111111111111111111111100011001000000000000000000
10011111111111111111011110110000010000000000000000
10011111110111111111111110101101011111110011011000
10011111110111111011111110111111011111010010000000
00011111110111111111111110111101011111111100110000
00001110110111100010000010010101000000010000000000
Best eval score so far: 0.8645 @ step 38500 epoch 11.76
loss: 0.250423, lagrangian_loss: 0.000850, attention_score_distillation_loss: 0.000020
loss: 0.126676, lagrangian_loss: -0.000397, attention_score_distillation_loss: 0.000020
ETA: 8:53:39 | Epoch 11 finished. Took 1155.55 seconds.
----------------------------------------------------------------------
time: 2023-07-19 18:26:10
Evaluating: accuracy: 0.8713, eval_loss: 0.5835, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5826, expected_sparsity: 0.5779, expected_sequence_sparsity: 0.8347, target_sparsity: 0.58, step: 39500
lambda_1: -0.2773, lambda_2: 229.9755 lambda_3: 0.0000
train remain: [1.   0.98 0.65 0.64 0.54 0.51 0.75 0.67 0.73 0.29]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.5, 0.74, 0.68, 0.72, 0.3]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.11, 0.08, 0.06, 0.04, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100111101010100010001110
11111111111111111101101110011111001001010000010000
11111111111111111111111100011001000000000000000000
10011111111111111111011110110000010000000000000000
10011111110111111111111110101101011111110011011000
10011111110111111011111110111111011111010010000000
00011111110111111111111110011101011111111100110000
10001110110111100010000010010001000000010000000000
Best eval score so far: 0.8645 @ step 38500 epoch 11.76
Saving the best model so far: [Epoch 12 | Step: 39500 | MACs sparsity: 0.5826 | Score: 0.8713 | Loss: 0.5835]
loss: 0.288629, lagrangian_loss: 0.002659, attention_score_distillation_loss: 0.000020
loss: 0.233647, lagrangian_loss: 0.002637, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 18:29:18
Evaluating: accuracy: 0.8737, eval_loss: 0.5163, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5788, expected_sequence_sparsity: 0.835, target_sparsity: 0.58, step: 40000
lambda_1: -0.9066, lambda_2: 233.1449 lambda_3: 0.0000
train remain: [1.   0.98 0.65 0.64 0.54 0.51 0.74 0.66 0.7  0.28]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.5, 0.72, 0.66, 0.7, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.11, 0.08, 0.05, 0.04, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100111101010100010001110
11111111111111111101101110011111001001010000010000
11111111111111111111111100011001000000000000000000
10011111111111111111011110110000010000000000000000
10011111110111111111111110101101011011110011011000
10011111110111111011111110111101011111010010000000
00011111110111111011111110011101011111111100110000
10001110110111100010000010010001000000000000000000
Best eval score so far: 0.8713 @ step 39500 epoch 12.06
Saving the best model so far: [Epoch 12 | Step: 40000 | MACs sparsity: 0.5842 | Score: 0.8737 | Loss: 0.5163]
loss: 0.081747, lagrangian_loss: -0.000359, attention_score_distillation_loss: 0.000020
loss: 0.485934, lagrangian_loss: -0.000194, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 18:32:32
Evaluating: accuracy: 0.8731, eval_loss: 0.5205, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.579, expected_sequence_sparsity: 0.8351, target_sparsity: 0.58, step: 40500
lambda_1: -1.6143, lambda_2: 235.9640 lambda_3: 0.0000
train remain: [1.   0.99 0.65 0.64 0.54 0.51 0.75 0.66 0.69 0.28]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.5, 0.72, 0.66, 0.68, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.11, 0.08, 0.05, 0.04, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100111101010100010001110
11111111111111111101101110011111001001010000010000
11111111111111111111111100011001000000000000000000
10011111111111111111011110110000010000000000000000
10011111110111111111111110101101011011110011011000
10011111110111111011111110111101011111010010000000
00011111110111111011111110011101011111111100100000
10001110110111100010000010010001000000000000000000
Best eval score so far: 0.8737 @ step 40000 epoch 12.22
loss: 0.287362, lagrangian_loss: 0.007187, attention_score_distillation_loss: 0.000020
loss: 0.299514, lagrangian_loss: -0.001372, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 18:35:29
Evaluating: accuracy: 0.8706, eval_loss: 0.5592, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.579, expected_sequence_sparsity: 0.8351, target_sparsity: 0.58, step: 41000
lambda_1: -0.1922, lambda_2: 238.8374 lambda_3: 0.0000
train remain: [1.   0.99 0.65 0.64 0.54 0.5  0.75 0.66 0.68 0.28]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.5, 0.72, 0.66, 0.68, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.11, 0.08, 0.05, 0.04, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100111101010100010001110
11111111111111111101101110011111001001010000010000
11111111111111111111111100011001000000000000000000
10011111111111111111011110110000010000000000000000
10011111110111111111111110101101011011110011011000
10011111110111111011111110111101011111010010000000
00011111110111111011111110011101011111111100100000
10001110110110100010000010010101000000000000000000
Best eval score so far: 0.8737 @ step 40000 epoch 12.22
loss: 0.325905, lagrangian_loss: -0.000034, attention_score_distillation_loss: 0.000020
loss: 0.054371, lagrangian_loss: 0.001815, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 18:38:26
Evaluating: accuracy: 0.8689, eval_loss: 0.5074, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5791, expected_sequence_sparsity: 0.8351, target_sparsity: 0.58, step: 41500
lambda_1: -0.3139, lambda_2: 241.9311 lambda_3: 0.0000
train remain: [1.   0.99 0.65 0.64 0.54 0.5  0.74 0.66 0.66 0.29]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.5, 0.72, 0.66, 0.66, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.11, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100111101010100010001110
11111111111111111101101110011111001001010000010000
11111111111111111111111100010001000000000010000000
10011111111111111111011110110000010000000000000000
10011111110111111111111110101101011011110011011000
10011111110111111011111110111101011111010010000000
00011111110111101011111110011101011111111100100000
10001110110110100010000010010101000000000000000000
Best eval score so far: 0.8737 @ step 40000 epoch 12.22
loss: 0.440043, lagrangian_loss: -0.000070, attention_score_distillation_loss: 0.000020
loss: 0.436326, lagrangian_loss: -0.000101, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 18:41:21
Evaluating: accuracy: 0.8737, eval_loss: 0.5319, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5791, expected_sequence_sparsity: 0.8351, target_sparsity: 0.58, step: 42000
lambda_1: -0.5984, lambda_2: 244.5685 lambda_3: 0.0000
train remain: [1.   0.99 0.65 0.64 0.54 0.51 0.75 0.66 0.66 0.29]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.5, 0.72, 0.66, 0.66, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.11, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100111101010100010001110
11111111111111111101101110011111001001010000010000
11111111111111111111111100010001000010000000000000
10011111111111111111011110110000010000000000000000
10011111110111111111111110101101011011110011011000
10011111110111111011111110111101011111010010000000
00011111110111101011111010011101011111111110100000
10001110110110100010000010010101000000000000000000
Best eval score so far: 0.8737 @ step 40000 epoch 12.22
loss: 0.062840, lagrangian_loss: 0.000442, attention_score_distillation_loss: 0.000020
loss: 0.190441, lagrangian_loss: 0.000324, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 18:44:15
Evaluating: accuracy: 0.8457, eval_loss: 0.5385, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5791, expected_sequence_sparsity: 0.8351, target_sparsity: 0.58, step: 42500
lambda_1: -0.6183, lambda_2: 247.5235 lambda_3: 0.0000
train remain: [1.   0.99 0.65 0.64 0.54 0.51 0.75 0.66 0.65 0.31]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.5, 0.72, 0.66, 0.66, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.11, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100111101010100010001110
11111111111111111101101110011111001001010000010000
11111111111111111111111100010001000010000000000000
10011111111111111111011110110000010000000000000000
10011111110111111111111110101101011011110011011000
10011111110111111011111110111101011111000110000000
00011111110111101011111010011101011111111110100000
00001110110110100011000010010101000000000000000000
Best eval score so far: 0.8737 @ step 40000 epoch 12.22
loss: 0.480059, lagrangian_loss: 0.000144, attention_score_distillation_loss: 0.000020
ETA: 8:36:20 | Epoch 12 finished. Took 1193.76 seconds.
loss: 0.170166, lagrangian_loss: 0.000170, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 18:47:09
Evaluating: accuracy: 0.8788, eval_loss: 0.5086, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5791, expected_sequence_sparsity: 0.8351, target_sparsity: 0.58, step: 43000
lambda_1: -0.1994, lambda_2: 250.0612 lambda_3: 0.0000
train remain: [1.   0.99 0.65 0.64 0.54 0.51 0.75 0.66 0.65 0.37]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.5, 0.72, 0.66, 0.66, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.11, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100111101010100010001110
11111111111111111101101110011111001001010000010000
11111111111111111111111100010001000000000010000000
10011111111111111111011110110000010000000000000000
10011111110111111111111110101101011011110011011000
10011111110111111011111110111101011111000110000000
00011111110111101011111010011101011111111110100000
10001110110110100011000010010001000000000000000000
Best eval score so far: 0.8737 @ step 40000 epoch 12.22
Saving the best model so far: [Epoch 13 | Step: 43000 | MACs sparsity: 0.5858 | Score: 0.8788 | Loss: 0.5086]
loss: 0.040995, lagrangian_loss: 0.000592, attention_score_distillation_loss: 0.000020
loss: 0.251927, lagrangian_loss: 0.000451, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 18:50:28
Evaluating: accuracy: 0.8616, eval_loss: 0.5874, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5791, expected_sequence_sparsity: 0.8351, target_sparsity: 0.58, step: 43500
lambda_1: -0.5277, lambda_2: 253.0344 lambda_3: 0.0000
train remain: [1.   0.99 0.65 0.64 0.54 0.51 0.75 0.66 0.65 0.42]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.5, 0.72, 0.66, 0.66, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.11, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100111101010100010001110
11111111111111111101101110011111001001010000010000
11111111111111111111111100010001000000000000000100
10011111111111111111011110110000010000000000000000
10011111110111111111111110101101011011110011011000
10011111110111111011111110111101011111010010000000
00011111110111101011111010011101011111111110100000
00001110110110100011000010010101000000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.253333, lagrangian_loss: -0.000115, attention_score_distillation_loss: 0.000020
loss: 0.036641, lagrangian_loss: 0.000462, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 18:53:25
Evaluating: accuracy: 0.8766, eval_loss: 0.5508, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5791, expected_sequence_sparsity: 0.8351, target_sparsity: 0.58, step: 44000
lambda_1: -0.1536, lambda_2: 255.9957 lambda_3: 0.0000
train remain: [1.   0.99 0.64 0.64 0.54 0.51 0.75 0.66 0.65 0.38]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.5, 0.72, 0.66, 0.66, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.11, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100111101010100010001110
11111111111111111101101110011111001001010000010000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110000010000000000000000
10011111110111111111111110101101011011110011011000
10011111110111111011111110111101011111010010000000
00011111110111101011111010011101011111111110100000
10001110110110100010000010010101000000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.310916, lagrangian_loss: 0.000239, attention_score_distillation_loss: 0.000020
loss: 0.180653, lagrangian_loss: 0.000746, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 18:56:20
Evaluating: accuracy: 0.8662, eval_loss: 0.537, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5792, expected_sequence_sparsity: 0.8352, target_sparsity: 0.58, step: 44500
lambda_1: -0.3324, lambda_2: 258.6468 lambda_3: 0.0000
train remain: [1.   0.99 0.64 0.64 0.54 0.51 0.74 0.66 0.64 0.4 ]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.5, 0.72, 0.66, 0.64, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.11, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100111101010100010001110
11111111111111111101111110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111011011110110100010000000000000000
10011111110111111111111110101101011011110011011000
10011111110111111011111110111101011111010010000000
00011111110111101011111010011101011111111100100000
00001110110110100010000010010101000001000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.184043, lagrangian_loss: 0.000094, attention_score_distillation_loss: 0.000020
loss: 0.409894, lagrangian_loss: 0.001817, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 18:59:17
Evaluating: accuracy: 0.8656, eval_loss: 0.5098, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5792, expected_sequence_sparsity: 0.8352, target_sparsity: 0.58, step: 45000
lambda_1: -0.1056, lambda_2: 261.8033 lambda_3: 0.0000
train remain: [1.   0.99 0.64 0.64 0.54 0.51 0.74 0.65 0.63 0.45]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.5, 0.72, 0.66, 0.64, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.11, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100111101010100010001110
11111111111111111101111110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111011011110110100010000000000000000
10011111110111111111111110101101011011110011011000
10011111110111111011111110111101011111010010000000
00011111110111101011111010011101011101111110100000
00001110110110100011000010010101000000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.092437, lagrangian_loss: 0.000001, attention_score_distillation_loss: 0.000020
loss: 0.053508, lagrangian_loss: 0.000077, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 19:02:12
Evaluating: accuracy: 0.877, eval_loss: 0.5153, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5786, expected_sequence_sparsity: 0.835, target_sparsity: 0.58, step: 45500
lambda_1: -0.4928, lambda_2: 264.7666 lambda_3: 0.0000
train remain: [1.   0.99 0.64 0.64 0.54 0.52 0.74 0.64 0.63 0.43]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.72, 0.64, 0.62, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100111101010100010001110
11111111111111111101101110011111001001010000010000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10011111110111111111111110101101011011110011011000
10011111110111111011111110111101011111000010000000
00011111110111101011111010011101011101111100100000
10001110110110100010000010010101000000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.382647, lagrangian_loss: -0.000054, attention_score_distillation_loss: 0.000020
loss: 0.238657, lagrangian_loss: 0.001680, attention_score_distillation_loss: 0.000020
ETA: 8:17:35 | Epoch 13 finished. Took 1159.43 seconds.
----------------------------------------------------------------------
time: 2023-07-19 19:05:09
Evaluating: accuracy: 0.8633, eval_loss: 0.4926, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5786, expected_sequence_sparsity: 0.835, target_sparsity: 0.58, step: 46000
lambda_1: -0.9284, lambda_2: 267.9405 lambda_3: 0.0000
train remain: [1.   0.99 0.64 0.64 0.55 0.52 0.73 0.64 0.61 0.45]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.72, 0.64, 0.62, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100111101010100010001110
11111111111111111101111110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10011111110111111111111110101101011011110011011000
10011111110111111011111110111101011111000010000000
00011111110111101011111010011101011101111100100000
00001110110110100011000010010101000000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.109442, lagrangian_loss: -0.000188, attention_score_distillation_loss: 0.000020
loss: 0.059597, lagrangian_loss: -0.000030, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 19:08:05
Evaluating: accuracy: 0.8733, eval_loss: 0.5329, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5786, expected_sequence_sparsity: 0.835, target_sparsity: 0.58, step: 46500
lambda_1: -0.3270, lambda_2: 270.8869 lambda_3: 0.0000
train remain: [1.   0.99 0.64 0.64 0.55 0.52 0.73 0.64 0.62 0.49]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.72, 0.64, 0.62, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100111101010100010001110
11111111111111111101111110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10011111110111111111111110101101011011110011011000
10011111110111111011111110111101011111000010000000
00011111110110101011111010011101011101111110100000
00001110110110101010000010010101000000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.071506, lagrangian_loss: 0.000188, attention_score_distillation_loss: 0.000020
loss: 0.331254, lagrangian_loss: 0.000519, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 19:11:00
Evaluating: accuracy: 0.875, eval_loss: 0.515, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5788, expected_sequence_sparsity: 0.835, target_sparsity: 0.58, step: 47000
lambda_1: -0.4275, lambda_2: 273.4706 lambda_3: 0.0000
train remain: [1.   0.99 0.64 0.64 0.55 0.52 0.73 0.64 0.6  0.44]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.72, 0.64, 0.6, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011001101100110101010100010001110
11111111111111111101111110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10011111110111111111111110101101011011110011011000
10011111110111111011111110111101011111000010000000
00011111110110101011111010011101011101111100100000
00001110110110101010000010010001010000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.367759, lagrangian_loss: 0.002024, attention_score_distillation_loss: 0.000020
loss: 0.027904, lagrangian_loss: 0.001765, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 19:13:56
Evaluating: accuracy: 0.8693, eval_loss: 0.5222, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5788, expected_sequence_sparsity: 0.835, target_sparsity: 0.58, step: 47500
lambda_1: -0.8592, lambda_2: 276.1123 lambda_3: 0.0000
train remain: [1.   0.99 0.64 0.63 0.54 0.52 0.73 0.64 0.6  0.36]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.72, 0.64, 0.6, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010101010001110
11111111111111111101111110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10011111110111111111111110101101011011110011011000
10011111110111111011111110111101011111000010000000
00011111110110101011111010011101011101110110100000
00001110110110101010000010010001010000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.039616, lagrangian_loss: -0.000269, attention_score_distillation_loss: 0.000020
loss: 0.116070, lagrangian_loss: -0.000028, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 19:16:52
Evaluating: accuracy: 0.873, eval_loss: 0.5689, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5788, expected_sequence_sparsity: 0.835, target_sparsity: 0.58, step: 48000
lambda_1: -0.3408, lambda_2: 278.9204 lambda_3: 0.0000
train remain: [1.   0.99 0.64 0.64 0.55 0.52 0.73 0.64 0.6  0.41]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.72, 0.64, 0.6, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010101010001110
11111111111111111101111110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10011111110111111111111110101101011011110011011000
10011111110111101011111110111101011111000010100000
00011111110110101011111010011101011101110110100000
00001110110110101010000010010001010000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.164379, lagrangian_loss: 0.000214, attention_score_distillation_loss: 0.000020
loss: 0.077955, lagrangian_loss: 0.000890, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 19:19:47
Evaluating: accuracy: 0.8691, eval_loss: 0.5247, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5788, expected_sequence_sparsity: 0.835, target_sparsity: 0.58, step: 48500
lambda_1: -0.7698, lambda_2: 281.7541 lambda_3: 0.0000
train remain: [1.   0.99 0.64 0.63 0.55 0.52 0.73 0.64 0.6  0.39]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.72, 0.64, 0.6, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111101111110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110001010000000000000000
10011111110111111111111110101101011011110011011000
10011111110111101011111110111101011111000011000000
00011111110110101011111010011101011101110110100000
00001110110110101010000010010101000000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.130931, lagrangian_loss: 0.000093, attention_score_distillation_loss: 0.000020
loss: 0.027282, lagrangian_loss: -0.000181, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 19:22:41
Evaluating: accuracy: 0.8737, eval_loss: 0.5521, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5792, expected_sequence_sparsity: 0.8352, target_sparsity: 0.58, step: 49000
lambda_1: -0.4425, lambda_2: 284.4858 lambda_3: 0.0000
train remain: [1.   0.99 0.64 0.64 0.55 0.52 0.72 0.64 0.6  0.4 ]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.7, 0.64, 0.6, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111101111110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110001010000000000000000
10011111110111111111111110101101011001110011011000
10011111110111101011111110111101011111010010000000
00011111110110101011111010011101011101110110100000
00001110110110101010000010010001010000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.134201, lagrangian_loss: 0.000294, attention_score_distillation_loss: 0.000020
ETA: 7:58:47 | Epoch 14 finished. Took 1160.45 seconds.
loss: 0.148353, lagrangian_loss: -0.000181, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 19:25:36
Evaluating: accuracy: 0.8627, eval_loss: 0.5654, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5792, expected_sequence_sparsity: 0.8352, target_sparsity: 0.58, step: 49500
lambda_1: -0.2822, lambda_2: 287.3346 lambda_3: 0.0000
train remain: [1.   0.99 0.64 0.63 0.55 0.52 0.72 0.64 0.59 0.43]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.7, 0.64, 0.6, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111101111110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110001010000000000000000
10011111110111111111111110101101011001110011011000
10011111110111101011111110111101011111010010000000
00011111110110101011111010011101011101110110100000
00001110110110101010000010010001010000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.042877, lagrangian_loss: 0.000351, attention_score_distillation_loss: 0.000020
loss: 0.086501, lagrangian_loss: 0.000065, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 19:28:32
Evaluating: accuracy: 0.8677, eval_loss: 0.5378, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5793, expected_sequence_sparsity: 0.8352, target_sparsity: 0.58, step: 50000
lambda_1: -0.6061, lambda_2: 290.3694 lambda_3: 0.0000
train remain: [1.   0.99 0.64 0.63 0.55 0.52 0.72 0.63 0.59 0.43]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.7, 0.64, 0.58, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010101110
11111111111111111101111110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110001010000000000000000
10011111110111111111111110101101011001110011011000
10011111110111101011111110111101011111010010000000
00011111110110101011111010011101011101110100100000
00001110110110101010000010010001010000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.040072, lagrangian_loss: 0.000249, attention_score_distillation_loss: 0.000020
loss: 0.287236, lagrangian_loss: 0.000915, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 19:31:26
Evaluating: accuracy: 0.8733, eval_loss: 0.5234, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5793, expected_sequence_sparsity: 0.8352, target_sparsity: 0.58, step: 50500
lambda_1: -0.1086, lambda_2: 293.2389 lambda_3: 0.0000
train remain: [1.   0.99 0.64 0.63 0.55 0.52 0.72 0.63 0.58 0.45]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.7, 0.64, 0.58, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010110010001110
11111111111111111101111110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110001010000000000000000
10011111110111111111111110101101011001110011011000
10011111110111101011111110111101011111010010000000
00011111110110101011111010011101011101110100100000
00001110110110101010000010010001010000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.112167, lagrangian_loss: 0.000069, attention_score_distillation_loss: 0.000020
loss: 0.482560, lagrangian_loss: 0.000126, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 19:34:21
Evaluating: accuracy: 0.8589, eval_loss: 0.5786, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5793, expected_sequence_sparsity: 0.8352, target_sparsity: 0.58, step: 51000
lambda_1: -0.6258, lambda_2: 295.7832 lambda_3: 0.0000
train remain: [1.   0.99 0.64 0.63 0.55 0.52 0.71 0.64 0.58 0.41]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.7, 0.64, 0.58, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010110010001110
11111111111111111101111110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110001010000000000000000
10011111110111111111111110101101011001110011011000
10011111110111101011111110111101011111010010000000
00011111110110101011111010011101011101110100100000
00001110110110101010000010010001000000010000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.273637, lagrangian_loss: 0.000390, attention_score_distillation_loss: 0.000020
loss: 0.045290, lagrangian_loss: -0.000121, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 19:37:14
Evaluating: accuracy: 0.8689, eval_loss: 0.564, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5793, expected_sequence_sparsity: 0.8352, target_sparsity: 0.58, step: 51500
lambda_1: -0.0940, lambda_2: 298.5451 lambda_3: 0.0000
train remain: [1.   0.99 0.64 0.63 0.55 0.52 0.72 0.63 0.58 0.41]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.7, 0.64, 0.58, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111101111110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110001010000000000000000
10011111110111111111111110101101011001110011011000
10011111110111101011111110111101011111000110000000
00011111110110101011111010011101011101110100100000
00001110110110101010000010010001010000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.042610, lagrangian_loss: 0.000492, attention_score_distillation_loss: 0.000020
loss: 0.136418, lagrangian_loss: 0.000698, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 19:40:10
Evaluating: accuracy: 0.8759, eval_loss: 0.5208, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5793, expected_sequence_sparsity: 0.8352, target_sparsity: 0.58, step: 52000
lambda_1: -0.8201, lambda_2: 301.1749 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.63 0.55 0.52 0.71 0.63 0.58 0.37]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.7, 0.64, 0.58, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111101111110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10011111110111111111111110101101011001110011011000
10011111110111101011111110111101011111010010000000
00011111110110101011011010011101011101110110100000
00001110110110101010000010010001010000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.198329, lagrangian_loss: -0.000504, attention_score_distillation_loss: 0.000020
loss: 0.285631, lagrangian_loss: -0.000127, attention_score_distillation_loss: 0.000020
ETA: 7:39:13 | Epoch 15 finished. Took 1132.47 seconds.
----------------------------------------------------------------------
time: 2023-07-19 19:43:08
Evaluating: accuracy: 0.8711, eval_loss: 0.5429, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5797, expected_sequence_sparsity: 0.8354, target_sparsity: 0.58, step: 52500
lambda_1: 0.0207, lambda_2: 304.0379 lambda_3: 0.0000
train remain: [1.   0.99 0.64 0.63 0.55 0.52 0.71 0.63 0.58 0.38]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.68, 0.64, 0.58, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110001010000000000000000
10011111110111111111111110101101011001110011010000
10011111110111101011111110111101011111010010000000
00011111110110101011011010011101011101110110100000
00001110110110101010000010010001010000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.023148, lagrangian_loss: 0.000203, attention_score_distillation_loss: 0.000020
loss: 0.197529, lagrangian_loss: 0.000068, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 19:46:04
Evaluating: accuracy: 0.8614, eval_loss: 0.5646, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5797, expected_sequence_sparsity: 0.8354, target_sparsity: 0.58, step: 53000
lambda_1: -0.7220, lambda_2: 306.8342 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.63 0.55 0.52 0.71 0.63 0.58 0.39]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.68, 0.64, 0.58, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110001010000000000000000
10011111110111111111111110101101011001110011010000
10011111110111101011111110111101011111010010000000
00011111110110101011011010011101011101110110100000
00001110110110101010000010010001010000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.249287, lagrangian_loss: 0.000696, attention_score_distillation_loss: 0.000020
loss: 0.030555, lagrangian_loss: 0.000652, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 19:49:01
Evaluating: accuracy: 0.87, eval_loss: 0.5721, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5799, expected_sequence_sparsity: 0.8355, target_sparsity: 0.58, step: 53500
lambda_1: -0.1421, lambda_2: 309.9918 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.71 0.63 0.58 0.39]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.68, 0.62, 0.58, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010101010001110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110001010000000000000000
10001111110111111111111110101101011001110011011000
10011111110111101011111110111101011111000010000000
00011111110110101011011010011101011101110110100000
00001110110110101010000010010001010000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.018115, lagrangian_loss: 0.000054, attention_score_distillation_loss: 0.000020
loss: 0.121729, lagrangian_loss: 0.000089, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 19:51:55
Evaluating: accuracy: 0.875, eval_loss: 0.5168, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5799, expected_sequence_sparsity: 0.8355, target_sparsity: 0.58, step: 54000
lambda_1: -0.7594, lambda_2: 312.8035 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.63 0.55 0.52 0.71 0.62 0.58 0.39]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.68, 0.62, 0.58, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010101010001110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111111111111110101101011011110011010000
10011111110111101011111110111101011111000010000000
00011111110110101011011010011101011101110110100000
00001110110110101010000010010001010000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.076535, lagrangian_loss: 0.000040, attention_score_distillation_loss: 0.000020
loss: 0.019201, lagrangian_loss: 0.000153, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 19:54:47
Evaluating: accuracy: 0.873, eval_loss: 0.5762, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5802, expected_sequence_sparsity: 0.8356, target_sparsity: 0.58, step: 54500
lambda_1: -0.4042, lambda_2: 315.5988 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.71 0.61 0.58 0.36]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.68, 0.6, 0.58, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010011110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111111111111110101101011011110011010000
10011111110111101011111110011101011111000010000000
10011111110110101011011010011101011101110100100000
00001110110110101010000010010101000000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.458262, lagrangian_loss: 0.000433, attention_score_distillation_loss: 0.000020
loss: 0.019395, lagrangian_loss: 0.000112, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 19:57:43
Evaluating: accuracy: 0.8675, eval_loss: 0.5985, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5802, expected_sequence_sparsity: 0.8356, target_sparsity: 0.58, step: 55000
lambda_1: -0.0856, lambda_2: 318.5368 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.72 0.61 0.58 0.35]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.68, 0.6, 0.58, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011010101100110101010100010001110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111111111111110101101011011110011010000
10011111110111101011111110011101011111000010000000
00011111110110101011011010011101011101110110100000
00001110110110101011000010010001000000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.030892, lagrangian_loss: 0.001352, attention_score_distillation_loss: 0.000020
loss: 0.043613, lagrangian_loss: 0.000026, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 20:00:39
Evaluating: accuracy: 0.877, eval_loss: 0.5395, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5799, expected_sequence_sparsity: 0.8355, target_sparsity: 0.58, step: 55500
lambda_1: -0.6350, lambda_2: 321.4322 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.71 0.61 0.58 0.36]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.68, 0.62, 0.58, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011010101100110101010100010001110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111111111111110101101011011110011010000
10011111110111101011111110011101011111000011000000
00011111110110101011011010011101011101110110100000
00001110110110101011000010010001000000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.056100, lagrangian_loss: -0.000230, attention_score_distillation_loss: 0.000020
ETA: 7:20:20 | Epoch 16 finished. Took 1159.59 seconds.
loss: 0.020278, lagrangian_loss: -0.000281, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 20:03:34
Evaluating: accuracy: 0.8717, eval_loss: 0.5872, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5799, expected_sequence_sparsity: 0.8355, target_sparsity: 0.58, step: 56000
lambda_1: -0.3550, lambda_2: 324.4251 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.71 0.61 0.58 0.32]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.68, 0.62, 0.58, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111101101110011111001001011000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101111111110101101011111110011010000
10011111110111101011111110011101011111000011000000
10011111110110101011011010011101011101110100100000
00001110110110101011000010010001000000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.101723, lagrangian_loss: -0.000083, attention_score_distillation_loss: 0.000020
loss: 0.064691, lagrangian_loss: 0.000446, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 20:06:29
Evaluating: accuracy: 0.8753, eval_loss: 0.557, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5799, expected_sequence_sparsity: 0.8355, target_sparsity: 0.58, step: 56500
lambda_1: -0.1721, lambda_2: 327.0652 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.7  0.61 0.57 0.34]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.68, 0.62, 0.58, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101101110101010100010001110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101111111110101101011111110011010000
10011111110111101011111110011101011111010010000000
00011111110110101011011010011101011101110110100000
00001110110110101010000010010101000000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.122554, lagrangian_loss: 0.001092, attention_score_distillation_loss: 0.000020
loss: 0.325221, lagrangian_loss: -0.000051, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 20:09:27
Evaluating: accuracy: 0.8715, eval_loss: 0.5532, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5801, expected_sequence_sparsity: 0.8355, target_sparsity: 0.58, step: 57000
lambda_1: -0.2316, lambda_2: 329.9371 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.69 0.61 0.57 0.34]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.68, 0.62, 0.56, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010101110
11111111111111111101101110011111001001011000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101111111110101101011111110011010000
10011111110111101011111110011101011111010010000000
00011111110110101011011010011101011101110100100000
00000110110110101011000010010101000000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.163286, lagrangian_loss: -0.000038, attention_score_distillation_loss: 0.000020
loss: 0.038924, lagrangian_loss: 0.000436, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 20:12:22
Evaluating: accuracy: 0.8735, eval_loss: 0.5578, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5805, expected_sequence_sparsity: 0.8357, target_sparsity: 0.58, step: 57500
lambda_1: -0.6205, lambda_2: 332.8638 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.68 0.61 0.56 0.35]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.66, 0.62, 0.56, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101011100010001110
11111111111111111101101110011111001001010010000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101111111110101101011011110011010000
10011111110111101011111110011101011111010010000000
00011111110110101011011010011101011101110100100000
00000110110110101011000010010101000000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.187014, lagrangian_loss: -0.000031, attention_score_distillation_loss: 0.000020
loss: 0.040996, lagrangian_loss: 0.000667, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 20:15:16
Evaluating: accuracy: 0.8737, eval_loss: 0.5757, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5807, expected_sequence_sparsity: 0.8358, target_sparsity: 0.58, step: 58000
lambda_1: -0.4421, lambda_2: 335.7651 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.68 0.61 0.56 0.35]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.66, 0.6, 0.56, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101011100010001110
11111111111111111101101110011111001001011000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101111111110101101011011110011010000
10011111110111101011111110011101011111000010000000
10001111110110101011011010011101011101110100100000
00000110110110101011000010010101000000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.265504, lagrangian_loss: -0.000059, attention_score_distillation_loss: 0.000020
loss: 0.283869, lagrangian_loss: -0.000070, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 20:18:12
Evaluating: accuracy: 0.8742, eval_loss: 0.5676, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5807, expected_sequence_sparsity: 0.8358, target_sparsity: 0.58, step: 58500
lambda_1: -0.6451, lambda_2: 338.5070 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.68 0.61 0.56 0.34]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.66, 0.6, 0.56, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000111100110101010100010001110
11111111111111111101101110011111001001011000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101111111110101101011011110011010000
10011111110111101011111110011101011111000010000000
10001111110110101011011010011101011101110100100000
00000110110110101011000010010101000000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.068650, lagrangian_loss: -0.000297, attention_score_distillation_loss: 0.000020
loss: 0.026540, lagrangian_loss: 0.000359, attention_score_distillation_loss: 0.000020
ETA: 7:00:53 | Epoch 17 finished. Took 1133.54 seconds.
----------------------------------------------------------------------
time: 2023-07-19 20:21:07
Evaluating: accuracy: 0.8755, eval_loss: 0.5649, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5807, expected_sequence_sparsity: 0.8358, target_sparsity: 0.58, step: 59000
lambda_1: -0.1968, lambda_2: 341.6069 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.67 0.61 0.56 0.32]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.66, 0.6, 0.56, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000111100110101010100010001110
11111111111111111101101110011111101001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101111111110101101011011110011010000
10011111110111101011111110011101011111000010000000
10001111110110101011011010011101011101110100100000
00000110110110101011000010010101000000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.083371, lagrangian_loss: 0.000198, attention_score_distillation_loss: 0.000020
loss: 0.118603, lagrangian_loss: -0.000006, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 20:24:02
Evaluating: accuracy: 0.8742, eval_loss: 0.5648, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5807, expected_sequence_sparsity: 0.8358, target_sparsity: 0.58, step: 59500
lambda_1: -0.6989, lambda_2: 344.4115 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.68 0.61 0.56 0.34]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.66, 0.6, 0.56, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011010101100110101010100010001110
11111111111111111101101110011111101001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101111111110101101011011110011010000
10011111110111101011111110011101011111000010000000
10001111110110101011011010011101011101110100100000
00000110110110101011000010010101000000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.194317, lagrangian_loss: 0.000329, attention_score_distillation_loss: 0.000020
loss: 0.164261, lagrangian_loss: -0.000186, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 20:26:56
Evaluating: accuracy: 0.8781, eval_loss: 0.5391, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5807, expected_sequence_sparsity: 0.8358, target_sparsity: 0.58, step: 60000
lambda_1: -0.3548, lambda_2: 347.2031 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.68 0.6  0.56 0.34]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.66, 0.6, 0.56, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.05, 0.03, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011010101100110101010100010001110
11111111111111111101101110011111001011010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101111111110101101011001110111010000
10011111110111101011111110011101011111000010000000
10001111110110101011011010011101011101110100100000
00000110110110101011000010010101000000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.031535, lagrangian_loss: 0.000339, attention_score_distillation_loss: 0.000020
loss: 0.132691, lagrangian_loss: 0.000609, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 20:29:51
Evaluating: accuracy: 0.8748, eval_loss: 0.556, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5809, expected_sequence_sparsity: 0.8358, target_sparsity: 0.58, step: 60500
lambda_1: -0.6470, lambda_2: 350.1119 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.7  0.59 0.56 0.36]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.66, 0.58, 0.56, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011010101100110101010100010001110
11111111111111111101101110011111101001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101111111110101101011001110111010000
10001111110111101011111010011101011111000110000000
10001111110110101011011010011101011101110100100000
00000110110110101011000010010101000000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.281685, lagrangian_loss: -0.000284, attention_score_distillation_loss: 0.000020
loss: 0.038755, lagrangian_loss: -0.000014, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 20:32:47
Evaluating: accuracy: 0.87, eval_loss: 0.5432, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5809, expected_sequence_sparsity: 0.8358, target_sparsity: 0.58, step: 61000
lambda_1: -0.3024, lambda_2: 352.7885 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.7  0.59 0.56 0.39]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.66, 0.58, 0.56, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010110010001110
11111111111111111101101110111111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101111111110101101011001110111010000
10001111110111101011111010011101011111000110000000
00001111110110101011011010011101011101110110100000
00000110110110101011000010010101000000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
loss: 0.018308, lagrangian_loss: 0.000454, attention_score_distillation_loss: 0.000020
loss: 0.132226, lagrangian_loss: -0.000008, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 20:35:43
Evaluating: accuracy: 0.879, eval_loss: 0.5427, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5809, expected_sequence_sparsity: 0.8358, target_sparsity: 0.58, step: 61500
lambda_1: -0.5045, lambda_2: 355.7060 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.7  0.59 0.56 0.38]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.66, 0.58, 0.56, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010110010001110
11111111111111111101101110011111001001010000000010
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101111111110101101011001110111010000
10001111110111101011111010011101011111000110000000
00001111110110101011011010011101011101110110100000
10000110110110101011000010010001000000000000000000
Best eval score so far: 0.8788 @ step 43000 epoch 13.13
Saving the best model so far: [Epoch 18 | Step: 61500 | MACs sparsity: 0.5842 | Score: 0.879 | Loss: 0.5427]
loss: 0.027696, lagrangian_loss: -0.000166, attention_score_distillation_loss: 0.000020
loss: 0.105174, lagrangian_loss: 0.000082, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 20:38:53
Evaluating: accuracy: 0.8746, eval_loss: 0.5542, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5809, expected_sequence_sparsity: 0.8358, target_sparsity: 0.58, step: 62000
lambda_1: 0.0505, lambda_2: 358.6501 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.7  0.59 0.56 0.41]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.66, 0.58, 0.56, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011001101100110101010100010001110
11111111111111111101101110011111001001010000000010
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101111111110101101011001110111010000
10001111110111101011111010011101011111000110000000
00001111110110101011011010011101011101110110100000
10000110110110101011000010010001000000000000000000
Best eval score so far: 0.8790 @ step 61500 epoch 18.78
loss: 0.024550, lagrangian_loss: 0.000187, attention_score_distillation_loss: 0.000020
ETA: 6:42:16 | Epoch 18 finished. Took 1175.83 seconds.
loss: 0.037825, lagrangian_loss: 0.000171, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 20:41:50
Evaluating: accuracy: 0.8781, eval_loss: 0.605, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5809, expected_sequence_sparsity: 0.8358, target_sparsity: 0.58, step: 62500
lambda_1: -0.3468, lambda_2: 361.5862 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.7  0.59 0.56 0.45]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.66, 0.58, 0.56, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.08, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011001101100110101010100010001110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101111111110101101011001110111010000
10001111110111101011111010011101011111000110000000
00001111110110101011011010011101011101110110100000
10000110110110101011000010010001000000000000000000
Best eval score so far: 0.8790 @ step 61500 epoch 18.78
loss: 0.079975, lagrangian_loss: -0.000049, attention_score_distillation_loss: 0.000020
loss: 0.018173, lagrangian_loss: 0.003154, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 20:44:46
Evaluating: accuracy: 0.8792, eval_loss: 0.5629, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5813, expected_sequence_sparsity: 0.836, target_sparsity: 0.58, step: 63000
lambda_1: -0.4614, lambda_2: 364.2297 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.54 0.52 0.69 0.59 0.56 0.4 ]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.64, 0.58, 0.56, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111101101110011111001001010000000010
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101111111010101101011001110111010000
10001111110111101011111010011101011111000110000000
00001111110110101011011010011101011101110110100000
10000110110110101011000010010001000000000000000000
Best eval score so far: 0.8790 @ step 61500 epoch 18.78
Saving the best model so far: [Epoch 19 | Step: 63000 | MACs sparsity: 0.5858 | Score: 0.8792 | Loss: 0.5629]
loss: 0.012827, lagrangian_loss: -0.000131, attention_score_distillation_loss: 0.000020
loss: 0.020887, lagrangian_loss: 0.000083, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 20:48:05
Evaluating: accuracy: 0.8779, eval_loss: 0.5378, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5813, expected_sequence_sparsity: 0.836, target_sparsity: 0.58, step: 63500
lambda_1: -0.2574, lambda_2: 366.9497 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.69 0.59 0.56 0.38]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.64, 0.58, 0.56, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111101101110011111001001010000000010
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101111111010101101011001110111010000
10001111110111101011111010011101011111000110000000
00001111110110101011011010011101011101110110100000
10000110110110101011000010010001000000000000000000
Best eval score so far: 0.8792 @ step 63000 epoch 19.24
loss: 0.163272, lagrangian_loss: -0.000041, attention_score_distillation_loss: 0.000020
loss: 0.290646, lagrangian_loss: 0.000570, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 20:50:59
Evaluating: accuracy: 0.8689, eval_loss: 0.5094, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5813, expected_sequence_sparsity: 0.836, target_sparsity: 0.58, step: 64000
lambda_1: -0.3177, lambda_2: 369.8412 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.68 0.59 0.56 0.39]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.64, 0.58, 0.56, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111101101110011111001001010000000010
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101111111010101101011001110111010000
10001111110111101011111010011101011111000110000000
00001111110110101011011010011101011101110110100000
00000110110110101011000010010001010000000000000000
Best eval score so far: 0.8792 @ step 63000 epoch 19.24
loss: 0.051100, lagrangian_loss: 0.000511, attention_score_distillation_loss: 0.000020
loss: 0.338850, lagrangian_loss: 0.000601, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 20:53:55
Evaluating: accuracy: 0.8797, eval_loss: 0.5075, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5813, expected_sequence_sparsity: 0.836, target_sparsity: 0.58, step: 64500
lambda_1: -0.3798, lambda_2: 372.7704 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.69 0.59 0.56 0.4 ]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.64, 0.58, 0.56, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101111111010101101011001110111010000
10001111110111101011111010011101011111000110000000
00001111110110101011011010011101011101110110100000
00000110110110101011001010010001000000000000000000
Best eval score so far: 0.8792 @ step 63000 epoch 19.24
Saving the best model so far: [Epoch 19 | Step: 64500 | MACs sparsity: 0.5858 | Score: 0.8797 | Loss: 0.5075]
loss: 0.276395, lagrangian_loss: 0.000011, attention_score_distillation_loss: 0.000020
loss: 0.260566, lagrangian_loss: -0.000048, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 20:57:12
Evaluating: accuracy: 0.8781, eval_loss: 0.5454, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5813, expected_sequence_sparsity: 0.836, target_sparsity: 0.58, step: 65000
lambda_1: -0.4760, lambda_2: 375.3930 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.69 0.59 0.56 0.38]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.64, 0.58, 0.56, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101111111010101101011001110111010000
10001111110111101011111010011101011111000110000000
00001111110110101011011010011101011101110110100000
10000110110110101011000010010001000000000000000000
Best eval score so far: 0.8797 @ step 64500 epoch 19.70
loss: 0.072845, lagrangian_loss: -0.000035, attention_score_distillation_loss: 0.000020
loss: 0.072607, lagrangian_loss: 0.000275, attention_score_distillation_loss: 0.000020
ETA: 6:23:35 | Epoch 19 finished. Took 1177.69 seconds.
----------------------------------------------------------------------
time: 2023-07-19 21:00:06
Evaluating: accuracy: 0.881, eval_loss: 0.5374, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5813, expected_sequence_sparsity: 0.836, target_sparsity: 0.58, step: 65500
lambda_1: -0.4677, lambda_2: 378.2476 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.7  0.59 0.56 0.39]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.64, 0.58, 0.56, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101111111010101101011101110011010000
10001111110111101011111010011101011111000110000000
00001111110110101011011010011101011101110110100000
10000110110110101011000010010001000000000000000000
Best eval score so far: 0.8797 @ step 64500 epoch 19.70
Saving the best model so far: [Epoch 20 | Step: 65500 | MACs sparsity: 0.5858 | Score: 0.881 | Loss: 0.5374]
loss: 0.023048, lagrangian_loss: -0.000142, attention_score_distillation_loss: 0.000020
loss: 0.017388, lagrangian_loss: -0.000085, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 21:03:21
Evaluating: accuracy: 0.8788, eval_loss: 0.538, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5813, expected_sequence_sparsity: 0.836, target_sparsity: 0.58, step: 66000
lambda_1: -0.6193, lambda_2: 381.0638 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.69 0.59 0.56 0.36]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.64, 0.58, 0.56, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101111111010101101011101110011010000
10001111110111101011111010011101011111000110000000
00001111110110101011011010011101011101110110100000
10000110110110101011000010010001000000000000000000
Best eval score so far: 0.8810 @ step 65500 epoch 20.01
loss: 0.095806, lagrangian_loss: 0.001000, attention_score_distillation_loss: 0.000020
loss: 0.056584, lagrangian_loss: 0.001137, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 21:06:16
Evaluating: accuracy: 0.8731, eval_loss: 0.5281, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5813, expected_sequence_sparsity: 0.836, target_sparsity: 0.58, step: 66500
lambda_1: -0.4764, lambda_2: 383.5851 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.53 0.68 0.59 0.56 0.34]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.64, 0.58, 0.56, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111101111110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101111111010101101011101110011010000
10001111110111101011111010011101011111000110000000
00001111110110101011011010011101011101110110100000
00000110110110101011000010010001000000010000000000
Best eval score so far: 0.8810 @ step 65500 epoch 20.01
loss: 0.018263, lagrangian_loss: -0.000061, attention_score_distillation_loss: 0.000020
loss: 0.110658, lagrangian_loss: -0.000056, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 21:09:11
Evaluating: accuracy: 0.8741, eval_loss: 0.5167, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5818, expected_sequence_sparsity: 0.8362, target_sparsity: 0.58, step: 67000
lambda_1: -0.4481, lambda_2: 386.4257 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.53 0.67 0.59 0.55 0.35]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.62, 0.58, 0.54, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111101111110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101111111010101101011001110011010000
10001111110111101011111010011101011111000110000000
00001111110110101011011010011101011101110100100000
00000110110110101011000010010001010000000000000000
Best eval score so far: 0.8810 @ step 65500 epoch 20.01
loss: 0.021097, lagrangian_loss: -0.000038, attention_score_distillation_loss: 0.000020
loss: 0.038061, lagrangian_loss: 0.000271, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 21:12:07
Evaluating: accuracy: 0.8774, eval_loss: 0.5126, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5803, expected_sequence_sparsity: 0.8356, target_sparsity: 0.58, step: 67500
lambda_1: -0.3271, lambda_2: 389.7024 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.53 0.66 0.58 0.55 0.37]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.62, 0.58, 0.54, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111101111110011111001001010000000000
11111111111111111111111110010001000010000000000000
10011111111111111111011110110100010000000000000000
10001111110111101111111010101101011001110011010000
10001111110111101011111010011101011111000110000000
00001111110110101011011010011101011101110100100000
00000110110110101011000010010001010000000000000000
Best eval score so far: 0.8810 @ step 65500 epoch 20.01
loss: 0.016362, lagrangian_loss: 0.001655, attention_score_distillation_loss: 0.000020
loss: 0.063967, lagrangian_loss: 0.000203, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 21:15:02
Evaluating: accuracy: 0.8807, eval_loss: 0.5286, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5812, expected_sequence_sparsity: 0.8359, target_sparsity: 0.58, step: 68000
lambda_1: -0.1287, lambda_2: 392.2574 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.66 0.58 0.55 0.4 ]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.62, 0.58, 0.54, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111101111110011111001001010000000000
11111111111111111111111110010001001000000000000000
10011111111111111011011110110100010000000000000000
10001111110111101111111010101101011001110011010000
10001111110111101011111010011101011111000110000000
00001111110110101011011010011101011101110100100000
00000110110110101011000010010001000000010000000000
Best eval score so far: 0.8810 @ step 65500 epoch 20.01
loss: 0.252197, lagrangian_loss: 0.000016, attention_score_distillation_loss: 0.000020
loss: 0.258947, lagrangian_loss: -0.000000, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 21:17:57
Evaluating: accuracy: 0.8843, eval_loss: 0.5017, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5826, expected_sequence_sparsity: 0.8365, target_sparsity: 0.58, step: 68500
lambda_1: -0.2580, lambda_2: 395.0177 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.65 0.58 0.55 0.45]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.5, 0.62, 0.58, 0.54, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.11, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111101101110011111001001010001000000
11111111111111111111111110010001000000000000000000
10011111111111111011011110110100010000000000000000
10001111110111101111111010101101011001110011010000
10001111110111101011111010011101011111000110000000
00001111110110101011011010011101011101110100100000
00000110110110101011000010010001000000010000000000
Best eval score so far: 0.8810 @ step 65500 epoch 20.01
Saving the best model so far: [Epoch 20 | Step: 68500 | MACs sparsity: 0.5874 | Score: 0.8843 | Loss: 0.5017]
loss: 0.463440, lagrangian_loss: -0.000034, attention_score_distillation_loss: 0.000020
loss: 0.032587, lagrangian_loss: 0.000091, attention_score_distillation_loss: 0.000020
ETA: 6:05:07 | Epoch 20 finished. Took 1197.46 seconds.
----------------------------------------------------------------------
time: 2023-07-19 21:21:12
Evaluating: accuracy: 0.8849, eval_loss: 0.4964, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5812, expected_sequence_sparsity: 0.8359, target_sparsity: 0.58, step: 69000
lambda_1: -0.1886, lambda_2: 398.0453 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.65 0.58 0.55 0.44]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.62, 0.58, 0.54, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101110110101010100010001110
11111111111111111101111110011111001001010000000000
11111111111111111111111110010001000000100000000000
10011111111111111011011110110100010000000000000000
10001111110111101111111010101101011001110011010000
10001111110111101011111010011101011111000110000000
00001111110110101011011010011101011101110100100000
00000110110110101011000010010101000000000000000000
Best eval score so far: 0.8843 @ step 68500 epoch 20.92
Saving the best model so far: [Epoch 21 | Step: 69000 | MACs sparsity: 0.5874 | Score: 0.8849 | Loss: 0.4964]
loss: 0.070037, lagrangian_loss: 0.000229, attention_score_distillation_loss: 0.000020
loss: 0.015016, lagrangian_loss: 0.000159, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 21:24:22
Evaluating: accuracy: 0.8814, eval_loss: 0.53, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5803, expected_sequence_sparsity: 0.8356, target_sparsity: 0.58, step: 69500
lambda_1: -0.0890, lambda_2: 400.6562 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.65 0.58 0.55 0.46]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.62, 0.58, 0.54, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111000101100110101010100010001110
11111111111111111101111110011111001001010000000000
11111111111111111111111110010001000000100000000000
10011111111111111011011110110100011000000000000000
10001111110111101111111010101101011001110011010000
10001111110111101011111010011101011111000110000000
00001111110110101011011010011101011101110100100000
00000110110110101011000010010101000000000000000000
Best eval score so far: 0.8849 @ step 69000 epoch 21.08
loss: 0.045294, lagrangian_loss: 0.000109, attention_score_distillation_loss: 0.000020
loss: 0.010873, lagrangian_loss: 0.000007, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 21:27:17
Evaluating: accuracy: 0.8785, eval_loss: 0.5248, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5803, expected_sequence_sparsity: 0.8356, target_sparsity: 0.58, step: 70000
lambda_1: -0.2052, lambda_2: 403.3384 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.65 0.58 0.55 0.46]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.62, 0.58, 0.54, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111000101100110101010100010001110
11111111111111111101111110011111001001010000000000
11111111111111111111111110010001000000100000000000
10011111111111111011011110110100011000000000000000
10001111110111101111111010101101011001110011010000
10001111110111101011111010011101011111000110000000
00001111110110101011011010011101010101110110100000
00000110110110101011000010010101000000000000000000
Best eval score so far: 0.8849 @ step 69000 epoch 21.08
loss: 0.023941, lagrangian_loss: -0.000002, attention_score_distillation_loss: 0.000020
loss: 0.042479, lagrangian_loss: 0.000123, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 21:30:11
Evaluating: accuracy: 0.8794, eval_loss: 0.5079, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5803, expected_sequence_sparsity: 0.8356, target_sparsity: 0.58, step: 70500
lambda_1: -0.1424, lambda_2: 406.1184 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.64 0.58 0.55 0.45]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.62, 0.58, 0.54, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111000101100110101010100010001110
11111111111111111101111110011111001001010000000000
11111111111111111111111110011001000000000000000000
10011111111111111011011110110100011000000000000000
10001111110111101111111010101101011001110011010000
10001111110111101011111010011101011111000110000000
00001111110110101011011010011101010101110110100000
00000110110110101011000010010101000000000000000000
Best eval score so far: 0.8849 @ step 69000 epoch 21.08
loss: 0.335359, lagrangian_loss: -0.000012, attention_score_distillation_loss: 0.000020
loss: 0.114362, lagrangian_loss: 0.000204, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 21:33:05
Evaluating: accuracy: 0.8825, eval_loss: 0.5149, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5805, expected_sequence_sparsity: 0.8357, target_sparsity: 0.58, step: 71000
lambda_1: -0.0808, lambda_2: 408.9560 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.56 0.52 0.65 0.57 0.55 0.47]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.62, 0.56, 0.54, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111000101100110101010100010001110
11111111111111111111101110011111001001010000000000
11111111111111111111111110011001000000000000000000
10011111111111111011011110110100011000000000000000
10001111110111101011111010101101011101110011010000
10001111110111101011111010011101011111000010000000
00001111110110101011011010011101010101110110100000
00000110110110101011000010010101000000000000000000
Best eval score so far: 0.8849 @ step 69000 epoch 21.08
loss: 0.165304, lagrangian_loss: 0.000107, attention_score_distillation_loss: 0.000020
loss: 0.013163, lagrangian_loss: -0.000017, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 21:36:02
Evaluating: accuracy: 0.8836, eval_loss: 0.5286, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5805, expected_sequence_sparsity: 0.8357, target_sparsity: 0.58, step: 71500
lambda_1: -0.7439, lambda_2: 411.7583 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.56 0.52 0.65 0.57 0.55 0.44]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.62, 0.56, 0.54, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111000101100110101010100010001110
11111111111111111101111110011111001001010000000000
11111111111111111111111110011001000000000000000000
10011111111111111011011110110100011000000000000000
10001111110111101011111010101101011101110011010000
10001111110111101011111010011101011101000110000000
00001111110110101011011010011101010101110110100000
00000110110110101011000010010001000000010000000000
Best eval score so far: 0.8849 @ step 69000 epoch 21.08
loss: 0.043684, lagrangian_loss: 0.000143, attention_score_distillation_loss: 0.000020
loss: 0.099347, lagrangian_loss: 0.000058, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 21:38:56
Evaluating: accuracy: 0.8797, eval_loss: 0.5235, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5805, expected_sequence_sparsity: 0.8357, target_sparsity: 0.58, step: 72000
lambda_1: -0.2411, lambda_2: 414.7701 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.56 0.52 0.66 0.57 0.54 0.41]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.62, 0.56, 0.54, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111000101100110101010100010001110
11111111111111111101101110011111011001010000000000
11111111111111111111111110011001000000000000000000
10011111111111111011011110110100011000000000000000
10001111110111101011111010101101011101110011010000
10001111110111101011111010011101011101000110000000
00001111110110101011011010011101010101110110100000
00000110110110101011000010010001000000010000000000
Best eval score so far: 0.8849 @ step 69000 epoch 21.08
loss: 0.301586, lagrangian_loss: 0.000069, attention_score_distillation_loss: 0.000020
ETA: 5:46:11 | Epoch 21 finished. Took 1174.74 seconds.
loss: 0.066749, lagrangian_loss: -0.000078, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 21:41:53
Evaluating: accuracy: 0.8781, eval_loss: 0.5336, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5805, expected_sequence_sparsity: 0.8357, target_sparsity: 0.58, step: 72500
lambda_1: -0.1597, lambda_2: 417.6783 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.56 0.52 0.65 0.57 0.55 0.39]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.62, 0.56, 0.54, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111000101100110101010100010001110
11111111111111111101101110011111011001010000000000
11111111111111111111111110011001000000000000000000
10011111111111111011011110110100011000000000000000
10001111110111101011111010101101011101110011010000
10001111110111101011111010011101011101000110000000
00001111110110101011011010011101010101110110100000
00000110110110101011000010010001000000010000000000
Best eval score so far: 0.8849 @ step 69000 epoch 21.08
loss: 0.070522, lagrangian_loss: 0.000030, attention_score_distillation_loss: 0.000020
loss: 0.036869, lagrangian_loss: 0.000113, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 21:44:49
Evaluating: accuracy: 0.881, eval_loss: 0.5276, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5805, expected_sequence_sparsity: 0.8357, target_sparsity: 0.58, step: 73000
lambda_1: -0.3256, lambda_2: 420.6231 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.66 0.57 0.55 0.44]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.62, 0.56, 0.54, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111000101100110101010100010001110
11111111111111111101101110011111011001010000000000
11111111111111111111111110010101000000000000000000
10011111111111111011011110110100011000000000000000
10001111110111101011111010101101011101110011010000
10001111110111101011111010011101011101000110000000
00001111110110101011011010011101010101110110100000
00000110110110101011000010010001000000010000000000
Best eval score so far: 0.8849 @ step 69000 epoch 21.08
loss: 0.067634, lagrangian_loss: 0.000154, attention_score_distillation_loss: 0.000020
loss: 0.216381, lagrangian_loss: -0.000009, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 21:47:46
Evaluating: accuracy: 0.8816, eval_loss: 0.5287, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5805, expected_sequence_sparsity: 0.8357, target_sparsity: 0.58, step: 73500
lambda_1: -0.0074, lambda_2: 423.1298 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.56 0.52 0.66 0.57 0.55 0.45]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.62, 0.56, 0.54, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111000101100110101010100010001110
11111111111111111101101111011111001001010000000000
11111111111111111111111110011001000000000000000000
10011111111111111011011110110100011000000000000000
10001111110111101011111010101101011101110011010000
10001111110111101011111010011101011101000110000000
00001111110110101011011010011101010101110110100000
00000110110110101011000010010001000000010000000000
Best eval score so far: 0.8849 @ step 69000 epoch 21.08
loss: 0.031040, lagrangian_loss: 0.000128, attention_score_distillation_loss: 0.000020
loss: 0.033999, lagrangian_loss: 0.000281, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 21:50:41
Evaluating: accuracy: 0.8797, eval_loss: 0.511, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.581, expected_sequence_sparsity: 0.8359, target_sparsity: 0.58, step: 74000
lambda_1: -0.2138, lambda_2: 426.0714 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.66 0.57 0.55 0.4 ]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.6, 0.56, 0.54, 0.26]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111000101100110101010100010001110
11111111111111111101101111011111001001010000000000
11111111111111111111111110011001000000000000000000
10011111111111111011011110110100011000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101011101000110000000
00001111110110101011011010011101010101110110100000
00000110110110101011000010010001000000000000000000
Best eval score so far: 0.8849 @ step 69000 epoch 21.08
loss: 0.033948, lagrangian_loss: 0.001736, attention_score_distillation_loss: 0.000020
loss: 0.339679, lagrangian_loss: 0.000277, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 21:53:35
Evaluating: accuracy: 0.8816, eval_loss: 0.5339, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5809, expected_sequence_sparsity: 0.8359, target_sparsity: 0.58, step: 74500
lambda_1: -0.1441, lambda_2: 428.8959 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.56 0.53 0.66 0.57 0.55 0.42]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.6, 0.56, 0.54, 0.28]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111000101100110101010100010001110
11111111111111111101101110011111001001010100000000
11111111111111111111111110011001000000000000000000
10011111111111111011011110110100011000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101011101000110000000
00001111110110101011011010011101010101110110100000
00000110110110101011000010010001000000010000000000
Best eval score so far: 0.8849 @ step 69000 epoch 21.08
loss: 0.016333, lagrangian_loss: 0.000049, attention_score_distillation_loss: 0.000020
loss: 0.029728, lagrangian_loss: 0.000155, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 21:56:29
Evaluating: accuracy: 0.8763, eval_loss: 0.5386, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.581, expected_sequence_sparsity: 0.8359, target_sparsity: 0.58, step: 75000
lambda_1: -0.2069, lambda_2: 431.5603 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.66 0.56 0.55 0.44]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.6, 0.56, 0.54, 0.26]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111000101100110101010100010001110
11111111111111111101101110111111001001010000000000
11111111111111111111111110011001000000000000000000
10011111111111111011111110110100010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101011101000110000000
00001111110110101011011010011101010101110110100000
00000110110110101011000010000001000000010000000000
Best eval score so far: 0.8849 @ step 69000 epoch 21.08
loss: 0.024295, lagrangian_loss: 0.000959, attention_score_distillation_loss: 0.000020
loss: 0.130474, lagrangian_loss: -0.000029, attention_score_distillation_loss: 0.000020
ETA: 5:26:41 | Epoch 22 finished. Took 1132.14 seconds.
----------------------------------------------------------------------
time: 2023-07-19 21:59:24
Evaluating: accuracy: 0.877, eval_loss: 0.5532, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5824, expected_sequence_sparsity: 0.8364, target_sparsity: 0.58, step: 75500
lambda_1: -0.2969, lambda_2: 434.6068 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.52 0.68 0.56 0.55 0.4 ]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.6, 0.56, 0.54, 0.26]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111000101100110101010100010001110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111011111110110100010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101011101000110000000
00001111110110101011011010011101010101110110100000
00000110110110101011000010000001000000010000000000
Best eval score so far: 0.8849 @ step 69000 epoch 21.08
loss: 0.117379, lagrangian_loss: 0.000013, attention_score_distillation_loss: 0.000020
loss: 0.012942, lagrangian_loss: 0.000444, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 22:02:19
Evaluating: accuracy: 0.8735, eval_loss: 0.535, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5824, expected_sequence_sparsity: 0.8364, target_sparsity: 0.58, step: 76000
lambda_1: -0.4099, lambda_2: 437.4709 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.53 0.68 0.56 0.55 0.42]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.6, 0.56, 0.54, 0.26]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111000101100110101010100010001110
11111111111111111101101110111111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111011111110110100010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101011101000110000000
00001111110110101011011010011101010101110110100000
00000110110010101011000010000001000001010000000000
Best eval score so far: 0.8849 @ step 69000 epoch 21.08
loss: 0.015356, lagrangian_loss: 0.000100, attention_score_distillation_loss: 0.000020
loss: 0.022805, lagrangian_loss: 0.000466, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 22:05:11
Evaluating: accuracy: 0.877, eval_loss: 0.5574, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5825, expected_sequence_sparsity: 0.8365, target_sparsity: 0.58, step: 76500
lambda_1: -0.2786, lambda_2: 440.0410 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.53 0.67 0.56 0.54 0.42]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.6, 0.56, 0.54, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111000101100110101010100010001110
11111111111111111101111110011111001001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111011111110110100010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101011101000110000000
00001111110110101011011010011101010101110110100000
00000110110010101011000010000001000001000000000000
Best eval score so far: 0.8849 @ step 69000 epoch 21.08
loss: 0.060316, lagrangian_loss: -0.000032, attention_score_distillation_loss: 0.000020
loss: 0.036453, lagrangian_loss: 0.000028, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 22:08:06
Evaluating: accuracy: 0.8821, eval_loss: 0.506, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5825, expected_sequence_sparsity: 0.8365, target_sparsity: 0.58, step: 77000
lambda_1: -0.2779, lambda_2: 443.1506 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.54 0.67 0.56 0.54 0.44]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.6, 0.56, 0.54, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111000101100110101010100010001110
11111111111111111101101110011111001001010000000100
11111111111111111111111110010001000000000000000000
10011111111111111011111110110100010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101011101000110000000
00001111110110101011011010011101010101110110100000
00000110110010101011000010000001000000010000000000
Best eval score so far: 0.8849 @ step 69000 epoch 21.08
loss: 0.035918, lagrangian_loss: -0.000030, attention_score_distillation_loss: 0.000020
loss: 0.066926, lagrangian_loss: 0.000072, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 22:11:02
Evaluating: accuracy: 0.8819, eval_loss: 0.5381, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5824, expected_sequence_sparsity: 0.8364, target_sparsity: 0.58, step: 77500
lambda_1: -0.5802, lambda_2: 446.2108 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.54 0.69 0.56 0.54 0.39]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.6, 0.56, 0.54, 0.26]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010011110
11111111111111111101101110011111001001110000000000
11111111111111111111111110010001000000000000000000
10011111111111111011111110110100010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101011101000110000000
00001111110110101011011010011101010101110110100000
00000110110010101011000010000101000001000000000000
Best eval score so far: 0.8849 @ step 69000 epoch 21.08
loss: 0.010055, lagrangian_loss: 0.000035, attention_score_distillation_loss: 0.000020
loss: 0.041910, lagrangian_loss: 0.001313, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 22:13:57
Evaluating: accuracy: 0.881, eval_loss: 0.525, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5824, expected_sequence_sparsity: 0.8364, target_sparsity: 0.58, step: 78000
lambda_1: -0.0576, lambda_2: 449.4054 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.54 0.69 0.56 0.54 0.34]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.6, 0.56, 0.54, 0.26]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.01]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000111100110101010100010001110
11111111111111111101101110011111001001010000000001
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101011101000110000000
00001111110110101011011010011101010101110110100000
00000110110010101011000010000001000001010000000000
Best eval score so far: 0.8849 @ step 69000 epoch 21.08
loss: 0.083640, lagrangian_loss: 0.001204, attention_score_distillation_loss: 0.000020
loss: 0.039007, lagrangian_loss: 0.000224, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 22:16:55
Evaluating: accuracy: 0.8797, eval_loss: 0.5243, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5825, expected_sequence_sparsity: 0.8365, target_sparsity: 0.58, step: 78500
lambda_1: -0.1890, lambda_2: 452.1937 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.54 0.68 0.55 0.53 0.38]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.6, 0.56, 0.52, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100011001110
11111111111111111101101110011111001001010000001000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101011101000110000000
00001111110110101011011010011101010101110100100000
00000110110010101011000010000001000001000000000000
Best eval score so far: 0.8849 @ step 69000 epoch 21.08
loss: 0.036491, lagrangian_loss: 0.000262, attention_score_distillation_loss: 0.000020
ETA: 5:07:32 | Epoch 23 finished. Took 1159.49 seconds.
loss: 0.086234, lagrangian_loss: -0.000001, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 22:19:50
Evaluating: accuracy: 0.8752, eval_loss: 0.5335, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5825, expected_sequence_sparsity: 0.8365, target_sparsity: 0.58, step: 79000
lambda_1: -0.1797, lambda_2: 454.9331 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.54 0.68 0.55 0.53 0.42]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.6, 0.56, 0.52, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101110110101010100010001110
11111111111111111101101110011111001001010000001000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101011101000110000000
00001111110110101011011010011101010101110100100000
00000110110010101011000010000001000001000000000000
Best eval score so far: 0.8849 @ step 69000 epoch 21.08
loss: 0.037827, lagrangian_loss: -0.000013, attention_score_distillation_loss: 0.000020
loss: 0.024940, lagrangian_loss: 0.000085, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 22:22:45
Evaluating: accuracy: 0.8814, eval_loss: 0.5207, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5827, expected_sequence_sparsity: 0.8366, target_sparsity: 0.58, step: 79500
lambda_1: -0.0871, lambda_2: 457.6575 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.53 0.67 0.55 0.53 0.41]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.6, 0.54, 0.52, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111101101110011111011001010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101011101000010000000
00001111110110101011011010001101010101110110100000
00000110110010101011000010000001000001000000000000
Best eval score so far: 0.8849 @ step 69000 epoch 21.08
loss: 0.041184, lagrangian_loss: 0.001178, attention_score_distillation_loss: 0.000020
loss: 0.016891, lagrangian_loss: 0.000692, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 22:25:39
Evaluating: accuracy: 0.8796, eval_loss: 0.5212, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5827, expected_sequence_sparsity: 0.8366, target_sparsity: 0.58, step: 80000
lambda_1: -0.0412, lambda_2: 460.7020 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.53 0.66 0.56 0.53 0.43]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.54, 0.52, 0.6, 0.54, 0.52, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.22, 0.12, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111101101110011111001101010000000000
11111111111111111111111110010001000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101011101000010000000
00001111110110101011011010001101010101110110100000
00000110110010101011000010000101000000000000000000
Best eval score so far: 0.8849 @ step 69000 epoch 21.08
loss: 0.028016, lagrangian_loss: 0.000206, attention_score_distillation_loss: 0.000020
loss: 0.014005, lagrangian_loss: -0.000021, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 22:28:34
Evaluating: accuracy: 0.8839, eval_loss: 0.5331, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5813, expected_sequence_sparsity: 0.836, target_sparsity: 0.58, step: 80500
lambda_1: -0.0560, lambda_2: 463.2805 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.53 0.65 0.56 0.53 0.39]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.6, 0.54, 0.52, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111101101110011111001101010000000000
11111111111111111111111110010101000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101011101000010000000
00001111110110101011011010001101010101110110100000
00000110110010101011000010000101000000000000000000
Best eval score so far: 0.8849 @ step 69000 epoch 21.08
loss: 0.031058, lagrangian_loss: 0.001869, attention_score_distillation_loss: 0.000020
loss: 0.035282, lagrangian_loss: 0.000179, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 22:31:30
Evaluating: accuracy: 0.8808, eval_loss: 0.5228, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5813, expected_sequence_sparsity: 0.836, target_sparsity: 0.58, step: 81000
lambda_1: -0.0867, lambda_2: 466.2064 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.53 0.65 0.56 0.53 0.4 ]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.6, 0.54, 0.52, 0.24]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111101101110011111001101010000000000
11111111111111111111111110010101000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101011101000010000000
00001111110110101011011010001101010101110110100000
00000110110010101011000010000101000000000000000000
Best eval score so far: 0.8849 @ step 69000 epoch 21.08
loss: 0.020326, lagrangian_loss: 0.000087, attention_score_distillation_loss: 0.000020
loss: 0.237867, lagrangian_loss: 0.000121, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 22:34:23
Evaluating: accuracy: 0.885, eval_loss: 0.498, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5814, expected_sequence_sparsity: 0.836, target_sparsity: 0.58, step: 81500
lambda_1: -0.1589, lambda_2: 469.1951 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.53 0.66 0.55 0.53 0.38]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.6, 0.54, 0.52, 0.22]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011010101100110101010100010001110
11111111111111111101101110011111001101010000000000
11111111111111111111111110010101000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101011101000010000000
00001111110110101011011010001101010101110110100000
00000110110010101011000010000001000000000000000000
Best eval score so far: 0.8849 @ step 69000 epoch 21.08
Saving the best model so far: [Epoch 24 | Step: 81500 | MACs sparsity: 0.5858 | Score: 0.885 | Loss: 0.498]
loss: 0.133779, lagrangian_loss: 0.000331, attention_score_distillation_loss: 0.000020
loss: 0.082751, lagrangian_loss: 0.000102, attention_score_distillation_loss: 0.000020
ETA: 4:48:26 | Epoch 24 finished. Took 1165.38 seconds.
----------------------------------------------------------------------
time: 2023-07-19 22:37:54
Evaluating: accuracy: 0.8838, eval_loss: 0.5325, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5814, expected_sequence_sparsity: 0.836, target_sparsity: 0.58, step: 82000
lambda_1: -0.0668, lambda_2: 471.9885 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.53 0.67 0.55 0.53 0.4 ]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.6, 0.54, 0.52, 0.22]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011010101100110101010100010001110
11111111111111111101101110011111001101010000000000
11111111111111111111111110010101000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101010101000110000000
00001111110110101011011010001101010101110110100000
00000110110010101011000010000001000000000000000000
Best eval score so far: 0.8850 @ step 81500 epoch 24.89
loss: 0.078437, lagrangian_loss: 0.000281, attention_score_distillation_loss: 0.000020
loss: 0.097536, lagrangian_loss: 0.000040, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 22:40:51
Evaluating: accuracy: 0.8828, eval_loss: 0.5407, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5814, expected_sequence_sparsity: 0.836, target_sparsity: 0.58, step: 82500
lambda_1: -0.1629, lambda_2: 474.8652 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.55 0.53 0.68 0.55 0.53 0.39]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.52, 0.6, 0.54, 0.52, 0.22]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.12, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101101110101010100010001110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010101000000000000000000
10011111111111111111011110110100010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101010101000110000000
00001111110110101011011010001101010101110110100000
00000010110010101011000010000001000000010000000000
Best eval score so far: 0.8850 @ step 81500 epoch 24.89
loss: 0.230822, lagrangian_loss: 0.000075, attention_score_distillation_loss: 0.000020
loss: 0.017206, lagrangian_loss: 0.000369, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 22:43:46
Evaluating: accuracy: 0.885, eval_loss: 0.5044, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5822, expected_sequence_sparsity: 0.8364, target_sparsity: 0.58, step: 83000
lambda_1: -0.2070, lambda_2: 477.7310 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.56 0.52 0.68 0.55 0.53 0.33]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.54, 0.52, 0.22]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010101000000000000000000
10011111111111111111011110110000010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101010101000110000000
00001111110110101011011010001101010101110110100000
00000010110010101011000010000001000000010000000000
Best eval score so far: 0.8850 @ step 81500 epoch 24.89
loss: 0.071927, lagrangian_loss: -0.000022, attention_score_distillation_loss: 0.000020
loss: 0.026830, lagrangian_loss: 0.001024, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 22:46:41
Evaluating: accuracy: 0.8817, eval_loss: 0.5089, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5822, expected_sequence_sparsity: 0.8364, target_sparsity: 0.58, step: 83500
lambda_1: 0.0075, lambda_2: 480.3314 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.56 0.52 0.68 0.55 0.53 0.37]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.54, 0.52, 0.22]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010101000000000000000000
10011111111111111111011110110000010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101010101000110000000
00001111110110101011011010001101010101110110100000
00000010110010101011000010000100000000010000000000
Best eval score so far: 0.8850 @ step 81500 epoch 24.89
loss: 0.019109, lagrangian_loss: 0.000081, attention_score_distillation_loss: 0.000020
loss: 0.018641, lagrangian_loss: 0.000064, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 22:49:35
Evaluating: accuracy: 0.8803, eval_loss: 0.5375, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5822, expected_sequence_sparsity: 0.8364, target_sparsity: 0.58, step: 84000
lambda_1: -0.1322, lambda_2: 483.6827 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.56 0.52 0.68 0.54 0.53 0.37]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.54, 0.52, 0.22]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010011110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010101000000000000000000
10011111111111111111011110110000010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101010101010010000000
00001111110110101011011010001101010101110110100000
00000010110010101011000010000100000000000100000000
Best eval score so far: 0.8850 @ step 81500 epoch 24.89
loss: 0.209876, lagrangian_loss: 0.000086, attention_score_distillation_loss: 0.000020
loss: 0.057056, lagrangian_loss: 0.000157, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 22:52:33
Evaluating: accuracy: 0.8845, eval_loss: 0.507, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5822, expected_sequence_sparsity: 0.8364, target_sparsity: 0.58, step: 84500
lambda_1: -0.1916, lambda_2: 486.3956 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.56 0.52 0.66 0.54 0.53 0.4 ]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.54, 0.52, 0.22]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010011110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010101000000000000000000
10011111111111111111011110110000010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101010101010010000000
00001111110110101011011010001101010101110110100000
00000010110010101011000010000100000000000100000000
Best eval score so far: 0.8850 @ step 81500 epoch 24.89
loss: 0.322628, lagrangian_loss: 0.000392, attention_score_distillation_loss: 0.000020
loss: 0.166063, lagrangian_loss: 0.000245, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 22:55:26
Evaluating: accuracy: 0.8779, eval_loss: 0.5346, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5822, expected_sequence_sparsity: 0.8364, target_sparsity: 0.58, step: 85000
lambda_1: -0.2417, lambda_2: 488.8941 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.64 0.56 0.52 0.66 0.54 0.53 0.37]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.54, 0.52, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010011110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010101000000000000000000
10011111111111111111011110110000010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101010101010010000000
00001111110110101011011010001101010101110110100000
00000010110010101011000010000100000000000000000000
Best eval score so far: 0.8850 @ step 81500 epoch 24.89
loss: 0.091727, lagrangian_loss: -0.000030, attention_score_distillation_loss: 0.000020
ETA: 4:29:17 | Epoch 25 finished. Took 1161.18 seconds.
loss: 0.012089, lagrangian_loss: 0.000028, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 22:58:24
Evaluating: accuracy: 0.8841, eval_loss: 0.5069, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5822, expected_sequence_sparsity: 0.8364, target_sparsity: 0.58, step: 85500
lambda_1: -0.1114, lambda_2: 491.7215 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.65 0.56 0.52 0.65 0.54 0.53 0.37]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.54, 0.52, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010011110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010101000000000000000000
10011111111111111111011110110000010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101010101010010000000
00001111110110101011011010001101010101110110100000
00000010110010101011000010000100000000000000000000
Best eval score so far: 0.8850 @ step 81500 epoch 24.89
loss: 0.030648, lagrangian_loss: 0.000231, attention_score_distillation_loss: 0.000020
loss: 0.012096, lagrangian_loss: 0.000040, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 23:01:18
Evaluating: accuracy: 0.8828, eval_loss: 0.5292, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5822, expected_sequence_sparsity: 0.8364, target_sparsity: 0.58, step: 86000
lambda_1: -0.1465, lambda_2: 494.5087 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.65 0.56 0.52 0.65 0.54 0.53 0.36]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.54, 0.52, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010011110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001000000000000100000
10011111111111111111011110110000010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101010101010010000000
00001111110110101011011010001101010101110110100000
00000010110010101011000010000100000000000000000000
Best eval score so far: 0.8850 @ step 81500 epoch 24.89
loss: 0.016542, lagrangian_loss: -0.000008, attention_score_distillation_loss: 0.000020
loss: 0.016405, lagrangian_loss: 0.000194, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 23:04:14
Evaluating: accuracy: 0.8819, eval_loss: 0.5187, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5822, expected_sequence_sparsity: 0.8364, target_sparsity: 0.58, step: 86500
lambda_1: -0.1348, lambda_2: 497.3129 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.65 0.56 0.52 0.66 0.54 0.53 0.33]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.54, 0.52, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010011110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001000000000000100000
10011111111111111111011110110000010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101010101010010000000
00001111110110101011011010001101010101110110100000
00000010110010101011000010000000000000000100000000
Best eval score so far: 0.8850 @ step 81500 epoch 24.89
loss: 0.067991, lagrangian_loss: 0.000220, attention_score_distillation_loss: 0.000020
loss: 0.016550, lagrangian_loss: 0.000009, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 23:07:08
Evaluating: accuracy: 0.8843, eval_loss: 0.5264, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5822, expected_sequence_sparsity: 0.8364, target_sparsity: 0.58, step: 87000
lambda_1: -0.3611, lambda_2: 500.1249 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.65 0.56 0.52 0.65 0.54 0.53 0.31]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.54, 0.52, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010011110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001000010000000000000
10011111111111111111011110110000010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101010101010010000000
00001111110110101011011010001101010101110110100000
00000010110010101001000010000100000000000100000000
Best eval score so far: 0.8850 @ step 81500 epoch 24.89
loss: 0.016051, lagrangian_loss: -0.000007, attention_score_distillation_loss: 0.000020
loss: 0.027265, lagrangian_loss: -0.000026, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 23:10:02
Evaluating: accuracy: 0.8797, eval_loss: 0.5297, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5822, expected_sequence_sparsity: 0.8364, target_sparsity: 0.58, step: 87500
lambda_1: -0.0226, lambda_2: 502.9646 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.65 0.56 0.52 0.64 0.54 0.53 0.31]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.54, 0.52, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010011110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001000000000000100000
10011111111111111111011110110000010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101010101010010000000
00001111110110101011011010001101010101110110100000
00000010110010101001000010000100000000000010000000
Best eval score so far: 0.8850 @ step 81500 epoch 24.89
loss: 0.060761, lagrangian_loss: 0.000470, attention_score_distillation_loss: 0.000020
loss: 0.014509, lagrangian_loss: 0.000508, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 23:12:57
Evaluating: accuracy: 0.8817, eval_loss: 0.5085, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5824, expected_sequence_sparsity: 0.8365, target_sparsity: 0.58, step: 88000
lambda_1: -0.2795, lambda_2: 505.8651 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.65 0.56 0.52 0.65 0.54 0.53 0.33]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.52, 0.52, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010011110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010011000000000000000000
10011111111111111111011110110000010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101010101000010000000
00001111110110101011011010001101010101110110100000
00000010110010101011000010000000000000000100000000
Best eval score so far: 0.8850 @ step 81500 epoch 24.89
loss: 0.022134, lagrangian_loss: -0.000032, attention_score_distillation_loss: 0.000020
loss: 0.012158, lagrangian_loss: 0.001050, attention_score_distillation_loss: 0.000020
ETA: 4:09:52 | Epoch 26 finished. Took 1131.58 seconds.
----------------------------------------------------------------------
time: 2023-07-19 23:15:52
Evaluating: accuracy: 0.8805, eval_loss: 0.5221, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5824, expected_sequence_sparsity: 0.8365, target_sparsity: 0.58, step: 88500
lambda_1: -0.0967, lambda_2: 508.6771 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.65 0.56 0.52 0.64 0.53 0.53 0.34]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.52, 0.52, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010011110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001000000000000001000
10011111111111111111011110110000010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101010101000010000000
00001111110110101011011010001101010101110110100000
00000011110010101001000010000000010000000000000000
Best eval score so far: 0.8850 @ step 81500 epoch 24.89
loss: 0.013911, lagrangian_loss: 0.000187, attention_score_distillation_loss: 0.000020
loss: 0.028696, lagrangian_loss: 0.000877, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 23:18:48
Evaluating: accuracy: 0.8836, eval_loss: 0.5095, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5825, expected_sequence_sparsity: 0.8365, target_sparsity: 0.58, step: 89000
lambda_1: -0.0785, lambda_2: 511.6286 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.65 0.56 0.52 0.64 0.53 0.52 0.32]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.52, 0.5, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010011110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001001000000000000000
10011111111111111111011110110000010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101010101000010000000
00001111110110101011011010001101010101110110000000
00000010110010101001000010000000010000000100000000
Best eval score so far: 0.8850 @ step 81500 epoch 24.89
loss: 0.030722, lagrangian_loss: 0.000403, attention_score_distillation_loss: 0.000020
loss: 0.038689, lagrangian_loss: 0.000497, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 23:21:40
Evaluating: accuracy: 0.8816, eval_loss: 0.5278, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5825, expected_sequence_sparsity: 0.8365, target_sparsity: 0.58, step: 89500
lambda_1: -0.1155, lambda_2: 514.1971 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.65 0.56 0.52 0.65 0.53 0.52 0.31]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.52, 0.5, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010011110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001001000000000000000
10011111111111111111011110110000010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011111010011101010101000010000000
00001111110110101011011010001101010101110110000000
00000010110010101001000010000000010100000000000000
Best eval score so far: 0.8850 @ step 81500 epoch 24.89
loss: 0.019358, lagrangian_loss: 0.000019, attention_score_distillation_loss: 0.000020
loss: 0.016133, lagrangian_loss: 0.000004, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 23:24:34
Evaluating: accuracy: 0.8819, eval_loss: 0.5351, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5874, expected_sparsity: 0.5825, expected_sequence_sparsity: 0.8365, target_sparsity: 0.58, step: 90000
lambda_1: -0.1443, lambda_2: 517.0613 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.65 0.56 0.52 0.65 0.53 0.52 0.31]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.6, 0.52, 0.5, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010011110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001000001000000000000
10011111111111111111011110110000010000000000000000
10001111110111101011111010001101011101110011010000
10001111110111101011011010011101010101010010000000
00001111110110101011011010001101010101110110000000
10000010110010101001000010000000010000000000000000
Best eval score so far: 0.8850 @ step 81500 epoch 24.89
loss: 0.038661, lagrangian_loss: 0.000410, attention_score_distillation_loss: 0.000020
loss: 0.014877, lagrangian_loss: -0.000050, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 23:27:30
Evaluating: accuracy: 0.8812, eval_loss: 0.5484, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.589, expected_sparsity: 0.5828, expected_sequence_sparsity: 0.8366, target_sparsity: 0.58, step: 90500
lambda_1: -0.1692, lambda_2: 519.9286 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.65 0.56 0.51 0.68 0.53 0.51 0.3 ]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.58, 0.52, 0.5, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001000000000000001000
10011111111111111111011110110000010000000000000000
10001111110111101011111010001101011101110011000000
10001111110111101011011010011101010101010010000000
00001111110110101011011010001101010101110110000000
10000010110010101001000010000000010000000000000000
Best eval score so far: 0.8850 @ step 81500 epoch 24.89
loss: 0.060792, lagrangian_loss: 0.000433, attention_score_distillation_loss: 0.000020
loss: 0.017730, lagrangian_loss: 0.002628, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 23:30:26
Evaluating: accuracy: 0.8836, eval_loss: 0.5471, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.589, expected_sparsity: 0.5828, expected_sequence_sparsity: 0.8366, target_sparsity: 0.58, step: 91000
lambda_1: -0.2392, lambda_2: 522.8953 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.65 0.56 0.51 0.67 0.53 0.51 0.33]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.58, 0.52, 0.5, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111000101100110101010100010001110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001000000100000000000
10011111111111111111011110110000010000000000000000
10001111110111101011111010001101011101110011000000
10001111110111101011011010011101010101010010000000
00001111110110101011011010001101010101110110000000
10000010110010101001000010000000010000000000000000
Best eval score so far: 0.8850 @ step 81500 epoch 24.89
loss: 0.010168, lagrangian_loss: 0.000062, attention_score_distillation_loss: 0.000020
loss: 0.026259, lagrangian_loss: 0.000977, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 23:33:22
Evaluating: accuracy: 0.8832, eval_loss: 0.5449, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.589, expected_sparsity: 0.5828, expected_sequence_sparsity: 0.8366, target_sparsity: 0.58, step: 91500
lambda_1: -0.0913, lambda_2: 525.4966 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.65 0.56 0.51 0.65 0.53 0.51 0.34]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.58, 0.52, 0.5, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111000101100110101010100010001110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001000000100000000000
10011111111111111111011110110000010000000000000000
10001111110111101011111010001101011101110011000000
10001111110111101011011010011101010101010010000000
00001111110110101011011010001101010101110110000000
10000010110010101001000010000000010000000000000000
Best eval score so far: 0.8850 @ step 81500 epoch 24.89
loss: 0.009967, lagrangian_loss: 0.000270, attention_score_distillation_loss: 0.000020
ETA: 3:50:41 | Epoch 27 finished. Took 1159.53 seconds.
loss: 0.041729, lagrangian_loss: 0.000000, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 23:36:17
Evaluating: accuracy: 0.8796, eval_loss: 0.5416, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.589, expected_sparsity: 0.5828, expected_sequence_sparsity: 0.8366, target_sparsity: 0.58, step: 92000
lambda_1: -0.1914, lambda_2: 528.3075 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.65 0.56 0.5  0.67 0.53 0.51 0.33]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.58, 0.52, 0.5, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100111101010100010001110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010011000000000000000000
10011111111111111111011110110000010000000000000000
10001111110111101011111010001101011101110011000000
10001111110111101011011010011101010101010010000000
00001111110110101011011010001101010101110110000000
10000010110010101001000010000000010000000000000000
Best eval score so far: 0.8850 @ step 81500 epoch 24.89
loss: 0.034962, lagrangian_loss: 0.000348, attention_score_distillation_loss: 0.000020
loss: 0.009019, lagrangian_loss: 0.000027, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 23:39:11
Evaluating: accuracy: 0.8863, eval_loss: 0.5116, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.589, expected_sparsity: 0.5828, expected_sequence_sparsity: 0.8366, target_sparsity: 0.58, step: 92500
lambda_1: -0.0542, lambda_2: 531.2956 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.65 0.56 0.5  0.67 0.53 0.51 0.33]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.58, 0.52, 0.5, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101110110101010100010001110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001000000100000000000
10011111111111111111011110110000010000000000000000
10001111110111101011111010001101011101110011000000
10001111110111101011011010011101010101010010000000
10001111110110101011011010001101010101010110000000
10000010110010101001000010000000010000000000000000
Best eval score so far: 0.8850 @ step 81500 epoch 24.89
Saving the best model so far: [Epoch 28 | Step: 92500 | MACs sparsity: 0.589 | Score: 0.8863 | Loss: 0.5116]
loss: 0.024062, lagrangian_loss: 0.000332, attention_score_distillation_loss: 0.000020
loss: 0.021907, lagrangian_loss: 0.000013, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 23:42:24
Evaluating: accuracy: 0.8832, eval_loss: 0.5365, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.589, expected_sparsity: 0.5828, expected_sequence_sparsity: 0.8366, target_sparsity: 0.58, step: 93000
lambda_1: -0.0981, lambda_2: 534.1141 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.65 0.56 0.51 0.67 0.53 0.51 0.31]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.58, 0.52, 0.5, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100111101010100010001110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001000000100000000000
10011111111111111111011110110000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
10001111110110101011011010001101010101010110000000
10000010110010101001000010000000010000000000000000
Best eval score so far: 0.8863 @ step 92500 epoch 28.25
loss: 0.030566, lagrangian_loss: 0.000105, attention_score_distillation_loss: 0.000020
loss: 0.168717, lagrangian_loss: 0.000140, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 23:45:16
Evaluating: accuracy: 0.8823, eval_loss: 0.5358, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.589, expected_sparsity: 0.5828, expected_sequence_sparsity: 0.8366, target_sparsity: 0.58, step: 93500
lambda_1: -0.0281, lambda_2: 536.7365 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.65 0.56 0.51 0.65 0.53 0.51 0.31]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.58, 0.52, 0.5, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100111101010100010001110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001000000100000000000
10011111111111111111011110110000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
10001111110110101011011010001101010101010110000000
10000010110010101001000010000000010000000000000000
Best eval score so far: 0.8863 @ step 92500 epoch 28.25
loss: 0.021299, lagrangian_loss: 0.000387, attention_score_distillation_loss: 0.000020
loss: 0.054373, lagrangian_loss: 0.000078, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 23:48:14
Evaluating: accuracy: 0.8816, eval_loss: 0.5277, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.589, expected_sparsity: 0.5828, expected_sequence_sparsity: 0.8366, target_sparsity: 0.58, step: 94000
lambda_1: -0.1772, lambda_2: 539.5947 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.65 0.56 0.51 0.65 0.53 0.51 0.32]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.58, 0.52, 0.5, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100111101010100010001110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001000000000000100000
10011111111111111111011110110000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
10001111110110101011011010001101010101010110000000
10000010110010101001000010000000010000000000000000
Best eval score so far: 0.8863 @ step 92500 epoch 28.25
loss: 0.025179, lagrangian_loss: 0.000466, attention_score_distillation_loss: 0.000020
loss: 0.025246, lagrangian_loss: 0.000568, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 23:51:11
Evaluating: accuracy: 0.8836, eval_loss: 0.5171, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.589, expected_sparsity: 0.5828, expected_sequence_sparsity: 0.8366, target_sparsity: 0.58, step: 94500
lambda_1: -0.1541, lambda_2: 542.3299 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.65 0.56 0.51 0.66 0.53 0.51 0.3 ]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.58, 0.52, 0.5, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111111101110011111001001010000000000
11111111111111111111111110010001000000000000100000
10011111111111111111011110110000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
10001111110110101011011010001101010101010110000000
00000010110010101001000010000000010000000100000000
Best eval score so far: 0.8863 @ step 92500 epoch 28.25
loss: 0.021549, lagrangian_loss: 0.000965, attention_score_distillation_loss: 0.000020
loss: 0.088531, lagrangian_loss: 0.000408, attention_score_distillation_loss: 0.000020
ETA: 3:31:26 | Epoch 28 finished. Took 1148.81 seconds.
----------------------------------------------------------------------
time: 2023-07-19 23:54:07
Evaluating: accuracy: 0.8832, eval_loss: 0.5069, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.589, expected_sparsity: 0.5829, expected_sequence_sparsity: 0.8366, target_sparsity: 0.58, step: 95000
lambda_1: -0.0965, lambda_2: 545.4095 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.65 0.56 0.51 0.66 0.54 0.51 0.27]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.58, 0.52, 0.5, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111111101110011111001001010000000000
11111111111111111111111110011001000000000000000000
10011111111111111111011110110000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
10001111110110101011011010001101010101010110000000
00000010110010101001000010000000010000000000000000
Best eval score so far: 0.8863 @ step 92500 epoch 28.25
loss: 0.028516, lagrangian_loss: 0.001028, attention_score_distillation_loss: 0.000020
loss: 0.071162, lagrangian_loss: 0.000041, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 23:57:01
Evaluating: accuracy: 0.8814, eval_loss: 0.5242, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.589, expected_sparsity: 0.5829, expected_sequence_sparsity: 0.8366, target_sparsity: 0.58, step: 95500
lambda_1: -0.1236, lambda_2: 548.5181 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.65 0.56 0.51 0.66 0.54 0.51 0.24]
infer remain: [1.0, 1.0, 0.64, 0.64, 0.56, 0.5, 0.58, 0.52, 0.5, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.41, 0.23, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101110100010001110
11111111111111111111101110011111001001010000000000
11111111111111111111111110011001000000000000000000
10011111111111111111011110110000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
10001111110110101011011010001101010101010110000000
00000010110010101001000010000000010000000000000000
Best eval score so far: 0.8863 @ step 92500 epoch 28.25
loss: 0.026110, lagrangian_loss: 0.000573, attention_score_distillation_loss: 0.000020
loss: 0.018340, lagrangian_loss: 0.000261, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-19 23:59:56
Evaluating: accuracy: 0.8838, eval_loss: 0.529, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5825, expected_sparsity: 0.5805, expected_sequence_sparsity: 0.8357, target_sparsity: 0.58, step: 96000
lambda_1: -0.0759, lambda_2: 551.2899 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.66 0.56 0.51 0.64 0.55 0.51 0.25]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.5, 0.58, 0.54, 0.5, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.12, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101110100010001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110011001000000000000000000
10011111111111111111011110110000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101110010000000
00001111110110111011011010001101010101010110000000
00000000110010101001000010000000010000000100000000
Best eval score so far: 0.8863 @ step 92500 epoch 28.25
loss: 0.026097, lagrangian_loss: 0.000302, attention_score_distillation_loss: 0.000020
loss: 0.010432, lagrangian_loss: 0.000542, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 00:02:51
Evaluating: accuracy: 0.8845, eval_loss: 0.5358, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5825, expected_sparsity: 0.5807, expected_sequence_sparsity: 0.8358, target_sparsity: 0.58, step: 96500
lambda_1: -0.1349, lambda_2: 554.2318 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.66 0.56 0.51 0.64 0.54 0.51 0.26]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.5, 0.58, 0.52, 0.5, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.12, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010101010001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110010001000000000000100000
10011111111111111111011110110000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
00001111110110101011011010001101010101010111000000
00000000110010101001000010000000010000000100000000
Best eval score so far: 0.8863 @ step 92500 epoch 28.25
loss: 0.025863, lagrangian_loss: -0.000008, attention_score_distillation_loss: 0.000020
loss: 0.024088, lagrangian_loss: 0.000186, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 00:05:45
Evaluating: accuracy: 0.8814, eval_loss: 0.5225, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5825, expected_sparsity: 0.5805, expected_sequence_sparsity: 0.8357, target_sparsity: 0.58, step: 97000
lambda_1: -0.0319, lambda_2: 557.1033 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.65 0.56 0.51 0.64 0.55 0.52 0.27]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.5, 0.58, 0.54, 0.5, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.12, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010101010001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110010001000000000000100000
10011111111111111111011110110000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010100000
00001111110110101011011010001101010101010111000000
00000000110010101001000010000000010000000100000000
Best eval score so far: 0.8863 @ step 92500 epoch 28.25
loss: 0.034251, lagrangian_loss: 0.000234, attention_score_distillation_loss: 0.000020
loss: 0.019184, lagrangian_loss: 0.000692, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 00:08:38
Evaluating: accuracy: 0.8832, eval_loss: 0.5355, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 97500
lambda_1: -0.3556, lambda_2: 559.8083 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.66 0.56 0.5  0.63 0.54 0.52 0.25]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.5, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011001101100110101010100010001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110011001000000000000000000
10011111111111111111011110010000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
10001111110110101011011010001101010101010110000000
00000000110010101001010010000000010000000000000000
Best eval score so far: 0.8863 @ step 92500 epoch 28.25
loss: 0.027194, lagrangian_loss: 0.000046, attention_score_distillation_loss: 0.000020
loss: 0.042669, lagrangian_loss: 0.000126, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 00:11:32
Evaluating: accuracy: 0.8838, eval_loss: 0.5163, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 98000
lambda_1: -0.3674, lambda_2: 562.6284 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.66 0.56 0.51 0.63 0.54 0.52 0.23]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.5, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011001101100110101010100010001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110011001000000000000000000
10011111111111111111011110010000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
10001111110110101011011010001101010101010110000000
00000000110010101001000010000000010000000100000000
Best eval score so far: 0.8863 @ step 92500 epoch 28.25
loss: 0.009455, lagrangian_loss: 0.000018, attention_score_distillation_loss: 0.000020
ETA: 3:12:13 | Epoch 29 finished. Took 1154.38 seconds.
loss: 0.013330, lagrangian_loss: -0.000005, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 00:14:26
Evaluating: accuracy: 0.8856, eval_loss: 0.5263, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 98500
lambda_1: -0.2245, lambda_2: 565.5886 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.66 0.56 0.52 0.63 0.54 0.52 0.23]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.5, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010101110
11111111111111111111111110011111001001010000000000
11111111111111111111111110011001000000000000000000
10011111111111111011011110110000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
10001111110110101011011010001101010101010110000000
00000001110010101001000010000000010000000000000000
Best eval score so far: 0.8863 @ step 92500 epoch 28.25
loss: 0.036448, lagrangian_loss: 0.000187, attention_score_distillation_loss: 0.000020
loss: 0.014051, lagrangian_loss: 0.000138, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 00:17:20
Evaluating: accuracy: 0.8821, eval_loss: 0.519, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 99000
lambda_1: -0.0611, lambda_2: 568.2416 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.66 0.56 0.51 0.63 0.54 0.52 0.23]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.5, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100110001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110010001000000000000001000
10011111111111111011011110110000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
10001111110110101011011010001101010101010110000000
00000000110010101001000010000000010000000100000000
Best eval score so far: 0.8863 @ step 92500 epoch 28.25
loss: 0.019944, lagrangian_loss: 0.000065, attention_score_distillation_loss: 0.000020
loss: 0.009355, lagrangian_loss: 0.000059, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 00:20:18
Evaluating: accuracy: 0.8839, eval_loss: 0.5029, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 99500
lambda_1: -0.0753, lambda_2: 571.0054 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.66 0.56 0.51 0.62 0.54 0.52 0.24]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.5, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101110110101010100010001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110010101000000000000000000
10011111111111111011011110110000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
10001111110110101011011010001101010101010110000000
00000000110010101001000010000000010000000100000000
Best eval score so far: 0.8863 @ step 92500 epoch 28.25
loss: 0.025599, lagrangian_loss: 0.000201, attention_score_distillation_loss: 0.000020
loss: 0.018277, lagrangian_loss: 0.000745, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 00:23:14
Evaluating: accuracy: 0.8827, eval_loss: 0.5216, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 100000
lambda_1: -0.2823, lambda_2: 573.6256 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.66 0.56 0.51 0.62 0.53 0.52 0.24]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.5, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010101110
11111111111111111111111110011111001001010000000000
11111111111111111111111110010101000000000000000000
10011111111111111011011110110000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
10001111110110101011011010001101010101010110000000
00000000110010101001000010000000010000000100000000
Best eval score so far: 0.8863 @ step 92500 epoch 28.25
loss: 0.064611, lagrangian_loss: 0.000288, attention_score_distillation_loss: 0.000020
loss: 0.041355, lagrangian_loss: -0.000049, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 00:26:08
Evaluating: accuracy: 0.8821, eval_loss: 0.5243, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 100500
lambda_1: -0.3691, lambda_2: 576.3730 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.66 0.56 0.5  0.63 0.53 0.52 0.23]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.5, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111000101100110101010100010001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110010001010000000000000000
10011111111111111011011110110000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
10001111110110101011011010001101010101010110000000
00000000110010101001000010000000010000000100000000
Best eval score so far: 0.8863 @ step 92500 epoch 28.25
loss: 0.015014, lagrangian_loss: 0.000531, attention_score_distillation_loss: 0.000020
loss: 0.006308, lagrangian_loss: 0.000607, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 00:29:01
Evaluating: accuracy: 0.8861, eval_loss: 0.4942, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 101000
lambda_1: -0.0898, lambda_2: 579.3690 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.66 0.56 0.5  0.63 0.53 0.52 0.24]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.5, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111000101100110101010100010001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110010001010000000000000000
10011111111111111111011110010000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
00001111110110101011011010001101010101110110000000
00000000110010101001000010000100010000000000000000
Best eval score so far: 0.8863 @ step 92500 epoch 28.25
loss: 0.016041, lagrangian_loss: -0.000003, attention_score_distillation_loss: 0.000020
loss: 0.015913, lagrangian_loss: 0.000060, attention_score_distillation_loss: 0.000020
ETA: 2:52:52 | Epoch 30 finished. Took 1127.96 seconds.
----------------------------------------------------------------------
time: 2023-07-20 00:31:55
Evaluating: accuracy: 0.8876, eval_loss: 0.504, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 101500
lambda_1: -0.1130, lambda_2: 582.3550 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.66 0.56 0.5  0.63 0.53 0.52 0.24]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.5, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111000101100110101010100010001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110010001010000000000000000
10011111111111111111011110010000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
10001111110110101011011010001101010101010110000000
00000000110010101001000010000100010000000000000000
Best eval score so far: 0.8863 @ step 92500 epoch 28.25
Saving the best model so far: [Epoch 31 | Step: 101500 | MACs sparsity: 0.5858 | Score: 0.8876 | Loss: 0.504]
loss: 0.029129, lagrangian_loss: 0.000000, attention_score_distillation_loss: 0.000020
loss: 0.010372, lagrangian_loss: 0.000013, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 00:35:13
Evaluating: accuracy: 0.8832, eval_loss: 0.5193, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 102000
lambda_1: -0.0899, lambda_2: 585.4129 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.66 0.56 0.5  0.62 0.53 0.52 0.24]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.5, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111000101100110101010100010001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110010001010000000000000000
10011111111111111011011110110000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
10001111110110101011011010001101010101010110000000
00000000110010101001000010000100010000000000000000
Best eval score so far: 0.8876 @ step 101500 epoch 31.00
loss: 0.016138, lagrangian_loss: 0.000039, attention_score_distillation_loss: 0.000020
loss: 0.018942, lagrangian_loss: 0.000091, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 00:38:09
Evaluating: accuracy: 0.8785, eval_loss: 0.5197, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 102500
lambda_1: -0.0504, lambda_2: 588.0081 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.67 0.56 0.5  0.62 0.53 0.51 0.23]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.5, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011001101100110101010100010001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110010001010000000000000000
10011111111111111111011110010000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
10001111110110101011011010001101010101010110000000
00000000110010101001000010000100010000000000000000
Best eval score so far: 0.8876 @ step 101500 epoch 31.00
loss: 0.017721, lagrangian_loss: 0.000006, attention_score_distillation_loss: 0.000020
loss: 0.104862, lagrangian_loss: -0.000057, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 00:41:01
Evaluating: accuracy: 0.8823, eval_loss: 0.5236, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 103000
lambda_1: -0.2173, lambda_2: 591.1951 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.67 0.56 0.49 0.62 0.53 0.51 0.22]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010110010001110
11111111111111111111101110011111001001010010000000
11111111111111111111111110010001000000010000000000
10011111111111111011011110110000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
00001111110110101011011010001101010101010110000000
00000000110010101001000010000000010000010000000000
Best eval score so far: 0.8876 @ step 101500 epoch 31.00
loss: 0.030129, lagrangian_loss: 0.000093, attention_score_distillation_loss: 0.000020
loss: 0.023923, lagrangian_loss: 0.001481, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 00:43:57
Evaluating: accuracy: 0.8816, eval_loss: 0.5137, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 103500
lambda_1: -0.2420, lambda_2: 593.8513 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.67 0.56 0.49 0.61 0.53 0.51 0.22]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010110010001110
11111111111111111111101110011111001001010010000000
11111111111111111111111110010001000000000010000000
10011111111111111111011110010000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
00001111110110101011011010001101010101010110000000
00000000110010101001000010000100010000000000000000
Best eval score so far: 0.8876 @ step 101500 epoch 31.00
loss: 0.027373, lagrangian_loss: 0.000385, attention_score_distillation_loss: 0.000020
loss: 0.022553, lagrangian_loss: 0.000678, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 00:46:53
Evaluating: accuracy: 0.8814, eval_loss: 0.5183, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 104000
lambda_1: -0.1379, lambda_2: 596.6783 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.67 0.56 0.49 0.61 0.53 0.52 0.22]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011100101100110101010100010001110
11111111111111111111101110011111001001010010000000
11111111111111111111111110010001000000100000000000
10011111111111111111011110010000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
00001111110110101011011010001101010101010110000000
00000000110010101001000010000100010000000000000000
Best eval score so far: 0.8876 @ step 101500 epoch 31.00
loss: 0.019303, lagrangian_loss: -0.000008, attention_score_distillation_loss: 0.000020
loss: 0.010552, lagrangian_loss: 0.000552, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 00:49:46
Evaluating: accuracy: 0.8814, eval_loss: 0.5319, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 104500
lambda_1: -0.2552, lambda_2: 599.6215 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.67 0.56 0.49 0.63 0.53 0.52 0.22]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100110001110
11111111111111111111101110011111001001010010000000
11111111111111111111111110010001000000000010000000
10011111111111111111011110010000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
00001111110110101011011010001101010101010110000000
00000000110010101001000010000100010000000000000000
Best eval score so far: 0.8876 @ step 101500 epoch 31.00
loss: 0.011960, lagrangian_loss: 0.000014, attention_score_distillation_loss: 0.000020
loss: 0.019688, lagrangian_loss: 0.000106, attention_score_distillation_loss: 0.000020
ETA: 2:33:46 | Epoch 31 finished. Took 1178.22 seconds.
----------------------------------------------------------------------
time: 2023-07-20 00:52:37
Evaluating: accuracy: 0.886, eval_loss: 0.5226, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 105000
lambda_1: -0.1972, lambda_2: 602.2722 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.67 0.56 0.49 0.63 0.53 0.53 0.22]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111111000101100110101010100010001110
11111111111111111111101110111111001001010000000000
11111111111111111111111110010001000000001000000000
10011111111111111111011110010000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
00001111110110101011011010001101010101010110000000
00000000110010101001000010000100010000000000000000
Best eval score so far: 0.8876 @ step 101500 epoch 31.00
loss: 0.018185, lagrangian_loss: 0.000124, attention_score_distillation_loss: 0.000020
loss: 0.038352, lagrangian_loss: 0.000876, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 00:55:33
Evaluating: accuracy: 0.8819, eval_loss: 0.5228, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 105500
lambda_1: -0.0381, lambda_2: 604.9241 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.67 0.56 0.49 0.63 0.53 0.54 0.23]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010011110
11111111111111111111101110011111001001010010000000
11111111111111111111111110010001000000000010000000
10011111111111111111011110010000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
00001111110110101011011010001101010101010110000000
00000000110010101001000010000000010000010000000000
Best eval score so far: 0.8876 @ step 101500 epoch 31.00
loss: 0.022027, lagrangian_loss: -0.000001, attention_score_distillation_loss: 0.000020
loss: 0.017946, lagrangian_loss: 0.000056, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 00:58:28
Evaluating: accuracy: 0.8799, eval_loss: 0.519, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 106000
lambda_1: -0.0027, lambda_2: 607.9390 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.67 0.56 0.48 0.61 0.53 0.53 0.23]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010011110
11111111111111111111101110011111001001010010000000
11111111111111111111111110110001000000000000000000
10011111111111111111011110010000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
00001111110110101011011010001101010101010110000000
00000000110010101001000010000000010001000000000000
Best eval score so far: 0.8876 @ step 101500 epoch 31.00
loss: 0.020228, lagrangian_loss: 0.000002, attention_score_distillation_loss: 0.000020
loss: 0.014955, lagrangian_loss: 0.000203, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 01:01:25
Evaluating: accuracy: 0.8832, eval_loss: 0.5164, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 106500
lambda_1: -0.0252, lambda_2: 610.6922 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.67 0.56 0.48 0.6  0.53 0.54 0.24]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101110100010001110
11111111111111111111101110011111001001010010000000
11111111111111111111111110110001000000000000000000
10011111111111111111011110010000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
00001111110110101011011010001101010101010110000000
00000000110010101001000010001000010000000000000000
Best eval score so far: 0.8876 @ step 101500 epoch 31.00
loss: 0.043442, lagrangian_loss: 0.000086, attention_score_distillation_loss: 0.000020
loss: 0.007651, lagrangian_loss: 0.001064, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 01:04:19
Evaluating: accuracy: 0.8854, eval_loss: 0.513, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 107000
lambda_1: -0.1249, lambda_2: 613.4114 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.67 0.56 0.48 0.6  0.53 0.55 0.25]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101110100010001110
11111111111111111111101110011111001001010000000100
11111111111111111111111110110001000000000000000000
10011111111111111111011110010000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
00001111110110101011011010001101010101010110000000
00000000110010101001000010001000010000000000000000
Best eval score so far: 0.8876 @ step 101500 epoch 31.00
loss: 0.013384, lagrangian_loss: -0.000004, attention_score_distillation_loss: 0.000020
loss: 0.023731, lagrangian_loss: 0.000007, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 01:07:11
Evaluating: accuracy: 0.8845, eval_loss: 0.5166, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 107500
lambda_1: -0.1067, lambda_2: 615.9699 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.67 0.56 0.48 0.61 0.53 0.54 0.26]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101110100010001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110110001000000000000000000
10011111111111111011011110110000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
00001111110110101011011010001101010101010110000000
00000000110010101001000010000000010001000000000000
Best eval score so far: 0.8876 @ step 101500 epoch 31.00
loss: 0.036126, lagrangian_loss: 0.000340, attention_score_distillation_loss: 0.000020
loss: 0.017057, lagrangian_loss: 0.000863, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 01:10:07
Evaluating: accuracy: 0.8828, eval_loss: 0.5206, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 108000
lambda_1: -0.1473, lambda_2: 618.8947 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.67 0.56 0.48 0.61 0.53 0.53 0.26]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101110100010001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110110001000000000000000000
10011111111111111011011110110000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
00001111110110101011011010001101010101010110000000
00000000110010101001000010000000010001000000000000
Best eval score so far: 0.8876 @ step 101500 epoch 31.00
loss: 0.029450, lagrangian_loss: 0.000080, attention_score_distillation_loss: 0.000020
ETA: 2:14:33 | Epoch 32 finished. Took 1156.63 seconds.
loss: 0.035288, lagrangian_loss: 0.000073, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 01:13:02
Evaluating: accuracy: 0.8812, eval_loss: 0.5171, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 108500
lambda_1: -0.2455, lambda_2: 621.8017 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.67 0.56 0.48 0.6  0.53 0.52 0.25]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101110100010001110
11111111111111111111101110011111101001010000000000
11111111111111111111111110010001000000000000000100
10011111111111111011011110110000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
00001111110110101011011010001101010101010110000000
00000000110010101001000010000000010001000000000000
Best eval score so far: 0.8876 @ step 101500 epoch 31.00
loss: 0.090272, lagrangian_loss: 0.000302, attention_score_distillation_loss: 0.000020
loss: 0.312071, lagrangian_loss: -0.000002, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 01:15:57
Evaluating: accuracy: 0.8816, eval_loss: 0.5111, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 109000
lambda_1: -0.0687, lambda_2: 624.6755 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.67 0.56 0.48 0.6  0.53 0.53 0.28]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010110010001110
11111111111111111111101110011111001001010000000010
11111111111111111111111110010001000000000000000100
10011111111111111011011110110000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
00001111110110101011011010001101010101010110000000
00000000110010101001000010000000010001000000000000
Best eval score so far: 0.8876 @ step 101500 epoch 31.00
loss: 0.016343, lagrangian_loss: 0.001062, attention_score_distillation_loss: 0.000020
loss: 0.022479, lagrangian_loss: 0.000448, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 01:18:51
Evaluating: accuracy: 0.8832, eval_loss: 0.5128, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 109500
lambda_1: -0.0097, lambda_2: 627.6951 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.67 0.56 0.48 0.61 0.54 0.54 0.28]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010110010001110
11111111111111111111101110111111001001010000000000
11111111111111111111111110010001000000000000000100
10011111111111111011011110110000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
00001111110110101011011010001101010101010110000000
00000000110010101001000010000000010001000000000000
Best eval score so far: 0.8876 @ step 101500 epoch 31.00
loss: 0.013309, lagrangian_loss: 0.000062, attention_score_distillation_loss: 0.000020
loss: 0.024009, lagrangian_loss: 0.000019, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 01:21:46
Evaluating: accuracy: 0.8838, eval_loss: 0.487, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 110000
lambda_1: -0.1083, lambda_2: 630.5648 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.67 0.56 0.48 0.62 0.54 0.53 0.32]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101101110101010100010001110
11111111111111111111101110011111001001010000000010
11111111111111111111111110010001001000000000000000
10011111111111111011011110110000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
10000111110110101011011010001101010101010110000000
00000000110010101001000010000000010001000000000000
Best eval score so far: 0.8876 @ step 101500 epoch 31.00
loss: 0.019397, lagrangian_loss: 0.000004, attention_score_distillation_loss: 0.000020
loss: 0.012824, lagrangian_loss: 0.000403, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 01:24:40
Evaluating: accuracy: 0.8839, eval_loss: 0.4972, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 110500
lambda_1: 0.0451, lambda_2: 633.6393 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.67 0.56 0.48 0.62 0.54 0.54 0.33]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101101110101010100010001110
11111111111111111111101110011111001001010000000010
11111111111111111111111110010001001000000000000000
10011111111111111011011110110000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
10000111110110101011011010001101010101010110000000
00000000110010101001000010000000010001000000000000
Best eval score so far: 0.8876 @ step 101500 epoch 31.00
loss: 0.033758, lagrangian_loss: 0.003981, attention_score_distillation_loss: 0.000020
loss: 0.185234, lagrangian_loss: 0.000036, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 01:27:34
Evaluating: accuracy: 0.8808, eval_loss: 0.5088, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 111000
lambda_1: -0.1296, lambda_2: 636.4888 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.67 0.56 0.48 0.62 0.53 0.54 0.33]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101101110101010100010001110
11111111111111111111101110011111001001010000000010
11111111111111111111111110110001000000000000000000
10011111111111111011011110110000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
10000111110110101011011010001101010101010110000000
00000000110010101001000010000000010001000000000000
Best eval score so far: 0.8876 @ step 101500 epoch 31.00
loss: 0.017665, lagrangian_loss: 0.000045, attention_score_distillation_loss: 0.000020
loss: 0.056198, lagrangian_loss: 0.002160, attention_score_distillation_loss: 0.000020
ETA: 1:55:15 | Epoch 33 finished. Took 1125.81 seconds.
----------------------------------------------------------------------
time: 2023-07-20 01:30:28
Evaluating: accuracy: 0.8885, eval_loss: 0.484, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 111500
lambda_1: -0.0298, lambda_2: 639.6103 lambda_3: 0.0000
train remain: [1.   1.   0.64 0.67 0.56 0.48 0.61 0.54 0.51 0.35]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101101110101010100010001110
11111111111111111111101111011111001001010000000000
11111111111111111111111110110001000000000000000000
10011111111111111111011110010000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
10000111110110101011011010001101010101010110000000
00000000110010101001000010000000010001000000000000
Best eval score so far: 0.8876 @ step 101500 epoch 31.00
Saving the best model so far: [Epoch 34 | Step: 111500 | MACs sparsity: 0.5858 | Score: 0.8885 | Loss: 0.484]
loss: 0.009672, lagrangian_loss: 0.000016, attention_score_distillation_loss: 0.000020
loss: 0.014149, lagrangian_loss: 0.000600, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 01:33:44
Evaluating: accuracy: 0.8852, eval_loss: 0.4942, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 112000
lambda_1: -0.2453, lambda_2: 642.4060 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.67 0.56 0.48 0.6  0.54 0.51 0.36]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101101110101010100010001110
11111111111111111111101110011111001001011000000000
11111111111111111111111110110001000000000000000000
10011111111111111111011110010000010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
10000111110110101011011010001101010101010110000000
00000000110010101001000010000000010001000000000000
Best eval score so far: 0.8885 @ step 111500 epoch 34.06
loss: 0.016415, lagrangian_loss: 0.001111, attention_score_distillation_loss: 0.000020
loss: 0.014697, lagrangian_loss: 0.000108, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 01:36:36
Evaluating: accuracy: 0.8847, eval_loss: 0.5105, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 112500
lambda_1: -0.1336, lambda_2: 645.3339 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.67 0.56 0.48 0.61 0.54 0.51 0.38]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101101110101010100010001110
11111111111111111111101110011111001001010000100000
11111111111111111111111110010001001000000000000000
10011111111111111011011110010000010100000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
00000111110110101011011010011101010101010110000000
00000000110010101001000010000000010001000000000000
Best eval score so far: 0.8885 @ step 111500 epoch 34.06
loss: 0.012433, lagrangian_loss: 0.000879, attention_score_distillation_loss: 0.000020
loss: 0.015961, lagrangian_loss: 0.000317, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 01:39:32
Evaluating: accuracy: 0.8854, eval_loss: 0.5173, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 113000
lambda_1: -0.0414, lambda_2: 648.1123 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.67 0.56 0.48 0.61 0.54 0.51 0.36]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101101110101010100010001110
11111111111111111111101110011111001001010000100000
11111111111111111111111110010001010000000000000000
10011111111111111011011110010100010000000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
00000111110110101011011010001101010101010111000000
00000000110010101001000010000000010001000100000000
Best eval score so far: 0.8885 @ step 111500 epoch 34.06
loss: 0.023253, lagrangian_loss: 0.000001, attention_score_distillation_loss: 0.000020
loss: 0.013246, lagrangian_loss: 0.000484, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 01:42:28
Evaluating: accuracy: 0.886, eval_loss: 0.494, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 113500
lambda_1: -0.0298, lambda_2: 650.7918 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.67 0.56 0.48 0.61 0.54 0.51 0.36]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101101110101010100010001110
11111111111111111111101110011111001001010000100000
11111111111111111111111110010101000000000000000000
10011111111111111011011110010000010100000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
10000111110110101011011010001101010101010110000000
00000000110010101001000010000000010001000100000000
Best eval score so far: 0.8885 @ step 111500 epoch 34.06
loss: 0.012123, lagrangian_loss: 0.000108, attention_score_distillation_loss: 0.000020
loss: 0.027925, lagrangian_loss: 0.000008, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 01:45:23
Evaluating: accuracy: 0.888, eval_loss: 0.4794, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 114000
lambda_1: -0.0279, lambda_2: 653.8177 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.67 0.56 0.49 0.61 0.54 0.51 0.37]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011010101100110101010100010001110
11111111111111111111101110011111001001010000100000
11111111111111111111111110010001001000000000000000
10011111111111111011011110010000010100000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
10000111110110101011011010001101010101010110000000
00000000110010101001000010000000010001000100000000
Best eval score so far: 0.8885 @ step 111500 epoch 34.06
loss: 0.017866, lagrangian_loss: 0.000185, attention_score_distillation_loss: 0.000020
loss: 0.011997, lagrangian_loss: 0.000330, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 01:48:18
Evaluating: accuracy: 0.8863, eval_loss: 0.5032, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 114500
lambda_1: -0.2001, lambda_2: 656.7768 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.67 0.56 0.49 0.61 0.54 0.51 0.35]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011010101100110101010100010001110
11111111111111111111101110011111001001010000100000
11111111111111111111111110010001001000000000000000
10011111111111111011011110010000010100000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
10000111110110101011011010001101010101010110000000
00000000110010101001000010000000010001000100000000
Best eval score so far: 0.8885 @ step 111500 epoch 34.06
loss: 0.051674, lagrangian_loss: 0.000606, attention_score_distillation_loss: 0.000020
ETA: 1:36:06 | Epoch 34 finished. Took 1179.12 seconds.
loss: 0.025404, lagrangian_loss: 0.000522, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 01:51:12
Evaluating: accuracy: 0.8856, eval_loss: 0.5026, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5842, expected_sparsity: 0.5813, expected_sequence_sparsity: 0.836, target_sparsity: 0.58, step: 115000
lambda_1: -0.0083, lambda_2: 659.4255 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.67 0.56 0.49 0.6  0.54 0.51 0.36]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.54, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101101110101010100010001110
11111111111111111111101110011111001001010000100000
11111111111111111111111110010001001000000000000000
10011111111111111011011110010000010100000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
00000000110010101001000010000000010001000100000000
Best eval score so far: 0.8885 @ step 111500 epoch 34.06
loss: 0.006642, lagrangian_loss: 0.001465, attention_score_distillation_loss: 0.000020
loss: 0.034278, lagrangian_loss: 0.001431, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 01:54:08
Evaluating: accuracy: 0.8882, eval_loss: 0.497, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 115500
lambda_1: -0.0315, lambda_2: 661.9892 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.67 0.56 0.49 0.62 0.54 0.51 0.36]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101101110101010100010001110
11111111111111111111101110011111001001010000100000
11111111111111111111111110010001001000000000000000
10011111111111111011011110010000010100000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
10000111110110101011011010001101010101010110000000
00000000110010101001000010000000010001000100000000
Best eval score so far: 0.8885 @ step 111500 epoch 34.06
loss: 0.023288, lagrangian_loss: 0.000097, attention_score_distillation_loss: 0.000020
loss: 0.124069, lagrangian_loss: 0.001244, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 01:57:00
Evaluating: accuracy: 0.8869, eval_loss: 0.5081, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5858, expected_sparsity: 0.5815, expected_sequence_sparsity: 0.8361, target_sparsity: 0.58, step: 116000
lambda_1: -0.0517, lambda_2: 664.7474 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.67 0.56 0.49 0.62 0.54 0.51 0.39]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.48, 0.58, 0.52, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.11, 0.07, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101101110101010100010001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110010001000000000001000000
10011111111111111011011110010000010100000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010010000000
10000111110110101011011010001101010101010110000000
00000001110010101001000010000000010001000000000000
Best eval score so far: 0.8885 @ step 111500 epoch 34.06
loss: 0.021439, lagrangian_loss: 0.000098, attention_score_distillation_loss: 0.000020
loss: 0.005964, lagrangian_loss: 0.000119, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 01:59:55
Evaluating: accuracy: 0.8889, eval_loss: 0.5052, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5825, expected_sparsity: 0.5805, expected_sequence_sparsity: 0.8357, target_sparsity: 0.58, step: 116500
lambda_1: -0.0838, lambda_2: 667.7610 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.67 0.56 0.5  0.62 0.54 0.51 0.39]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.5, 0.58, 0.54, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.12, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101101110101010100010001110
11111111111111111111101110011111101001010000000000
11111111111111111111111110010001000100000000000000
10011111111111111011011110010100010100000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
00000001110010101001000010000000010001000000000000
Best eval score so far: 0.8885 @ step 111500 epoch 34.06
Saving the best model so far: [Epoch 35 | Step: 116500 | MACs sparsity: 0.5825 | Score: 0.8889 | Loss: 0.5052]
loss: 0.018273, lagrangian_loss: 0.000901, attention_score_distillation_loss: 0.000020
loss: 0.034840, lagrangian_loss: 0.000091, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 02:03:16
Evaluating: accuracy: 0.8856, eval_loss: 0.5155, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5825, expected_sparsity: 0.5805, expected_sequence_sparsity: 0.8357, target_sparsity: 0.58, step: 117000
lambda_1: -0.1159, lambda_2: 670.6220 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.67 0.56 0.5  0.62 0.55 0.51 0.33]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.5, 0.58, 0.54, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.12, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101101110101010100010001110
11111111111111111111101110011111101001010000000000
11111111111111111111111110010001000100000000000000
10011111111111111011011110010100010100000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
00000001110010101001000010000000010001000000000000
Best eval score so far: 0.8889 @ step 116500 epoch 35.58
loss: 0.052002, lagrangian_loss: 0.000486, attention_score_distillation_loss: 0.000020
loss: 0.275865, lagrangian_loss: 0.000262, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 02:06:11
Evaluating: accuracy: 0.8849, eval_loss: 0.5017, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5825, expected_sparsity: 0.5805, expected_sequence_sparsity: 0.8357, target_sparsity: 0.58, step: 117500
lambda_1: -0.0307, lambda_2: 673.2815 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.67 0.56 0.5  0.62 0.55 0.51 0.33]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.5, 0.58, 0.54, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.12, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101110100010001110
11111111111111111111101110011111101001010000000000
11111111111111111111111110010001000100000000000000
10011111111111111011011110010100010100000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
00000000110010101001000010000000010001000100000000
Best eval score so far: 0.8889 @ step 116500 epoch 35.58
loss: 0.010287, lagrangian_loss: 0.001855, attention_score_distillation_loss: 0.000020
loss: 0.020126, lagrangian_loss: 0.002051, attention_score_distillation_loss: 0.000020
ETA: 1:16:53 | Epoch 35 finished. Took 1151.57 seconds.
----------------------------------------------------------------------
time: 2023-07-20 02:09:04
Evaluating: accuracy: 0.8861, eval_loss: 0.5174, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5825, expected_sparsity: 0.5805, expected_sequence_sparsity: 0.8357, target_sparsity: 0.58, step: 118000
lambda_1: -0.0687, lambda_2: 676.1065 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.67 0.56 0.5  0.61 0.55 0.52 0.33]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.5, 0.58, 0.54, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.12, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101110100010001110
11111111111111111111101110011111101001010000000000
11111111111111111111111110010001010000000000000000
10011111111111111011011110010100010100000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
00000000110010101001000010000000010001000100000000
Best eval score so far: 0.8889 @ step 116500 epoch 35.58
loss: 0.018968, lagrangian_loss: 0.000013, attention_score_distillation_loss: 0.000020
loss: 0.009348, lagrangian_loss: 0.000524, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 02:11:57
Evaluating: accuracy: 0.8839, eval_loss: 0.5171, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 118500
lambda_1: -0.0222, lambda_2: 678.8206 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.67 0.56 0.5  0.61 0.56 0.51 0.33]
infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010001110
11111111111111111111101110011111011001010000000000
11111111111111111111111110010001100000000000000000
10011111111111111011011110010100010100000000000000
10001111110111101011011010001101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
00000000110010101001000010000000010001000100000000
Best eval score so far: 0.8889 @ step 116500 epoch 35.58
loss: 0.094354, lagrangian_loss: 0.000007, attention_score_distillation_loss: 0.000020
loss: 0.072993, lagrangian_loss: 0.001410, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 02:14:51
Evaluating: accuracy: 0.8843, eval_loss: 0.5155, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5825, expected_sparsity: 0.5809, expected_sequence_sparsity: 0.8358, target_sparsity: 0.58, step: 119000
lambda_1: -0.0245, lambda_2: 681.9456 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.67 0.56 0.5  0.61 0.55 0.51 0.31]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.12, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111111101110011111101001010000000000
11111111111111111111111110010001100000000000000000
10011111111111111011011110010100010100000000000000
10001111110111101011011010001101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
00000000110010101001000010001000010001000000000000
Best eval score so far: 0.8889 @ step 116500 epoch 35.58
loss: 0.017492, lagrangian_loss: 0.000471, attention_score_distillation_loss: 0.000020
loss: 0.010002, lagrangian_loss: 0.000056, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 02:17:45
Evaluating: accuracy: 0.8863, eval_loss: 0.5078, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5825, expected_sparsity: 0.5809, expected_sequence_sparsity: 0.8359, target_sparsity: 0.58, step: 119500
lambda_1: -0.2100, lambda_2: 684.5416 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.67 0.56 0.5  0.61 0.56 0.51 0.27]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.12, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110111010100010001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110010011000000000000000000
10011111111111111011011110010100010100000000000000
10001111110111101011011010001101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
00000000110010101001000010001000010000000000000000
Best eval score so far: 0.8889 @ step 116500 epoch 35.58
loss: 0.007061, lagrangian_loss: 0.000066, attention_score_distillation_loss: 0.000020
loss: 0.013536, lagrangian_loss: 0.000733, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 02:20:38
Evaluating: accuracy: 0.8839, eval_loss: 0.5118, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5825, expected_sparsity: 0.5809, expected_sequence_sparsity: 0.8359, target_sparsity: 0.58, step: 120000
lambda_1: -0.1203, lambda_2: 687.2068 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.67 0.56 0.5  0.61 0.56 0.5  0.26]
infer remain: [1.0, 1.0, 0.64, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.64, 0.42, 0.24, 0.12, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100010001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110010011000000000000000000
10011111111111111011011110010100010100000000000000
10001111110111101011011010001101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
00000000110010101001000010001000010000000000000000
Best eval score so far: 0.8889 @ step 116500 epoch 35.58
loss: 0.021576, lagrangian_loss: -0.000003, attention_score_distillation_loss: 0.000020
loss: 0.012608, lagrangian_loss: 0.003057, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 02:23:34
Evaluating: accuracy: 0.883, eval_loss: 0.5232, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 120500
lambda_1: -0.0161, lambda_2: 690.1986 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.67 0.56 0.51 0.61 0.56 0.5  0.3 ]
infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110010001000000000001000000
10011111111111111011011110010100010100000000000000
10001111110111101011011010001101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
00000001110010101001000010000000010000000000000000
Best eval score so far: 0.8889 @ step 116500 epoch 35.58
loss: 0.012848, lagrangian_loss: 0.000075, attention_score_distillation_loss: 0.000020
loss: 0.028520, lagrangian_loss: 0.000243, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 02:26:31
Evaluating: accuracy: 0.8858, eval_loss: 0.5248, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 121000
lambda_1: -0.1056, lambda_2: 693.0524 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.67 0.56 0.5  0.61 0.57 0.5  0.29]
infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110010001000100000000000000
10011111111111111011011110010100010100000000000000
10001111110111101011011010001101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
00000001110010101001000010000000010000000000000000
Best eval score so far: 0.8889 @ step 116500 epoch 35.58
loss: 0.013956, lagrangian_loss: 0.001404, attention_score_distillation_loss: 0.000020
ETA: 0:57:40 | Epoch 36 finished. Took 1154.19 seconds.
loss: 0.010747, lagrangian_loss: 0.000768, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 02:29:25
Evaluating: accuracy: 0.8836, eval_loss: 0.5295, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 121500
lambda_1: -0.1803, lambda_2: 695.7605 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.67 0.56 0.5  0.61 0.57 0.51 0.29]
infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010001110
11111111111111111111101111011111001001010000000000
11111111111111111111111110010001000100000000000000
10011111111111111011011110010100010100000000000000
10001111110111101011011010001101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
00000001110010101001000010000000010000000000000000
Best eval score so far: 0.8889 @ step 116500 epoch 35.58
loss: 0.009487, lagrangian_loss: 0.000044, attention_score_distillation_loss: 0.000020
loss: 0.025625, lagrangian_loss: 0.001445, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 02:32:19
Evaluating: accuracy: 0.8849, eval_loss: 0.5215, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.589, expected_sparsity: 0.5844, expected_sequence_sparsity: 0.8373, target_sparsity: 0.58, step: 122000
lambda_1: 0.0195, lambda_2: 698.6012 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.67 0.56 0.5  0.61 0.57 0.5  0.32]
infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.58, 0.54, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.07, 0.04, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110010001000100000000000000
10011111111111111011011110010100010100000000000000
10001111110111101011011010011101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
00000001110010101001000010000100010000000000000000
Best eval score so far: 0.8889 @ step 116500 epoch 35.58
loss: 0.024159, lagrangian_loss: 0.000060, attention_score_distillation_loss: 0.000020
loss: 0.180402, lagrangian_loss: 0.000060, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 02:35:13
Evaluating: accuracy: 0.8896, eval_loss: 0.5091, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 122500
lambda_1: -0.1475, lambda_2: 701.2036 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.66 0.56 0.5  0.62 0.57 0.5  0.3 ]
infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.18]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010001110
11111111111111111111101111011111001001010000000000
11111111111111111111111110010001000000000000001000
10011111111111111011011110010100010100000000000000
10001111110111101011011010001101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
00000001110010101001000010000000010000000000000000
Best eval score so far: 0.8889 @ step 116500 epoch 35.58
Saving the best model so far: [Epoch 37 | Step: 122500 | MACs sparsity: 0.5907 | Score: 0.8896 | Loss: 0.5091]
loss: 0.027328, lagrangian_loss: 0.000009, attention_score_distillation_loss: 0.000020
loss: 0.015560, lagrangian_loss: 0.001482, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 02:38:28
Evaluating: accuracy: 0.8871, eval_loss: 0.5127, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 123000
lambda_1: -0.0213, lambda_2: 703.9400 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.66 0.56 0.5  0.63 0.57 0.5  0.3 ]
infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010001110
11111111111111111111101110011111011001010000000000
11111111111111111111111110010001010000000000000000
10011111111111111011011110010100010100000000000000
10001111110111101011011010001101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
10000001110010101001000010000000010000000000000000
Best eval score so far: 0.8896 @ step 122500 epoch 37.42
loss: 0.008344, lagrangian_loss: 0.000499, attention_score_distillation_loss: 0.000020
loss: 0.019039, lagrangian_loss: 0.000813, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 02:41:22
Evaluating: accuracy: 0.8876, eval_loss: 0.5088, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 123500
lambda_1: -0.1699, lambda_2: 706.9811 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.66 0.56 0.5  0.62 0.57 0.5  0.29]
infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010001110
11111111111111111111101110011111011001010000000000
11111111111111111111111110010001000000000000010000
10011111111111111011011110010100010100000000000000
10001111110111101011011010001101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
10000001110010101001000010000000010000000000000000
Best eval score so far: 0.8896 @ step 122500 epoch 37.42
loss: 0.012543, lagrangian_loss: 0.000236, attention_score_distillation_loss: 0.000020
loss: 0.018498, lagrangian_loss: 0.000248, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 02:44:16
Evaluating: accuracy: 0.888, eval_loss: 0.5014, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 124000
lambda_1: -0.0337, lambda_2: 709.7526 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.66 0.57 0.51 0.62 0.57 0.51 0.29]
infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010001110
11111111111111111111101111011111001001010000000000
11111111111111111111111110010001001000000000000000
10011111111111111011011110010100010100000000000000
10001111110111101011011010001101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
10000001110010101001000010000000010000000000000000
Best eval score so far: 0.8896 @ step 122500 epoch 37.42
loss: 0.007905, lagrangian_loss: 0.000554, attention_score_distillation_loss: 0.000020
loss: 0.022859, lagrangian_loss: 0.000130, attention_score_distillation_loss: 0.000020
ETA: 0:38:26 | Epoch 37 finished. Took 1145.31 seconds.
----------------------------------------------------------------------
time: 2023-07-20 02:47:10
Evaluating: accuracy: 0.8856, eval_loss: 0.5157, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 124500
lambda_1: -0.0528, lambda_2: 712.5897 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.66 0.57 0.51 0.62 0.58 0.51 0.29]
infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010001110
11111111111111111111101110011111011001010000000000
11111111111111111111111110010001001000000000000000
10011111111111111011011110010100010100000000000000
10001111110111101011011010001101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
10000001110010101001000010000000010000000000000000
Best eval score so far: 0.8896 @ step 122500 epoch 37.42
loss: 0.018599, lagrangian_loss: 0.000060, attention_score_distillation_loss: 0.000020
loss: 0.010490, lagrangian_loss: 0.000759, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 02:50:06
Evaluating: accuracy: 0.8856, eval_loss: 0.5223, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 125000
lambda_1: 0.0169, lambda_2: 715.5607 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.66 0.57 0.51 0.62 0.58 0.51 0.33]
infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010001110
11111111111111111111101110011111011001010000000000
11111111111111111111111110010001001000000000000000
10011111111111111011011110010100010100000000000000
10001111110111101011011010001101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
10000000110010101001000010001000010000000000000000
Best eval score so far: 0.8896 @ step 122500 epoch 37.42
loss: 0.016980, lagrangian_loss: 0.000000, attention_score_distillation_loss: 0.000020
loss: 0.011128, lagrangian_loss: 0.000029, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 02:52:59
Evaluating: accuracy: 0.8845, eval_loss: 0.5246, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 125500
lambda_1: -0.0808, lambda_2: 718.1848 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.66 0.57 0.51 0.61 0.59 0.51 0.31]
infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010001110
11111111111111111111101110011111011001010000000000
11111111111111111111111110010001001000000000000000
10011111111111111011011110010100010100000000000000
10001111110111101011011010001101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
10000000110010101001000010001000010000000000000000
Best eval score so far: 0.8896 @ step 122500 epoch 37.42
loss: 0.004824, lagrangian_loss: 0.000259, attention_score_distillation_loss: 0.000020
loss: 0.011993, lagrangian_loss: 0.000695, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 02:55:52
Evaluating: accuracy: 0.8882, eval_loss: 0.5184, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 126000
lambda_1: -0.1046, lambda_2: 721.0481 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.66 0.57 0.51 0.63 0.59 0.51 0.28]
infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110010001001000000000000000
10011111111111111011011110010100010100000000000000
10001111110111101011011010001101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
10000000110010101001010010000000010000000000000000
Best eval score so far: 0.8896 @ step 122500 epoch 37.42
loss: 0.030306, lagrangian_loss: 0.000542, attention_score_distillation_loss: 0.000020
loss: 0.004088, lagrangian_loss: 0.000430, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 02:58:47
Evaluating: accuracy: 0.888, eval_loss: 0.5171, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 126500
lambda_1: -0.1985, lambda_2: 723.7037 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.66 0.56 0.51 0.63 0.59 0.5  0.27]
infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110010001001000000000000000
10011111111111111011011110010100010100000000000000
10001111110111101011011010001101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
10000000110010101001000010000000010001000000000000
Best eval score so far: 0.8896 @ step 122500 epoch 37.42
loss: 0.018511, lagrangian_loss: 0.000958, attention_score_distillation_loss: 0.000020
loss: 0.013215, lagrangian_loss: 0.000503, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 03:01:45
Evaluating: accuracy: 0.8883, eval_loss: 0.5052, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 127000
lambda_1: -0.1221, lambda_2: 726.5182 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.67 0.56 0.51 0.63 0.58 0.5  0.27]
infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110010001001000000000000000
10011111111111111011011110110100010000000000000000
10001111110111101011011010001101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
10000000110010101001000010000000010001000000000000
Best eval score so far: 0.8896 @ step 122500 epoch 37.42
loss: 0.056806, lagrangian_loss: 0.000014, attention_score_distillation_loss: 0.000020
loss: 0.054186, lagrangian_loss: 0.000159, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 03:04:40
Evaluating: accuracy: 0.888, eval_loss: 0.5053, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5856, expected_sequence_sparsity: 0.8377, target_sparsity: 0.58, step: 127500
lambda_1: -0.0592, lambda_2: 729.4135 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.67 0.56 0.51 0.63 0.58 0.5  0.27]
infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.48, 0.56, 0.54, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110110001000000000000000000
10011111111111111011011110110000010000000000000000
10001111110111101011011010001101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
10000000110010101001000010000000010001000000000000
Best eval score so far: 0.8896 @ step 122500 epoch 37.42
loss: 0.015133, lagrangian_loss: 0.000475, attention_score_distillation_loss: 0.000020
ETA: 0:19:13 | Epoch 38 finished. Took 1157.29 seconds.
loss: 0.032904, lagrangian_loss: 0.000816, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 03:07:34
Evaluating: accuracy: 0.8883, eval_loss: 0.5059, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5848, expected_sequence_sparsity: 0.8374, target_sparsity: 0.58, step: 128000
lambda_1: 0.0053, lambda_2: 732.3585 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.67 0.57 0.51 0.63 0.58 0.5  0.27]
infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.5, 0.56, 0.54, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110110001000000000000000000
10011111111111111011011110010100010100000000000000
10001111110111101011011010001101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
10000000110010101001000010000000010001000000000000
Best eval score so far: 0.8896 @ step 122500 epoch 37.42
loss: 0.028195, lagrangian_loss: 0.002901, attention_score_distillation_loss: 0.000020
loss: 0.016411, lagrangian_loss: 0.000012, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 03:10:27
Evaluating: accuracy: 0.8869, eval_loss: 0.5113, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5856, expected_sequence_sparsity: 0.8377, target_sparsity: 0.58, step: 128500
lambda_1: 0.0081, lambda_2: 735.3931 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.67 0.57 0.51 0.62 0.58 0.5  0.27]
infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.48, 0.56, 0.54, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110110001000000000000000000
10011111111111111011011110010100010000000000000000
10001111110111101011011010001101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
10000000110010101001000010000100010000000000000000
Best eval score so far: 0.8896 @ step 122500 epoch 37.42
loss: 0.016679, lagrangian_loss: 0.000161, attention_score_distillation_loss: 0.000020
loss: 0.016979, lagrangian_loss: 0.000000, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 03:13:23
Evaluating: accuracy: 0.8894, eval_loss: 0.4967, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5856, expected_sequence_sparsity: 0.8377, target_sparsity: 0.58, step: 129000
lambda_1: -0.0743, lambda_2: 738.4060 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.67 0.57 0.5  0.63 0.57 0.5  0.26]
infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.48, 0.56, 0.54, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110110001000000000000000000
10011111111111111011011110010100010000000000000000
10001111110111101011011010001101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
10000000110010101001000010000000010001000000000000
Best eval score so far: 0.8896 @ step 122500 epoch 37.42
loss: 0.297057, lagrangian_loss: 0.000963, attention_score_distillation_loss: 0.000020
loss: 0.012158, lagrangian_loss: 0.000028, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 03:16:20
Evaluating: accuracy: 0.8883, eval_loss: 0.5075, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5856, expected_sequence_sparsity: 0.8377, target_sparsity: 0.58, step: 129500
lambda_1: -0.1297, lambda_2: 740.8897 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.67 0.56 0.5  0.62 0.57 0.5  0.25]
infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.48, 0.56, 0.54, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110110001000000000000000000
10011111111111111011011110010100010000000000000000
10001111110111101011011010001101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
10000000110010101001000010000000010001000000000000
Best eval score so far: 0.8896 @ step 122500 epoch 37.42
loss: 0.018539, lagrangian_loss: 0.001272, attention_score_distillation_loss: 0.000020
loss: 0.130012, lagrangian_loss: 0.000784, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 03:19:14
Evaluating: accuracy: 0.8894, eval_loss: 0.5067, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5856, expected_sequence_sparsity: 0.8377, target_sparsity: 0.58, step: 130000
lambda_1: -0.1158, lambda_2: 743.7689 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.67 0.56 0.5  0.62 0.57 0.5  0.24]
infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.48, 0.56, 0.54, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111110111011000101100110101010100010001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110110001000000000000000000
10011111111111111011011110010100010000000000000000
10001111110111101011011010001101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
10000000110010101001000010000100010000000000000000
Best eval score so far: 0.8896 @ step 122500 epoch 37.42
loss: 0.005933, lagrangian_loss: 0.000256, attention_score_distillation_loss: 0.000020
loss: 0.022068, lagrangian_loss: 0.000556, attention_score_distillation_loss: 0.000020
----------------------------------------------------------------------
time: 2023-07-20 03:22:09
Evaluating: accuracy: 0.8902, eval_loss: 0.5014, token_prune_loc: [False, False, True, True, True, True, True, True, True, True], macs_sparsity: 0.5907, expected_sparsity: 0.5856, expected_sequence_sparsity: 0.8377, target_sparsity: 0.58, step: 130500
lambda_1: -0.0685, lambda_2: 746.9197 lambda_3: 0.0000
train remain: [1.   1.   0.63 0.67 0.57 0.5  0.61 0.57 0.5  0.28]
infer remain: [1.0, 1.0, 0.62, 0.66, 0.56, 0.48, 0.56, 0.54, 0.48, 0.2]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.62, 0.41, 0.23, 0.11, 0.06, 0.03, 0.02, 0.0]
11111111111111111111111111111111111111111111111111
11111111111111111111111111111111111111111111111111
11111111111111111011000101100110101010100000001110
11111111111111111111111110011111001001010000000000
11111111111111111111111110110001000000000000000000
10011111111111111011011110010100010000000000000000
10001111110111101011011010001101011101110011000000
10001111110111101011011010011101010101010110000000
10000111110110101011011010001101010101010110000000
10000000110010101001000010000100010000000000000000
Best eval score so far: 0.8896 @ step 122500 epoch 37.42
Saving the best model so far: [Epoch 39 | Step: 130500 | MACs sparsity: 0.5907 | Score: 0.8902 | Loss: 0.5014]
loss: 0.019569, lagrangian_loss: 0.000286, attention_score_distillation_loss: 0.000020
loss: 0.027820, lagrangian_loss: 0.000030, attention_score_distillation_loss: 0.000020
ETA: 0:00:00 | Epoch 39 finished. Took 1159.91 seconds.
07/20/2023 03:26:58 - WARNING - urllib3.connectionpool - Retrying (Retry(total=4, connect=5, read=4, redirect=5, status=5)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='southcentralus.api.azureml.ms', port=443): Read timed out. (read timeout=120)")': /mlflow/v2.0/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourceGroups/gcr-singularity-octo/providers/Microsoft.MachineLearningServices/workspaces/msroctows/api/2.0/mlflow/runs/get?run_uuid=8f1ed327-ef83-4836-9c66-d06bcf6f5683&run_id=8f1ed327-ef83-4836-9c66-d06bcf6f5683