Downloading (…)lve/main/config.json: 100%|██████████| 718/718 [00:00<00:00, 180kB/s] Downloading (…)olve/main/vocab.json: 100%|██████████| 1.04M/1.04M [00:00<00:00, 1.34MB/s] Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:01<00:00, 283kB/s] Downloading (…)/main/tokenizer.json: 100%|██████████| 1.36M/1.36M [00:03<00:00, 386kB/s] Downloading (…)"pytorch_model.bin";: 100%|██████████| 1.52G/1.52G [06:17<00:00, 4.03MB/s] Downloading (…)neration_config.json: 100%|██████████| 124/124 [00:00<00:00, 24.8kB/s] Found cached dataset common_gen (C:/Users/Jingqian/.cache/huggingface/datasets/common_gen/default/2020.5.30/1a9e8bdc026c41ce7a9e96260debf7d2809cb7fd63fa02b017e4fac1b00c6b23) 100%|██████████| 3/3 [00:00<00:00, 749.61it/s] 100%|██████████| 68/68 [00:01<00:00, 65.32ba/s] 100%|██████████| 5/5 [00:00<00:00, 84.73ba/s] 100%|██████████| 2/2 [00:00<00:00, 133.33ba/s] 100%|██████████| 68/68 [00:03<00:00, 22.30ba/s] 100%|██████████| 5/5 [00:00<00:00, 25.37ba/s] 100%|██████████| 2/2 [00:00<00:00, 76.81ba/s] C:\Users\Jingqian\anaconda3\lib\site-packages\transformers\optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning warnings.warn( ***** Running training ***** Num examples = 4592 Num Epochs = 5 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 8 Gradient Accumulation steps = 1 Total optimization steps = 2870 Number of trainable parameters = 354823168 17%|█▋ | 500/2870 [03:25<16:23, 2.41it/s]{'loss': 2.4535, 'learning_rate': 4.128919860627178e-05, 'epoch': 0.87} 20%|██ | 574/2870 [03:55<15:39, 2.44it/s]***** Running Evaluation ***** Num examples = 297 Batch size = 8 0%| | 0/38 [00:00