CowardCow / training.log

Upload folder using huggingface_hub

ccf4931 about 1 year ago

47.6 kB

	[2023-12-11 20:12:03,965] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
	[2023-12-11 20:12:05,820] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
	[2023-12-11 20:12:05,820] [INFO] [runner.py:570:main] cmd = /home/t-sokumar/miniconda3/envs/ft/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path local/jsonfile --data_split 1,0,0 --model_name_or_path codellama/CodeLlama-7b-hf --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --max_seq_len 512 --learning_rate 9.65e-6 --weight_decay 0. --num_train_epochs 5 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --gradient_checkpointing --zero_stage 3 --deepspeed --lora_dim 128 --lora_module_name layers. --output_dir ./output_step1_Codellama_7b_lora_llamahub-devrev --add_eot_token
	[2023-12-11 20:12:08,529] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
	[2023-12-11 20:12:10,776] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
	[2023-12-11 20:12:10,776] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
	[2023-12-11 20:12:10,776] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
	[2023-12-11 20:12:10,776] [INFO] [launch.py:163:main] dist_world_size=4
	[2023-12-11 20:12:10,776] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
	[2023-12-11 20:12:14,340] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
	[2023-12-11 20:12:14,349] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
	[2023-12-11 20:12:14,559] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
	[2023-12-11 20:12:14,602] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
	/home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
	warnings.warn(
	/home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
	warnings.warn(
	[2023-12-11 20:12:15,940] [INFO] [comm.py:637:init_distributed] cdb=None
	[2023-12-11 20:12:15,940] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
	/home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
	warnings.warn(
	/home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
	warnings.warn(
	[2023-12-11 20:12:16,326] [INFO] [comm.py:637:init_distributed] cdb=None
	[2023-12-11 20:12:16,414] [INFO] [comm.py:637:init_distributed] cdb=None
	[2023-12-11 20:12:16,446] [INFO] [comm.py:637:init_distributed] cdb=None
	The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
	The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'.
	The class this function is called from is 'LlamaTokenizer'.
	The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
	The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'.
	The class this function is called from is 'LlamaTokenizer'.
	The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
	The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'.
	The class this function is called from is 'LlamaTokenizer'.
	The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
	The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'.
	The class this function is called from is 'LlamaTokenizer'.
	You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
	You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
	You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
	You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
	[2023-12-11 20:12:19,202] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 291, num_elems = 6.74B
	Loading checkpoint shards: 0%\| \| 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 0%\| \| 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 0%\| \| 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 0%\| \| 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 50%\|██████████████████████████████████████████████████████████ \| 1/2 [00:00<00:00, 1.23it/s] Loading checkpoint shards: 50%\|██████████████████████████████████████████████████████████ \| 1/2 [00:00<00:00, 1.19it/s] Loading checkpoint shards: 50%\|██████████████████████████████████████████████████████████ \| 1/2 [00:00<00:00, 1.20it/s] Loading checkpoint shards: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:01<00:00, 1.02it/s] Loading checkpoint shards: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:01<00:00, 1.04it/s]
	Loading checkpoint shards: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:01<00:00, 1.03it/s] Loading checkpoint shards: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:01<00:00, 1.05it/s]
	Loading checkpoint shards: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:01<00:00, 1.02it/s] Loading checkpoint shards: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:01<00:00, 1.04it/s]
	Loading checkpoint shards: 50%\|██████████████████████████████████████████████████████████ \| 1/2 [00:03<00:03, 3.28s/it] Loading checkpoint shards: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:04<00:00, 2.04s/it] Loading checkpoint shards: 100%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:04<00:00, 2.22s/it]
	Using /home/t-sokumar/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
	Using /home/t-sokumar/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
	Using /home/t-sokumar/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
	Using /home/t-sokumar/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
	Detected CUDA files, patching ldflags
	Emitting ninja build file /home/t-sokumar/.cache/torch_extensions/py311_cu121/fused_adam/build.ninja...
	Building extension module fused_adam...
	Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
	ninja: no work to do.
	Loading extension module fused_adam...
	Time to load fused_adam op: 0.10928606986999512 seconds
	/home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
	self._dummy_overflow_buf = get_accelerator().IntTensor([0])
	Loading extension module fused_adam...
	Loading extension module fused_adam...
	Loading extension module fused_adam...
	Time to load fused_adam op: 0.20180773735046387 seconds
	Time to load fused_adam op: 0.2018909454345703 seconds
	Time to load fused_adam op: 0.20151114463806152 seconds
	/home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
	self._dummy_overflow_buf = get_accelerator().IntTensor([0])
	/home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
	self._dummy_overflow_buf = get_accelerator().IntTensor([0])
	/home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
	self._dummy_overflow_buf = get_accelerator().IntTensor([0])
	[2023-12-11 20:12:28,877] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.12.4, git-hash=unknown, git-branch=unknown
	[2023-12-11 20:12:28,877] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
	[2023-12-11 20:12:28,899] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
	[2023-12-11 20:12:28,901] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
	[2023-12-11 20:12:28,901] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
	[2023-12-11 20:12:28,939] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
	[2023-12-11 20:12:28,939] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
	[2023-12-11 20:12:28,939] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
	[2023-12-11 20:12:28,940] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 3 optimizer
	[2023-12-11 20:12:29,054] [INFO] [utils.py:795:see_memory_usage] Stage 3 initialize beginning
	[2023-12-11 20:12:29,055] [INFO] [utils.py:796:see_memory_usage] MA 4.37 GB Max_MA 4.75 GB CA 8.93 GB Max_CA 9 GB
	[2023-12-11 20:12:29,055] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 95.76 GB, percent = 38.1%
	[2023-12-11 20:12:29,057] [INFO] [stage3.py:127:__init__] Reduce bucket size 500,000,000
	[2023-12-11 20:12:29,057] [INFO] [stage3.py:128:__init__] Prefetch bucket size 30000000
	[2023-12-11 20:12:29,164] [INFO] [utils.py:795:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
	[2023-12-11 20:12:29,165] [INFO] [utils.py:796:see_memory_usage] MA 4.37 GB Max_MA 4.37 GB CA 8.93 GB Max_CA 9 GB
	[2023-12-11 20:12:29,165] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 95.77 GB, percent = 38.1%
	Parameter Offload: Total persistent parameters: 266240 in 65 params
	[2023-12-11 20:12:29,482] [INFO] [utils.py:795:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
	[2023-12-11 20:12:29,483] [INFO] [utils.py:796:see_memory_usage] MA 3.54 GB Max_MA 4.43 GB CA 8.94 GB Max_CA 9 GB
	[2023-12-11 20:12:29,483] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 95.79 GB, percent = 38.1%
	[2023-12-11 20:12:29,597] [INFO] [utils.py:795:see_memory_usage] Before creating fp16 partitions
	[2023-12-11 20:12:29,598] [INFO] [utils.py:796:see_memory_usage] MA 3.54 GB Max_MA 3.54 GB CA 8.94 GB Max_CA 9 GB
	[2023-12-11 20:12:29,598] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 95.78 GB, percent = 38.1%
	[2023-12-11 20:12:30,301] [INFO] [utils.py:795:see_memory_usage] After creating fp16 partitions: 3
	[2023-12-11 20:12:30,301] [INFO] [utils.py:796:see_memory_usage] MA 3.54 GB Max_MA 3.54 GB CA 5.46 GB Max_CA 9 GB
	[2023-12-11 20:12:30,348] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 96.3 GB, percent = 38.3%
	[2023-12-11 20:12:30,468] [INFO] [utils.py:795:see_memory_usage] Before creating fp32 partitions
	[2023-12-11 20:12:30,469] [INFO] [utils.py:796:see_memory_usage] MA 3.54 GB Max_MA 3.54 GB CA 5.46 GB Max_CA 5 GB
	[2023-12-11 20:12:30,469] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 93.01 GB, percent = 37.0%
	[2023-12-11 20:12:30,579] [INFO] [utils.py:795:see_memory_usage] After creating fp32 partitions
	[2023-12-11 20:12:30,580] [INFO] [utils.py:796:see_memory_usage] MA 4.09 GB Max_MA 4.24 GB CA 6.16 GB Max_CA 6 GB
	[2023-12-11 20:12:30,580] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 93.01 GB, percent = 37.0%
	[2023-12-11 20:12:30,689] [INFO] [utils.py:795:see_memory_usage] Before initializing optimizer states
	[2023-12-11 20:12:30,690] [INFO] [utils.py:796:see_memory_usage] MA 4.09 GB Max_MA 4.09 GB CA 6.16 GB Max_CA 6 GB
	[2023-12-11 20:12:30,690] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 93.01 GB, percent = 37.0%
	[2023-12-11 20:12:30,815] [INFO] [utils.py:795:see_memory_usage] After initializing optimizer states
	[2023-12-11 20:12:30,815] [INFO] [utils.py:796:see_memory_usage] MA 5.17 GB Max_MA 5.47 GB CA 7.54 GB Max_CA 8 GB
	[2023-12-11 20:12:30,815] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 93.02 GB, percent = 37.0%
	[2023-12-11 20:12:30,816] [INFO] [stage3.py:479:_setup_for_real_optimizer] optimizer state initialized
	[2023-12-11 20:12:31,320] [INFO] [utils.py:795:see_memory_usage] After initializing ZeRO optimizer
	[2023-12-11 20:12:31,321] [INFO] [utils.py:796:see_memory_usage] MA 6.38 GB Max_MA 6.86 GB CA 9.23 GB Max_CA 9 GB
	[2023-12-11 20:12:31,321] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 93.01 GB, percent = 37.0%
	[2023-12-11 20:12:31,321] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam
	[2023-12-11 20:12:31,322] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
	[2023-12-11 20:12:31,322] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f31e5b4f890>
	[2023-12-11 20:12:31,322] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[9.65e-06, 0.0005, 9.65e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
	[2023-12-11 20:12:31,323] [INFO] [config.py:979:print] DeepSpeedEngine configuration:
	[2023-12-11 20:12:31,323] [INFO] [config.py:983:print] activation_checkpointing_config {
	"partition_activations": false,
	"contiguous_memory_optimization": false,
	"cpu_checkpointing": false,
	"number_checkpoints": null,
	"synchronize_checkpoint_boundary": false,
	"profile": false
	}
	[2023-12-11 20:12:31,323] [INFO] [config.py:983:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
	[2023-12-11 20:12:31,323] [INFO] [config.py:983:print] amp_enabled .................. False
	[2023-12-11 20:12:31,323] [INFO] [config.py:983:print] amp_params ................... False
	[2023-12-11 20:12:31,324] [INFO] [config.py:983:print] autotuning_config ............ {
	"enabled": false,
	"start_step": null,
	"end_step": null,
	"metric_path": null,
	"arg_mappings": null,
	"metric": "throughput",
	"model_info": null,
	"results_dir": "autotuning_results",
	"exps_dir": "autotuning_exps",
	"overwrite": true,
	"fast": true,
	"start_profile_step": 3,
	"end_profile_step": 5,
	"tuner_type": "gridsearch",
	"tuner_early_stopping": 5,
	"tuner_num_trials": 50,
	"model_info_path": null,
	"mp_size": 1,
	"max_train_batch_size": null,
	"min_train_batch_size": 1,
	"max_train_micro_batch_size_per_gpu": 1.024000e+03,
	"min_train_micro_batch_size_per_gpu": 1,
	"num_tuning_micro_batch_sizes": 3
	}
	[2023-12-11 20:12:31,324] [INFO] [config.py:983:print] bfloat16_enabled ............. False
	[2023-12-11 20:12:31,324] [INFO] [config.py:983:print] checkpoint_parallel_write_pipeline False
	[2023-12-11 20:12:31,324] [INFO] [config.py:983:print] checkpoint_tag_validation_enabled True
	[2023-12-11 20:12:31,324] [INFO] [config.py:983:print] checkpoint_tag_validation_fail False
	[2023-12-11 20:12:31,324] [INFO] [config.py:983:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f3193907bd0>
	[2023-12-11 20:12:31,324] [INFO] [config.py:983:print] communication_data_type ...... None
	[2023-12-11 20:12:31,324] [INFO] [config.py:983:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
	[2023-12-11 20:12:31,324] [INFO] [config.py:983:print] curriculum_enabled_legacy .... False
	[2023-12-11 20:12:31,324] [INFO] [config.py:983:print] curriculum_params_legacy ..... False
	[2023-12-11 20:12:31,324] [INFO] [config.py:983:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
	[2023-12-11 20:12:31,324] [INFO] [config.py:983:print] data_efficiency_enabled ...... False
	[2023-12-11 20:12:31,324] [INFO] [config.py:983:print] dataloader_drop_last ......... False
	[2023-12-11 20:12:31,324] [INFO] [config.py:983:print] disable_allgather ............ False
	[2023-12-11 20:12:31,324] [INFO] [config.py:983:print] dump_state ................... False
	[2023-12-11 20:12:31,324] [INFO] [config.py:983:print] dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 100, 'delayed_shift': 2, 'consecutive_hysteresis': False, 'min_scale': 1}
	[2023-12-11 20:12:31,324] [INFO] [config.py:983:print] eigenvalue_enabled ........... False
	[2023-12-11 20:12:31,324] [INFO] [config.py:983:print] eigenvalue_gas_boundary_resolution 1
	[2023-12-11 20:12:31,324] [INFO] [config.py:983:print] eigenvalue_layer_name ........ bert.encoder.layer
	[2023-12-11 20:12:31,324] [INFO] [config.py:983:print] eigenvalue_layer_num ......... 0
	[2023-12-11 20:12:31,324] [INFO] [config.py:983:print] eigenvalue_max_iter .......... 100
	[2023-12-11 20:12:31,324] [INFO] [config.py:983:print] eigenvalue_stability ......... 1e-06
	[2023-12-11 20:12:31,324] [INFO] [config.py:983:print] eigenvalue_tol ............... 0.01
	[2023-12-11 20:12:31,324] [INFO] [config.py:983:print] eigenvalue_verbose ........... False
	[2023-12-11 20:12:31,324] [INFO] [config.py:983:print] elasticity_enabled ........... False
	[2023-12-11 20:12:31,325] [INFO] [config.py:983:print] flops_profiler_config ........ {
	"enabled": false,
	"recompute_fwd_factor": 0.0,
	"profile_step": 1,
	"module_depth": -1,
	"top_modules": 1,
	"detailed": true,
	"output_file": null
	}
	[2023-12-11 20:12:31,325] [INFO] [config.py:983:print] fp16_auto_cast ............... False
	[2023-12-11 20:12:31,325] [INFO] [config.py:983:print] fp16_enabled ................. True
	[2023-12-11 20:12:31,325] [INFO] [config.py:983:print] fp16_master_weights_and_gradients False
	[2023-12-11 20:12:31,325] [INFO] [config.py:983:print] global_rank .................. 0
	[2023-12-11 20:12:31,325] [INFO] [config.py:983:print] grad_accum_dtype ............. None
	[2023-12-11 20:12:31,325] [INFO] [config.py:983:print] gradient_accumulation_steps .. 1
	[2023-12-11 20:12:31,325] [INFO] [config.py:983:print] gradient_clipping ............ 1.0
	[2023-12-11 20:12:31,325] [INFO] [config.py:983:print] gradient_predivide_factor .... 1.0
	[2023-12-11 20:12:31,325] [INFO] [config.py:983:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
	[2023-12-11 20:12:31,325] [INFO] [config.py:983:print] initial_dynamic_scale ........ 65536
	[2023-12-11 20:12:31,325] [INFO] [config.py:983:print] load_universal_checkpoint .... False
	[2023-12-11 20:12:31,325] [INFO] [config.py:983:print] loss_scale ................... 0
	[2023-12-11 20:12:31,325] [INFO] [config.py:983:print] memory_breakdown ............. False
	[2023-12-11 20:12:31,325] [INFO] [config.py:983:print] mics_hierarchial_params_gather False
	[2023-12-11 20:12:31,325] [INFO] [config.py:983:print] mics_shard_size .............. -1
	[2023-12-11 20:12:31,325] [INFO] [config.py:983:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='step1_tensorboard/ds_tensorboard_logs/', job_name='step1_model_tensorboard') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
	[2023-12-11 20:12:31,325] [INFO] [config.py:983:print] nebula_config ................ {
	"enabled": false,
	"persistent_storage_path": null,
	"persistent_time_interval": 100,
	"num_of_version_in_retention": 2,
	"enable_nebula_load": true,
	"load_path": null
	}
	[2023-12-11 20:12:31,325] [INFO] [config.py:983:print] optimizer_legacy_fusion ...... False
	[2023-12-11 20:12:31,325] [INFO] [config.py:983:print] optimizer_name ............... None
	[2023-12-11 20:12:31,325] [INFO] [config.py:983:print] optimizer_params ............. None
	[2023-12-11 20:12:31,325] [INFO] [config.py:983:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
	[2023-12-11 20:12:31,325] [INFO] [config.py:983:print] pld_enabled .................. False
	[2023-12-11 20:12:31,325] [INFO] [config.py:983:print] pld_params ................... False
	[2023-12-11 20:12:31,325] [INFO] [config.py:983:print] prescale_gradients ........... False
	[2023-12-11 20:12:31,325] [INFO] [config.py:983:print] scheduler_name ............... None
	[2023-12-11 20:12:31,325] [INFO] [config.py:983:print] scheduler_params ............. None
	[2023-12-11 20:12:31,325] [INFO] [config.py:983:print] seq_parallel_communication_data_type torch.float32
	[2023-12-11 20:12:31,326] [INFO] [config.py:983:print] sparse_attention ............. None
	[2023-12-11 20:12:31,326] [INFO] [config.py:983:print] sparse_gradients_enabled ..... False
	[2023-12-11 20:12:31,326] [INFO] [config.py:983:print] steps_per_print .............. 10
	[2023-12-11 20:12:31,326] [INFO] [config.py:983:print] train_batch_size ............. 32
	[2023-12-11 20:12:31,326] [INFO] [config.py:983:print] train_micro_batch_size_per_gpu 8
	[2023-12-11 20:12:31,326] [INFO] [config.py:983:print] use_data_before_expert_parallel_ False
	[2023-12-11 20:12:31,326] [INFO] [config.py:983:print] use_node_local_storage ....... False
	[2023-12-11 20:12:31,326] [INFO] [config.py:983:print] wall_clock_breakdown ......... False
	[2023-12-11 20:12:31,326] [INFO] [config.py:983:print] weight_quantization_config ... None
	[2023-12-11 20:12:31,326] [INFO] [config.py:983:print] world_size ................... 4
	[2023-12-11 20:12:31,326] [INFO] [config.py:983:print] zero_allow_untested_optimizer False
	[2023-12-11 20:12:31,326] [INFO] [config.py:983:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=30000000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=30000000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=False pipeline_loading_checkpoint=False override_module_apply=True
	[2023-12-11 20:12:31,326] [INFO] [config.py:983:print] zero_enabled ................. True
	[2023-12-11 20:12:31,326] [INFO] [config.py:983:print] zero_force_ds_cpu_optimizer .. True
	[2023-12-11 20:12:31,326] [INFO] [config.py:983:print] zero_optimization_stage ...... 3
	[2023-12-11 20:12:31,326] [INFO] [config.py:969:print_user_config] json = {
	"train_batch_size": 32,
	"train_micro_batch_size_per_gpu": 8,
	"steps_per_print": 10,
	"zero_optimization": {
	"stage": 3,
	"offload_param": {
	"device": "none"
	},
	"offload_optimizer": {
	"device": "none"
	},
	"stage3_param_persistence_threshold": 1.000000e+04,
	"stage3_max_live_parameters": 3.000000e+07,
	"stage3_prefetch_bucket_size": 3.000000e+07,
	"memory_efficient_linear": false
	},
	"fp16": {
	"enabled": true,
	"loss_scale_window": 100
	},
	"gradient_clipping": 1.0,
	"prescale_gradients": false,
	"wall_clock_breakdown": false,
	"hybrid_engine": {
	"enabled": false,
	"max_out_tokens": 512,
	"inference_tp_size": 1,
	"release_inference_cache": false,
	"pin_parameters": true,
	"tp_gather_partition_size": 8
	},
	"tensorboard": {
	"enabled": false,
	"output_path": "step1_tensorboard/ds_tensorboard_logs/",
	"job_name": "step1_model_tensorboard"
	}
	}
	*** Running training ***
	*** Evaluating perplexity, Epoch 0/5 ***
	ppl: 4.460639476776123, loss: 1.4952921867370605
	Beginning of Epoch 1/5, Total Micro Batches 13
	/home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
	warnings.warn(
	/home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
	warnings.warn(
	/home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
	warnings.warn(
	/home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
	warnings.warn(
	Model Parameters: 6.927 B, Latency: 4.17s, TFLOPs: 10.04, Samples/sec: 1.92, Time/seq 0.52s, Batch Size: 8, Sequence Length: 512
	Invalidate trace cache @ step 0: expected module 6, but got module 0
	Model Parameters: 6.927 B, Latency: 3.74s, TFLOPs: 11.20, Samples/sec: 2.14, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.76s, TFLOPs: 11.14, Samples/sec: 2.13, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.46, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.53, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.53, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.53, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.51, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	[2023-12-11 20:13:11,248] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=0, lr=[9.097325323776738e-06, 0.00047136400641330245, 9.097325323776738e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
	[2023-12-11 20:13:11,248] [INFO] [timer.py:260:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=8.766147695613881, CurrSamplesPerSec=8.809815752797453, MemAllocated=6.88GB, MaxMemAllocated=10.68GB
	Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.51, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.51, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.24s, TFLOPs: 12.90, Samples/sec: 2.47, Time/seq 0.41s, Batch Size: 8, Sequence Length: 512
	*** Evaluating perplexity, Epoch 1/5 ***
	Invalidate trace cache @ step 0: expected module 0, but got module 6
	ppl: 1.6560871601104736, loss: 0.5044576525688171
	Beginning of Epoch 2/5, Total Micro Batches 13
	Model Parameters: 6.927 B, Latency: 3.75s, TFLOPs: 11.15, Samples/sec: 2.13, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.76s, TFLOPs: 11.15, Samples/sec: 2.13, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.49, Samples/sec: 2.20, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.51, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.50, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.50, Samples/sec: 2.20, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	[2023-12-11 20:13:49,353] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=0, lr=[7.565912402977827e-06, 0.00039201618668278893, 7.565912402977827e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
	[2023-12-11 20:13:49,354] [INFO] [timer.py:260:stop] epoch=1/micro_step=7/global_step=20, RunningAvgSamplesPerSec=8.803895836862662, CurrSamplesPerSec=8.791045583607062, MemAllocated=6.88GB, MaxMemAllocated=11.06GB
	Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.50, Samples/sec: 2.20, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.51, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.50, Samples/sec: 2.20, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.50, Samples/sec: 2.20, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.47, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.46, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.25s, TFLOPs: 12.88, Samples/sec: 2.46, Time/seq 0.41s, Batch Size: 8, Sequence Length: 512
	*** Evaluating perplexity, Epoch 2/5 ***
	Invalidate trace cache @ step 0: expected module 0, but got module 6
	ppl: 1.0178232192993164, loss: 0.01766625978052616
	Beginning of Epoch 3/5, Total Micro Batches 13
	Model Parameters: 6.927 B, Latency: 3.76s, TFLOPs: 11.13, Samples/sec: 2.13, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.77s, TFLOPs: 11.09, Samples/sec: 2.12, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.49, Samples/sec: 2.20, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	[2023-12-11 20:14:27,532] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=0, lr=[5.4065894822319335e-06, 0.0002801341700638307, 5.4065894822319335e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
	[2023-12-11 20:14:27,533] [INFO] [timer.py:260:stop] epoch=2/micro_step=4/global_step=30, RunningAvgSamplesPerSec=8.808840107678392, CurrSamplesPerSec=8.779266138519437, MemAllocated=6.88GB, MaxMemAllocated=11.06GB
	Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.48, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.49, Samples/sec: 2.20, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.49, Samples/sec: 2.20, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.49, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.48, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.49, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.47, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.49, Samples/sec: 2.20, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.47, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.25s, TFLOPs: 12.86, Samples/sec: 2.46, Time/seq 0.41s, Batch Size: 8, Sequence Length: 512
	*** Evaluating perplexity, Epoch 3/5 ***
	Invalidate trace cache @ step 0: expected module 0, but got module 6
	ppl: 1.0056875944137573, loss: 0.005671397782862186
	Beginning of Epoch 4/5, Total Micro Batches 13
	[2023-12-11 20:15:05,601] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=0, lr=[3.1140314200197657e-06, 0.00016134877823936609, 3.1140314200197657e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
	[2023-12-11 20:15:05,601] [INFO] [timer.py:260:stop] epoch=3/micro_step=1/global_step=40, RunningAvgSamplesPerSec=8.818374436983056, CurrSamplesPerSec=8.49120081099869, MemAllocated=6.88GB, MaxMemAllocated=11.06GB
	Model Parameters: 6.927 B, Latency: 3.77s, TFLOPs: 11.10, Samples/sec: 2.12, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.77s, TFLOPs: 11.09, Samples/sec: 2.12, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.47, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.47, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 11.44, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.46, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.46, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.46, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 11.45, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.46, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	[2023-12-11 20:15:42,281] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=0, lr=[1.2134356400744368e-06, 6.28723129572247e-05, 1.2134356400744368e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
	[2023-12-11 20:15:42,281] [INFO] [timer.py:260:stop] epoch=3/micro_step=11/global_step=50, RunningAvgSamplesPerSec=8.800315028679389, CurrSamplesPerSec=8.764479266712412, MemAllocated=6.88GB, MaxMemAllocated=11.06GB
	Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.46, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.47, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.27s, TFLOPs: 12.79, Samples/sec: 2.44, Time/seq 0.41s, Batch Size: 8, Sequence Length: 512
	*** Evaluating perplexity, Epoch 4/5 ***
	Invalidate trace cache @ step 0: expected module 0, but got module 6
	ppl: 1.0032395124435425, loss: 0.0032342304475605488
	Beginning of Epoch 5/5, Total Micro Batches 13
	Model Parameters: 6.927 B, Latency: 3.77s, TFLOPs: 11.09, Samples/sec: 2.12, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.79s, TFLOPs: 11.05, Samples/sec: 2.11, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.46, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 11.45, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 11.45, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.46, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 11.43, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	[2023-12-11 20:16:20,586] [INFO] [logging.py:96:log_dist] [Rank 0] step=60, skipped=0, lr=[1.4020573091929905e-07, 7.2645456434869975e-06, 1.4020573091929905e-07], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
	[2023-12-11 20:16:20,586] [INFO] [timer.py:260:stop] epoch=4/micro_step=8/global_step=60, RunningAvgSamplesPerSec=8.798149665169436, CurrSamplesPerSec=8.756539739490163, MemAllocated=6.88GB, MaxMemAllocated=11.06GB
	Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 11.45, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 11.45, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.46, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 11.44, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 11.44, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.28s, TFLOPs: 12.77, Samples/sec: 2.44, Time/seq 0.41s, Batch Size: 8, Sequence Length: 512
	*** Evaluating perplexity, Epoch 5/5 ***
	Invalidate trace cache @ step 0: expected module 0, but got module 6
	ppl: 1.003004550933838, loss: 0.0030000172555446625
	saving the final model ...
	[2023-12-11 20:16:53,814] [INFO] [launch.py:347:main] Process 2392412 exits successfully.
	[2023-12-11 20:16:54,182] [INFO] [launch.py:347:main] Process 2392414 exits successfully.
	[2023-12-11 20:16:54,182] [INFO] [launch.py:347:main] Process 2392413 exits successfully.
	[2023-12-11 20:18:58,197] [INFO] [launch.py:347:main] Process 2392411 exits successfully.