CowardCow / training.log

Upload folder using huggingface_hub

f1479f3 about 1 year ago

42.7 kB

	[2023-12-11 05:39:03,031] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
	[2023-12-11 05:39:04,827] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
	[2023-12-11 05:39:04,828] [INFO] [runner.py:570:main] cmd = /home/t-sokumar/miniconda3/envs/ft/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path local/jsonfile --data_split 1,0,0 --model_name_or_path codellama/CodeLlama-7b-hf --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --max_seq_len 512 --learning_rate 9.65e-6 --weight_decay 0. --num_train_epochs 3 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --gradient_checkpointing --zero_stage 3 --deepspeed --lora_dim 128 --lora_module_name layers. --output_dir ./output_step1_Codellama_7b_lora_llamahub-devrev --add_eot_token
	[2023-12-11 05:39:07,364] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
	[2023-12-11 05:39:09,159] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
	[2023-12-11 05:39:09,159] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
	[2023-12-11 05:39:09,159] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
	[2023-12-11 05:39:09,159] [INFO] [launch.py:163:main] dist_world_size=4
	[2023-12-11 05:39:09,159] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
	[2023-12-11 05:39:12,594] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
	[2023-12-11 05:39:12,600] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
	[2023-12-11 05:39:12,605] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
	[2023-12-11 05:39:12,606] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
	/home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
	warnings.warn(
	/home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
	warnings.warn(
	/home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
	warnings.warn(
	/home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
	warnings.warn(
	[2023-12-11 05:39:14,179] [INFO] [comm.py:637:init_distributed] cdb=None
	[2023-12-11 05:39:14,179] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
	[2023-12-11 05:39:14,642] [INFO] [comm.py:637:init_distributed] cdb=None
	[2023-12-11 05:39:14,646] [INFO] [comm.py:637:init_distributed] cdb=None
	[2023-12-11 05:39:14,678] [INFO] [comm.py:637:init_distributed] cdb=None
	The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
	The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'.
	The class this function is called from is 'LlamaTokenizer'.
	The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
	The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'.
	The class this function is called from is 'LlamaTokenizer'.
	The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
	The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'.
	The class this function is called from is 'LlamaTokenizer'.
	You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
	You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
	You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
	The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
	The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'.
	The class this function is called from is 'LlamaTokenizer'.
	You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
	[2023-12-11 05:39:17,564] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 291, num_elems = 6.74B
	Loading checkpoint shards: 0%\| \| 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 0%\| \| 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 0%\| \| 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 0%\| \| 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 50%\|███████████████████████████████████████████████████████████▌ \| 1/2 [00:04<00:04, 4.90s/it] Loading checkpoint shards: 50%\|███████████████████████████████████████████████████████████▌ \| 1/2 [00:04<00:04, 4.91s/it] Loading checkpoint shards: 50%\|███████████████████████████████████████████████████████████▌ \| 1/2 [00:04<00:04, 4.90s/it] Loading checkpoint shards: 100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:11<00:00, 6.00s/it] Loading checkpoint shards: 100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:11<00:00, 5.84s/it]
	Loading checkpoint shards: 100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:11<00:00, 6.00s/it] Loading checkpoint shards: 100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:11<00:00, 5.84s/it]
	Loading checkpoint shards: 100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:11<00:00, 6.00s/it] Loading checkpoint shards: 100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:11<00:00, 5.84s/it]
	Loading checkpoint shards: 50%\|███████████████████████████████████████████████████████████▌ \| 1/2 [00:20<00:20, 20.85s/it] Loading checkpoint shards: 100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:28<00:00, 12.90s/it] Loading checkpoint shards: 100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 2/2 [00:28<00:00, 14.09s/it]
	Using /home/t-sokumar/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
	Using /home/t-sokumar/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
	Using /home/t-sokumar/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
	Using /home/t-sokumar/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
	Detected CUDA files, patching ldflags
	Emitting ninja build file /home/t-sokumar/.cache/torch_extensions/py311_cu121/fused_adam/build.ninja...
	Building extension module fused_adam...
	Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
	ninja: no work to do.
	Loading extension module fused_adam...
	Time to load fused_adam op: 0.12226700782775879 seconds
	/home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
	self._dummy_overflow_buf = get_accelerator().IntTensor([0])
	Loading extension module fused_adam...Loading extension module fused_adam...

	Time to load fused_adam op: 0.2073993682861328 secondsTime to load fused_adam op: 0.2075939178466797 seconds

	/home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
	self._dummy_overflow_buf = get_accelerator().IntTensor([0])
	/home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
	self._dummy_overflow_buf = get_accelerator().IntTensor([0])
	Loading extension module fused_adam...
	[2023-12-11 05:39:51,139] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.12.4, git-hash=unknown, git-branch=unknown
	[2023-12-11 05:39:51,139] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
	Time to load fused_adam op: 0.2015979290008545 seconds
	/home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
	self._dummy_overflow_buf = get_accelerator().IntTensor([0])
	[2023-12-11 05:39:51,163] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
	[2023-12-11 05:39:51,164] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
	[2023-12-11 05:39:51,164] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
	[2023-12-11 05:39:51,206] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
	[2023-12-11 05:39:51,206] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
	[2023-12-11 05:39:51,206] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
	[2023-12-11 05:39:51,206] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 3 optimizer
	[2023-12-11 05:39:51,330] [INFO] [utils.py:795:see_memory_usage] Stage 3 initialize beginning
	[2023-12-11 05:39:51,331] [INFO] [utils.py:796:see_memory_usage] MA 4.37 GB Max_MA 4.75 GB CA 8.35 GB Max_CA 8 GB
	[2023-12-11 05:39:51,331] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 98.11 GB, percent = 39.0%
	[2023-12-11 05:39:51,333] [INFO] [stage3.py:127:__init__] Reduce bucket size 500,000,000
	[2023-12-11 05:39:51,333] [INFO] [stage3.py:128:__init__] Prefetch bucket size 30000000
	[2023-12-11 05:39:51,450] [INFO] [utils.py:795:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
	[2023-12-11 05:39:51,450] [INFO] [utils.py:796:see_memory_usage] MA 4.37 GB Max_MA 4.37 GB CA 8.35 GB Max_CA 8 GB
	[2023-12-11 05:39:51,450] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 98.13 GB, percent = 39.0%
	Parameter Offload: Total persistent parameters: 266240 in 65 params
	[2023-12-11 05:39:51,757] [INFO] [utils.py:795:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
	[2023-12-11 05:39:51,758] [INFO] [utils.py:796:see_memory_usage] MA 3.54 GB Max_MA 4.43 GB CA 8.35 GB Max_CA 8 GB
	[2023-12-11 05:39:51,758] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 98.13 GB, percent = 39.0%
	[2023-12-11 05:39:51,866] [INFO] [utils.py:795:see_memory_usage] Before creating fp16 partitions
	[2023-12-11 05:39:51,866] [INFO] [utils.py:796:see_memory_usage] MA 3.54 GB Max_MA 3.54 GB CA 8.35 GB Max_CA 8 GB
	[2023-12-11 05:39:51,866] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 98.13 GB, percent = 39.0%
	[2023-12-11 05:39:52,568] [INFO] [utils.py:795:see_memory_usage] After creating fp16 partitions: 3
	[2023-12-11 05:39:52,569] [INFO] [utils.py:796:see_memory_usage] MA 3.54 GB Max_MA 3.54 GB CA 5.29 GB Max_CA 8 GB
	[2023-12-11 05:39:52,569] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 98.22 GB, percent = 39.0%
	[2023-12-11 05:39:52,708] [INFO] [utils.py:795:see_memory_usage] Before creating fp32 partitions
	[2023-12-11 05:39:52,709] [INFO] [utils.py:796:see_memory_usage] MA 3.54 GB Max_MA 3.54 GB CA 5.29 GB Max_CA 5 GB
	[2023-12-11 05:39:52,709] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 96.08 GB, percent = 38.2%
	[2023-12-11 05:39:52,831] [INFO] [utils.py:795:see_memory_usage] After creating fp32 partitions
	[2023-12-11 05:39:52,832] [INFO] [utils.py:796:see_memory_usage] MA 4.08 GB Max_MA 4.23 GB CA 5.99 GB Max_CA 6 GB
	[2023-12-11 05:39:52,832] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 96.11 GB, percent = 38.2%
	[2023-12-11 05:39:52,942] [INFO] [utils.py:795:see_memory_usage] Before initializing optimizer states
	[2023-12-11 05:39:52,942] [INFO] [utils.py:796:see_memory_usage] MA 4.08 GB Max_MA 4.08 GB CA 5.99 GB Max_CA 6 GB
	[2023-12-11 05:39:52,943] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 96.11 GB, percent = 38.2%
	[2023-12-11 05:39:53,083] [INFO] [utils.py:795:see_memory_usage] After initializing optimizer states
	[2023-12-11 05:39:53,084] [INFO] [utils.py:796:see_memory_usage] MA 5.17 GB Max_MA 5.47 GB CA 7.38 GB Max_CA 7 GB
	[2023-12-11 05:39:53,084] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 96.07 GB, percent = 38.2%
	[2023-12-11 05:39:53,084] [INFO] [stage3.py:479:_setup_for_real_optimizer] optimizer state initialized
	[2023-12-11 05:39:53,479] [INFO] [utils.py:795:see_memory_usage] After initializing ZeRO optimizer
	[2023-12-11 05:39:53,480] [INFO] [utils.py:796:see_memory_usage] MA 6.37 GB Max_MA 6.86 GB CA 9.05 GB Max_CA 9 GB
	[2023-12-11 05:39:53,480] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 96.07 GB, percent = 38.2%
	[2023-12-11 05:39:53,480] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam
	[2023-12-11 05:39:53,480] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
	[2023-12-11 05:39:53,480] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f53cfda67d0>
	[2023-12-11 05:39:53,480] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[9.65e-06, 0.0005, 9.65e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
	[2023-12-11 05:39:53,482] [INFO] [config.py:979:print] DeepSpeedEngine configuration:
	[2023-12-11 05:39:53,482] [INFO] [config.py:983:print] activation_checkpointing_config {
	"partition_activations": false,
	"contiguous_memory_optimization": false,
	"cpu_checkpointing": false,
	"number_checkpoints": null,
	"synchronize_checkpoint_boundary": false,
	"profile": false
	}
	[2023-12-11 05:39:53,482] [INFO] [config.py:983:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
	[2023-12-11 05:39:53,482] [INFO] [config.py:983:print] amp_enabled .................. False
	[2023-12-11 05:39:53,482] [INFO] [config.py:983:print] amp_params ................... False
	[2023-12-11 05:39:53,482] [INFO] [config.py:983:print] autotuning_config ............ {
	"enabled": false,
	"start_step": null,
	"end_step": null,
	"metric_path": null,
	"arg_mappings": null,
	"metric": "throughput",
	"model_info": null,
	"results_dir": "autotuning_results",
	"exps_dir": "autotuning_exps",
	"overwrite": true,
	"fast": true,
	"start_profile_step": 3,
	"end_profile_step": 5,
	"tuner_type": "gridsearch",
	"tuner_early_stopping": 5,
	"tuner_num_trials": 50,
	"model_info_path": null,
	"mp_size": 1,
	"max_train_batch_size": null,
	"min_train_batch_size": 1,
	"max_train_micro_batch_size_per_gpu": 1.024000e+03,
	"min_train_micro_batch_size_per_gpu": 1,
	"num_tuning_micro_batch_sizes": 3
	}
	[2023-12-11 05:39:53,482] [INFO] [config.py:983:print] bfloat16_enabled ............. False
	[2023-12-11 05:39:53,482] [INFO] [config.py:983:print] checkpoint_parallel_write_pipeline False
	[2023-12-11 05:39:53,482] [INFO] [config.py:983:print] checkpoint_tag_validation_enabled True
	[2023-12-11 05:39:53,482] [INFO] [config.py:983:print] checkpoint_tag_validation_fail False
	[2023-12-11 05:39:53,482] [INFO] [config.py:983:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f53cf91c7d0>
	[2023-12-11 05:39:53,482] [INFO] [config.py:983:print] communication_data_type ...... None
	[2023-12-11 05:39:53,483] [INFO] [config.py:983:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
	[2023-12-11 05:39:53,483] [INFO] [config.py:983:print] curriculum_enabled_legacy .... False
	[2023-12-11 05:39:53,483] [INFO] [config.py:983:print] curriculum_params_legacy ..... False
	[2023-12-11 05:39:53,483] [INFO] [config.py:983:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
	[2023-12-11 05:39:53,483] [INFO] [config.py:983:print] data_efficiency_enabled ...... False
	[2023-12-11 05:39:53,483] [INFO] [config.py:983:print] dataloader_drop_last ......... False
	[2023-12-11 05:39:53,483] [INFO] [config.py:983:print] disable_allgather ............ False
	[2023-12-11 05:39:53,483] [INFO] [config.py:983:print] dump_state ................... False
	[2023-12-11 05:39:53,483] [INFO] [config.py:983:print] dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 100, 'delayed_shift': 2, 'consecutive_hysteresis': False, 'min_scale': 1}
	[2023-12-11 05:39:53,483] [INFO] [config.py:983:print] eigenvalue_enabled ........... False
	[2023-12-11 05:39:53,483] [INFO] [config.py:983:print] eigenvalue_gas_boundary_resolution 1
	[2023-12-11 05:39:53,483] [INFO] [config.py:983:print] eigenvalue_layer_name ........ bert.encoder.layer
	[2023-12-11 05:39:53,483] [INFO] [config.py:983:print] eigenvalue_layer_num ......... 0
	[2023-12-11 05:39:53,483] [INFO] [config.py:983:print] eigenvalue_max_iter .......... 100
	[2023-12-11 05:39:53,483] [INFO] [config.py:983:print] eigenvalue_stability ......... 1e-06
	[2023-12-11 05:39:53,483] [INFO] [config.py:983:print] eigenvalue_tol ............... 0.01
	[2023-12-11 05:39:53,483] [INFO] [config.py:983:print] eigenvalue_verbose ........... False
	[2023-12-11 05:39:53,483] [INFO] [config.py:983:print] elasticity_enabled ........... False
	[2023-12-11 05:39:53,483] [INFO] [config.py:983:print] flops_profiler_config ........ {
	"enabled": false,
	"recompute_fwd_factor": 0.0,
	"profile_step": 1,
	"module_depth": -1,
	"top_modules": 1,
	"detailed": true,
	"output_file": null
	}
	[2023-12-11 05:39:53,483] [INFO] [config.py:983:print] fp16_auto_cast ............... False
	[2023-12-11 05:39:53,483] [INFO] [config.py:983:print] fp16_enabled ................. True
	[2023-12-11 05:39:53,483] [INFO] [config.py:983:print] fp16_master_weights_and_gradients False
	[2023-12-11 05:39:53,483] [INFO] [config.py:983:print] global_rank .................. 0
	[2023-12-11 05:39:53,483] [INFO] [config.py:983:print] grad_accum_dtype ............. None
	[2023-12-11 05:39:53,483] [INFO] [config.py:983:print] gradient_accumulation_steps .. 1
	[2023-12-11 05:39:53,483] [INFO] [config.py:983:print] gradient_clipping ............ 1.0
	[2023-12-11 05:39:53,483] [INFO] [config.py:983:print] gradient_predivide_factor .... 1.0
	[2023-12-11 05:39:53,483] [INFO] [config.py:983:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
	[2023-12-11 05:39:53,483] [INFO] [config.py:983:print] initial_dynamic_scale ........ 65536
	[2023-12-11 05:39:53,484] [INFO] [config.py:983:print] load_universal_checkpoint .... False
	[2023-12-11 05:39:53,484] [INFO] [config.py:983:print] loss_scale ................... 0
	[2023-12-11 05:39:53,484] [INFO] [config.py:983:print] memory_breakdown ............. False
	[2023-12-11 05:39:53,484] [INFO] [config.py:983:print] mics_hierarchial_params_gather False
	[2023-12-11 05:39:53,484] [INFO] [config.py:983:print] mics_shard_size .............. -1
	[2023-12-11 05:39:53,484] [INFO] [config.py:983:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='step1_tensorboard/ds_tensorboard_logs/', job_name='step1_model_tensorboard') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
	[2023-12-11 05:39:53,484] [INFO] [config.py:983:print] nebula_config ................ {
	"enabled": false,
	"persistent_storage_path": null,
	"persistent_time_interval": 100,
	"num_of_version_in_retention": 2,
	"enable_nebula_load": true,
	"load_path": null
	}
	[2023-12-11 05:39:53,484] [INFO] [config.py:983:print] optimizer_legacy_fusion ...... False
	[2023-12-11 05:39:53,484] [INFO] [config.py:983:print] optimizer_name ............... None
	[2023-12-11 05:39:53,484] [INFO] [config.py:983:print] optimizer_params ............. None
	[2023-12-11 05:39:53,484] [INFO] [config.py:983:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
	[2023-12-11 05:39:53,484] [INFO] [config.py:983:print] pld_enabled .................. False
	[2023-12-11 05:39:53,484] [INFO] [config.py:983:print] pld_params ................... False
	[2023-12-11 05:39:53,484] [INFO] [config.py:983:print] prescale_gradients ........... False
	[2023-12-11 05:39:53,484] [INFO] [config.py:983:print] scheduler_name ............... None
	[2023-12-11 05:39:53,484] [INFO] [config.py:983:print] scheduler_params ............. None
	[2023-12-11 05:39:53,484] [INFO] [config.py:983:print] seq_parallel_communication_data_type torch.float32
	[2023-12-11 05:39:53,484] [INFO] [config.py:983:print] sparse_attention ............. None
	[2023-12-11 05:39:53,484] [INFO] [config.py:983:print] sparse_gradients_enabled ..... False
	[2023-12-11 05:39:53,484] [INFO] [config.py:983:print] steps_per_print .............. 10
	[2023-12-11 05:39:53,484] [INFO] [config.py:983:print] train_batch_size ............. 32
	[2023-12-11 05:39:53,484] [INFO] [config.py:983:print] train_micro_batch_size_per_gpu 8
	[2023-12-11 05:39:53,484] [INFO] [config.py:983:print] use_data_before_expert_parallel_ False
	[2023-12-11 05:39:53,484] [INFO] [config.py:983:print] use_node_local_storage ....... False
	[2023-12-11 05:39:53,484] [INFO] [config.py:983:print] wall_clock_breakdown ......... False
	[2023-12-11 05:39:53,484] [INFO] [config.py:983:print] weight_quantization_config ... None
	[2023-12-11 05:39:53,484] [INFO] [config.py:983:print] world_size ................... 4
	[2023-12-11 05:39:53,484] [INFO] [config.py:983:print] zero_allow_untested_optimizer False
	[2023-12-11 05:39:53,485] [INFO] [config.py:983:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=30000000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=30000000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=False pipeline_loading_checkpoint=False override_module_apply=True
	[2023-12-11 05:39:53,485] [INFO] [config.py:983:print] zero_enabled ................. True
	[2023-12-11 05:39:53,485] [INFO] [config.py:983:print] zero_force_ds_cpu_optimizer .. True
	[2023-12-11 05:39:53,485] [INFO] [config.py:983:print] zero_optimization_stage ...... 3
	[2023-12-11 05:39:53,485] [INFO] [config.py:969:print_user_config] json = {
	"train_batch_size": 32,
	"train_micro_batch_size_per_gpu": 8,
	"steps_per_print": 10,
	"zero_optimization": {
	"stage": 3,
	"offload_param": {
	"device": "none"
	},
	"offload_optimizer": {
	"device": "none"
	},
	"stage3_param_persistence_threshold": 1.000000e+04,
	"stage3_max_live_parameters": 3.000000e+07,
	"stage3_prefetch_bucket_size": 3.000000e+07,
	"memory_efficient_linear": false
	},
	"fp16": {
	"enabled": true,
	"loss_scale_window": 100
	},
	"gradient_clipping": 1.0,
	"prescale_gradients": false,
	"wall_clock_breakdown": false,
	"hybrid_engine": {
	"enabled": false,
	"max_out_tokens": 512,
	"inference_tp_size": 1,
	"release_inference_cache": false,
	"pin_parameters": true,
	"tp_gather_partition_size": 8
	},
	"tensorboard": {
	"enabled": false,
	"output_path": "step1_tensorboard/ds_tensorboard_logs/",
	"job_name": "step1_model_tensorboard"
	}
	}
	*** Running training ***
	*** Evaluating perplexity, Epoch 0/3 ***
	ppl: 4.460639476776123, loss: 1.4952921867370605
	Beginning of Epoch 1/3, Total Micro Batches 13
	/home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
	warnings.warn(
	/home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
	warnings.warn(
	/home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
	warnings.warn(
	/home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
	warnings.warn(
	Model Parameters: 6.927 B, Latency: 4.07s, TFLOPs: 10.30, Samples/sec: 1.97, Time/seq 0.51s, Batch Size: 8, Sequence Length: 512
	Invalidate trace cache @ step 0: expected module 6, but got module 0
	Model Parameters: 6.927 B, Latency: 3.74s, TFLOPs: 11.21, Samples/sec: 2.14, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.75s, TFLOPs: 11.17, Samples/sec: 2.13, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.73s, TFLOPs: 11.24, Samples/sec: 2.15, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.62s, TFLOPs: 11.55, Samples/sec: 2.21, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.53, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.62s, TFLOPs: 11.57, Samples/sec: 2.21, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.55, Samples/sec: 2.21, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.62s, TFLOPs: 11.56, Samples/sec: 2.21, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	[2023-12-11 05:40:33,349] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=0, lr=[8.167395005683819e-06, 0.00042318108837739987, 8.167395005683819e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
	[2023-12-11 05:40:33,349] [INFO] [timer.py:260:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=8.762510054419295, CurrSamplesPerSec=8.82404309088149, MemAllocated=6.88GB, MaxMemAllocated=10.68GB
	Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.54, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.62s, TFLOPs: 11.55, Samples/sec: 2.21, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.62s, TFLOPs: 11.56, Samples/sec: 2.21, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.24s, TFLOPs: 12.92, Samples/sec: 2.47, Time/seq 0.40s, Batch Size: 8, Sequence Length: 512
	*** Evaluating perplexity, Epoch 1/3 ***
	Invalidate trace cache @ step 0: expected module 0, but got module 6
	ppl: 1.7355413436889648, loss: 0.551319420337677
	Beginning of Epoch 2/3, Total Micro Batches 13
	Model Parameters: 6.927 B, Latency: 3.75s, TFLOPs: 11.17, Samples/sec: 2.13, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.76s, TFLOPs: 11.14, Samples/sec: 2.13, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.54, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.53, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	[2023-12-11 05:41:11,363] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=0, lr=[4.6307168389720735e-06, 0.0002399335149726463, 4.6307168389720735e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
	[2023-12-11 05:41:11,363] [INFO] [timer.py:260:stop] epoch=1/micro_step=7/global_step=20, RunningAvgSamplesPerSec=8.813860487355969, CurrSamplesPerSec=8.815178915300866, MemAllocated=6.88GB, MaxMemAllocated=11.06GB
	Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.53, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.53, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.53, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.24s, TFLOPs: 12.90, Samples/sec: 2.47, Time/seq 0.41s, Batch Size: 8, Sequence Length: 512
	*** Evaluating perplexity, Epoch 2/3 ***
	Invalidate trace cache @ step 0: expected module 0, but got module 6
	ppl: 1.0645378828048706, loss: 0.0625406950712204
	Beginning of Epoch 3/3, Total Micro Batches 13
	Model Parameters: 6.927 B, Latency: 3.75s, TFLOPs: 11.16, Samples/sec: 2.13, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.77s, TFLOPs: 11.11, Samples/sec: 2.12, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.53, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	[2023-12-11 05:41:49,440] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=0, lr=[1.2134356400744368e-06, 6.28723129572247e-05, 1.2134356400744368e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
	[2023-12-11 05:41:49,441] [INFO] [timer.py:260:stop] epoch=2/micro_step=4/global_step=30, RunningAvgSamplesPerSec=8.8235289391519, CurrSamplesPerSec=8.771353693343526, MemAllocated=6.88GB, MaxMemAllocated=11.06GB
	Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.47, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.53, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.51, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.48, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.50, Samples/sec: 2.20, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
	Model Parameters: 6.927 B, Latency: 3.26s, TFLOPs: 12.83, Samples/sec: 2.45, Time/seq 0.41s, Batch Size: 8, Sequence Length: 512
	*** Evaluating perplexity, Epoch 3/3 ***
	Invalidate trace cache @ step 0: expected module 0, but got module 6
	ppl: 1.031400442123413, loss: 0.03091755136847496
	saving the final model ...
	[2023-12-11 05:42:35,188] [INFO] [launch.py:347:main] Process 1247715 exits successfully.
	[2023-12-11 05:42:35,188] [INFO] [launch.py:347:main] Process 1247716 exits successfully.
	[2023-12-11 05:42:35,189] [INFO] [launch.py:347:main] Process 1247717 exits successfully.
	[2023-12-11 05:44:14,200] [INFO] [launch.py:347:main] Process 1247714 exits successfully.