aaaacash
/

CowardCow

Text Generation

Transformers

PyTorch

llama

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

aaaacash commited on Dec 12, 2023

Commit

ccf4931

1 Parent(s): 28c7f0e

Upload folder using huggingface_hub

Browse files

Files changed (1) hide show

training.log +209 -209

training.log CHANGED Viewed

@@ -1,29 +1,29 @@
-[2023-12-11 18:39:50,465] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
-[2023-12-11 18:39:52,336] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
-[2023-12-11 18:39:52,336] [INFO] [runner.py:570:main] cmd = /home/t-sokumar/miniconda3/envs/ft/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path local/jsonfile --data_split 1,0,0 --model_name_or_path codellama/CodeLlama-7b-hf --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --max_seq_len 512 --learning_rate 9.65e-6 --weight_decay 0. --num_train_epochs 5 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --gradient_checkpointing --zero_stage 3 --deepspeed --lora_dim 128 --lora_module_name layers. --output_dir ./output_step1_Codellama_7b_lora_llamahub-devrev --add_eot_token
-[2023-12-11 18:39:54,950] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
-[2023-12-11 18:39:57,147] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
-[2023-12-11 18:39:57,147] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
-[2023-12-11 18:39:57,147] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
-[2023-12-11 18:39:57,147] [INFO] [launch.py:163:main] dist_world_size=4
-[2023-12-11 18:39:57,147] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
-[2023-12-11 18:40:00,872] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
-[2023-12-11 18:40:00,873] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
-[2023-12-11 18:40:00,878] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
-[2023-12-11 18:40:00,879] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
   warnings.warn(
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
   warnings.warn(
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
   warnings.warn(
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
   warnings.warn(
-[2023-12-11 18:40:02,568] [INFO] [comm.py:637:init_distributed] cdb=None
-[2023-12-11 18:40:02,568] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
-[2023-12-11 18:40:02,810] [INFO] [comm.py:637:init_distributed] cdb=None
-[2023-12-11 18:40:02,842] [INFO] [comm.py:637:init_distributed] cdb=None
-[2023-12-11 18:40:02,862] [INFO] [comm.py:637:init_distributed] cdb=None
 The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
 The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'.
 The class this function is called from is 'LlamaTokenizer'.
@@ -40,11 +40,11 @@ You are using the default legacy behaviour of the <class 'transformers.models.ll
 You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
 You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
 You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
-[2023-12-11 18:40:05,507] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 291, num_elems = 6.74B
 Using /home/t-sokumar/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
 Using /home/t-sokumar/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
 Using /home/t-sokumar/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
@@ -55,70 +55,70 @@ Building extension module fused_adam...
 Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
 ninja: no work to do.
 Loading extension module fused_adam...
-Loading extension module fused_adam...
-Time to load fused_adam op: 0.10220003128051758 secondsTime to load fused_adam op: 0.11474394798278809 seconds
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
   self._dummy_overflow_buf = get_accelerator().IntTensor([0])
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
   self._dummy_overflow_buf = get_accelerator().IntTensor([0])
-[2023-12-11 18:40:15,841] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.12.4, git-hash=unknown, git-branch=unknown
-[2023-12-11 18:40:15,842] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
-[2023-12-11 18:40:15,862] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
-[2023-12-11 18:40:15,864] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
-[2023-12-11 18:40:15,864] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
-[2023-12-11 18:40:15,906] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
-[2023-12-11 18:40:15,906] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
-[2023-12-11 18:40:15,906] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
-[2023-12-11 18:40:15,906] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 3 optimizer
-Loading extension module fused_adam...
-Time to load fused_adam op: 0.20157265663146973 seconds
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
   self._dummy_overflow_buf = get_accelerator().IntTensor([0])
-Loading extension module fused_adam...
-Time to load fused_adam op: 0.20161700248718262 seconds
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
   self._dummy_overflow_buf = get_accelerator().IntTensor([0])
-[2023-12-11 18:40:16,029] [INFO] [utils.py:795:see_memory_usage] Stage 3 initialize beginning
-[2023-12-11 18:40:16,030] [INFO] [utils.py:796:see_memory_usage] MA 4.37 GB         Max_MA 4.75 GB         CA 8.09 GB         Max_CA 8 GB
-[2023-12-11 18:40:16,030] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 95.88 GB, percent = 38.1%
-[2023-12-11 18:40:16,032] [INFO] [stage3.py:127:__init__] Reduce bucket size 500,000,000
-[2023-12-11 18:40:16,032] [INFO] [stage3.py:128:__init__] Prefetch bucket size 30000000
-[2023-12-11 18:40:16,138] [INFO] [utils.py:795:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
-[2023-12-11 18:40:16,139] [INFO] [utils.py:796:see_memory_usage] MA 4.37 GB         Max_MA 4.37 GB         CA 8.09 GB         Max_CA 8 GB
-[2023-12-11 18:40:16,139] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 95.91 GB, percent = 38.1%
 Parameter Offload: Total persistent parameters: 266240 in 65 params
-[2023-12-11 18:40:16,505] [INFO] [utils.py:795:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
-[2023-12-11 18:40:16,506] [INFO] [utils.py:796:see_memory_usage] MA 3.54 GB         Max_MA 4.43 GB         CA 8.1 GB         Max_CA 8 GB
-[2023-12-11 18:40:16,506] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 95.88 GB, percent = 38.1%
-[2023-12-11 18:40:16,620] [INFO] [utils.py:795:see_memory_usage] Before creating fp16 partitions
-[2023-12-11 18:40:16,621] [INFO] [utils.py:796:see_memory_usage] MA 3.54 GB         Max_MA 3.54 GB         CA 8.1 GB         Max_CA 8 GB
-[2023-12-11 18:40:16,621] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 95.88 GB, percent = 38.1%
-[2023-12-11 18:40:17,385] [INFO] [utils.py:795:see_memory_usage] After creating fp16 partitions: 3
-[2023-12-11 18:40:17,386] [INFO] [utils.py:796:see_memory_usage] MA 3.54 GB         Max_MA 3.54 GB         CA 4.96 GB         Max_CA 8 GB
-[2023-12-11 18:40:17,386] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 96.0 GB, percent = 38.2%
-[2023-12-11 18:40:17,512] [INFO] [utils.py:795:see_memory_usage] Before creating fp32 partitions
-[2023-12-11 18:40:17,513] [INFO] [utils.py:796:see_memory_usage] MA 3.54 GB         Max_MA 3.54 GB         CA 4.96 GB         Max_CA 5 GB
-[2023-12-11 18:40:17,513] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 93.83 GB, percent = 37.3%
-[2023-12-11 18:40:17,659] [INFO] [utils.py:795:see_memory_usage] After creating fp32 partitions
-[2023-12-11 18:40:17,659] [INFO] [utils.py:796:see_memory_usage] MA 4.09 GB         Max_MA 4.23 GB         CA 5.78 GB         Max_CA 6 GB
-[2023-12-11 18:40:17,659] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 93.78 GB, percent = 37.3%
-[2023-12-11 18:40:17,777] [INFO] [utils.py:795:see_memory_usage] Before initializing optimizer states
-[2023-12-11 18:40:17,778] [INFO] [utils.py:796:see_memory_usage] MA 4.09 GB         Max_MA 4.09 GB         CA 5.78 GB         Max_CA 6 GB
-[2023-12-11 18:40:17,778] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 93.78 GB, percent = 37.3%
-[2023-12-11 18:40:17,916] [INFO] [utils.py:795:see_memory_usage] After initializing optimizer states
-[2023-12-11 18:40:17,916] [INFO] [utils.py:796:see_memory_usage] MA 5.17 GB         Max_MA 5.47 GB         CA 7.16 GB         Max_CA 7 GB
-[2023-12-11 18:40:17,917] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 93.78 GB, percent = 37.3%
-[2023-12-11 18:40:17,917] [INFO] [stage3.py:479:_setup_for_real_optimizer] optimizer state initialized
-[2023-12-11 18:40:18,316] [INFO] [utils.py:795:see_memory_usage] After initializing ZeRO optimizer
-[2023-12-11 18:40:18,317] [INFO] [utils.py:796:see_memory_usage] MA 6.38 GB         Max_MA 6.86 GB         CA 8.85 GB         Max_CA 9 GB
-[2023-12-11 18:40:18,317] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 93.09 GB, percent = 37.0%
-[2023-12-11 18:40:18,318] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam
-[2023-12-11 18:40:18,318] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
-[2023-12-11 18:40:18,318] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f41c409efd0>
-[2023-12-11 18:40:18,318] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[9.65e-06, 0.0005, 9.65e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
-[2023-12-11 18:40:18,319] [INFO] [config.py:979:print] DeepSpeedEngine configuration:
-[2023-12-11 18:40:18,319] [INFO] [config.py:983:print]   activation_checkpointing_config  {
     "partition_activations": false,
     "contiguous_memory_optimization": false,
     "cpu_checkpointing": false,
@@ -126,10 +126,10 @@ Parameter Offload: Total persistent parameters: 266240 in 65 params
     "synchronize_checkpoint_boundary": false,
     "profile": false
 }
-[2023-12-11 18:40:18,319] [INFO] [config.py:983:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
-[2023-12-11 18:40:18,320] [INFO] [config.py:983:print]   amp_enabled .................. False
-[2023-12-11 18:40:18,320] [INFO] [config.py:983:print]   amp_params ................... False
-[2023-12-11 18:40:18,320] [INFO] [config.py:983:print]   autotuning_config ............ {
     "enabled": false,
     "start_step": null,
     "end_step": null,
@@ -154,31 +154,31 @@ Parameter Offload: Total persistent parameters: 266240 in 65 params
     "min_train_micro_batch_size_per_gpu": 1,
     "num_tuning_micro_batch_sizes": 3
 }
-[2023-12-11 18:40:18,320] [INFO] [config.py:983:print]   bfloat16_enabled ............. False
-[2023-12-11 18:40:18,320] [INFO] [config.py:983:print]   checkpoint_parallel_write_pipeline  False
-[2023-12-11 18:40:18,320] [INFO] [config.py:983:print]   checkpoint_tag_validation_enabled  True
-[2023-12-11 18:40:18,320] [INFO] [config.py:983:print]   checkpoint_tag_validation_fail  False
-[2023-12-11 18:40:18,320] [INFO] [config.py:983:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f429937da10>
-[2023-12-11 18:40:18,320] [INFO] [config.py:983:print]   communication_data_type ...... None
-[2023-12-11 18:40:18,320] [INFO] [config.py:983:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
-[2023-12-11 18:40:18,320] [INFO] [config.py:983:print]   curriculum_enabled_legacy .... False
-[2023-12-11 18:40:18,320] [INFO] [config.py:983:print]   curriculum_params_legacy ..... False
-[2023-12-11 18:40:18,320] [INFO] [config.py:983:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
-[2023-12-11 18:40:18,320] [INFO] [config.py:983:print]   data_efficiency_enabled ...... False
-[2023-12-11 18:40:18,320] [INFO] [config.py:983:print]   dataloader_drop_last ......... False
-[2023-12-11 18:40:18,320] [INFO] [config.py:983:print]   disable_allgather ............ False
-[2023-12-11 18:40:18,320] [INFO] [config.py:983:print]   dump_state ................... False
-[2023-12-11 18:40:18,320] [INFO] [config.py:983:print]   dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 100, 'delayed_shift': 2, 'consecutive_hysteresis': False, 'min_scale': 1}
-[2023-12-11 18:40:18,320] [INFO] [config.py:983:print]   eigenvalue_enabled ........... False
-[2023-12-11 18:40:18,320] [INFO] [config.py:983:print]   eigenvalue_gas_boundary_resolution  1
-[2023-12-11 18:40:18,320] [INFO] [config.py:983:print]   eigenvalue_layer_name ........ bert.encoder.layer
-[2023-12-11 18:40:18,320] [INFO] [config.py:983:print]   eigenvalue_layer_num ......... 0
-[2023-12-11 18:40:18,320] [INFO] [config.py:983:print]   eigenvalue_max_iter .......... 100
-[2023-12-11 18:40:18,321] [INFO] [config.py:983:print]   eigenvalue_stability ......... 1e-06
-[2023-12-11 18:40:18,321] [INFO] [config.py:983:print]   eigenvalue_tol ............... 0.01
-[2023-12-11 18:40:18,321] [INFO] [config.py:983:print]   eigenvalue_verbose ........... False
-[2023-12-11 18:40:18,321] [INFO] [config.py:983:print]   elasticity_enabled ........... False
-[2023-12-11 18:40:18,321] [INFO] [config.py:983:print]   flops_profiler_config ........ {
     "enabled": false,
     "recompute_fwd_factor": 0.0,
     "profile_step": 1,
@@ -187,23 +187,23 @@ Parameter Offload: Total persistent parameters: 266240 in 65 params
     "detailed": true,
     "output_file": null
 }
-[2023-12-11 18:40:18,321] [INFO] [config.py:983:print]   fp16_auto_cast ............... False
-[2023-12-11 18:40:18,321] [INFO] [config.py:983:print]   fp16_enabled ................. True
-[2023-12-11 18:40:18,321] [INFO] [config.py:983:print]   fp16_master_weights_and_gradients  False
-[2023-12-11 18:40:18,321] [INFO] [config.py:983:print]   global_rank .................. 0
-[2023-12-11 18:40:18,321] [INFO] [config.py:983:print]   grad_accum_dtype ............. None
-[2023-12-11 18:40:18,321] [INFO] [config.py:983:print]   gradient_accumulation_steps .. 1
-[2023-12-11 18:40:18,321] [INFO] [config.py:983:print]   gradient_clipping ............ 1.0
-[2023-12-11 18:40:18,321] [INFO] [config.py:983:print]   gradient_predivide_factor .... 1.0
-[2023-12-11 18:40:18,321] [INFO] [config.py:983:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
-[2023-12-11 18:40:18,321] [INFO] [config.py:983:print]   initial_dynamic_scale ........ 65536
-[2023-12-11 18:40:18,321] [INFO] [config.py:983:print]   load_universal_checkpoint .... False
-[2023-12-11 18:40:18,321] [INFO] [config.py:983:print]   loss_scale ................... 0
-[2023-12-11 18:40:18,321] [INFO] [config.py:983:print]   memory_breakdown ............. False
-[2023-12-11 18:40:18,321] [INFO] [config.py:983:print]   mics_hierarchial_params_gather  False
-[2023-12-11 18:40:18,321] [INFO] [config.py:983:print]   mics_shard_size .............. -1
-[2023-12-11 18:40:18,321] [INFO] [config.py:983:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='step1_tensorboard/ds_tensorboard_logs/', job_name='step1_model_tensorboard') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
-[2023-12-11 18:40:18,321] [INFO] [config.py:983:print]   nebula_config ................ {
     "enabled": false,
     "persistent_storage_path": null,
     "persistent_time_interval": 100,
@@ -211,32 +211,32 @@ Parameter Offload: Total persistent parameters: 266240 in 65 params
     "enable_nebula_load": true,
     "load_path": null
 }
-[2023-12-11 18:40:18,321] [INFO] [config.py:983:print]   optimizer_legacy_fusion ...... False
-[2023-12-11 18:40:18,321] [INFO] [config.py:983:print]   optimizer_name ............... None
-[2023-12-11 18:40:18,321] [INFO] [config.py:983:print]   optimizer_params ............. None
-[2023-12-11 18:40:18,321] [INFO] [config.py:983:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
-[2023-12-11 18:40:18,321] [INFO] [config.py:983:print]   pld_enabled .................. False
-[2023-12-11 18:40:18,321] [INFO] [config.py:983:print]   pld_params ................... False
-[2023-12-11 18:40:18,322] [INFO] [config.py:983:print]   prescale_gradients ........... False
-[2023-12-11 18:40:18,322] [INFO] [config.py:983:print]   scheduler_name ............... None
-[2023-12-11 18:40:18,322] [INFO] [config.py:983:print]   scheduler_params ............. None
-[2023-12-11 18:40:18,322] [INFO] [config.py:983:print]   seq_parallel_communication_data_type  torch.float32
-[2023-12-11 18:40:18,322] [INFO] [config.py:983:print]   sparse_attention ............. None
-[2023-12-11 18:40:18,322] [INFO] [config.py:983:print]   sparse_gradients_enabled ..... False
-[2023-12-11 18:40:18,322] [INFO] [config.py:983:print]   steps_per_print .............. 10
-[2023-12-11 18:40:18,322] [INFO] [config.py:983:print]   train_batch_size ............. 32
-[2023-12-11 18:40:18,322] [INFO] [config.py:983:print]   train_micro_batch_size_per_gpu  8
-[2023-12-11 18:40:18,322] [INFO] [config.py:983:print]   use_data_before_expert_parallel_  False
-[2023-12-11 18:40:18,322] [INFO] [config.py:983:print]   use_node_local_storage ....... False
-[2023-12-11 18:40:18,322] [INFO] [config.py:983:print]   wall_clock_breakdown ......... False
-[2023-12-11 18:40:18,322] [INFO] [config.py:983:print]   weight_quantization_config ... None
-[2023-12-11 18:40:18,322] [INFO] [config.py:983:print]   world_size ................... 4
-[2023-12-11 18:40:18,322] [INFO] [config.py:983:print]   zero_allow_untested_optimizer  False
-[2023-12-11 18:40:18,322] [INFO] [config.py:983:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=30000000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=30000000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=False pipeline_loading_checkpoint=False override_module_apply=True
-[2023-12-11 18:40:18,322] [INFO] [config.py:983:print]   zero_enabled ................. True
-[2023-12-11 18:40:18,322] [INFO] [config.py:983:print]   zero_force_ds_cpu_optimizer .. True
-[2023-12-11 18:40:18,322] [INFO] [config.py:983:print]   zero_optimization_stage ...... 3
-[2023-12-11 18:40:18,322] [INFO] [config.py:969:print_user_config]   json = {
     "train_batch_size": 32,
     "train_micro_batch_size_per_gpu": 8,
     "steps_per_print": 10,
@@ -286,105 +286,105 @@ Beginning of Epoch 1/5, Total Micro Batches 13
   warnings.warn(
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
   warnings.warn(
-Model Parameters: 6.927 B, Latency: 4.17s, TFLOPs: 10.05, Samples/sec: 1.92, Time/seq 0.52s, Batch Size: 8, Sequence Length: 512
 Invalidate trace cache @ step 0: expected module 6, but got module 0
-Model Parameters: 6.927 B, Latency: 3.75s, TFLOPs: 11.17, Samples/sec: 2.13, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.76s, TFLOPs: 11.13, Samples/sec: 2.13, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 11.38, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.49, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.49, Samples/sec: 2.20, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.46, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.49, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.53, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
-[2023-12-11 18:40:58,482] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=0, lr=[9.097325323776738e-06, 0.00047136400641330245, 9.097325323776738e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
-[2023-12-11 18:40:58,482] [INFO] [timer.py:260:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=8.742200576698968, CurrSamplesPerSec=8.808332763470538, MemAllocated=6.88GB, MaxMemAllocated=10.68GB
 Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.51, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.50, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.24s, TFLOPs: 12.90, Samples/sec: 2.47, Time/seq 0.41s, Batch Size: 8, Sequence Length: 512
 ***** Evaluating perplexity, Epoch 1/5 *****
 Invalidate trace cache @ step 0: expected module 0, but got module 6
 ppl: 1.6560871601104736, loss: 0.5044576525688171
 Beginning of Epoch 2/5, Total Micro Batches 13
-Model Parameters: 6.927 B, Latency: 3.76s, TFLOPs: 11.12, Samples/sec: 2.13, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.78s, TFLOPs: 11.09, Samples/sec: 2.12, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.47, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.48, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.49, Samples/sec: 2.20, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.50, Samples/sec: 2.20, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-[2023-12-11 18:41:36,640] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=0, lr=[7.565912402977827e-06, 0.00039201618668278893, 7.565912402977827e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
-[2023-12-11 18:41:36,641] [INFO] [timer.py:260:stop] epoch=1/micro_step=7/global_step=20, RunningAvgSamplesPerSec=8.786215139346892, CurrSamplesPerSec=8.784209255946163, MemAllocated=6.88GB, MaxMemAllocated=11.06GB
-Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.49, Samples/sec: 2.20, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.49, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.49, Samples/sec: 2.20, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.49, Samples/sec: 2.20, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.47, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.46, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.25s, TFLOPs: 12.89, Samples/sec: 2.46, Time/seq 0.41s, Batch Size: 8, Sequence Length: 512
 ***** Evaluating perplexity, Epoch 2/5 *****
 Invalidate trace cache @ step 0: expected module 0, but got module 6
 ppl: 1.0178232192993164, loss: 0.01766625978052616
 Beginning of Epoch 3/5, Total Micro Batches 13
-Model Parameters: 6.927 B, Latency: 3.77s, TFLOPs: 11.12, Samples/sec: 2.12, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.78s, TFLOPs: 11.07, Samples/sec: 2.12, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.46, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-[2023-12-11 18:42:14,847] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=0, lr=[5.4065894822319335e-06, 0.0002801341700638307, 5.4065894822319335e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
-[2023-12-11 18:42:14,847] [INFO] [timer.py:260:stop] epoch=2/micro_step=4/global_step=30, RunningAvgSamplesPerSec=8.794852545546625, CurrSamplesPerSec=8.779833541428898, MemAllocated=6.88GB, MaxMemAllocated=11.06GB
-Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.48, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.48, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.48, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.48, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.47, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.46, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.47, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.47, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.47, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.26s, TFLOPs: 12.85, Samples/sec: 2.46, Time/seq 0.41s, Batch Size: 8, Sequence Length: 512
 ***** Evaluating perplexity, Epoch 3/5 *****
 Invalidate trace cache @ step 0: expected module 0, but got module 6
 ppl: 1.0056875944137573, loss: 0.005671397782862186
 Beginning of Epoch 4/5, Total Micro Batches 13
-[2023-12-11 18:42:52,948] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=0, lr=[3.1140314200197657e-06, 0.00016134877823936609, 3.1140314200197657e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
-[2023-12-11 18:42:52,948] [INFO] [timer.py:260:stop] epoch=3/micro_step=1/global_step=40, RunningAvgSamplesPerSec=8.805863440078602, CurrSamplesPerSec=8.48791662450673, MemAllocated=6.88GB, MaxMemAllocated=11.06GB
 Model Parameters: 6.927 B, Latency: 3.77s, TFLOPs: 11.10, Samples/sec: 2.12, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.78s, TFLOPs: 11.06, Samples/sec: 2.11, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.46, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.46, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 11.44, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.47, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.46, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 11.44, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 11.44, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 11.43, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-[2023-12-11 18:43:29,661] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=0, lr=[1.2134356400744368e-06, 6.28723129572247e-05, 1.2134356400744368e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
-[2023-12-11 18:43:29,661] [INFO] [timer.py:260:stop] epoch=3/micro_step=11/global_step=50, RunningAvgSamplesPerSec=8.788842827422865, CurrSamplesPerSec=8.75888550529353, MemAllocated=6.88GB, MaxMemAllocated=11.06GB
-Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.45, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 11.45, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.27s, TFLOPs: 12.81, Samples/sec: 2.45, Time/seq 0.41s, Batch Size: 8, Sequence Length: 512
 ***** Evaluating perplexity, Epoch 4/5 *****
 Invalidate trace cache @ step 0: expected module 0, but got module 6
 ppl: 1.0032395124435425, loss: 0.0032342304475605488
 Beginning of Epoch 5/5, Total Micro Batches 13
 Model Parameters: 6.927 B, Latency: 3.79s, TFLOPs: 11.05, Samples/sec: 2.11, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.79s, TFLOPs: 11.05, Samples/sec: 2.11, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 11.45, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 11.44, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 11.44, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 11.42, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 11.40, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-[2023-12-11 18:44:08,017] [INFO] [logging.py:96:log_dist] [Rank 0] step=60, skipped=0, lr=[1.4020573091929905e-07, 7.2645456434869975e-06, 1.4020573091929905e-07], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
-[2023-12-11 18:44:08,018] [INFO] [timer.py:260:stop] epoch=4/micro_step=8/global_step=60, RunningAvgSamplesPerSec=8.7865790752149, CurrSamplesPerSec=8.748600897610435, MemAllocated=6.88GB, MaxMemAllocated=11.06GB
-Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 11.44, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 11.43, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 11.40, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 11.39, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 11.38, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.27s, TFLOPs: 12.81, Samples/sec: 2.45, Time/seq 0.41s, Batch Size: 8, Sequence Length: 512
 ***** Evaluating perplexity, Epoch 5/5 *****
 Invalidate trace cache @ step 0: expected module 0, but got module 6
 ppl: 1.003004550933838, loss: 0.0030000172555446625
 saving the final model ...
-[2023-12-11 18:44:41,184] [INFO] [launch.py:347:main] Process 2269765 exits successfully.
-[2023-12-11 18:44:42,473] [INFO] [launch.py:347:main] Process 2269766 exits successfully.
-[2023-12-11 18:44:42,474] [INFO] [launch.py:347:main] Process 2269767 exits successfully.
-[2023-12-11 18:46:44,489] [INFO] [launch.py:347:main] Process 2269764 exits successfully.

+[2023-12-11 20:12:03,965] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
+[2023-12-11 20:12:05,820] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
+[2023-12-11 20:12:05,820] [INFO] [runner.py:570:main] cmd = /home/t-sokumar/miniconda3/envs/ft/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path local/jsonfile --data_split 1,0,0 --model_name_or_path codellama/CodeLlama-7b-hf --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --max_seq_len 512 --learning_rate 9.65e-6 --weight_decay 0. --num_train_epochs 5 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --gradient_checkpointing --zero_stage 3 --deepspeed --lora_dim 128 --lora_module_name layers. --output_dir ./output_step1_Codellama_7b_lora_llamahub-devrev --add_eot_token
+[2023-12-11 20:12:08,529] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
+[2023-12-11 20:12:10,776] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
+[2023-12-11 20:12:10,776] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
+[2023-12-11 20:12:10,776] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
+[2023-12-11 20:12:10,776] [INFO] [launch.py:163:main] dist_world_size=4
+[2023-12-11 20:12:10,776] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
+[2023-12-11 20:12:14,340] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
+[2023-12-11 20:12:14,349] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
+[2023-12-11 20:12:14,559] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
+[2023-12-11 20:12:14,602] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
   warnings.warn(
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
   warnings.warn(
+[2023-12-11 20:12:15,940] [INFO] [comm.py:637:init_distributed] cdb=None
+[2023-12-11 20:12:15,940] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
   warnings.warn(
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
   warnings.warn(
+[2023-12-11 20:12:16,326] [INFO] [comm.py:637:init_distributed] cdb=None
+[2023-12-11 20:12:16,414] [INFO] [comm.py:637:init_distributed] cdb=None
+[2023-12-11 20:12:16,446] [INFO] [comm.py:637:init_distributed] cdb=None
 The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
 The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'.
 The class this function is called from is 'LlamaTokenizer'.
 You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
 You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
 You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
+[2023-12-11 20:12:19,202] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 291, num_elems = 6.74B
 Using /home/t-sokumar/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
 Using /home/t-sokumar/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
 Using /home/t-sokumar/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
 Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
 ninja: no work to do.
 Loading extension module fused_adam...
+Time to load fused_adam op: 0.10928606986999512 seconds
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
   self._dummy_overflow_buf = get_accelerator().IntTensor([0])
+Loading extension module fused_adam...
+Loading extension module fused_adam...
+Loading extension module fused_adam...
+Time to load fused_adam op: 0.20180773735046387 seconds
+Time to load fused_adam op: 0.2018909454345703 seconds
+Time to load fused_adam op: 0.20151114463806152 seconds
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
   self._dummy_overflow_buf = get_accelerator().IntTensor([0])
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
   self._dummy_overflow_buf = get_accelerator().IntTensor([0])
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
   self._dummy_overflow_buf = get_accelerator().IntTensor([0])
+[2023-12-11 20:12:28,877] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.12.4, git-hash=unknown, git-branch=unknown
+[2023-12-11 20:12:28,877] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
+[2023-12-11 20:12:28,899] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
+[2023-12-11 20:12:28,901] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
+[2023-12-11 20:12:28,901] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
+[2023-12-11 20:12:28,939] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
+[2023-12-11 20:12:28,939] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
+[2023-12-11 20:12:28,939] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
+[2023-12-11 20:12:28,940] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 3 optimizer
+[2023-12-11 20:12:29,054] [INFO] [utils.py:795:see_memory_usage] Stage 3 initialize beginning
+[2023-12-11 20:12:29,055] [INFO] [utils.py:796:see_memory_usage] MA 4.37 GB         Max_MA 4.75 GB         CA 8.93 GB         Max_CA 9 GB
+[2023-12-11 20:12:29,055] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 95.76 GB, percent = 38.1%
+[2023-12-11 20:12:29,057] [INFO] [stage3.py:127:__init__] Reduce bucket size 500,000,000
+[2023-12-11 20:12:29,057] [INFO] [stage3.py:128:__init__] Prefetch bucket size 30000000
+[2023-12-11 20:12:29,164] [INFO] [utils.py:795:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
+[2023-12-11 20:12:29,165] [INFO] [utils.py:796:see_memory_usage] MA 4.37 GB         Max_MA 4.37 GB         CA 8.93 GB         Max_CA 9 GB
+[2023-12-11 20:12:29,165] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 95.77 GB, percent = 38.1%
 Parameter Offload: Total persistent parameters: 266240 in 65 params
+[2023-12-11 20:12:29,482] [INFO] [utils.py:795:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
+[2023-12-11 20:12:29,483] [INFO] [utils.py:796:see_memory_usage] MA 3.54 GB         Max_MA 4.43 GB         CA 8.94 GB         Max_CA 9 GB
+[2023-12-11 20:12:29,483] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 95.79 GB, percent = 38.1%
+[2023-12-11 20:12:29,597] [INFO] [utils.py:795:see_memory_usage] Before creating fp16 partitions
+[2023-12-11 20:12:29,598] [INFO] [utils.py:796:see_memory_usage] MA 3.54 GB         Max_MA 3.54 GB         CA 8.94 GB         Max_CA 9 GB
+[2023-12-11 20:12:29,598] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 95.78 GB, percent = 38.1%
+[2023-12-11 20:12:30,301] [INFO] [utils.py:795:see_memory_usage] After creating fp16 partitions: 3
+[2023-12-11 20:12:30,301] [INFO] [utils.py:796:see_memory_usage] MA 3.54 GB         Max_MA 3.54 GB         CA 5.46 GB         Max_CA 9 GB
+[2023-12-11 20:12:30,348] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 96.3 GB, percent = 38.3%
+[2023-12-11 20:12:30,468] [INFO] [utils.py:795:see_memory_usage] Before creating fp32 partitions
+[2023-12-11 20:12:30,469] [INFO] [utils.py:796:see_memory_usage] MA 3.54 GB         Max_MA 3.54 GB         CA 5.46 GB         Max_CA 5 GB
+[2023-12-11 20:12:30,469] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 93.01 GB, percent = 37.0%
+[2023-12-11 20:12:30,579] [INFO] [utils.py:795:see_memory_usage] After creating fp32 partitions
+[2023-12-11 20:12:30,580] [INFO] [utils.py:796:see_memory_usage] MA 4.09 GB         Max_MA 4.24 GB         CA 6.16 GB         Max_CA 6 GB
+[2023-12-11 20:12:30,580] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 93.01 GB, percent = 37.0%
+[2023-12-11 20:12:30,689] [INFO] [utils.py:795:see_memory_usage] Before initializing optimizer states
+[2023-12-11 20:12:30,690] [INFO] [utils.py:796:see_memory_usage] MA 4.09 GB         Max_MA 4.09 GB         CA 6.16 GB         Max_CA 6 GB
+[2023-12-11 20:12:30,690] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 93.01 GB, percent = 37.0%
+[2023-12-11 20:12:30,815] [INFO] [utils.py:795:see_memory_usage] After initializing optimizer states
+[2023-12-11 20:12:30,815] [INFO] [utils.py:796:see_memory_usage] MA 5.17 GB         Max_MA 5.47 GB         CA 7.54 GB         Max_CA 8 GB
+[2023-12-11 20:12:30,815] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 93.02 GB, percent = 37.0%
+[2023-12-11 20:12:30,816] [INFO] [stage3.py:479:_setup_for_real_optimizer] optimizer state initialized
+[2023-12-11 20:12:31,320] [INFO] [utils.py:795:see_memory_usage] After initializing ZeRO optimizer
+[2023-12-11 20:12:31,321] [INFO] [utils.py:796:see_memory_usage] MA 6.38 GB         Max_MA 6.86 GB         CA 9.23 GB         Max_CA 9 GB
+[2023-12-11 20:12:31,321] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 93.01 GB, percent = 37.0%
+[2023-12-11 20:12:31,321] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam
+[2023-12-11 20:12:31,322] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
+[2023-12-11 20:12:31,322] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f31e5b4f890>
+[2023-12-11 20:12:31,322] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[9.65e-06, 0.0005, 9.65e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
+[2023-12-11 20:12:31,323] [INFO] [config.py:979:print] DeepSpeedEngine configuration:
+[2023-12-11 20:12:31,323] [INFO] [config.py:983:print]   activation_checkpointing_config  {
     "partition_activations": false,
     "contiguous_memory_optimization": false,
     "cpu_checkpointing": false,
     "synchronize_checkpoint_boundary": false,
     "profile": false
 }
+[2023-12-11 20:12:31,323] [INFO] [config.py:983:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
+[2023-12-11 20:12:31,323] [INFO] [config.py:983:print]   amp_enabled .................. False
+[2023-12-11 20:12:31,323] [INFO] [config.py:983:print]   amp_params ................... False
+[2023-12-11 20:12:31,324] [INFO] [config.py:983:print]   autotuning_config ............ {
     "enabled": false,
     "start_step": null,
     "end_step": null,
     "min_train_micro_batch_size_per_gpu": 1,
     "num_tuning_micro_batch_sizes": 3
 }
+[2023-12-11 20:12:31,324] [INFO] [config.py:983:print]   bfloat16_enabled ............. False
+[2023-12-11 20:12:31,324] [INFO] [config.py:983:print]   checkpoint_parallel_write_pipeline  False
+[2023-12-11 20:12:31,324] [INFO] [config.py:983:print]   checkpoint_tag_validation_enabled  True
+[2023-12-11 20:12:31,324] [INFO] [config.py:983:print]   checkpoint_tag_validation_fail  False
+[2023-12-11 20:12:31,324] [INFO] [config.py:983:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f3193907bd0>
+[2023-12-11 20:12:31,324] [INFO] [config.py:983:print]   communication_data_type ...... None
+[2023-12-11 20:12:31,324] [INFO] [config.py:983:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
+[2023-12-11 20:12:31,324] [INFO] [config.py:983:print]   curriculum_enabled_legacy .... False
+[2023-12-11 20:12:31,324] [INFO] [config.py:983:print]   curriculum_params_legacy ..... False
+[2023-12-11 20:12:31,324] [INFO] [config.py:983:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
+[2023-12-11 20:12:31,324] [INFO] [config.py:983:print]   data_efficiency_enabled ...... False
+[2023-12-11 20:12:31,324] [INFO] [config.py:983:print]   dataloader_drop_last ......... False
+[2023-12-11 20:12:31,324] [INFO] [config.py:983:print]   disable_allgather ............ False
+[2023-12-11 20:12:31,324] [INFO] [config.py:983:print]   dump_state ................... False
+[2023-12-11 20:12:31,324] [INFO] [config.py:983:print]   dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 100, 'delayed_shift': 2, 'consecutive_hysteresis': False, 'min_scale': 1}
+[2023-12-11 20:12:31,324] [INFO] [config.py:983:print]   eigenvalue_enabled ........... False
+[2023-12-11 20:12:31,324] [INFO] [config.py:983:print]   eigenvalue_gas_boundary_resolution  1
+[2023-12-11 20:12:31,324] [INFO] [config.py:983:print]   eigenvalue_layer_name ........ bert.encoder.layer
+[2023-12-11 20:12:31,324] [INFO] [config.py:983:print]   eigenvalue_layer_num ......... 0
+[2023-12-11 20:12:31,324] [INFO] [config.py:983:print]   eigenvalue_max_iter .......... 100
+[2023-12-11 20:12:31,324] [INFO] [config.py:983:print]   eigenvalue_stability ......... 1e-06
+[2023-12-11 20:12:31,324] [INFO] [config.py:983:print]   eigenvalue_tol ............... 0.01
+[2023-12-11 20:12:31,324] [INFO] [config.py:983:print]   eigenvalue_verbose ........... False
+[2023-12-11 20:12:31,324] [INFO] [config.py:983:print]   elasticity_enabled ........... False
+[2023-12-11 20:12:31,325] [INFO] [config.py:983:print]   flops_profiler_config ........ {
     "enabled": false,
     "recompute_fwd_factor": 0.0,
     "profile_step": 1,
     "detailed": true,
     "output_file": null
 }
+[2023-12-11 20:12:31,325] [INFO] [config.py:983:print]   fp16_auto_cast ............... False
+[2023-12-11 20:12:31,325] [INFO] [config.py:983:print]   fp16_enabled ................. True
+[2023-12-11 20:12:31,325] [INFO] [config.py:983:print]   fp16_master_weights_and_gradients  False
+[2023-12-11 20:12:31,325] [INFO] [config.py:983:print]   global_rank .................. 0
+[2023-12-11 20:12:31,325] [INFO] [config.py:983:print]   grad_accum_dtype ............. None
+[2023-12-11 20:12:31,325] [INFO] [config.py:983:print]   gradient_accumulation_steps .. 1
+[2023-12-11 20:12:31,325] [INFO] [config.py:983:print]   gradient_clipping ............ 1.0
+[2023-12-11 20:12:31,325] [INFO] [config.py:983:print]   gradient_predivide_factor .... 1.0
+[2023-12-11 20:12:31,325] [INFO] [config.py:983:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
+[2023-12-11 20:12:31,325] [INFO] [config.py:983:print]   initial_dynamic_scale ........ 65536
+[2023-12-11 20:12:31,325] [INFO] [config.py:983:print]   load_universal_checkpoint .... False
+[2023-12-11 20:12:31,325] [INFO] [config.py:983:print]   loss_scale ................... 0
+[2023-12-11 20:12:31,325] [INFO] [config.py:983:print]   memory_breakdown ............. False
+[2023-12-11 20:12:31,325] [INFO] [config.py:983:print]   mics_hierarchial_params_gather  False
+[2023-12-11 20:12:31,325] [INFO] [config.py:983:print]   mics_shard_size .............. -1
+[2023-12-11 20:12:31,325] [INFO] [config.py:983:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='step1_tensorboard/ds_tensorboard_logs/', job_name='step1_model_tensorboard') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
+[2023-12-11 20:12:31,325] [INFO] [config.py:983:print]   nebula_config ................ {
     "enabled": false,
     "persistent_storage_path": null,
     "persistent_time_interval": 100,
     "enable_nebula_load": true,
     "load_path": null
 }
+[2023-12-11 20:12:31,325] [INFO] [config.py:983:print]   optimizer_legacy_fusion ...... False
+[2023-12-11 20:12:31,325] [INFO] [config.py:983:print]   optimizer_name ............... None
+[2023-12-11 20:12:31,325] [INFO] [config.py:983:print]   optimizer_params ............. None
+[2023-12-11 20:12:31,325] [INFO] [config.py:983:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
+[2023-12-11 20:12:31,325] [INFO] [config.py:983:print]   pld_enabled .................. False
+[2023-12-11 20:12:31,325] [INFO] [config.py:983:print]   pld_params ................... False
+[2023-12-11 20:12:31,325] [INFO] [config.py:983:print]   prescale_gradients ........... False
+[2023-12-11 20:12:31,325] [INFO] [config.py:983:print]   scheduler_name ............... None
+[2023-12-11 20:12:31,325] [INFO] [config.py:983:print]   scheduler_params ............. None
+[2023-12-11 20:12:31,325] [INFO] [config.py:983:print]   seq_parallel_communication_data_type  torch.float32
+[2023-12-11 20:12:31,326] [INFO] [config.py:983:print]   sparse_attention ............. None
+[2023-12-11 20:12:31,326] [INFO] [config.py:983:print]   sparse_gradients_enabled ..... False
+[2023-12-11 20:12:31,326] [INFO] [config.py:983:print]   steps_per_print .............. 10
+[2023-12-11 20:12:31,326] [INFO] [config.py:983:print]   train_batch_size ............. 32
+[2023-12-11 20:12:31,326] [INFO] [config.py:983:print]   train_micro_batch_size_per_gpu  8
+[2023-12-11 20:12:31,326] [INFO] [config.py:983:print]   use_data_before_expert_parallel_  False
+[2023-12-11 20:12:31,326] [INFO] [config.py:983:print]   use_node_local_storage ....... False
+[2023-12-11 20:12:31,326] [INFO] [config.py:983:print]   wall_clock_breakdown ......... False
+[2023-12-11 20:12:31,326] [INFO] [config.py:983:print]   weight_quantization_config ... None
+[2023-12-11 20:12:31,326] [INFO] [config.py:983:print]   world_size ................... 4
+[2023-12-11 20:12:31,326] [INFO] [config.py:983:print]   zero_allow_untested_optimizer  False
+[2023-12-11 20:12:31,326] [INFO] [config.py:983:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=30000000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=30000000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=False pipeline_loading_checkpoint=False override_module_apply=True
+[2023-12-11 20:12:31,326] [INFO] [config.py:983:print]   zero_enabled ................. True
+[2023-12-11 20:12:31,326] [INFO] [config.py:983:print]   zero_force_ds_cpu_optimizer .. True
+[2023-12-11 20:12:31,326] [INFO] [config.py:983:print]   zero_optimization_stage ...... 3
+[2023-12-11 20:12:31,326] [INFO] [config.py:969:print_user_config]   json = {
     "train_batch_size": 32,
     "train_micro_batch_size_per_gpu": 8,
     "steps_per_print": 10,
   warnings.warn(
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
   warnings.warn(
+Model Parameters: 6.927 B, Latency: 4.17s, TFLOPs: 10.04, Samples/sec: 1.92, Time/seq 0.52s, Batch Size: 8, Sequence Length: 512
 Invalidate trace cache @ step 0: expected module 6, but got module 0
+Model Parameters: 6.927 B, Latency: 3.74s, TFLOPs: 11.20, Samples/sec: 2.14, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.76s, TFLOPs: 11.14, Samples/sec: 2.13, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.46, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.53, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.53, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.53, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.51, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
+[2023-12-11 20:13:11,248] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=0, lr=[9.097325323776738e-06, 0.00047136400641330245, 9.097325323776738e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
+[2023-12-11 20:13:11,248] [INFO] [timer.py:260:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=8.766147695613881, CurrSamplesPerSec=8.809815752797453, MemAllocated=6.88GB, MaxMemAllocated=10.68GB
+Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.51, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.51, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.24s, TFLOPs: 12.90, Samples/sec: 2.47, Time/seq 0.41s, Batch Size: 8, Sequence Length: 512
 ***** Evaluating perplexity, Epoch 1/5 *****
 Invalidate trace cache @ step 0: expected module 0, but got module 6
 ppl: 1.6560871601104736, loss: 0.5044576525688171
 Beginning of Epoch 2/5, Total Micro Batches 13
+Model Parameters: 6.927 B, Latency: 3.75s, TFLOPs: 11.15, Samples/sec: 2.13, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.76s, TFLOPs: 11.15, Samples/sec: 2.13, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.49, Samples/sec: 2.20, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.51, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.50, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.50, Samples/sec: 2.20, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+[2023-12-11 20:13:49,353] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=0, lr=[7.565912402977827e-06, 0.00039201618668278893, 7.565912402977827e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
+[2023-12-11 20:13:49,354] [INFO] [timer.py:260:stop] epoch=1/micro_step=7/global_step=20, RunningAvgSamplesPerSec=8.803895836862662, CurrSamplesPerSec=8.791045583607062, MemAllocated=6.88GB, MaxMemAllocated=11.06GB
+Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.50, Samples/sec: 2.20, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.51, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.50, Samples/sec: 2.20, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.50, Samples/sec: 2.20, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.47, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.46, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.25s, TFLOPs: 12.88, Samples/sec: 2.46, Time/seq 0.41s, Batch Size: 8, Sequence Length: 512
 ***** Evaluating perplexity, Epoch 2/5 *****
 Invalidate trace cache @ step 0: expected module 0, but got module 6
 ppl: 1.0178232192993164, loss: 0.01766625978052616
 Beginning of Epoch 3/5, Total Micro Batches 13
+Model Parameters: 6.927 B, Latency: 3.76s, TFLOPs: 11.13, Samples/sec: 2.13, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.77s, TFLOPs: 11.09, Samples/sec: 2.12, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.49, Samples/sec: 2.20, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+[2023-12-11 20:14:27,532] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=0, lr=[5.4065894822319335e-06, 0.0002801341700638307, 5.4065894822319335e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
+[2023-12-11 20:14:27,533] [INFO] [timer.py:260:stop] epoch=2/micro_step=4/global_step=30, RunningAvgSamplesPerSec=8.808840107678392, CurrSamplesPerSec=8.779266138519437, MemAllocated=6.88GB, MaxMemAllocated=11.06GB
 Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.48, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.49, Samples/sec: 2.20, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.49, Samples/sec: 2.20, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.49, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.48, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.49, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.47, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.49, Samples/sec: 2.20, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.47, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.25s, TFLOPs: 12.86, Samples/sec: 2.46, Time/seq 0.41s, Batch Size: 8, Sequence Length: 512
 ***** Evaluating perplexity, Epoch 3/5 *****
 Invalidate trace cache @ step 0: expected module 0, but got module 6
 ppl: 1.0056875944137573, loss: 0.005671397782862186
 Beginning of Epoch 4/5, Total Micro Batches 13
+[2023-12-11 20:15:05,601] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=0, lr=[3.1140314200197657e-06, 0.00016134877823936609, 3.1140314200197657e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
+[2023-12-11 20:15:05,601] [INFO] [timer.py:260:stop] epoch=3/micro_step=1/global_step=40, RunningAvgSamplesPerSec=8.818374436983056, CurrSamplesPerSec=8.49120081099869, MemAllocated=6.88GB, MaxMemAllocated=11.06GB
 Model Parameters: 6.927 B, Latency: 3.77s, TFLOPs: 11.10, Samples/sec: 2.12, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.77s, TFLOPs: 11.09, Samples/sec: 2.12, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.47, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.47, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 11.44, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.46, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.46, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.46, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 11.45, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.46, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+[2023-12-11 20:15:42,281] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=0, lr=[1.2134356400744368e-06, 6.28723129572247e-05, 1.2134356400744368e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
+[2023-12-11 20:15:42,281] [INFO] [timer.py:260:stop] epoch=3/micro_step=11/global_step=50, RunningAvgSamplesPerSec=8.800315028679389, CurrSamplesPerSec=8.764479266712412, MemAllocated=6.88GB, MaxMemAllocated=11.06GB
+Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.46, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.47, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.27s, TFLOPs: 12.79, Samples/sec: 2.44, Time/seq 0.41s, Batch Size: 8, Sequence Length: 512
 ***** Evaluating perplexity, Epoch 4/5 *****
 Invalidate trace cache @ step 0: expected module 0, but got module 6
 ppl: 1.0032395124435425, loss: 0.0032342304475605488
 Beginning of Epoch 5/5, Total Micro Batches 13
+Model Parameters: 6.927 B, Latency: 3.77s, TFLOPs: 11.09, Samples/sec: 2.12, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.79s, TFLOPs: 11.05, Samples/sec: 2.11, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.46, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 11.45, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 11.45, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.46, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 11.43, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+[2023-12-11 20:16:20,586] [INFO] [logging.py:96:log_dist] [Rank 0] step=60, skipped=0, lr=[1.4020573091929905e-07, 7.2645456434869975e-06, 1.4020573091929905e-07], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
+[2023-12-11 20:16:20,586] [INFO] [timer.py:260:stop] epoch=4/micro_step=8/global_step=60, RunningAvgSamplesPerSec=8.798149665169436, CurrSamplesPerSec=8.756539739490163, MemAllocated=6.88GB, MaxMemAllocated=11.06GB
+Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 11.45, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 11.45, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.46, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 11.44, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
 Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 11.44, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.28s, TFLOPs: 12.77, Samples/sec: 2.44, Time/seq 0.41s, Batch Size: 8, Sequence Length: 512
 ***** Evaluating perplexity, Epoch 5/5 *****
 Invalidate trace cache @ step 0: expected module 0, but got module 6
 ppl: 1.003004550933838, loss: 0.0030000172555446625
 saving the final model ...
+[2023-12-11 20:16:53,814] [INFO] [launch.py:347:main] Process 2392412 exits successfully.
+[2023-12-11 20:16:54,182] [INFO] [launch.py:347:main] Process 2392414 exits successfully.
+[2023-12-11 20:16:54,182] [INFO] [launch.py:347:main] Process 2392413 exits successfully.
+[2023-12-11 20:18:58,197] [INFO] [launch.py:347:main] Process 2392411 exits successfully.