aaaacash
/

CowardCow

Text Generation

Transformers

PyTorch

llama

text-generation-inference

Model card Files Files and versions Community

aaaacash commited on Dec 11, 2023

Commit

cde919e

1 Parent(s): f1479f3

Upload folder using huggingface_hub

Browse files

Files changed (2) hide show

pytorch_model.bin +1 -1
training.log +405 -222

pytorch_model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e79895167c822572d1cf89779c4a10a05433bd2d166150e2c85d5d321054d016
 size 13477321262

 version https://git-lfs.github.com/spec/v1
+oid sha256:8326234814289d31a8746916a6bdfda29077c0adf1e00abdb7368af6575068fc
 size 13477321262

training.log CHANGED Viewed

@@ -1,32 +1,26 @@
-[2023-12-11 05:39:03,031] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
-[2023-12-11 05:39:04,827] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
-[2023-12-11 05:39:04,828] [INFO] [runner.py:570:main] cmd = /home/t-sokumar/miniconda3/envs/ft/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path local/jsonfile --data_split 1,0,0 --model_name_or_path codellama/CodeLlama-7b-hf --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --max_seq_len 512 --learning_rate 9.65e-6 --weight_decay 0. --num_train_epochs 3 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --gradient_checkpointing --zero_stage 3 --deepspeed --lora_dim 128 --lora_module_name layers. --output_dir ./output_step1_Codellama_7b_lora_llamahub-devrev --add_eot_token
-[2023-12-11 05:39:07,364] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
-[2023-12-11 05:39:09,159] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
-[2023-12-11 05:39:09,159] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
-[2023-12-11 05:39:09,159] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
-[2023-12-11 05:39:09,159] [INFO] [launch.py:163:main] dist_world_size=4
-[2023-12-11 05:39:09,159] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
-[2023-12-11 05:39:12,594] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
-[2023-12-11 05:39:12,600] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
-[2023-12-11 05:39:12,605] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
-[2023-12-11 05:39:12,606] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
   warnings.warn(
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
   warnings.warn(
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
   warnings.warn(
-/home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
-  warnings.warn(
-[2023-12-11 05:39:14,179] [INFO] [comm.py:637:init_distributed] cdb=None
-[2023-12-11 05:39:14,179] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
-[2023-12-11 05:39:14,642] [INFO] [comm.py:637:init_distributed] cdb=None
-[2023-12-11 05:39:14,646] [INFO] [comm.py:637:init_distributed] cdb=None
-[2023-12-11 05:39:14,678] [INFO] [comm.py:637:init_distributed] cdb=None
-The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
-The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'.
-The class this function is called from is 'LlamaTokenizer'.
 The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
 The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'.
 The class this function is called from is 'LlamaTokenizer'.
@@ -35,17 +29,14 @@ The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'.
 The class this function is called from is 'LlamaTokenizer'.
 You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
 You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
-You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
 The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
 The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'.
 The class this function is called from is 'LlamaTokenizer'.
 You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
-[2023-12-11 05:39:17,564] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 291, num_elems = 6.74B
-Using /home/t-sokumar/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
 Using /home/t-sokumar/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
 Using /home/t-sokumar/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
 Using /home/t-sokumar/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
@@ -55,70 +46,66 @@ Building extension module fused_adam...
 Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
 ninja: no work to do.
 Loading extension module fused_adam...
-Time to load fused_adam op: 0.12226700782775879 seconds
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
   self._dummy_overflow_buf = get_accelerator().IntTensor([0])
-Loading extension module fused_adam...Loading extension module fused_adam...
-Time to load fused_adam op: 0.2073993682861328 secondsTime to load fused_adam op: 0.2075939178466797 seconds
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
   self._dummy_overflow_buf = get_accelerator().IntTensor([0])
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
   self._dummy_overflow_buf = get_accelerator().IntTensor([0])
-Loading extension module fused_adam...
-[2023-12-11 05:39:51,139] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.12.4, git-hash=unknown, git-branch=unknown
-[2023-12-11 05:39:51,139] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
-Time to load fused_adam op: 0.2015979290008545 seconds
-/home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
-  self._dummy_overflow_buf = get_accelerator().IntTensor([0])
-[2023-12-11 05:39:51,163] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
-[2023-12-11 05:39:51,164] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
-[2023-12-11 05:39:51,164] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
-[2023-12-11 05:39:51,206] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
-[2023-12-11 05:39:51,206] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
-[2023-12-11 05:39:51,206] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
-[2023-12-11 05:39:51,206] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 3 optimizer
-[2023-12-11 05:39:51,330] [INFO] [utils.py:795:see_memory_usage] Stage 3 initialize beginning
-[2023-12-11 05:39:51,331] [INFO] [utils.py:796:see_memory_usage] MA 4.37 GB         Max_MA 4.75 GB         CA 8.35 GB         Max_CA 8 GB
-[2023-12-11 05:39:51,331] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 98.11 GB, percent = 39.0%
-[2023-12-11 05:39:51,333] [INFO] [stage3.py:127:__init__] Reduce bucket size 500,000,000
-[2023-12-11 05:39:51,333] [INFO] [stage3.py:128:__init__] Prefetch bucket size 30000000
-[2023-12-11 05:39:51,450] [INFO] [utils.py:795:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
-[2023-12-11 05:39:51,450] [INFO] [utils.py:796:see_memory_usage] MA 4.37 GB         Max_MA 4.37 GB         CA 8.35 GB         Max_CA 8 GB
-[2023-12-11 05:39:51,450] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 98.13 GB, percent = 39.0%
 Parameter Offload: Total persistent parameters: 266240 in 65 params
-[2023-12-11 05:39:51,757] [INFO] [utils.py:795:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
-[2023-12-11 05:39:51,758] [INFO] [utils.py:796:see_memory_usage] MA 3.54 GB         Max_MA 4.43 GB         CA 8.35 GB         Max_CA 8 GB
-[2023-12-11 05:39:51,758] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 98.13 GB, percent = 39.0%
-[2023-12-11 05:39:51,866] [INFO] [utils.py:795:see_memory_usage] Before creating fp16 partitions
-[2023-12-11 05:39:51,866] [INFO] [utils.py:796:see_memory_usage] MA 3.54 GB         Max_MA 3.54 GB         CA 8.35 GB         Max_CA 8 GB
-[2023-12-11 05:39:51,866] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 98.13 GB, percent = 39.0%
-[2023-12-11 05:39:52,568] [INFO] [utils.py:795:see_memory_usage] After creating fp16 partitions: 3
-[2023-12-11 05:39:52,569] [INFO] [utils.py:796:see_memory_usage] MA 3.54 GB         Max_MA 3.54 GB         CA 5.29 GB         Max_CA 8 GB
-[2023-12-11 05:39:52,569] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 98.22 GB, percent = 39.0%
-[2023-12-11 05:39:52,708] [INFO] [utils.py:795:see_memory_usage] Before creating fp32 partitions
-[2023-12-11 05:39:52,709] [INFO] [utils.py:796:see_memory_usage] MA 3.54 GB         Max_MA 3.54 GB         CA 5.29 GB         Max_CA 5 GB
-[2023-12-11 05:39:52,709] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 96.08 GB, percent = 38.2%
-[2023-12-11 05:39:52,831] [INFO] [utils.py:795:see_memory_usage] After creating fp32 partitions
-[2023-12-11 05:39:52,832] [INFO] [utils.py:796:see_memory_usage] MA 4.08 GB         Max_MA 4.23 GB         CA 5.99 GB         Max_CA 6 GB
-[2023-12-11 05:39:52,832] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 96.11 GB, percent = 38.2%
-[2023-12-11 05:39:52,942] [INFO] [utils.py:795:see_memory_usage] Before initializing optimizer states
-[2023-12-11 05:39:52,942] [INFO] [utils.py:796:see_memory_usage] MA 4.08 GB         Max_MA 4.08 GB         CA 5.99 GB         Max_CA 6 GB
-[2023-12-11 05:39:52,943] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 96.11 GB, percent = 38.2%
-[2023-12-11 05:39:53,083] [INFO] [utils.py:795:see_memory_usage] After initializing optimizer states
-[2023-12-11 05:39:53,084] [INFO] [utils.py:796:see_memory_usage] MA 5.17 GB         Max_MA 5.47 GB         CA 7.38 GB         Max_CA 7 GB
-[2023-12-11 05:39:53,084] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 96.07 GB, percent = 38.2%
-[2023-12-11 05:39:53,084] [INFO] [stage3.py:479:_setup_for_real_optimizer] optimizer state initialized
-[2023-12-11 05:39:53,479] [INFO] [utils.py:795:see_memory_usage] After initializing ZeRO optimizer
-[2023-12-11 05:39:53,480] [INFO] [utils.py:796:see_memory_usage] MA 6.37 GB         Max_MA 6.86 GB         CA 9.05 GB         Max_CA 9 GB
-[2023-12-11 05:39:53,480] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 96.07 GB, percent = 38.2%
-[2023-12-11 05:39:53,480] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam
-[2023-12-11 05:39:53,480] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
-[2023-12-11 05:39:53,480] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f53cfda67d0>
-[2023-12-11 05:39:53,480] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[9.65e-06, 0.0005, 9.65e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
-[2023-12-11 05:39:53,482] [INFO] [config.py:979:print] DeepSpeedEngine configuration:
-[2023-12-11 05:39:53,482] [INFO] [config.py:983:print]   activation_checkpointing_config  {
     "partition_activations": false,
     "contiguous_memory_optimization": false,
     "cpu_checkpointing": false,
@@ -126,10 +113,10 @@ Parameter Offload: Total persistent parameters: 266240 in 65 params
     "synchronize_checkpoint_boundary": false,
     "profile": false
 }
-[2023-12-11 05:39:53,482] [INFO] [config.py:983:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
-[2023-12-11 05:39:53,482] [INFO] [config.py:983:print]   amp_enabled .................. False
-[2023-12-11 05:39:53,482] [INFO] [config.py:983:print]   amp_params ................... False
-[2023-12-11 05:39:53,482] [INFO] [config.py:983:print]   autotuning_config ............ {
     "enabled": false,
     "start_step": null,
     "end_step": null,
@@ -154,31 +141,31 @@ Parameter Offload: Total persistent parameters: 266240 in 65 params
     "min_train_micro_batch_size_per_gpu": 1,
     "num_tuning_micro_batch_sizes": 3
 }
-[2023-12-11 05:39:53,482] [INFO] [config.py:983:print]   bfloat16_enabled ............. False
-[2023-12-11 05:39:53,482] [INFO] [config.py:983:print]   checkpoint_parallel_write_pipeline  False
-[2023-12-11 05:39:53,482] [INFO] [config.py:983:print]   checkpoint_tag_validation_enabled  True
-[2023-12-11 05:39:53,482] [INFO] [config.py:983:print]   checkpoint_tag_validation_fail  False
-[2023-12-11 05:39:53,482] [INFO] [config.py:983:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f53cf91c7d0>
-[2023-12-11 05:39:53,482] [INFO] [config.py:983:print]   communication_data_type ...... None
-[2023-12-11 05:39:53,483] [INFO] [config.py:983:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
-[2023-12-11 05:39:53,483] [INFO] [config.py:983:print]   curriculum_enabled_legacy .... False
-[2023-12-11 05:39:53,483] [INFO] [config.py:983:print]   curriculum_params_legacy ..... False
-[2023-12-11 05:39:53,483] [INFO] [config.py:983:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
-[2023-12-11 05:39:53,483] [INFO] [config.py:983:print]   data_efficiency_enabled ...... False
-[2023-12-11 05:39:53,483] [INFO] [config.py:983:print]   dataloader_drop_last ......... False
-[2023-12-11 05:39:53,483] [INFO] [config.py:983:print]   disable_allgather ............ False
-[2023-12-11 05:39:53,483] [INFO] [config.py:983:print]   dump_state ................... False
-[2023-12-11 05:39:53,483] [INFO] [config.py:983:print]   dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 100, 'delayed_shift': 2, 'consecutive_hysteresis': False, 'min_scale': 1}
-[2023-12-11 05:39:53,483] [INFO] [config.py:983:print]   eigenvalue_enabled ........... False
-[2023-12-11 05:39:53,483] [INFO] [config.py:983:print]   eigenvalue_gas_boundary_resolution  1
-[2023-12-11 05:39:53,483] [INFO] [config.py:983:print]   eigenvalue_layer_name ........ bert.encoder.layer
-[2023-12-11 05:39:53,483] [INFO] [config.py:983:print]   eigenvalue_layer_num ......... 0
-[2023-12-11 05:39:53,483] [INFO] [config.py:983:print]   eigenvalue_max_iter .......... 100
-[2023-12-11 05:39:53,483] [INFO] [config.py:983:print]   eigenvalue_stability ......... 1e-06
-[2023-12-11 05:39:53,483] [INFO] [config.py:983:print]   eigenvalue_tol ............... 0.01
-[2023-12-11 05:39:53,483] [INFO] [config.py:983:print]   eigenvalue_verbose ........... False
-[2023-12-11 05:39:53,483] [INFO] [config.py:983:print]   elasticity_enabled ........... False
-[2023-12-11 05:39:53,483] [INFO] [config.py:983:print]   flops_profiler_config ........ {
     "enabled": false,
     "recompute_fwd_factor": 0.0,
     "profile_step": 1,
@@ -187,23 +174,23 @@ Parameter Offload: Total persistent parameters: 266240 in 65 params
     "detailed": true,
     "output_file": null
 }
-[2023-12-11 05:39:53,483] [INFO] [config.py:983:print]   fp16_auto_cast ............... False
-[2023-12-11 05:39:53,483] [INFO] [config.py:983:print]   fp16_enabled ................. True
-[2023-12-11 05:39:53,483] [INFO] [config.py:983:print]   fp16_master_weights_and_gradients  False
-[2023-12-11 05:39:53,483] [INFO] [config.py:983:print]   global_rank .................. 0
-[2023-12-11 05:39:53,483] [INFO] [config.py:983:print]   grad_accum_dtype ............. None
-[2023-12-11 05:39:53,483] [INFO] [config.py:983:print]   gradient_accumulation_steps .. 1
-[2023-12-11 05:39:53,483] [INFO] [config.py:983:print]   gradient_clipping ............ 1.0
-[2023-12-11 05:39:53,483] [INFO] [config.py:983:print]   gradient_predivide_factor .... 1.0
-[2023-12-11 05:39:53,483] [INFO] [config.py:983:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
-[2023-12-11 05:39:53,483] [INFO] [config.py:983:print]   initial_dynamic_scale ........ 65536
-[2023-12-11 05:39:53,484] [INFO] [config.py:983:print]   load_universal_checkpoint .... False
-[2023-12-11 05:39:53,484] [INFO] [config.py:983:print]   loss_scale ................... 0
-[2023-12-11 05:39:53,484] [INFO] [config.py:983:print]   memory_breakdown ............. False
-[2023-12-11 05:39:53,484] [INFO] [config.py:983:print]   mics_hierarchial_params_gather  False
-[2023-12-11 05:39:53,484] [INFO] [config.py:983:print]   mics_shard_size .............. -1
-[2023-12-11 05:39:53,484] [INFO] [config.py:983:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='step1_tensorboard/ds_tensorboard_logs/', job_name='step1_model_tensorboard') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
-[2023-12-11 05:39:53,484] [INFO] [config.py:983:print]   nebula_config ................ {
     "enabled": false,
     "persistent_storage_path": null,
     "persistent_time_interval": 100,
@@ -211,33 +198,33 @@ Parameter Offload: Total persistent parameters: 266240 in 65 params
     "enable_nebula_load": true,
     "load_path": null
 }
-[2023-12-11 05:39:53,484] [INFO] [config.py:983:print]   optimizer_legacy_fusion ...... False
-[2023-12-11 05:39:53,484] [INFO] [config.py:983:print]   optimizer_name ............... None
-[2023-12-11 05:39:53,484] [INFO] [config.py:983:print]   optimizer_params ............. None
-[2023-12-11 05:39:53,484] [INFO] [config.py:983:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
-[2023-12-11 05:39:53,484] [INFO] [config.py:983:print]   pld_enabled .................. False
-[2023-12-11 05:39:53,484] [INFO] [config.py:983:print]   pld_params ................... False
-[2023-12-11 05:39:53,484] [INFO] [config.py:983:print]   prescale_gradients ........... False
-[2023-12-11 05:39:53,484] [INFO] [config.py:983:print]   scheduler_name ............... None
-[2023-12-11 05:39:53,484] [INFO] [config.py:983:print]   scheduler_params ............. None
-[2023-12-11 05:39:53,484] [INFO] [config.py:983:print]   seq_parallel_communication_data_type  torch.float32
-[2023-12-11 05:39:53,484] [INFO] [config.py:983:print]   sparse_attention ............. None
-[2023-12-11 05:39:53,484] [INFO] [config.py:983:print]   sparse_gradients_enabled ..... False
-[2023-12-11 05:39:53,484] [INFO] [config.py:983:print]   steps_per_print .............. 10
-[2023-12-11 05:39:53,484] [INFO] [config.py:983:print]   train_batch_size ............. 32
-[2023-12-11 05:39:53,484] [INFO] [config.py:983:print]   train_micro_batch_size_per_gpu  8
-[2023-12-11 05:39:53,484] [INFO] [config.py:983:print]   use_data_before_expert_parallel_  False
-[2023-12-11 05:39:53,484] [INFO] [config.py:983:print]   use_node_local_storage ....... False
-[2023-12-11 05:39:53,484] [INFO] [config.py:983:print]   wall_clock_breakdown ......... False
-[2023-12-11 05:39:53,484] [INFO] [config.py:983:print]   weight_quantization_config ... None
-[2023-12-11 05:39:53,484] [INFO] [config.py:983:print]   world_size ................... 4
-[2023-12-11 05:39:53,484] [INFO] [config.py:983:print]   zero_allow_untested_optimizer  False
-[2023-12-11 05:39:53,485] [INFO] [config.py:983:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=30000000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=30000000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=False pipeline_loading_checkpoint=False override_module_apply=True
-[2023-12-11 05:39:53,485] [INFO] [config.py:983:print]   zero_enabled ................. True
-[2023-12-11 05:39:53,485] [INFO] [config.py:983:print]   zero_force_ds_cpu_optimizer .. True
-[2023-12-11 05:39:53,485] [INFO] [config.py:983:print]   zero_optimization_stage ...... 3
-[2023-12-11 05:39:53,485] [INFO] [config.py:969:print_user_config]   json = {
-    "train_batch_size": 32,
     "train_micro_batch_size_per_gpu": 8,
     "steps_per_print": 10,
     "zero_optimization": {
@@ -275,76 +262,272 @@ Parameter Offload: Total persistent parameters: 266240 in 65 params
     }
 }
 ***** Running training *****
-***** Evaluating perplexity, Epoch 0/3 *****
-ppl: 4.460639476776123, loss: 1.4952921867370605
-Beginning of Epoch 1/3, Total Micro Batches 13
-/home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
-  warnings.warn(
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
   warnings.warn(
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
   warnings.warn(
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
   warnings.warn(
-Model Parameters: 6.927 B, Latency: 4.07s, TFLOPs: 10.30, Samples/sec: 1.97, Time/seq 0.51s, Batch Size: 8, Sequence Length: 512
 Invalidate trace cache @ step 0: expected module 6, but got module 0
-Model Parameters: 6.927 B, Latency: 3.74s, TFLOPs: 11.21, Samples/sec: 2.14, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.75s, TFLOPs: 11.17, Samples/sec: 2.13, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.73s, TFLOPs: 11.24, Samples/sec: 2.15, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.62s, TFLOPs: 11.55, Samples/sec: 2.21, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.53, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.62s, TFLOPs: 11.57, Samples/sec: 2.21, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.55, Samples/sec: 2.21, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.62s, TFLOPs: 11.56, Samples/sec: 2.21, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
-[2023-12-11 05:40:33,349] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=0, lr=[8.167395005683819e-06, 0.00042318108837739987, 8.167395005683819e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
-[2023-12-11 05:40:33,349] [INFO] [timer.py:260:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=8.762510054419295, CurrSamplesPerSec=8.82404309088149, MemAllocated=6.88GB, MaxMemAllocated=10.68GB
-Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.54, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.62s, TFLOPs: 11.55, Samples/sec: 2.21, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.62s, TFLOPs: 11.56, Samples/sec: 2.21, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.24s, TFLOPs: 12.92, Samples/sec: 2.47, Time/seq 0.40s, Batch Size: 8, Sequence Length: 512
-***** Evaluating perplexity, Epoch 1/3 *****
 Invalidate trace cache @ step 0: expected module 0, but got module 6
-ppl: 1.7355413436889648, loss: 0.551319420337677
-Beginning of Epoch 2/3, Total Micro Batches 13
-Model Parameters: 6.927 B, Latency: 3.75s, TFLOPs: 11.17, Samples/sec: 2.13, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.76s, TFLOPs: 11.14, Samples/sec: 2.13, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.54, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.53, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
-[2023-12-11 05:41:11,363] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=0, lr=[4.6307168389720735e-06, 0.0002399335149726463, 4.6307168389720735e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
-[2023-12-11 05:41:11,363] [INFO] [timer.py:260:stop] epoch=1/micro_step=7/global_step=20, RunningAvgSamplesPerSec=8.813860487355969, CurrSamplesPerSec=8.815178915300866, MemAllocated=6.88GB, MaxMemAllocated=11.06GB
-Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.53, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.53, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.53, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.24s, TFLOPs: 12.90, Samples/sec: 2.47, Time/seq 0.41s, Batch Size: 8, Sequence Length: 512
-***** Evaluating perplexity, Epoch 2/3 *****
 Invalidate trace cache @ step 0: expected module 0, but got module 6
-ppl: 1.0645378828048706, loss: 0.0625406950712204
-Beginning of Epoch 3/3, Total Micro Batches 13
-Model Parameters: 6.927 B, Latency: 3.75s, TFLOPs: 11.16, Samples/sec: 2.13, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.77s, TFLOPs: 11.11, Samples/sec: 2.12, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.53, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
-[2023-12-11 05:41:49,440] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=0, lr=[1.2134356400744368e-06, 6.28723129572247e-05, 1.2134356400744368e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
-[2023-12-11 05:41:49,441] [INFO] [timer.py:260:stop] epoch=2/micro_step=4/global_step=30, RunningAvgSamplesPerSec=8.8235289391519, CurrSamplesPerSec=8.771353693343526, MemAllocated=6.88GB, MaxMemAllocated=11.06GB
-Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.47, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.53, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.51, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.48, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.50, Samples/sec: 2.20, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
-Model Parameters: 6.927 B, Latency: 3.26s, TFLOPs: 12.83, Samples/sec: 2.45, Time/seq 0.41s, Batch Size: 8, Sequence Length: 512
-***** Evaluating perplexity, Epoch 3/3 *****
 Invalidate trace cache @ step 0: expected module 0, but got module 6
-ppl: 1.031400442123413, loss: 0.03091755136847496
 saving the final model ...
-[2023-12-11 05:42:35,188] [INFO] [launch.py:347:main] Process 1247715 exits successfully.
-[2023-12-11 05:42:35,188] [INFO] [launch.py:347:main] Process 1247716 exits successfully.
-[2023-12-11 05:42:35,189] [INFO] [launch.py:347:main] Process 1247717 exits successfully.
-[2023-12-11 05:44:14,200] [INFO] [launch.py:347:main] Process 1247714 exits successfully.

+[2023-12-11 10:42:54,890] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
+[2023-12-11 10:42:56,697] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
+Detected CUDA_VISIBLE_DEVICES=0,1,2: setting --include=localhost:0,1,2
+[2023-12-11 10:42:56,698] [INFO] [runner.py:570:main] cmd = /home/t-sokumar/miniconda3/envs/ft/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMl19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path local/jsonfile --data_split 1,0,0 --model_name_or_path codellama/CodeLlama-7b-hf --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --max_seq_len 512 --learning_rate 9.65e-6 --weight_decay 0. --num_train_epochs 10 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --gradient_checkpointing --zero_stage 3 --deepspeed --lora_dim 128 --lora_module_name layers. --output_dir ./output_step1_Codellama_7b_lora_llamahub-devrev --add_eot_token
+[2023-12-11 10:42:59,233] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
+[2023-12-11 10:43:01,086] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2]}
+[2023-12-11 10:43:01,086] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=3, node_rank=0
+[2023-12-11 10:43:01,086] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2]})
+[2023-12-11 10:43:01,086] [INFO] [launch.py:163:main] dist_world_size=3
+[2023-12-11 10:43:01,086] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2
+[2023-12-11 10:43:04,573] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
+[2023-12-11 10:43:04,579] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
+[2023-12-11 10:43:04,650] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
   warnings.warn(
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
   warnings.warn(
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
   warnings.warn(
+[2023-12-11 10:43:06,219] [INFO] [comm.py:637:init_distributed] cdb=None
+[2023-12-11 10:43:06,219] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
+[2023-12-11 10:43:06,306] [INFO] [comm.py:637:init_distributed] cdb=None
+[2023-12-11 10:43:06,307] [INFO] [comm.py:637:init_distributed] cdb=None
 The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
 The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'.
 The class this function is called from is 'LlamaTokenizer'.
 The class this function is called from is 'LlamaTokenizer'.
 You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
 You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
 The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
 The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'.
 The class this function is called from is 'LlamaTokenizer'.
 You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
+[2023-12-11 10:43:09,096] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 291, num_elems = 6.74B
 Using /home/t-sokumar/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
 Using /home/t-sokumar/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
 Using /home/t-sokumar/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
 Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
 ninja: no work to do.
 Loading extension module fused_adam...
+Time to load fused_adam op: 0.1028139591217041 seconds
+Loading extension module fused_adam...
+Loading extension module fused_adam...
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
   self._dummy_overflow_buf = get_accelerator().IntTensor([0])
+Time to load fused_adam op: 0.10137510299682617 seconds
+Time to load fused_adam op: 0.10164141654968262 seconds
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
   self._dummy_overflow_buf = get_accelerator().IntTensor([0])
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
   self._dummy_overflow_buf = get_accelerator().IntTensor([0])
+[2023-12-11 10:43:18,099] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.12.4, git-hash=unknown, git-branch=unknown
+[2023-12-11 10:43:18,099] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
+[2023-12-11 10:43:18,121] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
+[2023-12-11 10:43:18,123] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
+[2023-12-11 10:43:18,123] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
+[2023-12-11 10:43:18,161] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
+[2023-12-11 10:43:18,161] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
+[2023-12-11 10:43:18,161] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
+[2023-12-11 10:43:18,161] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 3 optimizer
+[2023-12-11 10:43:18,286] [INFO] [utils.py:795:see_memory_usage] Stage 3 initialize beginning
+[2023-12-11 10:43:18,287] [INFO] [utils.py:796:see_memory_usage] MA 5.37 GB         Max_MA 5.79 GB         CA 11.7 GB         Max_CA 12 GB
+[2023-12-11 10:43:18,287] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 93.3 GB, percent = 37.1%
+[2023-12-11 10:43:18,289] [INFO] [stage3.py:127:__init__] Reduce bucket size 500,000,000
+[2023-12-11 10:43:18,289] [INFO] [stage3.py:128:__init__] Prefetch bucket size 30000000
+[2023-12-11 10:43:18,399] [INFO] [utils.py:795:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
+[2023-12-11 10:43:18,400] [INFO] [utils.py:796:see_memory_usage] MA 5.37 GB         Max_MA 5.37 GB         CA 11.7 GB         Max_CA 12 GB
+[2023-12-11 10:43:18,400] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 93.27 GB, percent = 37.1%
 Parameter Offload: Total persistent parameters: 266240 in 65 params
+[2023-12-11 10:43:18,806] [INFO] [utils.py:795:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
+[2023-12-11 10:43:18,807] [INFO] [utils.py:796:see_memory_usage] MA 4.64 GB         Max_MA 5.46 GB         CA 11.7 GB         Max_CA 12 GB
+[2023-12-11 10:43:18,807] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 93.24 GB, percent = 37.1%
+[2023-12-11 10:43:18,913] [INFO] [utils.py:795:see_memory_usage] Before creating fp16 partitions
+[2023-12-11 10:43:18,914] [INFO] [utils.py:796:see_memory_usage] MA 4.64 GB         Max_MA 4.64 GB         CA 11.7 GB         Max_CA 12 GB
+[2023-12-11 10:43:18,914] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 93.24 GB, percent = 37.1%
+[2023-12-11 10:43:19,685] [INFO] [utils.py:795:see_memory_usage] After creating fp16 partitions: 3
+[2023-12-11 10:43:19,686] [INFO] [utils.py:796:see_memory_usage] MA 4.64 GB         Max_MA 4.64 GB         CA 7.4 GB         Max_CA 12 GB
+[2023-12-11 10:43:19,686] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 93.67 GB, percent = 37.2%
+[2023-12-11 10:43:19,803] [INFO] [utils.py:795:see_memory_usage] Before creating fp32 partitions
+[2023-12-11 10:43:19,804] [INFO] [utils.py:796:see_memory_usage] MA 4.64 GB         Max_MA 4.64 GB         CA 7.4 GB         Max_CA 7 GB
+[2023-12-11 10:43:19,804] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 93.07 GB, percent = 37.0%
+[2023-12-11 10:43:19,932] [INFO] [utils.py:795:see_memory_usage] After creating fp32 partitions
+[2023-12-11 10:43:19,932] [INFO] [utils.py:796:see_memory_usage] MA 5.36 GB         Max_MA 5.56 GB         CA 8.49 GB         Max_CA 8 GB
+[2023-12-11 10:43:19,964] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 93.04 GB, percent = 37.0%
+[2023-12-11 10:43:20,076] [INFO] [utils.py:795:see_memory_usage] Before initializing optimizer states
+[2023-12-11 10:43:20,077] [INFO] [utils.py:796:see_memory_usage] MA 5.36 GB         Max_MA 5.36 GB         CA 8.49 GB         Max_CA 8 GB
+[2023-12-11 10:43:20,077] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 92.33 GB, percent = 36.7%
+[2023-12-11 10:43:20,189] [INFO] [utils.py:795:see_memory_usage] After initializing optimizer states
+[2023-12-11 10:43:20,190] [INFO] [utils.py:796:see_memory_usage] MA 6.81 GB         Max_MA 7.21 GB         CA 10.34 GB         Max_CA 10 GB
+[2023-12-11 10:43:20,190] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 92.33 GB, percent = 36.7%
+[2023-12-11 10:43:20,190] [INFO] [stage3.py:479:_setup_for_real_optimizer] optimizer state initialized
+[2023-12-11 10:43:20,563] [INFO] [utils.py:795:see_memory_usage] After initializing ZeRO optimizer
+[2023-12-11 10:43:20,564] [INFO] [utils.py:796:see_memory_usage] MA 8.1 GB         Max_MA 8.59 GB         CA 12.01 GB         Max_CA 12 GB
+[2023-12-11 10:43:20,565] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory:  used = 91.63 GB, percent = 36.4%
+[2023-12-11 10:43:20,565] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam
+[2023-12-11 10:43:20,565] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
+[2023-12-11 10:43:20,565] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f02815f2b50>
+[2023-12-11 10:43:20,565] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[9.65e-06, 0.0005, 9.65e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
+[2023-12-11 10:43:20,566] [INFO] [config.py:979:print] DeepSpeedEngine configuration:
+[2023-12-11 10:43:20,567] [INFO] [config.py:983:print]   activation_checkpointing_config  {
     "partition_activations": false,
     "contiguous_memory_optimization": false,
     "cpu_checkpointing": false,
     "synchronize_checkpoint_boundary": false,
     "profile": false
 }
+[2023-12-11 10:43:20,567] [INFO] [config.py:983:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
+[2023-12-11 10:43:20,567] [INFO] [config.py:983:print]   amp_enabled .................. False
+[2023-12-11 10:43:20,567] [INFO] [config.py:983:print]   amp_params ................... False
+[2023-12-11 10:43:20,567] [INFO] [config.py:983:print]   autotuning_config ............ {
     "enabled": false,
     "start_step": null,
     "end_step": null,
     "min_train_micro_batch_size_per_gpu": 1,
     "num_tuning_micro_batch_sizes": 3
 }
+[2023-12-11 10:43:20,567] [INFO] [config.py:983:print]   bfloat16_enabled ............. False
+[2023-12-11 10:43:20,567] [INFO] [config.py:983:print]   checkpoint_parallel_write_pipeline  False
+[2023-12-11 10:43:20,567] [INFO] [config.py:983:print]   checkpoint_tag_validation_enabled  True
+[2023-12-11 10:43:20,567] [INFO] [config.py:983:print]   checkpoint_tag_validation_fail  False
+[2023-12-11 10:43:20,567] [INFO] [config.py:983:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f0281706bd0>
+[2023-12-11 10:43:20,567] [INFO] [config.py:983:print]   communication_data_type ...... None
+[2023-12-11 10:43:20,567] [INFO] [config.py:983:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
+[2023-12-11 10:43:20,567] [INFO] [config.py:983:print]   curriculum_enabled_legacy .... False
+[2023-12-11 10:43:20,567] [INFO] [config.py:983:print]   curriculum_params_legacy ..... False
+[2023-12-11 10:43:20,567] [INFO] [config.py:983:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
+[2023-12-11 10:43:20,568] [INFO] [config.py:983:print]   data_efficiency_enabled ...... False
+[2023-12-11 10:43:20,568] [INFO] [config.py:983:print]   dataloader_drop_last ......... False
+[2023-12-11 10:43:20,568] [INFO] [config.py:983:print]   disable_allgather ............ False
+[2023-12-11 10:43:20,568] [INFO] [config.py:983:print]   dump_state ................... False
+[2023-12-11 10:43:20,568] [INFO] [config.py:983:print]   dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 100, 'delayed_shift': 2, 'consecutive_hysteresis': False, 'min_scale': 1}
+[2023-12-11 10:43:20,568] [INFO] [config.py:983:print]   eigenvalue_enabled ........... False
+[2023-12-11 10:43:20,568] [INFO] [config.py:983:print]   eigenvalue_gas_boundary_resolution  1
+[2023-12-11 10:43:20,568] [INFO] [config.py:983:print]   eigenvalue_layer_name ........ bert.encoder.layer
+[2023-12-11 10:43:20,568] [INFO] [config.py:983:print]   eigenvalue_layer_num ......... 0
+[2023-12-11 10:43:20,568] [INFO] [config.py:983:print]   eigenvalue_max_iter .......... 100
+[2023-12-11 10:43:20,568] [INFO] [config.py:983:print]   eigenvalue_stability ......... 1e-06
+[2023-12-11 10:43:20,568] [INFO] [config.py:983:print]   eigenvalue_tol ............... 0.01
+[2023-12-11 10:43:20,568] [INFO] [config.py:983:print]   eigenvalue_verbose ........... False
+[2023-12-11 10:43:20,568] [INFO] [config.py:983:print]   elasticity_enabled ........... False
+[2023-12-11 10:43:20,568] [INFO] [config.py:983:print]   flops_profiler_config ........ {
     "enabled": false,
     "recompute_fwd_factor": 0.0,
     "profile_step": 1,
     "detailed": true,
     "output_file": null
 }
+[2023-12-11 10:43:20,568] [INFO] [config.py:983:print]   fp16_auto_cast ............... False
+[2023-12-11 10:43:20,568] [INFO] [config.py:983:print]   fp16_enabled ................. True
+[2023-12-11 10:43:20,568] [INFO] [config.py:983:print]   fp16_master_weights_and_gradients  False
+[2023-12-11 10:43:20,568] [INFO] [config.py:983:print]   global_rank .................. 0
+[2023-12-11 10:43:20,568] [INFO] [config.py:983:print]   grad_accum_dtype ............. None
+[2023-12-11 10:43:20,568] [INFO] [config.py:983:print]   gradient_accumulation_steps .. 1
+[2023-12-11 10:43:20,568] [INFO] [config.py:983:print]   gradient_clipping ............ 1.0
+[2023-12-11 10:43:20,568] [INFO] [config.py:983:print]   gradient_predivide_factor .... 1.0
+[2023-12-11 10:43:20,568] [INFO] [config.py:983:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
+[2023-12-11 10:43:20,568] [INFO] [config.py:983:print]   initial_dynamic_scale ........ 65536
+[2023-12-11 10:43:20,568] [INFO] [config.py:983:print]   load_universal_checkpoint .... False
+[2023-12-11 10:43:20,568] [INFO] [config.py:983:print]   loss_scale ................... 0
+[2023-12-11 10:43:20,568] [INFO] [config.py:983:print]   memory_breakdown ............. False
+[2023-12-11 10:43:20,569] [INFO] [config.py:983:print]   mics_hierarchial_params_gather  False
+[2023-12-11 10:43:20,569] [INFO] [config.py:983:print]   mics_shard_size .............. -1
+[2023-12-11 10:43:20,569] [INFO] [config.py:983:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='step1_tensorboard/ds_tensorboard_logs/', job_name='step1_model_tensorboard') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
+[2023-12-11 10:43:20,569] [INFO] [config.py:983:print]   nebula_config ................ {
     "enabled": false,
     "persistent_storage_path": null,
     "persistent_time_interval": 100,
     "enable_nebula_load": true,
     "load_path": null
 }
+[2023-12-11 10:43:20,569] [INFO] [config.py:983:print]   optimizer_legacy_fusion ...... False
+[2023-12-11 10:43:20,569] [INFO] [config.py:983:print]   optimizer_name ............... None
+[2023-12-11 10:43:20,569] [INFO] [config.py:983:print]   optimizer_params ............. None
+[2023-12-11 10:43:20,569] [INFO] [config.py:983:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
+[2023-12-11 10:43:20,569] [INFO] [config.py:983:print]   pld_enabled .................. False
+[2023-12-11 10:43:20,569] [INFO] [config.py:983:print]   pld_params ................... False
+[2023-12-11 10:43:20,569] [INFO] [config.py:983:print]   prescale_gradients ........... False
+[2023-12-11 10:43:20,569] [INFO] [config.py:983:print]   scheduler_name ............... None
+[2023-12-11 10:43:20,569] [INFO] [config.py:983:print]   scheduler_params ............. None
+[2023-12-11 10:43:20,569] [INFO] [config.py:983:print]   seq_parallel_communication_data_type  torch.float32
+[2023-12-11 10:43:20,569] [INFO] [config.py:983:print]   sparse_attention ............. None
+[2023-12-11 10:43:20,569] [INFO] [config.py:983:print]   sparse_gradients_enabled ..... False
+[2023-12-11 10:43:20,569] [INFO] [config.py:983:print]   steps_per_print .............. 10
+[2023-12-11 10:43:20,569] [INFO] [config.py:983:print]   train_batch_size ............. 24
+[2023-12-11 10:43:20,569] [INFO] [config.py:983:print]   train_micro_batch_size_per_gpu  8
+[2023-12-11 10:43:20,569] [INFO] [config.py:983:print]   use_data_before_expert_parallel_  False
+[2023-12-11 10:43:20,569] [INFO] [config.py:983:print]   use_node_local_storage ....... False
+[2023-12-11 10:43:20,569] [INFO] [config.py:983:print]   wall_clock_breakdown ......... False
+[2023-12-11 10:43:20,569] [INFO] [config.py:983:print]   weight_quantization_config ... None
+[2023-12-11 10:43:20,569] [INFO] [config.py:983:print]   world_size ................... 3
+[2023-12-11 10:43:20,569] [INFO] [config.py:983:print]   zero_allow_untested_optimizer  False
+[2023-12-11 10:43:20,569] [INFO] [config.py:983:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=30000000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=30000000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=False pipeline_loading_checkpoint=False override_module_apply=True
+[2023-12-11 10:43:20,570] [INFO] [config.py:983:print]   zero_enabled ................. True
+[2023-12-11 10:43:20,570] [INFO] [config.py:983:print]   zero_force_ds_cpu_optimizer .. True
+[2023-12-11 10:43:20,570] [INFO] [config.py:983:print]   zero_optimization_stage ...... 3
+[2023-12-11 10:43:20,570] [INFO] [config.py:969:print_user_config]   json = {
+    "train_batch_size": 24,
     "train_micro_batch_size_per_gpu": 8,
     "steps_per_print": 10,
     "zero_optimization": {
     }
 }
 ***** Running training *****
+***** Evaluating perplexity, Epoch 0/10 *****
+ppl: 4.454780578613281, loss: 1.4939777851104736
+Beginning of Epoch 1/10, Total Micro Batches 18
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
   warnings.warn(
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
   warnings.warn(
 /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
   warnings.warn(
+Model Parameters: 6.927 B, Latency: 4.19s, TFLOPs: 13.34, Samples/sec: 1.91, Time/seq 0.52s, Batch Size: 8, Sequence Length: 512
 Invalidate trace cache @ step 0: expected module 6, but got module 0
+Model Parameters: 6.927 B, Latency: 3.83s, TFLOPs: 14.59, Samples/sec: 2.09, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.84s, TFLOPs: 14.54, Samples/sec: 2.08, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.12, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.26, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.24, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.27, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.25, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.26, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+[2023-12-11 10:44:01,475] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=0, lr=[9.576697408283905e-06, 0.000496201938253052, 9.576697408283905e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
+[2023-12-11 10:44:01,475] [INFO] [timer.py:260:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=6.5135351330213425, CurrSamplesPerSec=6.560741213043025, MemAllocated=8.61GB, MaxMemAllocated=14.1GB
+Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.25, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.24, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.20, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.25, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.22, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.24, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.22, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.24, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 1.58s, TFLOPs: 35.24, Samples/sec: 5.05, Time/seq 0.20s, Batch Size: 8, Sequence Length: 512
+***** Evaluating perplexity, Epoch 1/10 *****
+Invalidate trace cache @ step 0: expected module 0, but got module 6
+ppl: 1.1439684629440308, loss: 0.1345033049583435
+Beginning of Epoch 2/10, Total Micro Batches 18
+Model Parameters: 6.927 B, Latency: 3.83s, TFLOPs: 14.58, Samples/sec: 2.09, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
+[2023-12-11 10:44:38,766] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=0, lr=[9.35901689529201e-06, 0.0004849231551964771, 9.35901689529201e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
+[2023-12-11 10:44:38,766] [INFO] [timer.py:260:stop] epoch=1/micro_step=2/global_step=20, RunningAvgSamplesPerSec=6.710114711955848, CurrSamplesPerSec=6.246338308005135, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
+Model Parameters: 6.927 B, Latency: 3.84s, TFLOPs: 14.52, Samples/sec: 2.08, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.23, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.23, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.24, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.23, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.23, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.21, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.20, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.22, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.23, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+[2023-12-11 10:45:15,457] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=0, lr=[9.003572573259918e-06, 0.00046650635094610973, 9.003572573259918e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
+[2023-12-11 10:45:15,457] [INFO] [timer.py:260:stop] epoch=1/micro_step=12/global_step=30, RunningAvgSamplesPerSec=6.650983352459916, CurrSamplesPerSec=6.541507580323197, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.21, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.22, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.21, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.23, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.22, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.23, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 1.58s, TFLOPs: 35.33, Samples/sec: 5.06, Time/seq 0.20s, Batch Size: 8, Sequence Length: 512
+***** Evaluating perplexity, Epoch 2/10 *****
+Invalidate trace cache @ step 0: expected module 0, but got module 6
+ppl: 1.0047215223312378, loss: 0.004710380919277668
+Beginning of Epoch 3/10, Total Micro Batches 18
+Model Parameters: 6.927 B, Latency: 3.83s, TFLOPs: 14.58, Samples/sec: 2.09, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.86s, TFLOPs: 14.47, Samples/sec: 2.07, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.21, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+[2023-12-11 10:45:52,775] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=0, lr=[8.52116443804907e-06, 0.0004415111107797445, 8.52116443804907e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
+[2023-12-11 10:45:52,776] [INFO] [timer.py:260:stop] epoch=2/micro_step=4/global_step=40, RunningAvgSamplesPerSec=6.7074234396499754, CurrSamplesPerSec=6.540627329973461, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.21, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.22, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.21, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.19, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.20, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.20, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.19, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.21, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.20, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.21, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+[2023-12-11 10:46:29,517] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=0, lr=[7.926450216737553e-06, 0.0004106969024216348, 7.926450216737553e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
+[2023-12-11 10:46:29,518] [INFO] [timer.py:260:stop] epoch=2/micro_step=14/global_step=50, RunningAvgSamplesPerSec=6.671387689388046, CurrSamplesPerSec=6.52887655567313, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.18, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.21, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.21, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.21, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 1.58s, TFLOPs: 35.30, Samples/sec: 5.06, Time/seq 0.20s, Batch Size: 8, Sequence Length: 512
+***** Evaluating perplexity, Epoch 3/10 *****
+Invalidate trace cache @ step 0: expected module 0, but got module 6
+ppl: 1.0025222301483154, loss: 0.002519124187529087
+Beginning of Epoch 4/10, Total Micro Batches 18
+Model Parameters: 6.927 B, Latency: 3.83s, TFLOPs: 14.57, Samples/sec: 2.09, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.85s, TFLOPs: 14.49, Samples/sec: 2.08, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.19, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.19, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.18, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+[2023-12-11 10:47:06,889] [INFO] [logging.py:96:log_dist] [Rank 0] step=60, skipped=0, lr=[7.237500000000001e-06, 0.000375, 7.237500000000001e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
+[2023-12-11 10:47:06,889] [INFO] [timer.py:260:stop] epoch=3/micro_step=6/global_step=60, RunningAvgSamplesPerSec=6.703453633726282, CurrSamplesPerSec=6.5185040516557375, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.19, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.19, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.19, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.19, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.19, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.18, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+[2023-12-11 10:47:43,680] [INFO] [logging.py:96:log_dist] [Rank 0] step=70, skipped=0, lr=[6.475247191546353e-06, 0.0003355050358314172, 6.475247191546353e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
+[2023-12-11 10:47:43,681] [INFO] [timer.py:260:stop] epoch=3/micro_step=16/global_step=70, RunningAvgSamplesPerSec=6.6772772936024944, CurrSamplesPerSec=6.529561351866838, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.18, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.19, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 1.59s, TFLOPs: 35.12, Samples/sec: 5.03, Time/seq 0.20s, Batch Size: 8, Sequence Length: 512
+***** Evaluating perplexity, Epoch 4/10 *****
+Invalidate trace cache @ step 0: expected module 0, but got module 6
+ppl: 1.0022900104522705, loss: 0.002287394367158413
+Beginning of Epoch 5/10, Total Micro Batches 18
+Model Parameters: 6.927 B, Latency: 3.84s, TFLOPs: 14.53, Samples/sec: 2.08, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.85s, TFLOPs: 14.49, Samples/sec: 2.08, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.13, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+[2023-12-11 10:48:21,116] [INFO] [logging.py:96:log_dist] [Rank 0] step=80, skipped=0, lr=[5.66285245724294e-06, 0.00029341204441673266, 5.66285245724294e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
+[2023-12-11 10:48:21,116] [INFO] [timer.py:260:stop] epoch=4/micro_step=8/global_step=80, RunningAvgSamplesPerSec=6.698826811268296, CurrSamplesPerSec=6.5242438758469135, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.18, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.18, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+[2023-12-11 10:48:55,839] [INFO] [logging.py:96:log_dist] [Rank 0] step=90, skipped=0, lr=[4.825e-06, 0.00025, 4.825e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
+[2023-12-11 10:48:55,839] [INFO] [timer.py:260:stop] epoch=4/micro_step=18/global_step=90, RunningAvgSamplesPerSec=6.72308755096596, CurrSamplesPerSec=15.258959684074387, MemAllocated=8.24GB, MaxMemAllocated=14.23GB
+Model Parameters: 6.927 B, Latency: 1.57s, TFLOPs: 35.46, Samples/sec: 5.08, Time/seq 0.20s, Batch Size: 8, Sequence Length: 512
+***** Evaluating perplexity, Epoch 5/10 *****
+Invalidate trace cache @ step 0: expected module 0, but got module 6
+ppl: 1.002268671989441, loss: 0.002266054507344961
+Beginning of Epoch 6/10, Total Micro Batches 18
+Model Parameters: 6.927 B, Latency: 3.83s, TFLOPs: 14.56, Samples/sec: 2.09, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.85s, TFLOPs: 14.49, Samples/sec: 2.08, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.18, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.18, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+[2023-12-11 10:49:35,362] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=0, lr=[3.987147542757061e-06, 0.00020658795558326743, 3.987147542757061e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
+[2023-12-11 10:49:35,363] [INFO] [timer.py:260:stop] epoch=5/micro_step=10/global_step=100, RunningAvgSamplesPerSec=6.69592354348812, CurrSamplesPerSec=6.516857819843849, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 1.69s, TFLOPs: 33.09, Samples/sec: 4.74, Time/seq 0.21s, Batch Size: 8, Sequence Length: 512
+***** Evaluating perplexity, Epoch 6/10 *****
+Invalidate trace cache @ step 0: expected module 0, but got module 6
+ppl: 1.002125859260559, loss: 0.0021236357279121876
+Beginning of Epoch 7/10, Total Micro Batches 18
+Model Parameters: 6.927 B, Latency: 3.83s, TFLOPs: 14.57, Samples/sec: 2.09, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
+[2023-12-11 10:50:12,933] [INFO] [logging.py:96:log_dist] [Rank 0] step=110, skipped=0, lr=[3.174752808453649e-06, 0.00016449496416858284, 3.174752808453649e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
+[2023-12-11 10:50:12,933] [INFO] [timer.py:260:stop] epoch=6/micro_step=2/global_step=110, RunningAvgSamplesPerSec=6.707642153640435, CurrSamplesPerSec=6.182524933965045, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
+Model Parameters: 6.927 B, Latency: 3.88s, TFLOPs: 14.37, Samples/sec: 2.06, Time/seq 0.49s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+[2023-12-11 10:50:49,797] [INFO] [logging.py:96:log_dist] [Rank 0] step=120, skipped=0, lr=[2.4125000000000015e-06, 0.00012500000000000006, 2.4125000000000015e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
+[2023-12-11 10:50:49,797] [INFO] [timer.py:260:stop] epoch=6/micro_step=12/global_step=120, RunningAvgSamplesPerSec=6.691004676878778, CurrSamplesPerSec=6.515100253599061, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.12, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 1.59s, TFLOPs: 35.12, Samples/sec: 5.03, Time/seq 0.20s, Batch Size: 8, Sequence Length: 512
+***** Evaluating perplexity, Epoch 7/10 *****
+Invalidate trace cache @ step 0: expected module 0, but got module 6
+ppl: 1.0021148920059204, loss: 0.0021126093342900276
+Beginning of Epoch 8/10, Total Micro Batches 18
+Model Parameters: 6.927 B, Latency: 3.82s, TFLOPs: 14.60, Samples/sec: 2.09, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.85s, TFLOPs: 14.51, Samples/sec: 2.08, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+[2023-12-11 10:51:27,245] [INFO] [logging.py:96:log_dist] [Rank 0] step=130, skipped=0, lr=[1.7235497832624478e-06, 8.930309757836516e-05, 1.7235497832624478e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
+[2023-12-11 10:51:27,246] [INFO] [timer.py:260:stop] epoch=7/micro_step=4/global_step=130, RunningAvgSamplesPerSec=6.703016297794747, CurrSamplesPerSec=6.513660995687833, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.12, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.13, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.12, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.18, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+[2023-12-11 10:52:04,119] [INFO] [logging.py:96:log_dist] [Rank 0] step=140, skipped=0, lr=[1.1288355619509317e-06, 5.848888922025553e-05, 1.1288355619509317e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
+[2023-12-11 10:52:04,120] [INFO] [timer.py:260:stop] epoch=7/micro_step=14/global_step=140, RunningAvgSamplesPerSec=6.689002559054329, CurrSamplesPerSec=6.521014034361496, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 1.57s, TFLOPs: 35.54, Samples/sec: 5.09, Time/seq 0.20s, Batch Size: 8, Sequence Length: 512
+***** Evaluating perplexity, Epoch 8/10 *****
 Invalidate trace cache @ step 0: expected module 0, but got module 6
+ppl: 1.0020469427108765, loss: 0.0020447911228984594
+Beginning of Epoch 9/10, Total Micro Batches 18
+Model Parameters: 6.927 B, Latency: 3.83s, TFLOPs: 14.58, Samples/sec: 2.09, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.86s, TFLOPs: 14.45, Samples/sec: 2.07, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+[2023-12-11 10:52:41,563] [INFO] [logging.py:96:log_dist] [Rank 0] step=150, skipped=0, lr=[6.464274267400833e-07, 3.3493649053890325e-05, 6.464274267400833e-07], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
+[2023-12-11 10:52:41,564] [INFO] [timer.py:260:stop] epoch=8/micro_step=6/global_step=150, RunningAvgSamplesPerSec=6.699587482336564, CurrSamplesPerSec=6.5183191731058265, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.13, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.70s, TFLOPs: 15.10, Samples/sec: 2.16, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.12, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+[2023-12-11 10:53:18,461] [INFO] [logging.py:96:log_dist] [Rank 0] step=160, skipped=0, lr=[2.909831047079924e-07, 1.5076844803522921e-05, 2.909831047079924e-07], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
+[2023-12-11 10:53:18,461] [INFO] [timer.py:260:stop] epoch=8/micro_step=16/global_step=160, RunningAvgSamplesPerSec=6.6873052348933255, CurrSamplesPerSec=6.516400938574413, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 1.61s, TFLOPs: 34.78, Samples/sec: 4.98, Time/seq 0.20s, Batch Size: 8, Sequence Length: 512
+***** Evaluating perplexity, Epoch 9/10 *****
 Invalidate trace cache @ step 0: expected module 0, but got module 6
+ppl: 1.0019946098327637, loss: 0.001992669189348817
+Beginning of Epoch 10/10, Total Micro Batches 18
+Model Parameters: 6.927 B, Latency: 3.83s, TFLOPs: 14.58, Samples/sec: 2.09, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.86s, TFLOPs: 14.45, Samples/sec: 2.07, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+[2023-12-11 10:53:55,945] [INFO] [logging.py:96:log_dist] [Rank 0] step=170, skipped=0, lr=[7.330259171609631e-08, 3.798061746947995e-06, 7.330259171609631e-08], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
+[2023-12-11 10:53:55,945] [INFO] [timer.py:260:stop] epoch=9/micro_step=8/global_step=170, RunningAvgSamplesPerSec=6.696277604142081, CurrSamplesPerSec=6.512733867002891, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
+[2023-12-11 10:54:30,744] [INFO] [logging.py:96:log_dist] [Rank 0] step=180, skipped=0, lr=[0.0, 0.0, 0.0], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
+[2023-12-11 10:54:30,744] [INFO] [timer.py:260:stop] epoch=9/micro_step=18/global_step=180, RunningAvgSamplesPerSec=6.707583967509366, CurrSamplesPerSec=14.954943653382319, MemAllocated=8.24GB, MaxMemAllocated=14.23GB
+Model Parameters: 6.927 B, Latency: 1.61s, TFLOPs: 34.75, Samples/sec: 4.98, Time/seq 0.20s, Batch Size: 8, Sequence Length: 512
+***** Evaluating perplexity, Epoch 10/10 *****
 Invalidate trace cache @ step 0: expected module 0, but got module 6
+ppl: 1.0020090341567993, loss: 0.002007057424634695
 saving the final model ...
+[2023-12-11 10:54:47,172] [INFO] [launch.py:347:main] Process 1648663 exits successfully.
+[2023-12-11 10:54:48,665] [INFO] [launch.py:347:main] Process 1648664 exits successfully.
+[2023-12-11 10:56:50,679] [INFO] [launch.py:347:main] Process 1648662 exits successfully.