aaaacash commited on
Commit
cde919e
·
1 Parent(s): f1479f3

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. pytorch_model.bin +1 -1
  2. training.log +405 -222
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e79895167c822572d1cf89779c4a10a05433bd2d166150e2c85d5d321054d016
3
  size 13477321262
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8326234814289d31a8746916a6bdfda29077c0adf1e00abdb7368af6575068fc
3
  size 13477321262
training.log CHANGED
@@ -1,32 +1,26 @@
1
- [2023-12-11 05:39:03,031] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
2
- [2023-12-11 05:39:04,827] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
3
- [2023-12-11 05:39:04,828] [INFO] [runner.py:570:main] cmd = /home/t-sokumar/miniconda3/envs/ft/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path local/jsonfile --data_split 1,0,0 --model_name_or_path codellama/CodeLlama-7b-hf --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --max_seq_len 512 --learning_rate 9.65e-6 --weight_decay 0. --num_train_epochs 3 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --gradient_checkpointing --zero_stage 3 --deepspeed --lora_dim 128 --lora_module_name layers. --output_dir ./output_step1_Codellama_7b_lora_llamahub-devrev --add_eot_token
4
- [2023-12-11 05:39:07,364] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
5
- [2023-12-11 05:39:09,159] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
6
- [2023-12-11 05:39:09,159] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
7
- [2023-12-11 05:39:09,159] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
8
- [2023-12-11 05:39:09,159] [INFO] [launch.py:163:main] dist_world_size=4
9
- [2023-12-11 05:39:09,159] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
10
- [2023-12-11 05:39:12,594] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
11
- [2023-12-11 05:39:12,600] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
12
- [2023-12-11 05:39:12,605] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
13
- [2023-12-11 05:39:12,606] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
14
  /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
15
  warnings.warn(
16
  /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
17
  warnings.warn(
18
  /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
19
  warnings.warn(
20
- /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
21
- warnings.warn(
22
- [2023-12-11 05:39:14,179] [INFO] [comm.py:637:init_distributed] cdb=None
23
- [2023-12-11 05:39:14,179] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
24
- [2023-12-11 05:39:14,642] [INFO] [comm.py:637:init_distributed] cdb=None
25
- [2023-12-11 05:39:14,646] [INFO] [comm.py:637:init_distributed] cdb=None
26
- [2023-12-11 05:39:14,678] [INFO] [comm.py:637:init_distributed] cdb=None
27
- The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
28
- The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'.
29
- The class this function is called from is 'LlamaTokenizer'.
30
  The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
31
  The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'.
32
  The class this function is called from is 'LlamaTokenizer'.
@@ -35,17 +29,14 @@ The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'.
35
  The class this function is called from is 'LlamaTokenizer'.
36
  You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
37
  You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
38
- You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
39
  The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
40
  The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'.
41
  The class this function is called from is 'LlamaTokenizer'.
42
  You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
43
- [2023-12-11 05:39:17,564] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 291, num_elems = 6.74B
44
-
45
-
46
-
47
-
48
- Using /home/t-sokumar/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
49
  Using /home/t-sokumar/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
50
  Using /home/t-sokumar/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
51
  Using /home/t-sokumar/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
@@ -55,70 +46,66 @@ Building extension module fused_adam...
55
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
56
  ninja: no work to do.
57
  Loading extension module fused_adam...
58
- Time to load fused_adam op: 0.12226700782775879 seconds
 
 
59
  /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
60
  self._dummy_overflow_buf = get_accelerator().IntTensor([0])
61
- Loading extension module fused_adam...Loading extension module fused_adam...
62
-
63
- Time to load fused_adam op: 0.2073993682861328 secondsTime to load fused_adam op: 0.2075939178466797 seconds
64
-
65
  /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
66
  self._dummy_overflow_buf = get_accelerator().IntTensor([0])
67
  /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
68
  self._dummy_overflow_buf = get_accelerator().IntTensor([0])
69
- Loading extension module fused_adam...
70
- [2023-12-11 05:39:51,139] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.12.4, git-hash=unknown, git-branch=unknown
71
- [2023-12-11 05:39:51,139] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
72
- Time to load fused_adam op: 0.2015979290008545 seconds
73
- /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
74
- self._dummy_overflow_buf = get_accelerator().IntTensor([0])
75
- [2023-12-11 05:39:51,163] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
76
- [2023-12-11 05:39:51,164] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
77
- [2023-12-11 05:39:51,164] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
78
- [2023-12-11 05:39:51,206] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
79
- [2023-12-11 05:39:51,206] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
80
- [2023-12-11 05:39:51,206] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
81
- [2023-12-11 05:39:51,206] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 3 optimizer
82
- [2023-12-11 05:39:51,330] [INFO] [utils.py:795:see_memory_usage] Stage 3 initialize beginning
83
- [2023-12-11 05:39:51,331] [INFO] [utils.py:796:see_memory_usage] MA 4.37 GB Max_MA 4.75 GB CA 8.35 GB Max_CA 8 GB
84
- [2023-12-11 05:39:51,331] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 98.11 GB, percent = 39.0%
85
- [2023-12-11 05:39:51,333] [INFO] [stage3.py:127:__init__] Reduce bucket size 500,000,000
86
- [2023-12-11 05:39:51,333] [INFO] [stage3.py:128:__init__] Prefetch bucket size 30000000
87
- [2023-12-11 05:39:51,450] [INFO] [utils.py:795:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
88
- [2023-12-11 05:39:51,450] [INFO] [utils.py:796:see_memory_usage] MA 4.37 GB Max_MA 4.37 GB CA 8.35 GB Max_CA 8 GB
89
- [2023-12-11 05:39:51,450] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 98.13 GB, percent = 39.0%
90
  Parameter Offload: Total persistent parameters: 266240 in 65 params
91
- [2023-12-11 05:39:51,757] [INFO] [utils.py:795:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
92
- [2023-12-11 05:39:51,758] [INFO] [utils.py:796:see_memory_usage] MA 3.54 GB Max_MA 4.43 GB CA 8.35 GB Max_CA 8 GB
93
- [2023-12-11 05:39:51,758] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 98.13 GB, percent = 39.0%
94
- [2023-12-11 05:39:51,866] [INFO] [utils.py:795:see_memory_usage] Before creating fp16 partitions
95
- [2023-12-11 05:39:51,866] [INFO] [utils.py:796:see_memory_usage] MA 3.54 GB Max_MA 3.54 GB CA 8.35 GB Max_CA 8 GB
96
- [2023-12-11 05:39:51,866] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 98.13 GB, percent = 39.0%
97
- [2023-12-11 05:39:52,568] [INFO] [utils.py:795:see_memory_usage] After creating fp16 partitions: 3
98
- [2023-12-11 05:39:52,569] [INFO] [utils.py:796:see_memory_usage] MA 3.54 GB Max_MA 3.54 GB CA 5.29 GB Max_CA 8 GB
99
- [2023-12-11 05:39:52,569] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 98.22 GB, percent = 39.0%
100
- [2023-12-11 05:39:52,708] [INFO] [utils.py:795:see_memory_usage] Before creating fp32 partitions
101
- [2023-12-11 05:39:52,709] [INFO] [utils.py:796:see_memory_usage] MA 3.54 GB Max_MA 3.54 GB CA 5.29 GB Max_CA 5 GB
102
- [2023-12-11 05:39:52,709] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 96.08 GB, percent = 38.2%
103
- [2023-12-11 05:39:52,831] [INFO] [utils.py:795:see_memory_usage] After creating fp32 partitions
104
- [2023-12-11 05:39:52,832] [INFO] [utils.py:796:see_memory_usage] MA 4.08 GB Max_MA 4.23 GB CA 5.99 GB Max_CA 6 GB
105
- [2023-12-11 05:39:52,832] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 96.11 GB, percent = 38.2%
106
- [2023-12-11 05:39:52,942] [INFO] [utils.py:795:see_memory_usage] Before initializing optimizer states
107
- [2023-12-11 05:39:52,942] [INFO] [utils.py:796:see_memory_usage] MA 4.08 GB Max_MA 4.08 GB CA 5.99 GB Max_CA 6 GB
108
- [2023-12-11 05:39:52,943] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 96.11 GB, percent = 38.2%
109
- [2023-12-11 05:39:53,083] [INFO] [utils.py:795:see_memory_usage] After initializing optimizer states
110
- [2023-12-11 05:39:53,084] [INFO] [utils.py:796:see_memory_usage] MA 5.17 GB Max_MA 5.47 GB CA 7.38 GB Max_CA 7 GB
111
- [2023-12-11 05:39:53,084] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 96.07 GB, percent = 38.2%
112
- [2023-12-11 05:39:53,084] [INFO] [stage3.py:479:_setup_for_real_optimizer] optimizer state initialized
113
- [2023-12-11 05:39:53,479] [INFO] [utils.py:795:see_memory_usage] After initializing ZeRO optimizer
114
- [2023-12-11 05:39:53,480] [INFO] [utils.py:796:see_memory_usage] MA 6.37 GB Max_MA 6.86 GB CA 9.05 GB Max_CA 9 GB
115
- [2023-12-11 05:39:53,480] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 96.07 GB, percent = 38.2%
116
- [2023-12-11 05:39:53,480] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam
117
- [2023-12-11 05:39:53,480] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
118
- [2023-12-11 05:39:53,480] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f53cfda67d0>
119
- [2023-12-11 05:39:53,480] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[9.65e-06, 0.0005, 9.65e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
120
- [2023-12-11 05:39:53,482] [INFO] [config.py:979:print] DeepSpeedEngine configuration:
121
- [2023-12-11 05:39:53,482] [INFO] [config.py:983:print] activation_checkpointing_config {
122
  "partition_activations": false,
123
  "contiguous_memory_optimization": false,
124
  "cpu_checkpointing": false,
@@ -126,10 +113,10 @@ Parameter Offload: Total persistent parameters: 266240 in 65 params
126
  "synchronize_checkpoint_boundary": false,
127
  "profile": false
128
  }
129
- [2023-12-11 05:39:53,482] [INFO] [config.py:983:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
130
- [2023-12-11 05:39:53,482] [INFO] [config.py:983:print] amp_enabled .................. False
131
- [2023-12-11 05:39:53,482] [INFO] [config.py:983:print] amp_params ................... False
132
- [2023-12-11 05:39:53,482] [INFO] [config.py:983:print] autotuning_config ............ {
133
  "enabled": false,
134
  "start_step": null,
135
  "end_step": null,
@@ -154,31 +141,31 @@ Parameter Offload: Total persistent parameters: 266240 in 65 params
154
  "min_train_micro_batch_size_per_gpu": 1,
155
  "num_tuning_micro_batch_sizes": 3
156
  }
157
- [2023-12-11 05:39:53,482] [INFO] [config.py:983:print] bfloat16_enabled ............. False
158
- [2023-12-11 05:39:53,482] [INFO] [config.py:983:print] checkpoint_parallel_write_pipeline False
159
- [2023-12-11 05:39:53,482] [INFO] [config.py:983:print] checkpoint_tag_validation_enabled True
160
- [2023-12-11 05:39:53,482] [INFO] [config.py:983:print] checkpoint_tag_validation_fail False
161
- [2023-12-11 05:39:53,482] [INFO] [config.py:983:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f53cf91c7d0>
162
- [2023-12-11 05:39:53,482] [INFO] [config.py:983:print] communication_data_type ...... None
163
- [2023-12-11 05:39:53,483] [INFO] [config.py:983:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
164
- [2023-12-11 05:39:53,483] [INFO] [config.py:983:print] curriculum_enabled_legacy .... False
165
- [2023-12-11 05:39:53,483] [INFO] [config.py:983:print] curriculum_params_legacy ..... False
166
- [2023-12-11 05:39:53,483] [INFO] [config.py:983:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
167
- [2023-12-11 05:39:53,483] [INFO] [config.py:983:print] data_efficiency_enabled ...... False
168
- [2023-12-11 05:39:53,483] [INFO] [config.py:983:print] dataloader_drop_last ......... False
169
- [2023-12-11 05:39:53,483] [INFO] [config.py:983:print] disable_allgather ............ False
170
- [2023-12-11 05:39:53,483] [INFO] [config.py:983:print] dump_state ................... False
171
- [2023-12-11 05:39:53,483] [INFO] [config.py:983:print] dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 100, 'delayed_shift': 2, 'consecutive_hysteresis': False, 'min_scale': 1}
172
- [2023-12-11 05:39:53,483] [INFO] [config.py:983:print] eigenvalue_enabled ........... False
173
- [2023-12-11 05:39:53,483] [INFO] [config.py:983:print] eigenvalue_gas_boundary_resolution 1
174
- [2023-12-11 05:39:53,483] [INFO] [config.py:983:print] eigenvalue_layer_name ........ bert.encoder.layer
175
- [2023-12-11 05:39:53,483] [INFO] [config.py:983:print] eigenvalue_layer_num ......... 0
176
- [2023-12-11 05:39:53,483] [INFO] [config.py:983:print] eigenvalue_max_iter .......... 100
177
- [2023-12-11 05:39:53,483] [INFO] [config.py:983:print] eigenvalue_stability ......... 1e-06
178
- [2023-12-11 05:39:53,483] [INFO] [config.py:983:print] eigenvalue_tol ............... 0.01
179
- [2023-12-11 05:39:53,483] [INFO] [config.py:983:print] eigenvalue_verbose ........... False
180
- [2023-12-11 05:39:53,483] [INFO] [config.py:983:print] elasticity_enabled ........... False
181
- [2023-12-11 05:39:53,483] [INFO] [config.py:983:print] flops_profiler_config ........ {
182
  "enabled": false,
183
  "recompute_fwd_factor": 0.0,
184
  "profile_step": 1,
@@ -187,23 +174,23 @@ Parameter Offload: Total persistent parameters: 266240 in 65 params
187
  "detailed": true,
188
  "output_file": null
189
  }
190
- [2023-12-11 05:39:53,483] [INFO] [config.py:983:print] fp16_auto_cast ............... False
191
- [2023-12-11 05:39:53,483] [INFO] [config.py:983:print] fp16_enabled ................. True
192
- [2023-12-11 05:39:53,483] [INFO] [config.py:983:print] fp16_master_weights_and_gradients False
193
- [2023-12-11 05:39:53,483] [INFO] [config.py:983:print] global_rank .................. 0
194
- [2023-12-11 05:39:53,483] [INFO] [config.py:983:print] grad_accum_dtype ............. None
195
- [2023-12-11 05:39:53,483] [INFO] [config.py:983:print] gradient_accumulation_steps .. 1
196
- [2023-12-11 05:39:53,483] [INFO] [config.py:983:print] gradient_clipping ............ 1.0
197
- [2023-12-11 05:39:53,483] [INFO] [config.py:983:print] gradient_predivide_factor .... 1.0
198
- [2023-12-11 05:39:53,483] [INFO] [config.py:983:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
199
- [2023-12-11 05:39:53,483] [INFO] [config.py:983:print] initial_dynamic_scale ........ 65536
200
- [2023-12-11 05:39:53,484] [INFO] [config.py:983:print] load_universal_checkpoint .... False
201
- [2023-12-11 05:39:53,484] [INFO] [config.py:983:print] loss_scale ................... 0
202
- [2023-12-11 05:39:53,484] [INFO] [config.py:983:print] memory_breakdown ............. False
203
- [2023-12-11 05:39:53,484] [INFO] [config.py:983:print] mics_hierarchial_params_gather False
204
- [2023-12-11 05:39:53,484] [INFO] [config.py:983:print] mics_shard_size .............. -1
205
- [2023-12-11 05:39:53,484] [INFO] [config.py:983:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='step1_tensorboard/ds_tensorboard_logs/', job_name='step1_model_tensorboard') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
206
- [2023-12-11 05:39:53,484] [INFO] [config.py:983:print] nebula_config ................ {
207
  "enabled": false,
208
  "persistent_storage_path": null,
209
  "persistent_time_interval": 100,
@@ -211,33 +198,33 @@ Parameter Offload: Total persistent parameters: 266240 in 65 params
211
  "enable_nebula_load": true,
212
  "load_path": null
213
  }
214
- [2023-12-11 05:39:53,484] [INFO] [config.py:983:print] optimizer_legacy_fusion ...... False
215
- [2023-12-11 05:39:53,484] [INFO] [config.py:983:print] optimizer_name ............... None
216
- [2023-12-11 05:39:53,484] [INFO] [config.py:983:print] optimizer_params ............. None
217
- [2023-12-11 05:39:53,484] [INFO] [config.py:983:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
218
- [2023-12-11 05:39:53,484] [INFO] [config.py:983:print] pld_enabled .................. False
219
- [2023-12-11 05:39:53,484] [INFO] [config.py:983:print] pld_params ................... False
220
- [2023-12-11 05:39:53,484] [INFO] [config.py:983:print] prescale_gradients ........... False
221
- [2023-12-11 05:39:53,484] [INFO] [config.py:983:print] scheduler_name ............... None
222
- [2023-12-11 05:39:53,484] [INFO] [config.py:983:print] scheduler_params ............. None
223
- [2023-12-11 05:39:53,484] [INFO] [config.py:983:print] seq_parallel_communication_data_type torch.float32
224
- [2023-12-11 05:39:53,484] [INFO] [config.py:983:print] sparse_attention ............. None
225
- [2023-12-11 05:39:53,484] [INFO] [config.py:983:print] sparse_gradients_enabled ..... False
226
- [2023-12-11 05:39:53,484] [INFO] [config.py:983:print] steps_per_print .............. 10
227
- [2023-12-11 05:39:53,484] [INFO] [config.py:983:print] train_batch_size ............. 32
228
- [2023-12-11 05:39:53,484] [INFO] [config.py:983:print] train_micro_batch_size_per_gpu 8
229
- [2023-12-11 05:39:53,484] [INFO] [config.py:983:print] use_data_before_expert_parallel_ False
230
- [2023-12-11 05:39:53,484] [INFO] [config.py:983:print] use_node_local_storage ....... False
231
- [2023-12-11 05:39:53,484] [INFO] [config.py:983:print] wall_clock_breakdown ......... False
232
- [2023-12-11 05:39:53,484] [INFO] [config.py:983:print] weight_quantization_config ... None
233
- [2023-12-11 05:39:53,484] [INFO] [config.py:983:print] world_size ................... 4
234
- [2023-12-11 05:39:53,484] [INFO] [config.py:983:print] zero_allow_untested_optimizer False
235
- [2023-12-11 05:39:53,485] [INFO] [config.py:983:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=30000000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=30000000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=False pipeline_loading_checkpoint=False override_module_apply=True
236
- [2023-12-11 05:39:53,485] [INFO] [config.py:983:print] zero_enabled ................. True
237
- [2023-12-11 05:39:53,485] [INFO] [config.py:983:print] zero_force_ds_cpu_optimizer .. True
238
- [2023-12-11 05:39:53,485] [INFO] [config.py:983:print] zero_optimization_stage ...... 3
239
- [2023-12-11 05:39:53,485] [INFO] [config.py:969:print_user_config] json = {
240
- "train_batch_size": 32,
241
  "train_micro_batch_size_per_gpu": 8,
242
  "steps_per_print": 10,
243
  "zero_optimization": {
@@ -275,76 +262,272 @@ Parameter Offload: Total persistent parameters: 266240 in 65 params
275
  }
276
  }
277
  ***** Running training *****
278
- ***** Evaluating perplexity, Epoch 0/3 *****
279
- ppl: 4.460639476776123, loss: 1.4952921867370605
280
- Beginning of Epoch 1/3, Total Micro Batches 13
281
- /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
282
- warnings.warn(
283
  /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
284
  warnings.warn(
285
  /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
286
  warnings.warn(
287
  /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
288
  warnings.warn(
289
- Model Parameters: 6.927 B, Latency: 4.07s, TFLOPs: 10.30, Samples/sec: 1.97, Time/seq 0.51s, Batch Size: 8, Sequence Length: 512
290
  Invalidate trace cache @ step 0: expected module 6, but got module 0
291
- Model Parameters: 6.927 B, Latency: 3.74s, TFLOPs: 11.21, Samples/sec: 2.14, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
292
- Model Parameters: 6.927 B, Latency: 3.75s, TFLOPs: 11.17, Samples/sec: 2.13, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
293
- Model Parameters: 6.927 B, Latency: 3.73s, TFLOPs: 11.24, Samples/sec: 2.15, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
294
- Model Parameters: 6.927 B, Latency: 3.62s, TFLOPs: 11.55, Samples/sec: 2.21, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
295
- Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.53, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
296
- Model Parameters: 6.927 B, Latency: 3.62s, TFLOPs: 11.57, Samples/sec: 2.21, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
297
- Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.55, Samples/sec: 2.21, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
298
- Model Parameters: 6.927 B, Latency: 3.62s, TFLOPs: 11.56, Samples/sec: 2.21, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
299
- [2023-12-11 05:40:33,349] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=0, lr=[8.167395005683819e-06, 0.00042318108837739987, 8.167395005683819e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
300
- [2023-12-11 05:40:33,349] [INFO] [timer.py:260:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=8.762510054419295, CurrSamplesPerSec=8.82404309088149, MemAllocated=6.88GB, MaxMemAllocated=10.68GB
301
- Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.54, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
302
- Model Parameters: 6.927 B, Latency: 3.62s, TFLOPs: 11.55, Samples/sec: 2.21, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
303
- Model Parameters: 6.927 B, Latency: 3.62s, TFLOPs: 11.56, Samples/sec: 2.21, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
304
- Model Parameters: 6.927 B, Latency: 3.24s, TFLOPs: 12.92, Samples/sec: 2.47, Time/seq 0.40s, Batch Size: 8, Sequence Length: 512
305
- ***** Evaluating perplexity, Epoch 1/3 *****
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
306
  Invalidate trace cache @ step 0: expected module 0, but got module 6
307
- ppl: 1.7355413436889648, loss: 0.551319420337677
308
- Beginning of Epoch 2/3, Total Micro Batches 13
309
- Model Parameters: 6.927 B, Latency: 3.75s, TFLOPs: 11.17, Samples/sec: 2.13, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
310
- Model Parameters: 6.927 B, Latency: 3.76s, TFLOPs: 11.14, Samples/sec: 2.13, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
311
- Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.54, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
312
- Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
313
- Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
314
- Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.53, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
315
- [2023-12-11 05:41:11,363] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=0, lr=[4.6307168389720735e-06, 0.0002399335149726463, 4.6307168389720735e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
316
- [2023-12-11 05:41:11,363] [INFO] [timer.py:260:stop] epoch=1/micro_step=7/global_step=20, RunningAvgSamplesPerSec=8.813860487355969, CurrSamplesPerSec=8.815178915300866, MemAllocated=6.88GB, MaxMemAllocated=11.06GB
317
- Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.53, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
318
- Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
319
- Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
320
- Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.53, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
321
- Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.53, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
322
- Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
323
- Model Parameters: 6.927 B, Latency: 3.24s, TFLOPs: 12.90, Samples/sec: 2.47, Time/seq 0.41s, Batch Size: 8, Sequence Length: 512
324
- ***** Evaluating perplexity, Epoch 2/3 *****
 
 
 
 
 
 
 
325
  Invalidate trace cache @ step 0: expected module 0, but got module 6
326
- ppl: 1.0645378828048706, loss: 0.0625406950712204
327
- Beginning of Epoch 3/3, Total Micro Batches 13
328
- Model Parameters: 6.927 B, Latency: 3.75s, TFLOPs: 11.16, Samples/sec: 2.13, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
329
- Model Parameters: 6.927 B, Latency: 3.77s, TFLOPs: 11.11, Samples/sec: 2.12, Time/seq 0.47s, Batch Size: 8, Sequence Length: 512
330
- Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.53, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
331
- [2023-12-11 05:41:49,440] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=0, lr=[1.2134356400744368e-06, 6.28723129572247e-05, 1.2134356400744368e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
332
- [2023-12-11 05:41:49,441] [INFO] [timer.py:260:stop] epoch=2/micro_step=4/global_step=30, RunningAvgSamplesPerSec=8.8235289391519, CurrSamplesPerSec=8.771353693343526, MemAllocated=6.88GB, MaxMemAllocated=11.06GB
333
- Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.47, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
334
- Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.53, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
335
- Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
336
- Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.51, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
337
- Model Parameters: 6.927 B, Latency: 3.65s, TFLOPs: 11.48, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
338
- Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
339
- Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
340
- Model Parameters: 6.927 B, Latency: 3.64s, TFLOPs: 11.50, Samples/sec: 2.20, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
341
- Model Parameters: 6.927 B, Latency: 3.63s, TFLOPs: 11.52, Samples/sec: 2.20, Time/seq 0.45s, Batch Size: 8, Sequence Length: 512
342
- Model Parameters: 6.927 B, Latency: 3.26s, TFLOPs: 12.83, Samples/sec: 2.45, Time/seq 0.41s, Batch Size: 8, Sequence Length: 512
343
- ***** Evaluating perplexity, Epoch 3/3 *****
 
 
 
 
 
 
 
344
  Invalidate trace cache @ step 0: expected module 0, but got module 6
345
- ppl: 1.031400442123413, loss: 0.03091755136847496
346
  saving the final model ...
347
- [2023-12-11 05:42:35,188] [INFO] [launch.py:347:main] Process 1247715 exits successfully.
348
- [2023-12-11 05:42:35,188] [INFO] [launch.py:347:main] Process 1247716 exits successfully.
349
- [2023-12-11 05:42:35,189] [INFO] [launch.py:347:main] Process 1247717 exits successfully.
350
- [2023-12-11 05:44:14,200] [INFO] [launch.py:347:main] Process 1247714 exits successfully.
 
1
+ [2023-12-11 10:42:54,890] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
2
+ [2023-12-11 10:42:56,697] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
3
+ Detected CUDA_VISIBLE_DEVICES=0,1,2: setting --include=localhost:0,1,2
4
+ [2023-12-11 10:42:56,698] [INFO] [runner.py:570:main] cmd = /home/t-sokumar/miniconda3/envs/ft/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMl19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path local/jsonfile --data_split 1,0,0 --model_name_or_path codellama/CodeLlama-7b-hf --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --max_seq_len 512 --learning_rate 9.65e-6 --weight_decay 0. --num_train_epochs 10 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --gradient_checkpointing --zero_stage 3 --deepspeed --lora_dim 128 --lora_module_name layers. --output_dir ./output_step1_Codellama_7b_lora_llamahub-devrev --add_eot_token
5
+ [2023-12-11 10:42:59,233] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
6
+ [2023-12-11 10:43:01,086] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2]}
7
+ [2023-12-11 10:43:01,086] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=3, node_rank=0
8
+ [2023-12-11 10:43:01,086] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2]})
9
+ [2023-12-11 10:43:01,086] [INFO] [launch.py:163:main] dist_world_size=3
10
+ [2023-12-11 10:43:01,086] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2
11
+ [2023-12-11 10:43:04,573] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
12
+ [2023-12-11 10:43:04,579] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
13
+ [2023-12-11 10:43:04,650] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
14
  /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
15
  warnings.warn(
16
  /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
17
  warnings.warn(
18
  /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
19
  warnings.warn(
20
+ [2023-12-11 10:43:06,219] [INFO] [comm.py:637:init_distributed] cdb=None
21
+ [2023-12-11 10:43:06,219] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
22
+ [2023-12-11 10:43:06,306] [INFO] [comm.py:637:init_distributed] cdb=None
23
+ [2023-12-11 10:43:06,307] [INFO] [comm.py:637:init_distributed] cdb=None
 
 
 
 
 
 
24
  The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
25
  The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'.
26
  The class this function is called from is 'LlamaTokenizer'.
 
29
  The class this function is called from is 'LlamaTokenizer'.
30
  You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
31
  You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
 
32
  The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
33
  The tokenizer class you load from this checkpoint is 'CodeLlamaTokenizer'.
34
  The class this function is called from is 'LlamaTokenizer'.
35
  You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
36
+ [2023-12-11 10:43:09,096] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 291, num_elems = 6.74B
37
+
38
+
39
+
 
 
40
  Using /home/t-sokumar/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
41
  Using /home/t-sokumar/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
42
  Using /home/t-sokumar/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
 
46
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
47
  ninja: no work to do.
48
  Loading extension module fused_adam...
49
+ Time to load fused_adam op: 0.1028139591217041 seconds
50
+ Loading extension module fused_adam...
51
+ Loading extension module fused_adam...
52
  /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
53
  self._dummy_overflow_buf = get_accelerator().IntTensor([0])
54
+ Time to load fused_adam op: 0.10137510299682617 seconds
55
+ Time to load fused_adam op: 0.10164141654968262 seconds
 
 
56
  /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
57
  self._dummy_overflow_buf = get_accelerator().IntTensor([0])
58
  /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
59
  self._dummy_overflow_buf = get_accelerator().IntTensor([0])
60
+ [2023-12-11 10:43:18,099] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.12.4, git-hash=unknown, git-branch=unknown
61
+ [2023-12-11 10:43:18,099] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
62
+ [2023-12-11 10:43:18,121] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
63
+ [2023-12-11 10:43:18,123] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
64
+ [2023-12-11 10:43:18,123] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
65
+ [2023-12-11 10:43:18,161] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
66
+ [2023-12-11 10:43:18,161] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
67
+ [2023-12-11 10:43:18,161] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
68
+ [2023-12-11 10:43:18,161] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 3 optimizer
69
+ [2023-12-11 10:43:18,286] [INFO] [utils.py:795:see_memory_usage] Stage 3 initialize beginning
70
+ [2023-12-11 10:43:18,287] [INFO] [utils.py:796:see_memory_usage] MA 5.37 GB Max_MA 5.79 GB CA 11.7 GB Max_CA 12 GB
71
+ [2023-12-11 10:43:18,287] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 93.3 GB, percent = 37.1%
72
+ [2023-12-11 10:43:18,289] [INFO] [stage3.py:127:__init__] Reduce bucket size 500,000,000
73
+ [2023-12-11 10:43:18,289] [INFO] [stage3.py:128:__init__] Prefetch bucket size 30000000
74
+ [2023-12-11 10:43:18,399] [INFO] [utils.py:795:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
75
+ [2023-12-11 10:43:18,400] [INFO] [utils.py:796:see_memory_usage] MA 5.37 GB Max_MA 5.37 GB CA 11.7 GB Max_CA 12 GB
76
+ [2023-12-11 10:43:18,400] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 93.27 GB, percent = 37.1%
 
 
 
 
77
  Parameter Offload: Total persistent parameters: 266240 in 65 params
78
+ [2023-12-11 10:43:18,806] [INFO] [utils.py:795:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
79
+ [2023-12-11 10:43:18,807] [INFO] [utils.py:796:see_memory_usage] MA 4.64 GB Max_MA 5.46 GB CA 11.7 GB Max_CA 12 GB
80
+ [2023-12-11 10:43:18,807] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 93.24 GB, percent = 37.1%
81
+ [2023-12-11 10:43:18,913] [INFO] [utils.py:795:see_memory_usage] Before creating fp16 partitions
82
+ [2023-12-11 10:43:18,914] [INFO] [utils.py:796:see_memory_usage] MA 4.64 GB Max_MA 4.64 GB CA 11.7 GB Max_CA 12 GB
83
+ [2023-12-11 10:43:18,914] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 93.24 GB, percent = 37.1%
84
+ [2023-12-11 10:43:19,685] [INFO] [utils.py:795:see_memory_usage] After creating fp16 partitions: 3
85
+ [2023-12-11 10:43:19,686] [INFO] [utils.py:796:see_memory_usage] MA 4.64 GB Max_MA 4.64 GB CA 7.4 GB Max_CA 12 GB
86
+ [2023-12-11 10:43:19,686] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 93.67 GB, percent = 37.2%
87
+ [2023-12-11 10:43:19,803] [INFO] [utils.py:795:see_memory_usage] Before creating fp32 partitions
88
+ [2023-12-11 10:43:19,804] [INFO] [utils.py:796:see_memory_usage] MA 4.64 GB Max_MA 4.64 GB CA 7.4 GB Max_CA 7 GB
89
+ [2023-12-11 10:43:19,804] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 93.07 GB, percent = 37.0%
90
+ [2023-12-11 10:43:19,932] [INFO] [utils.py:795:see_memory_usage] After creating fp32 partitions
91
+ [2023-12-11 10:43:19,932] [INFO] [utils.py:796:see_memory_usage] MA 5.36 GB Max_MA 5.56 GB CA 8.49 GB Max_CA 8 GB
92
+ [2023-12-11 10:43:19,964] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 93.04 GB, percent = 37.0%
93
+ [2023-12-11 10:43:20,076] [INFO] [utils.py:795:see_memory_usage] Before initializing optimizer states
94
+ [2023-12-11 10:43:20,077] [INFO] [utils.py:796:see_memory_usage] MA 5.36 GB Max_MA 5.36 GB CA 8.49 GB Max_CA 8 GB
95
+ [2023-12-11 10:43:20,077] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 92.33 GB, percent = 36.7%
96
+ [2023-12-11 10:43:20,189] [INFO] [utils.py:795:see_memory_usage] After initializing optimizer states
97
+ [2023-12-11 10:43:20,190] [INFO] [utils.py:796:see_memory_usage] MA 6.81 GB Max_MA 7.21 GB CA 10.34 GB Max_CA 10 GB
98
+ [2023-12-11 10:43:20,190] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 92.33 GB, percent = 36.7%
99
+ [2023-12-11 10:43:20,190] [INFO] [stage3.py:479:_setup_for_real_optimizer] optimizer state initialized
100
+ [2023-12-11 10:43:20,563] [INFO] [utils.py:795:see_memory_usage] After initializing ZeRO optimizer
101
+ [2023-12-11 10:43:20,564] [INFO] [utils.py:796:see_memory_usage] MA 8.1 GB Max_MA 8.59 GB CA 12.01 GB Max_CA 12 GB
102
+ [2023-12-11 10:43:20,565] [INFO] [utils.py:803:see_memory_usage] CPU Virtual Memory: used = 91.63 GB, percent = 36.4%
103
+ [2023-12-11 10:43:20,565] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam
104
+ [2023-12-11 10:43:20,565] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
105
+ [2023-12-11 10:43:20,565] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f02815f2b50>
106
+ [2023-12-11 10:43:20,565] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[9.65e-06, 0.0005, 9.65e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
107
+ [2023-12-11 10:43:20,566] [INFO] [config.py:979:print] DeepSpeedEngine configuration:
108
+ [2023-12-11 10:43:20,567] [INFO] [config.py:983:print] activation_checkpointing_config {
109
  "partition_activations": false,
110
  "contiguous_memory_optimization": false,
111
  "cpu_checkpointing": false,
 
113
  "synchronize_checkpoint_boundary": false,
114
  "profile": false
115
  }
116
+ [2023-12-11 10:43:20,567] [INFO] [config.py:983:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
117
+ [2023-12-11 10:43:20,567] [INFO] [config.py:983:print] amp_enabled .................. False
118
+ [2023-12-11 10:43:20,567] [INFO] [config.py:983:print] amp_params ................... False
119
+ [2023-12-11 10:43:20,567] [INFO] [config.py:983:print] autotuning_config ............ {
120
  "enabled": false,
121
  "start_step": null,
122
  "end_step": null,
 
141
  "min_train_micro_batch_size_per_gpu": 1,
142
  "num_tuning_micro_batch_sizes": 3
143
  }
144
+ [2023-12-11 10:43:20,567] [INFO] [config.py:983:print] bfloat16_enabled ............. False
145
+ [2023-12-11 10:43:20,567] [INFO] [config.py:983:print] checkpoint_parallel_write_pipeline False
146
+ [2023-12-11 10:43:20,567] [INFO] [config.py:983:print] checkpoint_tag_validation_enabled True
147
+ [2023-12-11 10:43:20,567] [INFO] [config.py:983:print] checkpoint_tag_validation_fail False
148
+ [2023-12-11 10:43:20,567] [INFO] [config.py:983:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f0281706bd0>
149
+ [2023-12-11 10:43:20,567] [INFO] [config.py:983:print] communication_data_type ...... None
150
+ [2023-12-11 10:43:20,567] [INFO] [config.py:983:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
151
+ [2023-12-11 10:43:20,567] [INFO] [config.py:983:print] curriculum_enabled_legacy .... False
152
+ [2023-12-11 10:43:20,567] [INFO] [config.py:983:print] curriculum_params_legacy ..... False
153
+ [2023-12-11 10:43:20,567] [INFO] [config.py:983:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
154
+ [2023-12-11 10:43:20,568] [INFO] [config.py:983:print] data_efficiency_enabled ...... False
155
+ [2023-12-11 10:43:20,568] [INFO] [config.py:983:print] dataloader_drop_last ......... False
156
+ [2023-12-11 10:43:20,568] [INFO] [config.py:983:print] disable_allgather ............ False
157
+ [2023-12-11 10:43:20,568] [INFO] [config.py:983:print] dump_state ................... False
158
+ [2023-12-11 10:43:20,568] [INFO] [config.py:983:print] dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 100, 'delayed_shift': 2, 'consecutive_hysteresis': False, 'min_scale': 1}
159
+ [2023-12-11 10:43:20,568] [INFO] [config.py:983:print] eigenvalue_enabled ........... False
160
+ [2023-12-11 10:43:20,568] [INFO] [config.py:983:print] eigenvalue_gas_boundary_resolution 1
161
+ [2023-12-11 10:43:20,568] [INFO] [config.py:983:print] eigenvalue_layer_name ........ bert.encoder.layer
162
+ [2023-12-11 10:43:20,568] [INFO] [config.py:983:print] eigenvalue_layer_num ......... 0
163
+ [2023-12-11 10:43:20,568] [INFO] [config.py:983:print] eigenvalue_max_iter .......... 100
164
+ [2023-12-11 10:43:20,568] [INFO] [config.py:983:print] eigenvalue_stability ......... 1e-06
165
+ [2023-12-11 10:43:20,568] [INFO] [config.py:983:print] eigenvalue_tol ............... 0.01
166
+ [2023-12-11 10:43:20,568] [INFO] [config.py:983:print] eigenvalue_verbose ........... False
167
+ [2023-12-11 10:43:20,568] [INFO] [config.py:983:print] elasticity_enabled ........... False
168
+ [2023-12-11 10:43:20,568] [INFO] [config.py:983:print] flops_profiler_config ........ {
169
  "enabled": false,
170
  "recompute_fwd_factor": 0.0,
171
  "profile_step": 1,
 
174
  "detailed": true,
175
  "output_file": null
176
  }
177
+ [2023-12-11 10:43:20,568] [INFO] [config.py:983:print] fp16_auto_cast ............... False
178
+ [2023-12-11 10:43:20,568] [INFO] [config.py:983:print] fp16_enabled ................. True
179
+ [2023-12-11 10:43:20,568] [INFO] [config.py:983:print] fp16_master_weights_and_gradients False
180
+ [2023-12-11 10:43:20,568] [INFO] [config.py:983:print] global_rank .................. 0
181
+ [2023-12-11 10:43:20,568] [INFO] [config.py:983:print] grad_accum_dtype ............. None
182
+ [2023-12-11 10:43:20,568] [INFO] [config.py:983:print] gradient_accumulation_steps .. 1
183
+ [2023-12-11 10:43:20,568] [INFO] [config.py:983:print] gradient_clipping ............ 1.0
184
+ [2023-12-11 10:43:20,568] [INFO] [config.py:983:print] gradient_predivide_factor .... 1.0
185
+ [2023-12-11 10:43:20,568] [INFO] [config.py:983:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
186
+ [2023-12-11 10:43:20,568] [INFO] [config.py:983:print] initial_dynamic_scale ........ 65536
187
+ [2023-12-11 10:43:20,568] [INFO] [config.py:983:print] load_universal_checkpoint .... False
188
+ [2023-12-11 10:43:20,568] [INFO] [config.py:983:print] loss_scale ................... 0
189
+ [2023-12-11 10:43:20,568] [INFO] [config.py:983:print] memory_breakdown ............. False
190
+ [2023-12-11 10:43:20,569] [INFO] [config.py:983:print] mics_hierarchial_params_gather False
191
+ [2023-12-11 10:43:20,569] [INFO] [config.py:983:print] mics_shard_size .............. -1
192
+ [2023-12-11 10:43:20,569] [INFO] [config.py:983:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='step1_tensorboard/ds_tensorboard_logs/', job_name='step1_model_tensorboard') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
193
+ [2023-12-11 10:43:20,569] [INFO] [config.py:983:print] nebula_config ................ {
194
  "enabled": false,
195
  "persistent_storage_path": null,
196
  "persistent_time_interval": 100,
 
198
  "enable_nebula_load": true,
199
  "load_path": null
200
  }
201
+ [2023-12-11 10:43:20,569] [INFO] [config.py:983:print] optimizer_legacy_fusion ...... False
202
+ [2023-12-11 10:43:20,569] [INFO] [config.py:983:print] optimizer_name ............... None
203
+ [2023-12-11 10:43:20,569] [INFO] [config.py:983:print] optimizer_params ............. None
204
+ [2023-12-11 10:43:20,569] [INFO] [config.py:983:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
205
+ [2023-12-11 10:43:20,569] [INFO] [config.py:983:print] pld_enabled .................. False
206
+ [2023-12-11 10:43:20,569] [INFO] [config.py:983:print] pld_params ................... False
207
+ [2023-12-11 10:43:20,569] [INFO] [config.py:983:print] prescale_gradients ........... False
208
+ [2023-12-11 10:43:20,569] [INFO] [config.py:983:print] scheduler_name ............... None
209
+ [2023-12-11 10:43:20,569] [INFO] [config.py:983:print] scheduler_params ............. None
210
+ [2023-12-11 10:43:20,569] [INFO] [config.py:983:print] seq_parallel_communication_data_type torch.float32
211
+ [2023-12-11 10:43:20,569] [INFO] [config.py:983:print] sparse_attention ............. None
212
+ [2023-12-11 10:43:20,569] [INFO] [config.py:983:print] sparse_gradients_enabled ..... False
213
+ [2023-12-11 10:43:20,569] [INFO] [config.py:983:print] steps_per_print .............. 10
214
+ [2023-12-11 10:43:20,569] [INFO] [config.py:983:print] train_batch_size ............. 24
215
+ [2023-12-11 10:43:20,569] [INFO] [config.py:983:print] train_micro_batch_size_per_gpu 8
216
+ [2023-12-11 10:43:20,569] [INFO] [config.py:983:print] use_data_before_expert_parallel_ False
217
+ [2023-12-11 10:43:20,569] [INFO] [config.py:983:print] use_node_local_storage ....... False
218
+ [2023-12-11 10:43:20,569] [INFO] [config.py:983:print] wall_clock_breakdown ......... False
219
+ [2023-12-11 10:43:20,569] [INFO] [config.py:983:print] weight_quantization_config ... None
220
+ [2023-12-11 10:43:20,569] [INFO] [config.py:983:print] world_size ................... 3
221
+ [2023-12-11 10:43:20,569] [INFO] [config.py:983:print] zero_allow_untested_optimizer False
222
+ [2023-12-11 10:43:20,569] [INFO] [config.py:983:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=30000000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=30000000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=False pipeline_loading_checkpoint=False override_module_apply=True
223
+ [2023-12-11 10:43:20,570] [INFO] [config.py:983:print] zero_enabled ................. True
224
+ [2023-12-11 10:43:20,570] [INFO] [config.py:983:print] zero_force_ds_cpu_optimizer .. True
225
+ [2023-12-11 10:43:20,570] [INFO] [config.py:983:print] zero_optimization_stage ...... 3
226
+ [2023-12-11 10:43:20,570] [INFO] [config.py:969:print_user_config] json = {
227
+ "train_batch_size": 24,
228
  "train_micro_batch_size_per_gpu": 8,
229
  "steps_per_print": 10,
230
  "zero_optimization": {
 
262
  }
263
  }
264
  ***** Running training *****
265
+ ***** Evaluating perplexity, Epoch 0/10 *****
266
+ ppl: 4.454780578613281, loss: 1.4939777851104736
267
+ Beginning of Epoch 1/10, Total Micro Batches 18
 
 
268
  /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
269
  warnings.warn(
270
  /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
271
  warnings.warn(
272
  /home/t-sokumar/miniconda3/envs/ft/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
273
  warnings.warn(
274
+ Model Parameters: 6.927 B, Latency: 4.19s, TFLOPs: 13.34, Samples/sec: 1.91, Time/seq 0.52s, Batch Size: 8, Sequence Length: 512
275
  Invalidate trace cache @ step 0: expected module 6, but got module 0
276
+ Model Parameters: 6.927 B, Latency: 3.83s, TFLOPs: 14.59, Samples/sec: 2.09, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
277
+ Model Parameters: 6.927 B, Latency: 3.84s, TFLOPs: 14.54, Samples/sec: 2.08, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
278
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.12, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
279
+ Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.26, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
280
+ Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.24, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
281
+ Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.27, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
282
+ Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.25, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
283
+ Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.26, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
284
+ [2023-12-11 10:44:01,475] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=0, lr=[9.576697408283905e-06, 0.000496201938253052, 9.576697408283905e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
285
+ [2023-12-11 10:44:01,475] [INFO] [timer.py:260:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=6.5135351330213425, CurrSamplesPerSec=6.560741213043025, MemAllocated=8.61GB, MaxMemAllocated=14.1GB
286
+ Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.25, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
287
+ Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.24, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
288
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.20, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
289
+ Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.25, Samples/sec: 2.19, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
290
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.22, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
291
+ Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.24, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
292
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.22, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
293
+ Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.24, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
294
+ Model Parameters: 6.927 B, Latency: 1.58s, TFLOPs: 35.24, Samples/sec: 5.05, Time/seq 0.20s, Batch Size: 8, Sequence Length: 512
295
+ ***** Evaluating perplexity, Epoch 1/10 *****
296
+ Invalidate trace cache @ step 0: expected module 0, but got module 6
297
+ ppl: 1.1439684629440308, loss: 0.1345033049583435
298
+ Beginning of Epoch 2/10, Total Micro Batches 18
299
+ Model Parameters: 6.927 B, Latency: 3.83s, TFLOPs: 14.58, Samples/sec: 2.09, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
300
+ [2023-12-11 10:44:38,766] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=0, lr=[9.35901689529201e-06, 0.0004849231551964771, 9.35901689529201e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
301
+ [2023-12-11 10:44:38,766] [INFO] [timer.py:260:stop] epoch=1/micro_step=2/global_step=20, RunningAvgSamplesPerSec=6.710114711955848, CurrSamplesPerSec=6.246338308005135, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
302
+ Model Parameters: 6.927 B, Latency: 3.84s, TFLOPs: 14.52, Samples/sec: 2.08, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
303
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.23, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
304
+ Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.23, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
305
+ Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.24, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
306
+ Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.23, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
307
+ Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.23, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
308
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.21, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
309
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.20, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
310
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.22, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
311
+ Model Parameters: 6.927 B, Latency: 3.66s, TFLOPs: 15.23, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
312
+ [2023-12-11 10:45:15,457] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=0, lr=[9.003572573259918e-06, 0.00046650635094610973, 9.003572573259918e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
313
+ [2023-12-11 10:45:15,457] [INFO] [timer.py:260:stop] epoch=1/micro_step=12/global_step=30, RunningAvgSamplesPerSec=6.650983352459916, CurrSamplesPerSec=6.541507580323197, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
314
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.21, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
315
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.22, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
316
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.21, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
317
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.23, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
318
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.22, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
319
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.23, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
320
+ Model Parameters: 6.927 B, Latency: 1.58s, TFLOPs: 35.33, Samples/sec: 5.06, Time/seq 0.20s, Batch Size: 8, Sequence Length: 512
321
+ ***** Evaluating perplexity, Epoch 2/10 *****
322
+ Invalidate trace cache @ step 0: expected module 0, but got module 6
323
+ ppl: 1.0047215223312378, loss: 0.004710380919277668
324
+ Beginning of Epoch 3/10, Total Micro Batches 18
325
+ Model Parameters: 6.927 B, Latency: 3.83s, TFLOPs: 14.58, Samples/sec: 2.09, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
326
+ Model Parameters: 6.927 B, Latency: 3.86s, TFLOPs: 14.47, Samples/sec: 2.07, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
327
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.21, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
328
+ [2023-12-11 10:45:52,775] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=0, lr=[8.52116443804907e-06, 0.0004415111107797445, 8.52116443804907e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
329
+ [2023-12-11 10:45:52,776] [INFO] [timer.py:260:stop] epoch=2/micro_step=4/global_step=40, RunningAvgSamplesPerSec=6.7074234396499754, CurrSamplesPerSec=6.540627329973461, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
330
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.21, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
331
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.22, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
332
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.21, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
333
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.19, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
334
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.20, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
335
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.20, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
336
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.19, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
337
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.21, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
338
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.20, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
339
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.21, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
340
+ [2023-12-11 10:46:29,517] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=0, lr=[7.926450216737553e-06, 0.0004106969024216348, 7.926450216737553e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
341
+ [2023-12-11 10:46:29,518] [INFO] [timer.py:260:stop] epoch=2/micro_step=14/global_step=50, RunningAvgSamplesPerSec=6.671387689388046, CurrSamplesPerSec=6.52887655567313, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
342
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.18, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
343
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.21, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
344
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.21, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
345
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.21, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
346
+ Model Parameters: 6.927 B, Latency: 1.58s, TFLOPs: 35.30, Samples/sec: 5.06, Time/seq 0.20s, Batch Size: 8, Sequence Length: 512
347
+ ***** Evaluating perplexity, Epoch 3/10 *****
348
+ Invalidate trace cache @ step 0: expected module 0, but got module 6
349
+ ppl: 1.0025222301483154, loss: 0.002519124187529087
350
+ Beginning of Epoch 4/10, Total Micro Batches 18
351
+ Model Parameters: 6.927 B, Latency: 3.83s, TFLOPs: 14.57, Samples/sec: 2.09, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
352
+ Model Parameters: 6.927 B, Latency: 3.85s, TFLOPs: 14.49, Samples/sec: 2.08, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
353
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.19, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
354
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.19, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
355
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.18, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
356
+ [2023-12-11 10:47:06,889] [INFO] [logging.py:96:log_dist] [Rank 0] step=60, skipped=0, lr=[7.237500000000001e-06, 0.000375, 7.237500000000001e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
357
+ [2023-12-11 10:47:06,889] [INFO] [timer.py:260:stop] epoch=3/micro_step=6/global_step=60, RunningAvgSamplesPerSec=6.703453633726282, CurrSamplesPerSec=6.5185040516557375, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
358
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
359
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.19, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
360
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.19, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
361
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.19, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
362
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.19, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
363
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.19, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
364
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
365
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.18, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
366
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
367
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
368
+ [2023-12-11 10:47:43,680] [INFO] [logging.py:96:log_dist] [Rank 0] step=70, skipped=0, lr=[6.475247191546353e-06, 0.0003355050358314172, 6.475247191546353e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
369
+ [2023-12-11 10:47:43,681] [INFO] [timer.py:260:stop] epoch=3/micro_step=16/global_step=70, RunningAvgSamplesPerSec=6.6772772936024944, CurrSamplesPerSec=6.529561351866838, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
370
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.18, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
371
+ Model Parameters: 6.927 B, Latency: 3.67s, TFLOPs: 15.19, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
372
+ Model Parameters: 6.927 B, Latency: 1.59s, TFLOPs: 35.12, Samples/sec: 5.03, Time/seq 0.20s, Batch Size: 8, Sequence Length: 512
373
+ ***** Evaluating perplexity, Epoch 4/10 *****
374
+ Invalidate trace cache @ step 0: expected module 0, but got module 6
375
+ ppl: 1.0022900104522705, loss: 0.002287394367158413
376
+ Beginning of Epoch 5/10, Total Micro Batches 18
377
+ Model Parameters: 6.927 B, Latency: 3.84s, TFLOPs: 14.53, Samples/sec: 2.08, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
378
+ Model Parameters: 6.927 B, Latency: 3.85s, TFLOPs: 14.49, Samples/sec: 2.08, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
379
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.13, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
380
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
381
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
382
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
383
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
384
+ [2023-12-11 10:48:21,116] [INFO] [logging.py:96:log_dist] [Rank 0] step=80, skipped=0, lr=[5.66285245724294e-06, 0.00029341204441673266, 5.66285245724294e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
385
+ [2023-12-11 10:48:21,116] [INFO] [timer.py:260:stop] epoch=4/micro_step=8/global_step=80, RunningAvgSamplesPerSec=6.698826811268296, CurrSamplesPerSec=6.5242438758469135, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
386
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
387
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.18, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
388
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
389
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
390
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
391
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.18, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
392
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
393
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
394
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
395
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
396
+ [2023-12-11 10:48:55,839] [INFO] [logging.py:96:log_dist] [Rank 0] step=90, skipped=0, lr=[4.825e-06, 0.00025, 4.825e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
397
+ [2023-12-11 10:48:55,839] [INFO] [timer.py:260:stop] epoch=4/micro_step=18/global_step=90, RunningAvgSamplesPerSec=6.72308755096596, CurrSamplesPerSec=15.258959684074387, MemAllocated=8.24GB, MaxMemAllocated=14.23GB
398
+ Model Parameters: 6.927 B, Latency: 1.57s, TFLOPs: 35.46, Samples/sec: 5.08, Time/seq 0.20s, Batch Size: 8, Sequence Length: 512
399
+ ***** Evaluating perplexity, Epoch 5/10 *****
400
+ Invalidate trace cache @ step 0: expected module 0, but got module 6
401
+ ppl: 1.002268671989441, loss: 0.002266054507344961
402
+ Beginning of Epoch 6/10, Total Micro Batches 18
403
+ Model Parameters: 6.927 B, Latency: 3.83s, TFLOPs: 14.56, Samples/sec: 2.09, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
404
+ Model Parameters: 6.927 B, Latency: 3.85s, TFLOPs: 14.49, Samples/sec: 2.08, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
405
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
406
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.18, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
407
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
408
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.18, Samples/sec: 2.18, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
409
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
410
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
411
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
412
+ [2023-12-11 10:49:35,362] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=0, lr=[3.987147542757061e-06, 0.00020658795558326743, 3.987147542757061e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
413
+ [2023-12-11 10:49:35,363] [INFO] [timer.py:260:stop] epoch=5/micro_step=10/global_step=100, RunningAvgSamplesPerSec=6.69592354348812, CurrSamplesPerSec=6.516857819843849, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
414
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
415
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
416
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
417
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
418
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
419
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
420
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
421
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
422
+ Model Parameters: 6.927 B, Latency: 1.69s, TFLOPs: 33.09, Samples/sec: 4.74, Time/seq 0.21s, Batch Size: 8, Sequence Length: 512
423
+ ***** Evaluating perplexity, Epoch 6/10 *****
424
+ Invalidate trace cache @ step 0: expected module 0, but got module 6
425
+ ppl: 1.002125859260559, loss: 0.0021236357279121876
426
+ Beginning of Epoch 7/10, Total Micro Batches 18
427
+ Model Parameters: 6.927 B, Latency: 3.83s, TFLOPs: 14.57, Samples/sec: 2.09, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
428
+ [2023-12-11 10:50:12,933] [INFO] [logging.py:96:log_dist] [Rank 0] step=110, skipped=0, lr=[3.174752808453649e-06, 0.00016449496416858284, 3.174752808453649e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
429
+ [2023-12-11 10:50:12,933] [INFO] [timer.py:260:stop] epoch=6/micro_step=2/global_step=110, RunningAvgSamplesPerSec=6.707642153640435, CurrSamplesPerSec=6.182524933965045, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
430
+ Model Parameters: 6.927 B, Latency: 3.88s, TFLOPs: 14.37, Samples/sec: 2.06, Time/seq 0.49s, Batch Size: 8, Sequence Length: 512
431
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
432
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
433
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
434
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
435
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
436
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
437
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
438
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
439
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
440
+ [2023-12-11 10:50:49,797] [INFO] [logging.py:96:log_dist] [Rank 0] step=120, skipped=0, lr=[2.4125000000000015e-06, 0.00012500000000000006, 2.4125000000000015e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
441
+ [2023-12-11 10:50:49,797] [INFO] [timer.py:260:stop] epoch=6/micro_step=12/global_step=120, RunningAvgSamplesPerSec=6.691004676878778, CurrSamplesPerSec=6.515100253599061, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
442
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
443
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.17, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
444
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
445
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
446
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
447
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.12, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
448
+ Model Parameters: 6.927 B, Latency: 1.59s, TFLOPs: 35.12, Samples/sec: 5.03, Time/seq 0.20s, Batch Size: 8, Sequence Length: 512
449
+ ***** Evaluating perplexity, Epoch 7/10 *****
450
+ Invalidate trace cache @ step 0: expected module 0, but got module 6
451
+ ppl: 1.0021148920059204, loss: 0.0021126093342900276
452
+ Beginning of Epoch 8/10, Total Micro Batches 18
453
+ Model Parameters: 6.927 B, Latency: 3.82s, TFLOPs: 14.60, Samples/sec: 2.09, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
454
+ Model Parameters: 6.927 B, Latency: 3.85s, TFLOPs: 14.51, Samples/sec: 2.08, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
455
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
456
+ [2023-12-11 10:51:27,245] [INFO] [logging.py:96:log_dist] [Rank 0] step=130, skipped=0, lr=[1.7235497832624478e-06, 8.930309757836516e-05, 1.7235497832624478e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
457
+ [2023-12-11 10:51:27,246] [INFO] [timer.py:260:stop] epoch=7/micro_step=4/global_step=130, RunningAvgSamplesPerSec=6.703016297794747, CurrSamplesPerSec=6.513660995687833, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
458
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
459
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.12, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
460
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
461
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.13, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
462
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
463
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
464
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
465
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
466
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.12, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
467
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.18, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
468
+ [2023-12-11 10:52:04,119] [INFO] [logging.py:96:log_dist] [Rank 0] step=140, skipped=0, lr=[1.1288355619509317e-06, 5.848888922025553e-05, 1.1288355619509317e-06], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
469
+ [2023-12-11 10:52:04,120] [INFO] [timer.py:260:stop] epoch=7/micro_step=14/global_step=140, RunningAvgSamplesPerSec=6.689002559054329, CurrSamplesPerSec=6.521014034361496, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
470
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
471
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
472
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
473
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
474
+ Model Parameters: 6.927 B, Latency: 1.57s, TFLOPs: 35.54, Samples/sec: 5.09, Time/seq 0.20s, Batch Size: 8, Sequence Length: 512
475
+ ***** Evaluating perplexity, Epoch 8/10 *****
476
  Invalidate trace cache @ step 0: expected module 0, but got module 6
477
+ ppl: 1.0020469427108765, loss: 0.0020447911228984594
478
+ Beginning of Epoch 9/10, Total Micro Batches 18
479
+ Model Parameters: 6.927 B, Latency: 3.83s, TFLOPs: 14.58, Samples/sec: 2.09, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
480
+ Model Parameters: 6.927 B, Latency: 3.86s, TFLOPs: 14.45, Samples/sec: 2.07, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
481
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
482
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
483
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
484
+ [2023-12-11 10:52:41,563] [INFO] [logging.py:96:log_dist] [Rank 0] step=150, skipped=0, lr=[6.464274267400833e-07, 3.3493649053890325e-05, 6.464274267400833e-07], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
485
+ [2023-12-11 10:52:41,564] [INFO] [timer.py:260:stop] epoch=8/micro_step=6/global_step=150, RunningAvgSamplesPerSec=6.699587482336564, CurrSamplesPerSec=6.5183191731058265, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
486
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
487
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
488
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
489
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.13, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
490
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
491
+ Model Parameters: 6.927 B, Latency: 3.70s, TFLOPs: 15.10, Samples/sec: 2.16, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
492
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.12, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
493
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
494
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
495
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
496
+ [2023-12-11 10:53:18,461] [INFO] [logging.py:96:log_dist] [Rank 0] step=160, skipped=0, lr=[2.909831047079924e-07, 1.5076844803522921e-05, 2.909831047079924e-07], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
497
+ [2023-12-11 10:53:18,461] [INFO] [timer.py:260:stop] epoch=8/micro_step=16/global_step=160, RunningAvgSamplesPerSec=6.6873052348933255, CurrSamplesPerSec=6.516400938574413, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
498
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
499
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
500
+ Model Parameters: 6.927 B, Latency: 1.61s, TFLOPs: 34.78, Samples/sec: 4.98, Time/seq 0.20s, Batch Size: 8, Sequence Length: 512
501
+ ***** Evaluating perplexity, Epoch 9/10 *****
502
  Invalidate trace cache @ step 0: expected module 0, but got module 6
503
+ ppl: 1.0019946098327637, loss: 0.001992669189348817
504
+ Beginning of Epoch 10/10, Total Micro Batches 18
505
+ Model Parameters: 6.927 B, Latency: 3.83s, TFLOPs: 14.58, Samples/sec: 2.09, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
506
+ Model Parameters: 6.927 B, Latency: 3.86s, TFLOPs: 14.45, Samples/sec: 2.07, Time/seq 0.48s, Batch Size: 8, Sequence Length: 512
507
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
508
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
509
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
510
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.16, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
511
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
512
+ [2023-12-11 10:53:55,945] [INFO] [logging.py:96:log_dist] [Rank 0] step=170, skipped=0, lr=[7.330259171609631e-08, 3.798061746947995e-06, 7.330259171609631e-08], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
513
+ [2023-12-11 10:53:55,945] [INFO] [timer.py:260:stop] epoch=9/micro_step=8/global_step=170, RunningAvgSamplesPerSec=6.696277604142081, CurrSamplesPerSec=6.512733867002891, MemAllocated=8.61GB, MaxMemAllocated=14.23GB
514
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
515
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
516
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
517
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
518
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
519
+ Model Parameters: 6.927 B, Latency: 3.68s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
520
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
521
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
522
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.14, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
523
+ Model Parameters: 6.927 B, Latency: 3.69s, TFLOPs: 15.15, Samples/sec: 2.17, Time/seq 0.46s, Batch Size: 8, Sequence Length: 512
524
+ [2023-12-11 10:54:30,744] [INFO] [logging.py:96:log_dist] [Rank 0] step=180, skipped=0, lr=[0.0, 0.0, 0.0], mom=[(0.9, 0.95), (0.9, 0.95), (0.9, 0.95)]
525
+ [2023-12-11 10:54:30,744] [INFO] [timer.py:260:stop] epoch=9/micro_step=18/global_step=180, RunningAvgSamplesPerSec=6.707583967509366, CurrSamplesPerSec=14.954943653382319, MemAllocated=8.24GB, MaxMemAllocated=14.23GB
526
+ Model Parameters: 6.927 B, Latency: 1.61s, TFLOPs: 34.75, Samples/sec: 4.98, Time/seq 0.20s, Batch Size: 8, Sequence Length: 512
527
+ ***** Evaluating perplexity, Epoch 10/10 *****
528
  Invalidate trace cache @ step 0: expected module 0, but got module 6
529
+ ppl: 1.0020090341567993, loss: 0.002007057424634695
530
  saving the final model ...
531
+ [2023-12-11 10:54:47,172] [INFO] [launch.py:347:main] Process 1648663 exits successfully.
532
+ [2023-12-11 10:54:48,665] [INFO] [launch.py:347:main] Process 1648664 exits successfully.
533
+ [2023-12-11 10:56:50,679] [INFO] [launch.py:347:main] Process 1648662 exits successfully.