almersawi
/

llama-13b-chat-ft-test

Model card Files Files and versions Community

almersawi commited on Apr 21, 2024

Commit

c97ed50

verified ·

1 Parent(s): 7422540

Upload logs

Browse files

Files changed (1) hide show

logs.log +51 -51

logs.log CHANGED Viewed

@@ -1,53 +1,53 @@
-2024-04-21 14:56:41,306 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 0, total: 4 local rank: 0.
-2024-04-21 14:56:41,306 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 2, total: 4 local rank: 2.
-2024-04-21 14:56:41,306 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 1, total: 4 local rank: 1.
-2024-04-21 14:56:41,307 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 3, total: 4 local rank: 3.
-2024-04-21 14:56:42,018 - INFO: Problem Type: text_causal_language_modeling
-2024-04-21 14:56:42,018 - INFO: Global random seed: 537049
-2024-04-21 14:56:42,018 - INFO: Preparing the data...
-2024-04-21 14:56:42,018 - INFO: Setting up automatic validation split...
-2024-04-21 14:56:42,044 - INFO: Preparing train and validation data
-2024-04-21 14:56:42,044 - INFO: Loading train dataset...
-2024-04-21 14:56:42,794 - INFO: Stop token ids: []
-2024-04-21 14:56:42,800 - INFO: Loading validation dataset...
-2024-04-21 14:56:43,317 - INFO: Stop token ids: []
-2024-04-21 14:56:43,322 - INFO: Number of observations in train dataset: 495
-2024-04-21 14:56:43,322 - INFO: Number of observations in validation dataset: 5
-2024-04-21 14:56:43,693 - WARNING: PAD token id not matching between config and tokenizer. Overwriting with tokenizer id.
-2024-04-21 14:56:43,693 - INFO: Setting pretraining_tp of model config to 1.
-2024-04-21 14:56:43,696 - INFO: Using bfloat16 for backbone
-2024-04-21 14:56:43,709 - INFO: Stop token ids: []
-2024-04-21 14:56:43,712 - WARNING: PAD token id not matching between config and tokenizer. Overwriting with tokenizer id.
-2024-04-21 14:56:43,712 - INFO: Setting pretraining_tp of model config to 1.
-2024-04-21 14:56:43,714 - INFO: Using bfloat16 for backbone
-2024-04-21 14:56:43,714 - INFO: Loading meta-llama/Llama-2-13b-hf. This may take a while.
-2024-04-21 14:56:43,728 - WARNING: PAD token id not matching between config and tokenizer. Overwriting with tokenizer id.
-2024-04-21 14:56:43,728 - INFO: Setting pretraining_tp of model config to 1.
-2024-04-21 14:56:43,731 - INFO: Using bfloat16 for backbone
-2024-04-21 14:56:43,950 - WARNING: PAD token id not matching between config and tokenizer. Overwriting with tokenizer id.
-2024-04-21 14:56:43,950 - INFO: Setting pretraining_tp of model config to 1.
-2024-04-21 14:56:43,953 - INFO: Using bfloat16 for backbone
-2024-04-21 14:58:13,062 - INFO: Loaded meta-llama/Llama-2-13b-hf.
-2024-04-21 14:58:13,067 - INFO: Lora module names: ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj']
-2024-04-21 14:58:14,607 - INFO: Enough space available for saving model weights.Required space: 25632.04MB, Available space: 973819.37MB.
-2024-04-21 14:58:14,610 - INFO: Enough space available for saving model weights.Required space: 25632.04MB, Available space: 973819.37MB.
-2024-04-21 14:58:14,612 - INFO: Enough space available for saving model weights.Required space: 25632.04MB, Available space: 973819.37MB.
-2024-04-21 14:58:14,612 - INFO: Enough space available for saving model weights.Required space: 25632.04MB, Available space: 973819.37MB.
-2024-04-21 14:58:14,617 - INFO: Optimizer AdamW has been provided with parameters {'weight_decay': 0.0, 'eps': 1e-08, 'betas': (0.8999999762, 0.9990000129), 'lr': 0.0001}
-2024-04-21 14:58:14,619 - INFO: Optimizer AdamW has been provided with parameters {'eps': 1e-08, 'weight_decay': 0.0, 'betas': (0.8999999762, 0.9990000129), 'lr': 0.0001}
-2024-04-21 14:58:14,621 - INFO: Optimizer AdamW has been provided with parameters {'weight_decay': 0.0, 'eps': 1e-08, 'betas': (0.8999999762, 0.9990000129), 'lr': 0.0001}
-2024-04-21 14:58:14,622 - INFO: Optimizer AdamW has been provided with parameters {'weight_decay': 0.0, 'eps': 1e-08, 'betas': (0.8999999762, 0.9990000129), 'lr': 0.0001}
-2024-04-21 14:58:17,247 - INFO: started process: 1, can_track: False, tracking_mode: TrackingMode.DURING_EPOCH
-2024-04-21 14:58:17,251 - INFO: started process: 2, can_track: False, tracking_mode: TrackingMode.DURING_EPOCH
-2024-04-21 14:58:17,255 - INFO: started process: 3, can_track: False, tracking_mode: TrackingMode.DURING_EPOCH
-2024-04-21 14:58:17,389 - INFO: Evaluation step: 61
-2024-04-21 14:58:17,409 - INFO: Evaluation step: 61
-2024-04-21 14:58:17,411 - INFO: Evaluation step: 61
-2024-04-21 14:58:18,191 - INFO: started process: 0, can_track: True, tracking_mode: TrackingMode.DURING_EPOCH
-2024-04-21 14:58:18,191 - INFO: Training Epoch: 1 / 1
-2024-04-21 14:58:18,192 - INFO: train loss:   0%|          | 0/61 [00:00<?, ?it/s]
-2024-04-21 14:58:18,266 - INFO: Evaluation step: 61
-2024-04-21 15:00:34,754 - ERROR: Exception occurred during the run:
 Traceback (most recent call last):
   File "/app/finetuning/train.py", line 1179, in <module>
     run(cfg=cfg)
@@ -69,4 +69,4 @@ Traceback (most recent call last):
     response = verify_rest_response(response, endpoint)
   File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 152, in verify_rest_response
     raise RestException(json.loads(response.text))
-mlflow.exceptions.RestException: 409: INSERT ERROR, ENTITY:mlflow_metric, Duplicate _id: c754e19e-e264-459f-bab5-acfa2f30a15e

+2024-04-21 15:07:21,870 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 1, total: 4 local rank: 1.
+2024-04-21 15:07:21,870 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 3, total: 4 local rank: 3.
+2024-04-21 15:07:21,870 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 0, total: 4 local rank: 0.
+2024-04-21 15:07:21,870 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 2, total: 4 local rank: 2.
+2024-04-21 15:07:22,585 - INFO: Problem Type: text_causal_language_modeling
+2024-04-21 15:07:22,586 - INFO: Global random seed: 879809
+2024-04-21 15:07:22,586 - INFO: Preparing the data...
+2024-04-21 15:07:22,586 - INFO: Setting up automatic validation split...
+2024-04-21 15:07:22,613 - INFO: Preparing train and validation data
+2024-04-21 15:07:22,613 - INFO: Loading train dataset...
+2024-04-21 15:07:23,453 - INFO: Stop token ids: []
+2024-04-21 15:07:23,459 - INFO: Loading validation dataset...
+2024-04-21 15:07:23,933 - INFO: Stop token ids: []
+2024-04-21 15:07:23,937 - INFO: Number of observations in train dataset: 495
+2024-04-21 15:07:23,937 - INFO: Number of observations in validation dataset: 5
+2024-04-21 15:07:24,567 - WARNING: PAD token id not matching between config and tokenizer. Overwriting with tokenizer id.
+2024-04-21 15:07:24,568 - INFO: Setting pretraining_tp of model config to 1.
+2024-04-21 15:07:24,570 - INFO: Using bfloat16 for backbone
+2024-04-21 15:07:24,576 - WARNING: PAD token id not matching between config and tokenizer. Overwriting with tokenizer id.
+2024-04-21 15:07:24,576 - INFO: Setting pretraining_tp of model config to 1.
+2024-04-21 15:07:24,579 - INFO: Using bfloat16 for backbone
+2024-04-21 15:07:24,614 - WARNING: PAD token id not matching between config and tokenizer. Overwriting with tokenizer id.
+2024-04-21 15:07:24,614 - INFO: Setting pretraining_tp of model config to 1.
+2024-04-21 15:07:24,617 - INFO: Using bfloat16 for backbone
+2024-04-21 15:07:24,657 - INFO: Stop token ids: []
+2024-04-21 15:07:24,659 - WARNING: PAD token id not matching between config and tokenizer. Overwriting with tokenizer id.
+2024-04-21 15:07:24,660 - INFO: Setting pretraining_tp of model config to 1.
+2024-04-21 15:07:24,662 - INFO: Using bfloat16 for backbone
+2024-04-21 15:07:24,662 - INFO: Loading meta-llama/Llama-2-13b-hf. This may take a while.
+2024-04-21 15:14:07,270 - INFO: Loaded meta-llama/Llama-2-13b-hf.
+2024-04-21 15:14:07,274 - INFO: Lora module names: ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj']
+2024-04-21 15:14:09,060 - INFO: Enough space available for saving model weights.Required space: 25632.04MB, Available space: 973919.94MB.
+2024-04-21 15:14:09,062 - INFO: Enough space available for saving model weights.Required space: 25632.04MB, Available space: 973919.94MB.
+2024-04-21 15:14:09,064 - INFO: Enough space available for saving model weights.Required space: 25632.04MB, Available space: 973919.94MB.
+2024-04-21 15:14:09,064 - INFO: Enough space available for saving model weights.Required space: 25632.04MB, Available space: 973919.94MB.
+2024-04-21 15:14:09,070 - INFO: Optimizer AdamW has been provided with parameters {'eps': 1e-08, 'weight_decay': 0.0, 'betas': (0.8999999762, 0.9990000129), 'lr': 0.0001}
+2024-04-21 15:14:09,072 - INFO: Optimizer AdamW has been provided with parameters {'weight_decay': 0.0, 'eps': 1e-08, 'betas': (0.8999999762, 0.9990000129), 'lr': 0.0001}
+2024-04-21 15:14:09,074 - INFO: Optimizer AdamW has been provided with parameters {'eps': 1e-08, 'weight_decay': 0.0, 'betas': (0.8999999762, 0.9990000129), 'lr': 0.0001}
+2024-04-21 15:14:09,074 - INFO: Optimizer AdamW has been provided with parameters {'weight_decay': 0.0, 'eps': 1e-08, 'betas': (0.8999999762, 0.9990000129), 'lr': 0.0001}
+2024-04-21 15:14:11,746 - INFO: started process: 2, can_track: False, tracking_mode: TrackingMode.DURING_EPOCH
+2024-04-21 15:14:11,752 - INFO: started process: 3, can_track: False, tracking_mode: TrackingMode.DURING_EPOCH
+2024-04-21 15:14:11,753 - INFO: started process: 1, can_track: False, tracking_mode: TrackingMode.DURING_EPOCH
+2024-04-21 15:14:11,890 - INFO: Evaluation step: 61
+2024-04-21 15:14:11,902 - INFO: Evaluation step: 61
+2024-04-21 15:14:11,902 - INFO: Evaluation step: 61
+2024-04-21 15:14:12,694 - INFO: started process: 0, can_track: True, tracking_mode: TrackingMode.DURING_EPOCH
+2024-04-21 15:14:12,694 - INFO: Training Epoch: 1 / 1
+2024-04-21 15:14:12,695 - INFO: train loss:   0%|          | 0/61 [00:00<?, ?it/s]
+2024-04-21 15:14:12,765 - INFO: Evaluation step: 61
+2024-04-21 15:16:27,176 - ERROR: Exception occurred during the run:
 Traceback (most recent call last):
   File "/app/finetuning/train.py", line 1179, in <module>
     run(cfg=cfg)
     response = verify_rest_response(response, endpoint)
   File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 152, in verify_rest_response
     raise RestException(json.loads(response.text))
+mlflow.exceptions.RestException: 409: INSERT ERROR, ENTITY:mlflow_metric, Duplicate _id: ceb1b9c6-ddd4-4ac3-9db0-564af323463b