almersawi commited on
Commit
c97ed50
·
verified ·
1 Parent(s): 7422540

Upload logs

Browse files
Files changed (1) hide show
  1. logs.log +51 -51
logs.log CHANGED
@@ -1,53 +1,53 @@
1
- 2024-04-21 14:56:41,306 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 0, total: 4 local rank: 0.
2
- 2024-04-21 14:56:41,306 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 2, total: 4 local rank: 2.
3
- 2024-04-21 14:56:41,306 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 1, total: 4 local rank: 1.
4
- 2024-04-21 14:56:41,307 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 3, total: 4 local rank: 3.
5
- 2024-04-21 14:56:42,018 - INFO: Problem Type: text_causal_language_modeling
6
- 2024-04-21 14:56:42,018 - INFO: Global random seed: 537049
7
- 2024-04-21 14:56:42,018 - INFO: Preparing the data...
8
- 2024-04-21 14:56:42,018 - INFO: Setting up automatic validation split...
9
- 2024-04-21 14:56:42,044 - INFO: Preparing train and validation data
10
- 2024-04-21 14:56:42,044 - INFO: Loading train dataset...
11
- 2024-04-21 14:56:42,794 - INFO: Stop token ids: []
12
- 2024-04-21 14:56:42,800 - INFO: Loading validation dataset...
13
- 2024-04-21 14:56:43,317 - INFO: Stop token ids: []
14
- 2024-04-21 14:56:43,322 - INFO: Number of observations in train dataset: 495
15
- 2024-04-21 14:56:43,322 - INFO: Number of observations in validation dataset: 5
16
- 2024-04-21 14:56:43,693 - WARNING: PAD token id not matching between config and tokenizer. Overwriting with tokenizer id.
17
- 2024-04-21 14:56:43,693 - INFO: Setting pretraining_tp of model config to 1.
18
- 2024-04-21 14:56:43,696 - INFO: Using bfloat16 for backbone
19
- 2024-04-21 14:56:43,709 - INFO: Stop token ids: []
20
- 2024-04-21 14:56:43,712 - WARNING: PAD token id not matching between config and tokenizer. Overwriting with tokenizer id.
21
- 2024-04-21 14:56:43,712 - INFO: Setting pretraining_tp of model config to 1.
22
- 2024-04-21 14:56:43,714 - INFO: Using bfloat16 for backbone
23
- 2024-04-21 14:56:43,714 - INFO: Loading meta-llama/Llama-2-13b-hf. This may take a while.
24
- 2024-04-21 14:56:43,728 - WARNING: PAD token id not matching between config and tokenizer. Overwriting with tokenizer id.
25
- 2024-04-21 14:56:43,728 - INFO: Setting pretraining_tp of model config to 1.
26
- 2024-04-21 14:56:43,731 - INFO: Using bfloat16 for backbone
27
- 2024-04-21 14:56:43,950 - WARNING: PAD token id not matching between config and tokenizer. Overwriting with tokenizer id.
28
- 2024-04-21 14:56:43,950 - INFO: Setting pretraining_tp of model config to 1.
29
- 2024-04-21 14:56:43,953 - INFO: Using bfloat16 for backbone
30
- 2024-04-21 14:58:13,062 - INFO: Loaded meta-llama/Llama-2-13b-hf.
31
- 2024-04-21 14:58:13,067 - INFO: Lora module names: ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj']
32
- 2024-04-21 14:58:14,607 - INFO: Enough space available for saving model weights.Required space: 25632.04MB, Available space: 973819.37MB.
33
- 2024-04-21 14:58:14,610 - INFO: Enough space available for saving model weights.Required space: 25632.04MB, Available space: 973819.37MB.
34
- 2024-04-21 14:58:14,612 - INFO: Enough space available for saving model weights.Required space: 25632.04MB, Available space: 973819.37MB.
35
- 2024-04-21 14:58:14,612 - INFO: Enough space available for saving model weights.Required space: 25632.04MB, Available space: 973819.37MB.
36
- 2024-04-21 14:58:14,617 - INFO: Optimizer AdamW has been provided with parameters {'weight_decay': 0.0, 'eps': 1e-08, 'betas': (0.8999999762, 0.9990000129), 'lr': 0.0001}
37
- 2024-04-21 14:58:14,619 - INFO: Optimizer AdamW has been provided with parameters {'eps': 1e-08, 'weight_decay': 0.0, 'betas': (0.8999999762, 0.9990000129), 'lr': 0.0001}
38
- 2024-04-21 14:58:14,621 - INFO: Optimizer AdamW has been provided with parameters {'weight_decay': 0.0, 'eps': 1e-08, 'betas': (0.8999999762, 0.9990000129), 'lr': 0.0001}
39
- 2024-04-21 14:58:14,622 - INFO: Optimizer AdamW has been provided with parameters {'weight_decay': 0.0, 'eps': 1e-08, 'betas': (0.8999999762, 0.9990000129), 'lr': 0.0001}
40
- 2024-04-21 14:58:17,247 - INFO: started process: 1, can_track: False, tracking_mode: TrackingMode.DURING_EPOCH
41
- 2024-04-21 14:58:17,251 - INFO: started process: 2, can_track: False, tracking_mode: TrackingMode.DURING_EPOCH
42
- 2024-04-21 14:58:17,255 - INFO: started process: 3, can_track: False, tracking_mode: TrackingMode.DURING_EPOCH
43
- 2024-04-21 14:58:17,389 - INFO: Evaluation step: 61
44
- 2024-04-21 14:58:17,409 - INFO: Evaluation step: 61
45
- 2024-04-21 14:58:17,411 - INFO: Evaluation step: 61
46
- 2024-04-21 14:58:18,191 - INFO: started process: 0, can_track: True, tracking_mode: TrackingMode.DURING_EPOCH
47
- 2024-04-21 14:58:18,191 - INFO: Training Epoch: 1 / 1
48
- 2024-04-21 14:58:18,192 - INFO: train loss: 0%| | 0/61 [00:00<?, ?it/s]
49
- 2024-04-21 14:58:18,266 - INFO: Evaluation step: 61
50
- 2024-04-21 15:00:34,754 - ERROR: Exception occurred during the run:
51
  Traceback (most recent call last):
52
  File "/app/finetuning/train.py", line 1179, in <module>
53
  run(cfg=cfg)
@@ -69,4 +69,4 @@ Traceback (most recent call last):
69
  response = verify_rest_response(response, endpoint)
70
  File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 152, in verify_rest_response
71
  raise RestException(json.loads(response.text))
72
- mlflow.exceptions.RestException: 409: INSERT ERROR, ENTITY:mlflow_metric, Duplicate _id: c754e19e-e264-459f-bab5-acfa2f30a15e
 
1
+ 2024-04-21 15:07:21,870 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 1, total: 4 local rank: 1.
2
+ 2024-04-21 15:07:21,870 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 3, total: 4 local rank: 3.
3
+ 2024-04-21 15:07:21,870 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 0, total: 4 local rank: 0.
4
+ 2024-04-21 15:07:21,870 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 2, total: 4 local rank: 2.
5
+ 2024-04-21 15:07:22,585 - INFO: Problem Type: text_causal_language_modeling
6
+ 2024-04-21 15:07:22,586 - INFO: Global random seed: 879809
7
+ 2024-04-21 15:07:22,586 - INFO: Preparing the data...
8
+ 2024-04-21 15:07:22,586 - INFO: Setting up automatic validation split...
9
+ 2024-04-21 15:07:22,613 - INFO: Preparing train and validation data
10
+ 2024-04-21 15:07:22,613 - INFO: Loading train dataset...
11
+ 2024-04-21 15:07:23,453 - INFO: Stop token ids: []
12
+ 2024-04-21 15:07:23,459 - INFO: Loading validation dataset...
13
+ 2024-04-21 15:07:23,933 - INFO: Stop token ids: []
14
+ 2024-04-21 15:07:23,937 - INFO: Number of observations in train dataset: 495
15
+ 2024-04-21 15:07:23,937 - INFO: Number of observations in validation dataset: 5
16
+ 2024-04-21 15:07:24,567 - WARNING: PAD token id not matching between config and tokenizer. Overwriting with tokenizer id.
17
+ 2024-04-21 15:07:24,568 - INFO: Setting pretraining_tp of model config to 1.
18
+ 2024-04-21 15:07:24,570 - INFO: Using bfloat16 for backbone
19
+ 2024-04-21 15:07:24,576 - WARNING: PAD token id not matching between config and tokenizer. Overwriting with tokenizer id.
20
+ 2024-04-21 15:07:24,576 - INFO: Setting pretraining_tp of model config to 1.
21
+ 2024-04-21 15:07:24,579 - INFO: Using bfloat16 for backbone
22
+ 2024-04-21 15:07:24,614 - WARNING: PAD token id not matching between config and tokenizer. Overwriting with tokenizer id.
23
+ 2024-04-21 15:07:24,614 - INFO: Setting pretraining_tp of model config to 1.
24
+ 2024-04-21 15:07:24,617 - INFO: Using bfloat16 for backbone
25
+ 2024-04-21 15:07:24,657 - INFO: Stop token ids: []
26
+ 2024-04-21 15:07:24,659 - WARNING: PAD token id not matching between config and tokenizer. Overwriting with tokenizer id.
27
+ 2024-04-21 15:07:24,660 - INFO: Setting pretraining_tp of model config to 1.
28
+ 2024-04-21 15:07:24,662 - INFO: Using bfloat16 for backbone
29
+ 2024-04-21 15:07:24,662 - INFO: Loading meta-llama/Llama-2-13b-hf. This may take a while.
30
+ 2024-04-21 15:14:07,270 - INFO: Loaded meta-llama/Llama-2-13b-hf.
31
+ 2024-04-21 15:14:07,274 - INFO: Lora module names: ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj']
32
+ 2024-04-21 15:14:09,060 - INFO: Enough space available for saving model weights.Required space: 25632.04MB, Available space: 973919.94MB.
33
+ 2024-04-21 15:14:09,062 - INFO: Enough space available for saving model weights.Required space: 25632.04MB, Available space: 973919.94MB.
34
+ 2024-04-21 15:14:09,064 - INFO: Enough space available for saving model weights.Required space: 25632.04MB, Available space: 973919.94MB.
35
+ 2024-04-21 15:14:09,064 - INFO: Enough space available for saving model weights.Required space: 25632.04MB, Available space: 973919.94MB.
36
+ 2024-04-21 15:14:09,070 - INFO: Optimizer AdamW has been provided with parameters {'eps': 1e-08, 'weight_decay': 0.0, 'betas': (0.8999999762, 0.9990000129), 'lr': 0.0001}
37
+ 2024-04-21 15:14:09,072 - INFO: Optimizer AdamW has been provided with parameters {'weight_decay': 0.0, 'eps': 1e-08, 'betas': (0.8999999762, 0.9990000129), 'lr': 0.0001}
38
+ 2024-04-21 15:14:09,074 - INFO: Optimizer AdamW has been provided with parameters {'eps': 1e-08, 'weight_decay': 0.0, 'betas': (0.8999999762, 0.9990000129), 'lr': 0.0001}
39
+ 2024-04-21 15:14:09,074 - INFO: Optimizer AdamW has been provided with parameters {'weight_decay': 0.0, 'eps': 1e-08, 'betas': (0.8999999762, 0.9990000129), 'lr': 0.0001}
40
+ 2024-04-21 15:14:11,746 - INFO: started process: 2, can_track: False, tracking_mode: TrackingMode.DURING_EPOCH
41
+ 2024-04-21 15:14:11,752 - INFO: started process: 3, can_track: False, tracking_mode: TrackingMode.DURING_EPOCH
42
+ 2024-04-21 15:14:11,753 - INFO: started process: 1, can_track: False, tracking_mode: TrackingMode.DURING_EPOCH
43
+ 2024-04-21 15:14:11,890 - INFO: Evaluation step: 61
44
+ 2024-04-21 15:14:11,902 - INFO: Evaluation step: 61
45
+ 2024-04-21 15:14:11,902 - INFO: Evaluation step: 61
46
+ 2024-04-21 15:14:12,694 - INFO: started process: 0, can_track: True, tracking_mode: TrackingMode.DURING_EPOCH
47
+ 2024-04-21 15:14:12,694 - INFO: Training Epoch: 1 / 1
48
+ 2024-04-21 15:14:12,695 - INFO: train loss: 0%| | 0/61 [00:00<?, ?it/s]
49
+ 2024-04-21 15:14:12,765 - INFO: Evaluation step: 61
50
+ 2024-04-21 15:16:27,176 - ERROR: Exception occurred during the run:
51
  Traceback (most recent call last):
52
  File "/app/finetuning/train.py", line 1179, in <module>
53
  run(cfg=cfg)
 
69
  response = verify_rest_response(response, endpoint)
70
  File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 152, in verify_rest_response
71
  raise RestException(json.loads(response.text))
72
+ mlflow.exceptions.RestException: 409: INSERT ERROR, ENTITY:mlflow_metric, Duplicate _id: ceb1b9c6-ddd4-4ac3-9db0-564af323463b