2024-04-21 15:07:21,870 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 1, total: 4 local rank: 1. 2024-04-21 15:07:21,870 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 3, total: 4 local rank: 3. 2024-04-21 15:07:21,870 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 0, total: 4 local rank: 0. 2024-04-21 15:07:21,870 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 2, total: 4 local rank: 2. 2024-04-21 15:07:22,585 - INFO: Problem Type: text_causal_language_modeling 2024-04-21 15:07:22,586 - INFO: Global random seed: 879809 2024-04-21 15:07:22,586 - INFO: Preparing the data... 2024-04-21 15:07:22,586 - INFO: Setting up automatic validation split... 2024-04-21 15:07:22,613 - INFO: Preparing train and validation data 2024-04-21 15:07:22,613 - INFO: Loading train dataset... 2024-04-21 15:07:23,453 - INFO: Stop token ids: [] 2024-04-21 15:07:23,459 - INFO: Loading validation dataset... 2024-04-21 15:07:23,933 - INFO: Stop token ids: [] 2024-04-21 15:07:23,937 - INFO: Number of observations in train dataset: 495 2024-04-21 15:07:23,937 - INFO: Number of observations in validation dataset: 5 2024-04-21 15:07:24,567 - WARNING: PAD token id not matching between config and tokenizer. Overwriting with tokenizer id. 2024-04-21 15:07:24,568 - INFO: Setting pretraining_tp of model config to 1. 2024-04-21 15:07:24,570 - INFO: Using bfloat16 for backbone 2024-04-21 15:07:24,576 - WARNING: PAD token id not matching between config and tokenizer. Overwriting with tokenizer id. 2024-04-21 15:07:24,576 - INFO: Setting pretraining_tp of model config to 1. 2024-04-21 15:07:24,579 - INFO: Using bfloat16 for backbone 2024-04-21 15:07:24,614 - WARNING: PAD token id not matching between config and tokenizer. Overwriting with tokenizer id. 2024-04-21 15:07:24,614 - INFO: Setting pretraining_tp of model config to 1. 2024-04-21 15:07:24,617 - INFO: Using bfloat16 for backbone 2024-04-21 15:07:24,657 - INFO: Stop token ids: [] 2024-04-21 15:07:24,659 - WARNING: PAD token id not matching between config and tokenizer. Overwriting with tokenizer id. 2024-04-21 15:07:24,660 - INFO: Setting pretraining_tp of model config to 1. 2024-04-21 15:07:24,662 - INFO: Using bfloat16 for backbone 2024-04-21 15:07:24,662 - INFO: Loading meta-llama/Llama-2-13b-hf. This may take a while. 2024-04-21 15:14:07,270 - INFO: Loaded meta-llama/Llama-2-13b-hf. 2024-04-21 15:14:07,274 - INFO: Lora module names: ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'] 2024-04-21 15:14:09,060 - INFO: Enough space available for saving model weights.Required space: 25632.04MB, Available space: 973919.94MB. 2024-04-21 15:14:09,062 - INFO: Enough space available for saving model weights.Required space: 25632.04MB, Available space: 973919.94MB. 2024-04-21 15:14:09,064 - INFO: Enough space available for saving model weights.Required space: 25632.04MB, Available space: 973919.94MB. 2024-04-21 15:14:09,064 - INFO: Enough space available for saving model weights.Required space: 25632.04MB, Available space: 973919.94MB. 2024-04-21 15:14:09,070 - INFO: Optimizer AdamW has been provided with parameters {'eps': 1e-08, 'weight_decay': 0.0, 'betas': (0.8999999762, 0.9990000129), 'lr': 0.0001} 2024-04-21 15:14:09,072 - INFO: Optimizer AdamW has been provided with parameters {'weight_decay': 0.0, 'eps': 1e-08, 'betas': (0.8999999762, 0.9990000129), 'lr': 0.0001} 2024-04-21 15:14:09,074 - INFO: Optimizer AdamW has been provided with parameters {'eps': 1e-08, 'weight_decay': 0.0, 'betas': (0.8999999762, 0.9990000129), 'lr': 0.0001} 2024-04-21 15:14:09,074 - INFO: Optimizer AdamW has been provided with parameters {'weight_decay': 0.0, 'eps': 1e-08, 'betas': (0.8999999762, 0.9990000129), 'lr': 0.0001} 2024-04-21 15:14:11,746 - INFO: started process: 2, can_track: False, tracking_mode: TrackingMode.DURING_EPOCH 2024-04-21 15:14:11,752 - INFO: started process: 3, can_track: False, tracking_mode: TrackingMode.DURING_EPOCH 2024-04-21 15:14:11,753 - INFO: started process: 1, can_track: False, tracking_mode: TrackingMode.DURING_EPOCH 2024-04-21 15:14:11,890 - INFO: Evaluation step: 61 2024-04-21 15:14:11,902 - INFO: Evaluation step: 61 2024-04-21 15:14:11,902 - INFO: Evaluation step: 61 2024-04-21 15:14:12,694 - INFO: started process: 0, can_track: True, tracking_mode: TrackingMode.DURING_EPOCH 2024-04-21 15:14:12,694 - INFO: Training Epoch: 1 / 1 2024-04-21 15:14:12,695 - INFO: train loss: 0%| | 0/61 [00:00 run(cfg=cfg) File "/app/finetuning/train.py", line 1082, in run train_function( File "/app/finetuning/train.py", line 373, in run_train TrackingClient.log_metric( File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/fluent.py", line 809, in log_metric return MlflowClient().log_metric( File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/client.py", line 803, in log_metric return self._tracking_client.log_metric( File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/_tracking_service/client.py", line 317, in log_metric self.store.log_metric(run_id, metric) File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 208, in log_metric self._call_endpoint(LogMetric, req_body) File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 60, in _call_endpoint return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto) File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 220, in call_endpoint response = verify_rest_response(response, endpoint) File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 152, in verify_rest_response raise RestException(json.loads(response.text)) mlflow.exceptions.RestException: 409: INSERT ERROR, ENTITY:mlflow_metric, Duplicate _id: ceb1b9c6-ddd4-4ac3-9db0-564af323463b