diff --git "a/training_log_20250116_223318.txt" "b/training_log_20250116_223318.txt" new file mode 100644--- /dev/null +++ "b/training_log_20250116_223318.txt" @@ -0,0 +1,1117 @@ +[2025-01-16 22:33:22,720] torch.distributed.run: [WARNING] +[2025-01-16 22:33:22,720] torch.distributed.run: [WARNING] ***************************************** +[2025-01-16 22:33:22,720] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. +[2025-01-16 22:33:22,720] torch.distributed.run: [WARNING] ***************************************** +/cpfs02/user/zhaoxiangyu/miniconda3/envs/llava/lib/python3.10/site-packages/_distutils_hack/__init__.py:54: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml + warnings.warn( +/cpfs02/user/zhaoxiangyu/miniconda3/envs/llava/lib/python3.10/site-packages/_distutils_hack/__init__.py:54: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml + warnings.warn( +/cpfs02/user/zhaoxiangyu/miniconda3/envs/llava/lib/python3.10/site-packages/_distutils_hack/__init__.py:54: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml + warnings.warn( +/cpfs02/user/zhaoxiangyu/miniconda3/envs/llava/lib/python3.10/site-packages/_distutils_hack/__init__.py:54: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml + warnings.warn( +/cpfs02/user/zhaoxiangyu/miniconda3/envs/llava/lib/python3.10/site-packages/_distutils_hack/__init__.py:54: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml + warnings.warn( +/cpfs02/user/zhaoxiangyu/miniconda3/envs/llava/lib/python3.10/site-packages/_distutils_hack/__init__.py:54: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml + warnings.warn( +/cpfs02/user/zhaoxiangyu/miniconda3/envs/llava/lib/python3.10/site-packages/_distutils_hack/__init__.py:54: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml + warnings.warn( +/cpfs02/user/zhaoxiangyu/miniconda3/envs/llava/lib/python3.10/site-packages/_distutils_hack/__init__.py:54: UserWarning: Reliance on distutils from stdlib is deprecated. Users must rely on setuptools to provide the distutils module. Avoid importing distutils or import setuptools first, and avoid setting SETUPTOOLS_USE_DISTUTILS=stdlib. Register concerns at https://github.com/pypa/setuptools/issues/new?template=distutils-deprecation.yml + warnings.warn( +[2025-01-16 22:33:35,902] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) +[2025-01-16 22:33:35,902] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) +[2025-01-16 22:33:35,902] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) +[2025-01-16 22:33:35,902] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) +[2025-01-16 22:33:35,902] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) +[2025-01-16 22:33:35,902] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) +[2025-01-16 22:33:35,950] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) +[2025-01-16 22:33:36,138] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) +df: df: /root/.triton/autotune/root/.triton/autotunedf: /root/.triton/autotune: 没有那个文件或目录: 没有那个文件或目录 + +: 没有那个文件或目录 +df: /root/.triton/autotune: 没有那个文件或目录 +df: /root/.triton/autotune: 没有那个文件或目录 + [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH + [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH + [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH + [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH + [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH + [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH + [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH + [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH + [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 + [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible + [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 + [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible + [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 + [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible + [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 + [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible + [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 + [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible + [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 + [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible + [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 + [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible + [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1 + [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible +[2025-01-16 22:33:46,812] [INFO] [comm.py:637:init_distributed] cdb=None +[2025-01-16 22:33:46,813] [INFO] [comm.py:637:init_distributed] cdb=None +[2025-01-16 22:33:46,813] [INFO] [comm.py:637:init_distributed] cdb=None +[2025-01-16 22:33:46,813] [INFO] [comm.py:637:init_distributed] cdb=None +[2025-01-16 22:33:46,813] [INFO] [comm.py:637:init_distributed] cdb=None +[2025-01-16 22:33:46,813] [INFO] [comm.py:637:init_distributed] cdb=None +[2025-01-16 22:33:46,813] [INFO] [comm.py:637:init_distributed] cdb=None +[2025-01-16 22:33:46,813] [INFO] [comm.py:637:init_distributed] cdb=None +[W socket.cpp:663] [c10d] The IPv6 network addresses of (dlc1bfcxl3sf01eb-master-0, 23456) cannot be retrieved (gai error: -2 - Name or service not known). +[W socket.cpp:663] [c10d] The IPv6 network addresses of (dlc1bfcxl3sf01eb-master-0, 23456) cannot be retrieved (gai error: -2 - Name or service not known). +[W socket.cpp:663] [c10d] The IPv6 network addresses of (dlc1bfcxl3sf01eb-master-0, 23456) cannot be retrieved (gai error: -2 - Name or service not known). +01/16/2025 22:33:46 - WARNING - llava.train.train - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False +01/16/2025 22:33:46 - INFO - llava.train.train - Training/evaluation parameters TrainingArguments( +_n_gpu=1, +adafactor=False, +adam_beta1=0.9, +adam_beta2=0.999, +adam_epsilon=1e-08, +auto_find_batch_size=False, +bf16=True, +bf16_full_eval=False, +bits=16, +cache_dir=None, +data_seed=None, +dataloader_drop_last=False, +dataloader_num_workers=4, +dataloader_persistent_workers=False, +dataloader_pin_memory=True, +ddp_backend=None, +ddp_broadcast_buffers=None, +ddp_bucket_cap_mb=None, +ddp_find_unused_parameters=None, +ddp_timeout=1800, +debug=[], +deepspeed=./scripts/zero3.json, +disable_tqdm=False, +dispatch_batches=None, +do_eval=False, +do_predict=False, +do_train=False, +double_quant=True, +eval_accumulation_steps=None, +eval_delay=0, +eval_steps=None, +evaluation_strategy=no, +fp16=False, +fp16_backend=auto, +fp16_full_eval=False, +fp16_opt_level=O1, +freeze_mm_mlp_adapter=False, +fsdp=[], +fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, +fsdp_min_num_params=0, +fsdp_transformer_layer_cls_to_wrap=None, +full_determinism=False, +gradient_accumulation_steps=2, +gradient_checkpointing=True, +gradient_checkpointing_kwargs=None, +greater_is_better=None, +group_by_length=False, +group_by_modality_length=True, +half_precision_backend=auto, +hub_always_push=False, +hub_model_id=None, +hub_private_repo=False, +hub_strategy=every_save, +hub_token=, +ignore_data_skip=False, +include_inputs_for_metrics=False, +include_num_input_tokens_seen=False, +include_tokens_per_second=False, +jit_mode_eval=False, +label_names=None, +label_smoothing_factor=0.0, +learning_rate=2e-05, +length_column_name=length, +load_best_model_at_end=False, +local_rank=0, +log_level=passive, +log_level_replica=warning, +log_on_each_node=True, +logging_dir=./checkpoints/llavaAR4-internlm2_5-7b-sft-llavanext-notext-kn-infpolishmd-detail-knins40k-creationme10kfixed-chart11kmerge-tqa8k-info28kgpt/runs/Jan16_22-33-46_dlc1bfcxl3sf01eb-worker-0, +logging_first_step=False, +logging_nan_inf_filter=True, +logging_steps=1.0, +logging_strategy=steps, +lora_alpha=16, +lora_bias=none, +lora_dropout=0.05, +lora_enable=False, +lora_r=64, +lora_weight_path=, +lr_scheduler_kwargs={}, +lr_scheduler_type=cosine, +max_grad_norm=1.0, +max_steps=-1, +metric_for_best_model=None, +mm_projector_lr=None, +mm_vision_tower_lr=2e-06, +model_max_length=32768, +mp_parameters=, +mpt_attn_impl=triton, +neftune_noise_alpha=None, +no_cuda=False, +num_train_epochs=1.0, +optim=adamw_torch, +optim_args=None, +output_dir=./checkpoints/llavaAR4-internlm2_5-7b-sft-llavanext-notext-kn-infpolishmd-detail-knins40k-creationme10kfixed-chart11kmerge-tqa8k-info28kgpt, +overwrite_output_dir=False, +past_index=-1, +per_device_eval_batch_size=4, +per_device_train_batch_size=4, +prediction_loss_only=False, +push_to_hub=False, +push_to_hub_model_id=None, +push_to_hub_organization=None, +push_to_hub_token=, +quant_type=nf4, +ray_scope=last, +remove_unused_columns=False, +report_to=['wandb'], +resume_from_checkpoint=None, +run_name=llavaAR4-internlm2_5-7b-sft-llavanext-notext-kn-infpolishmd-detail-knins40k-creationme10kfixed-chart11kmerge-tqa8k-info28kgpt, +save_on_each_node=False, +save_only_model=False, +save_safetensors=True, +save_steps=10000, +save_strategy=steps, +save_total_limit=1, +seed=42, +skip_memory_metrics=True, +split_batches=False, +tf32=True, +torch_compile=False, +torch_compile_backend=None, +torch_compile_mode=None, +torchdynamo=None, +tpu_metrics_debug=False, +tpu_num_cores=None, +use_cpu=False, +use_ipex=False, +use_legacy_prediction_loop=False, +use_mps_device=False, +warmup_ratio=0.03, +warmup_steps=0, +weight_decay=0.0, +) +01/16/2025 22:33:46 - INFO - llava.train.train - Training/evaluation parameters DataArguments(data_path=None, meta_path='playground/meta_json/llavanext_sample/llava_next_notext_inf37kpolishmd_de35k_know40k_knins40k_creation10kfixed_chart11kmerge_tqa8k_info28k_gpt.json', lazy_preprocess=True, is_multimodal=False, image_folder=None, image_aspect_ratio='anyres', image_grid_pinpoints='[(336, 672), (672, 336), (672, 672), (1008, 336), (336, 1008)]', image_crop_resolution=None, image_split_resolution=None, use_data_resampling=False) +[INFO|configuration_utils.py:727] 2025-01-16 22:33:46,829 >> loading configuration file models/internlm/internlm2_5-7b-chat/config.json +[INFO|configuration_utils.py:727] 2025-01-16 22:33:46,851 >> loading configuration file models/internlm/internlm2_5-7b-chat/config.json +[INFO|configuration_utils.py:792] 2025-01-16 22:33:46,851 >> Model config InternLM2Config { + "_name_or_path": "models/internlm/internlm2_5-7b-chat", + "architectures": [ + "InternLM2ForCausalLM" + ], + "attn_implementation": "eager", + "auto_map": { + "AutoConfig": "configuration_internlm2.InternLM2Config", + "AutoModel": "modeling_internlm2.InternLM2ForCausalLM", + "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM" + }, + "bias": false, + "bos_token_id": 1, + "eos_token_id": 2, + "hidden_act": "silu", + "hidden_size": 4096, + "initializer_range": 0.02, + "intermediate_size": 14336, + "max_position_embeddings": 32768, + "model_type": "internlm2", + "num_attention_heads": 32, + "num_hidden_layers": 32, + "num_key_value_heads": 8, + "pad_token_id": 2, + "pretraining_tp": 1, + "rms_norm_eps": 1e-05, + "rope_scaling": { + "factor": 2.0, + "type": "dynamic" + }, + "rope_theta": 1000000, + "tie_word_embeddings": false, + "torch_dtype": "bfloat16", + "transformers_version": "4.37.2", + "use_cache": true, + "vocab_size": 92544 +} + +[WARNING|modeling_utils.py:2918] 2025-01-16 22:33:46,855 >> The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored. +[INFO|modeling_utils.py:3473] 2025-01-16 22:33:46,863 >> loading weights file models/internlm/internlm2_5-7b-chat/model.safetensors.index.json +[INFO|modeling_utils.py:1426] 2025-01-16 22:33:46,864 >> Instantiating LlavaInternlm2ForCausalLM model under default dtype torch.bfloat16. +[INFO|modeling_utils.py:3582] 2025-01-16 22:33:46,864 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model +[INFO|configuration_utils.py:826] 2025-01-16 22:33:46,870 >> Generate config GenerationConfig { + "bos_token_id": 1, + "eos_token_id": 2, + "pad_token_id": 2 +} + +01/16/2025 22:33:48 - WARNING - llava.train.train - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False +01/16/2025 22:33:48 - WARNING - llava.train.train - Process rank: 5, device: cuda:5, n_gpu: 1distributed training: True, 16-bits training: False +01/16/2025 22:33:48 - WARNING - llava.train.train - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False +[WARNING|modeling_utils.py:2918] 2025-01-16 22:33:48,865 >> The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored. +[WARNING|modeling_utils.py:2918] 2025-01-16 22:33:48,874 >> The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored. +[WARNING|modeling_utils.py:2918] 2025-01-16 22:33:48,879 >> The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored. +01/16/2025 22:33:48 - WARNING - llava.train.train - Process rank: 4, device: cuda:4, n_gpu: 1distributed training: True, 16-bits training: False +[WARNING|modeling_utils.py:2918] 2025-01-16 22:33:48,958 >> The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored. +dlc1bfcxl3sf01eb-worker-0:74:74 [1] NCCL INFO cudaDriverVersion 12010 +dlc1bfcxl3sf01eb-worker-0:75:75 [2] NCCL INFO cudaDriverVersion 12010 +dlc1bfcxl3sf01eb-worker-0:78:78 [5] NCCL INFO cudaDriverVersion 12010 +dlc1bfcxl3sf01eb-worker-0:73:73 [0] NCCL INFO cudaDriverVersion 12010 +dlc1bfcxl3sf01eb-worker-0:77:77 [4] NCCL INFO cudaDriverVersion 12010 +dlc1bfcxl3sf01eb-worker-0:78:78 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth +dlc1bfcxl3sf01eb-worker-0:75:75 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth +dlc1bfcxl3sf01eb-worker-0:74:74 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth +dlc1bfcxl3sf01eb-worker-0:77:77 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth +dlc1bfcxl3sf01eb-worker-0:73:73 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth +dlc1bfcxl3sf01eb-worker-0:75:75 [2] NCCL INFO Bootstrap : Using eth0:22.8.0.221<0> +dlc1bfcxl3sf01eb-worker-0:78:78 [5] NCCL INFO Bootstrap : Using eth0:22.8.0.221<0> +dlc1bfcxl3sf01eb-worker-0:73:73 [0] NCCL INFO Bootstrap : Using eth0:22.8.0.221<0> +dlc1bfcxl3sf01eb-worker-0:74:74 [1] NCCL INFO Bootstrap : Using eth0:22.8.0.221<0> +dlc1bfcxl3sf01eb-worker-0:75:75 [2] NCCL INFO Plugin name set by env to libnccl-net-none.so +dlc1bfcxl3sf01eb-worker-0:78:78 [5] NCCL INFO Plugin name set by env to libnccl-net-none.so +dlc1bfcxl3sf01eb-worker-0:73:73 [0] NCCL INFO Plugin name set by env to libnccl-net-none.so +dlc1bfcxl3sf01eb-worker-0:74:74 [1] NCCL INFO Plugin name set by env to libnccl-net-none.so +dlc1bfcxl3sf01eb-worker-0:77:77 [4] NCCL INFO Bootstrap : Using eth0:22.8.0.221<0> +dlc1bfcxl3sf01eb-worker-0:77:77 [4] NCCL INFO Plugin name set by env to libnccl-net-none.so +dlc1bfcxl3sf01eb-worker-0:74:74 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net-none.so) returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory +dlc1bfcxl3sf01eb-worker-0:75:75 [2] NCCL INFO NET/Plugin : Plugin load (libnccl-net-none.so) returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory +dlc1bfcxl3sf01eb-worker-0:78:78 [5] NCCL INFO NET/Plugin : Plugin load (libnccl-net-none.so) returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory +dlc1bfcxl3sf01eb-worker-0:73:73 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net-none.so) returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory +dlc1bfcxl3sf01eb-worker-0:74:74 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation +dlc1bfcxl3sf01eb-worker-0:78:78 [5] NCCL INFO NET/Plugin : No plugin found, using internal implementation +dlc1bfcxl3sf01eb-worker-0:73:73 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation +dlc1bfcxl3sf01eb-worker-0:75:75 [2] NCCL INFO NET/Plugin : No plugin found, using internal implementation +dlc1bfcxl3sf01eb-worker-0:77:77 [4] NCCL INFO NET/Plugin : Plugin load (libnccl-net-none.so) returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory +dlc1bfcxl3sf01eb-worker-0:77:77 [4] NCCL INFO NET/Plugin : No plugin found, using internal implementation +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO NCCL_IB_HCA set to mlx5 +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO NCCL_IB_HCA set to mlx5 +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO NCCL_IB_HCA set to mlx5 +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO NCCL_IB_HCA set to mlx5 +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO NCCL_IB_HCA set to mlx5 +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:22.8.0.221<0> +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO Using network IB +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:22.8.0.221<0> +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Using network IB +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:22.8.0.221<0> +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Using network IB +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:22.8.0.221<0> +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:22.8.0.221<0> +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO Using network IB +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO Using network IB +01/16/2025 22:33:49 - WARNING - llava.train.train - Process rank: 7, device: cuda:7, n_gpu: 1distributed training: True, 16-bits training: False +01/16/2025 22:33:49 - WARNING - llava.train.train - Process rank: 6, device: cuda:6, n_gpu: 1distributed training: True, 16-bits training: False +[WARNING|modeling_utils.py:2918] 2025-01-16 22:33:49,946 >> The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored. +[WARNING|modeling_utils.py:2918] 2025-01-16 22:33:49,951 >> The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored. +01/16/2025 22:33:50 - WARNING - llava.train.train - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False +[WARNING|modeling_utils.py:2918] 2025-01-16 22:33:50,029 >> The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored. +dlc1bfcxl3sf01eb-worker-0:79:79 [6] NCCL INFO cudaDriverVersion 12010 +dlc1bfcxl3sf01eb-worker-0:79:79 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth +dlc1bfcxl3sf01eb-worker-0:79:79 [6] NCCL INFO Bootstrap : Using eth0:22.8.0.221<0> +dlc1bfcxl3sf01eb-worker-0:79:79 [6] NCCL INFO Plugin name set by env to libnccl-net-none.so +dlc1bfcxl3sf01eb-worker-0:79:79 [6] NCCL INFO NET/Plugin : Plugin load (libnccl-net-none.so) returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory +dlc1bfcxl3sf01eb-worker-0:79:79 [6] NCCL INFO NET/Plugin : No plugin found, using internal implementation +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO NCCL_IB_HCA set to mlx5 +dlc1bfcxl3sf01eb-worker-0:80:80 [7] NCCL INFO cudaDriverVersion 12010 +dlc1bfcxl3sf01eb-worker-0:80:80 [7] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth +dlc1bfcxl3sf01eb-worker-0:80:80 [7] NCCL INFO Bootstrap : Using eth0:22.8.0.221<0> +dlc1bfcxl3sf01eb-worker-0:80:80 [7] NCCL INFO Plugin name set by env to libnccl-net-none.so +dlc1bfcxl3sf01eb-worker-0:80:80 [7] NCCL INFO NET/Plugin : Plugin load (libnccl-net-none.so) returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory +dlc1bfcxl3sf01eb-worker-0:80:80 [7] NCCL INFO NET/Plugin : No plugin found, using internal implementation +dlc1bfcxl3sf01eb-worker-0:76:76 [3] NCCL INFO cudaDriverVersion 12010 +dlc1bfcxl3sf01eb-worker-0:76:76 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth +dlc1bfcxl3sf01eb-worker-0:76:76 [3] NCCL INFO Bootstrap : Using eth0:22.8.0.221<0> +dlc1bfcxl3sf01eb-worker-0:76:76 [3] NCCL INFO Plugin name set by env to libnccl-net-none.so +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO NCCL_IB_HCA set to mlx5 +dlc1bfcxl3sf01eb-worker-0:76:76 [3] NCCL INFO NET/Plugin : Plugin load (libnccl-net-none.so) returned 2 : libnccl-net-none.so: cannot open shared object file: No such file or directory +dlc1bfcxl3sf01eb-worker-0:76:76 [3] NCCL INFO NET/Plugin : No plugin found, using internal implementation +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:22.8.0.221<0> +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO Using network IB +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO NCCL_IB_HCA set to mlx5 +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:22.8.0.221<0> +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO Using network IB +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE [2]mlx5_2:1/RoCE [3]mlx5_3:1/RoCE [RO]; OOB eth0:22.8.0.221<0> +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO Using network IB +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO comm 0x9be138b0 rank 15 nranks 16 cudaDev 7 nvmlDev 7 busId 80 commId 0x38f5ae6ce4fb61e2 - Init START +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO comm 0x9a3cae90 rank 14 nranks 16 cudaDev 6 nvmlDev 6 busId 70 commId 0x38f5ae6ce4fb61e2 - Init START +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO comm 0x9a88a060 rank 9 nranks 16 cudaDev 1 nvmlDev 1 busId 20 commId 0x38f5ae6ce4fb61e2 - Init START +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO comm 0x9b16d510 rank 8 nranks 16 cudaDev 0 nvmlDev 0 busId 10 commId 0x38f5ae6ce4fb61e2 - Init START +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO comm 0x9bbffb30 rank 11 nranks 16 cudaDev 3 nvmlDev 3 busId 40 commId 0x38f5ae6ce4fb61e2 - Init START +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO comm 0x9abceb70 rank 13 nranks 16 cudaDev 5 nvmlDev 5 busId 60 commId 0x38f5ae6ce4fb61e2 - Init START +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO comm 0x99f52590 rank 12 nranks 16 cudaDev 4 nvmlDev 4 busId 50 commId 0x38f5ae6ce4fb61e2 - Init START +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO comm 0x9b68b8f0 rank 10 nranks 16 cudaDev 2 nvmlDev 2 busId 30 commId 0x38f5ae6ce4fb61e2 - Init START +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO NVLS multicast support is not available on dev 7 +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO NVLS multicast support is not available on dev 5 +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO NVLS multicast support is not available on dev 6 +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff,ffffffff +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO NVLS multicast support is not available on dev 0 +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,ffffffff,ffffffff +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO NVLS multicast support is not available on dev 1 +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,ffffffff,ffffffff +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO NVLS multicast support is not available on dev 2 +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO NVLS multicast support is not available on dev 4 +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,ffffffff +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO NVLS multicast support is not available on dev 3 +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO NCCL_MIN_NCHANNELS set by environment to 4. +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO NCCL_MIN_NCHANNELS set by environment to 4. +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO Trees [0] 12/-1/-1->11->10 [1] 12/-1/-1->11->10 [2] -1/-1/-1->11->10 [3] 12/-1/-1->11->10 [4] 12/-1/-1->11->10 [5] 12/-1/-1->11->10 [6] -1/-1/-1->11->10 [7] 12/-1/-1->11->10 +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO NCCL_MIN_NCHANNELS set by environment to 4. +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO NCCL_MIN_NCHANNELS set by environment to 4. +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO P2P Chunksize set to 131072 +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO NCCL_MIN_NCHANNELS set by environment to 4. +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Trees [0] 13/-1/-1->12->11 [1] 13/-1/-1->12->11 [2] 13/-1/-1->12->4 [3] 13/-1/-1->12->11 [4] 13/-1/-1->12->11 [5] 13/-1/-1->12->11 [6] 13/4/-1->12->-1 [7] 13/-1/-1->12->11 +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO P2P Chunksize set to 131072 +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO Trees [0] -1/-1/-1->15->14 [1] 8/-1/-1->15->14 [2] 8/-1/-1->15->14 [3] 8/-1/-1->15->14 [4] -1/-1/-1->15->14 [5] 8/-1/-1->15->14 [6] 8/-1/-1->15->14 [7] 8/-1/-1->15->14 +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO Trees [0] 15/-1/-1->14->13 [1] 15/-1/-1->14->13 [2] 15/-1/-1->14->13 [3] 15/-1/-1->14->6 [4] 15/-1/-1->14->13 [5] 15/-1/-1->14->13 [6] 15/-1/-1->14->13 [7] 15/6/-1->14->-1 +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO Trees [0] 14/-1/-1->13->12 [1] 14/-1/-1->13->12 [2] 14/-1/-1->13->12 [3] -1/-1/-1->13->12 [4] 14/-1/-1->13->12 [5] 14/-1/-1->13->12 [6] 14/-1/-1->13->12 [7] -1/-1/-1->13->12 +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO P2P Chunksize set to 131072 +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO P2P Chunksize set to 131072 +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO P2P Chunksize set to 131072 +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO NCCL_MIN_NCHANNELS set by environment to 4. +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO NCCL_MIN_NCHANNELS set by environment to 4. +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO Trees [0] 11/-1/-1->10->9 [1] 11/-1/-1->10->2 [2] 11/-1/-1->10->9 [3] 11/-1/-1->10->9 [4] 11/-1/-1->10->9 [5] 11/2/-1->10->-1 [6] 11/-1/-1->10->9 [7] 11/-1/-1->10->9 +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO Trees [0] 10/-1/-1->9->8 [1] -1/-1/-1->9->8 [2] 10/-1/-1->9->8 [3] 10/-1/-1->9->8 [4] 10/-1/-1->9->8 [5] -1/-1/-1->9->8 [6] 10/-1/-1->9->8 [7] 10/-1/-1->9->8 +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO P2P Chunksize set to 131072 +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO P2P Chunksize set to 131072 +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO NCCL_MIN_NCHANNELS set by environment to 4. +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Trees [0] 9/-1/-1->8->0 [1] 9/-1/-1->8->15 [2] 9/-1/-1->8->15 [3] 9/-1/-1->8->15 [4] 9/0/-1->8->-1 [5] 9/-1/-1->8->15 [6] 9/-1/-1->8->15 [7] 9/-1/-1->8->15 +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO P2P Chunksize set to 131072 +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Channel 03/0 : 12[4] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Channel 01/0 : 8[0] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO Channel 01/0 : 10[2] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Channel 07/0 : 12[4] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO Channel 03/0 : 7[7] -> 14[6] [receive] via NET/IB/3/GDRDMA +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO Channel 07/0 : 7[7] -> 14[6] [receive] via NET/IB/3/GDRDMA +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Channel 05/0 : 8[0] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO Channel 05/0 : 10[2] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO Channel 00/0 : 9[1] -> 0[0] [send] via NET/IB/0/GDRDMA +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO Channel 04/0 : 9[1] -> 0[0] [send] via NET/IB/0/GDRDMA +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Channel 03/0 : 8[0] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Channel 07/0 : 8[0] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Channel 02/0 : 5[5] -> 12[4] [receive] via NET/IB/2/GDRDMA +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Channel 06/0 : 5[5] -> 12[4] [receive] via NET/IB/2/GDRDMA +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO Channel 01/0 : 11[3] -> 2[2] [send] via NET/IB/1/GDRDMA +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO Channel 05/0 : 11[3] -> 2[2] [send] via NET/IB/1/GDRDMA +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO Channel 01/0 : 3[3] -> 10[2] [receive] via NET/IB/1/GDRDMA +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO Channel 05/0 : 3[3] -> 10[2] [receive] via NET/IB/1/GDRDMA +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO Channel 02/0 : 13[5] -> 4[4] [send] via NET/IB/2/GDRDMA +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO Channel 06/0 : 13[5] -> 4[4] [send] via NET/IB/2/GDRDMA +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Channel 00/0 : 1[1] -> 8[0] [receive] via NET/IB/0/GDRDMA +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Channel 04/0 : 1[1] -> 8[0] [receive] via NET/IB/0/GDRDMA +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Channel 00/0 : 8[0] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO Channel 03/0 : 15[7] -> 6[6] [send] via NET/IB/3/GDRDMA +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO Channel 07/0 : 15[7] -> 6[6] [send] via NET/IB/3/GDRDMA +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Channel 01/0 : 12[4] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO Channel 03/0 : 14[6] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Channel 02/0 : 8[0] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Channel 05/0 : 12[4] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Channel 04/0 : 8[0] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO Channel 07/0 : 14[6] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Channel 06/0 : 8[0] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO Channel 00/0 : 10[2] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO Channel 00/0 : 15[7] -> 14[6] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO Channel 02/0 : 10[2] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO Channel 01/0 : 15[7] -> 14[6] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO Channel 00/0 : 13[5] -> 12[4] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO Channel 03/0 : 10[2] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO Channel 02/0 : 15[7] -> 14[6] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO Channel 01/0 : 13[5] -> 12[4] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO Channel 04/0 : 10[2] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO Channel 00/0 : 11[3] -> 10[2] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO Channel 04/0 : 15[7] -> 14[6] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO Channel 03/0 : 13[5] -> 12[4] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO Channel 06/0 : 10[2] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO Channel 02/0 : 11[3] -> 10[2] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO Channel 05/0 : 15[7] -> 14[6] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO Channel 04/0 : 13[5] -> 12[4] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO Channel 07/0 : 10[2] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO Channel 00/0 : 14[6] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO Channel 03/0 : 11[3] -> 10[2] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO Channel 06/0 : 15[7] -> 14[6] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO Channel 05/0 : 13[5] -> 12[4] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO Channel 01/0 : 14[6] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO Channel 04/0 : 11[3] -> 10[2] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO Channel 07/0 : 13[5] -> 12[4] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO Channel 02/0 : 14[6] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO Channel 06/0 : 11[3] -> 10[2] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO Channel 04/0 : 14[6] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO Channel 07/0 : 11[3] -> 10[2] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Channel 00/0 : 12[4] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO Channel 01/0 : 9[1] -> 8[0] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO Channel 05/0 : 14[6] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Channel 02/0 : 12[4] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO Channel 02/0 : 9[1] -> 8[0] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO Channel 06/0 : 14[6] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Channel 04/0 : 12[4] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO Channel 03/0 : 9[1] -> 8[0] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Channel 06/0 : 12[4] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO Channel 05/0 : 9[1] -> 8[0] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO Channel 06/0 : 9[1] -> 8[0] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO Channel 07/0 : 9[1] -> 8[0] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:78:356 [5] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 8. +dlc1bfcxl3sf01eb-worker-0:76:355 [3] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 8. +dlc1bfcxl3sf01eb-worker-0:74:358 [1] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 8. +dlc1bfcxl3sf01eb-worker-0:80:357 [7] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 8. +dlc1bfcxl3sf01eb-worker-0:77:354 [4] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. +dlc1bfcxl3sf01eb-worker-0:77:354 [4] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 8. +dlc1bfcxl3sf01eb-worker-0:76:355 [3] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. +dlc1bfcxl3sf01eb-worker-0:79:361 [6] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. +dlc1bfcxl3sf01eb-worker-0:79:361 [6] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 8. +dlc1bfcxl3sf01eb-worker-0:78:356 [5] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. +dlc1bfcxl3sf01eb-worker-0:75:359 [2] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. +dlc1bfcxl3sf01eb-worker-0:75:359 [2] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 8. +dlc1bfcxl3sf01eb-worker-0:74:358 [1] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. +dlc1bfcxl3sf01eb-worker-0:77:354 [4] NCCL INFO NCCL_IB_TC set by environment to 136. +dlc1bfcxl3sf01eb-worker-0:77:354 [4] NCCL INFO NCCL_IB_SL set by environment to 5. +dlc1bfcxl3sf01eb-worker-0:80:357 [7] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. +dlc1bfcxl3sf01eb-worker-0:77:354 [4] NCCL INFO NCCL_IB_TIMEOUT set by environment to 22. +dlc1bfcxl3sf01eb-worker-0:79:361 [6] NCCL INFO NCCL_IB_TC set by environment to 136. +dlc1bfcxl3sf01eb-worker-0:79:361 [6] NCCL INFO NCCL_IB_SL set by environment to 5. +dlc1bfcxl3sf01eb-worker-0:75:359 [2] NCCL INFO NCCL_IB_TC set by environment to 136. +dlc1bfcxl3sf01eb-worker-0:75:359 [2] NCCL INFO NCCL_IB_SL set by environment to 5. +dlc1bfcxl3sf01eb-worker-0:79:361 [6] NCCL INFO NCCL_IB_TIMEOUT set by environment to 22. +dlc1bfcxl3sf01eb-worker-0:75:359 [2] NCCL INFO NCCL_IB_TIMEOUT set by environment to 22. +dlc1bfcxl3sf01eb-worker-0:78:356 [5] NCCL INFO NCCL_IB_TC set by environment to 136. +dlc1bfcxl3sf01eb-worker-0:78:356 [5] NCCL INFO NCCL_IB_SL set by environment to 5. +dlc1bfcxl3sf01eb-worker-0:76:355 [3] NCCL INFO NCCL_IB_TC set by environment to 136. +dlc1bfcxl3sf01eb-worker-0:76:355 [3] NCCL INFO NCCL_IB_SL set by environment to 5. +dlc1bfcxl3sf01eb-worker-0:78:356 [5] NCCL INFO NCCL_IB_TIMEOUT set by environment to 22. +dlc1bfcxl3sf01eb-worker-0:76:355 [3] NCCL INFO NCCL_IB_TIMEOUT set by environment to 22. +dlc1bfcxl3sf01eb-worker-0:73:360 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3. +dlc1bfcxl3sf01eb-worker-0:73:360 [0] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 8. +dlc1bfcxl3sf01eb-worker-0:80:357 [7] NCCL INFO NCCL_IB_TC set by environment to 136. +dlc1bfcxl3sf01eb-worker-0:80:357 [7] NCCL INFO NCCL_IB_SL set by environment to 5. +dlc1bfcxl3sf01eb-worker-0:80:357 [7] NCCL INFO NCCL_IB_TIMEOUT set by environment to 22. +dlc1bfcxl3sf01eb-worker-0:73:360 [0] NCCL INFO NCCL_IB_TC set by environment to 136. +dlc1bfcxl3sf01eb-worker-0:73:360 [0] NCCL INFO NCCL_IB_SL set by environment to 5. +dlc1bfcxl3sf01eb-worker-0:73:360 [0] NCCL INFO NCCL_IB_TIMEOUT set by environment to 22. +dlc1bfcxl3sf01eb-worker-0:74:358 [1] NCCL INFO NCCL_IB_TC set by environment to 136. +dlc1bfcxl3sf01eb-worker-0:74:358 [1] NCCL INFO NCCL_IB_SL set by environment to 5. +dlc1bfcxl3sf01eb-worker-0:74:358 [1] NCCL INFO NCCL_IB_TIMEOUT set by environment to 22. +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO Connected all rings +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO Connected all rings +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO Connected all rings +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO Connected all rings +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO Connected all rings +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Connected all rings +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO Channel 00/0 : 13[5] -> 14[6] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO Channel 01/0 : 13[5] -> 14[6] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Connected all rings +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Channel 00/0 : 8[0] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO Connected all rings +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO Channel 02/0 : 13[5] -> 14[6] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Channel 01/0 : 8[0] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO Channel 00/0 : 11[3] -> 12[4] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Channel 02/0 : 8[0] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO Channel 04/0 : 13[5] -> 14[6] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO Channel 01/0 : 11[3] -> 12[4] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Channel 03/0 : 8[0] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO Channel 05/0 : 13[5] -> 14[6] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO Channel 00/0 : 14[6] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO Channel 03/0 : 11[3] -> 12[4] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Channel 04/0 : 8[0] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO Channel 06/0 : 13[5] -> 14[6] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO Channel 00/0 : 9[1] -> 10[2] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Channel 00/0 : 12[4] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO Channel 01/0 : 14[6] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO Channel 04/0 : 11[3] -> 12[4] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Channel 05/0 : 8[0] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO Channel 02/0 : 9[1] -> 10[2] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Channel 01/0 : 12[4] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO Channel 02/0 : 14[6] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO Channel 05/0 : 11[3] -> 12[4] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Channel 06/0 : 8[0] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO Channel 03/0 : 9[1] -> 10[2] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO Channel 00/0 : 10[2] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Channel 02/0 : 12[4] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO Channel 03/0 : 14[6] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO Channel 07/0 : 11[3] -> 12[4] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Channel 07/0 : 8[0] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO Channel 04/0 : 9[1] -> 10[2] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO Channel 01/0 : 10[2] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Channel 03/0 : 12[4] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO Channel 04/0 : 14[6] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO Channel 06/0 : 9[1] -> 10[2] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO Channel 02/0 : 10[2] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Channel 04/0 : 12[4] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO Channel 05/0 : 14[6] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO Channel 07/0 : 9[1] -> 10[2] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO Channel 03/0 : 10[2] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Channel 05/0 : 12[4] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO Channel 06/0 : 14[6] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Channel 01/0 : 8[0] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO Channel 04/0 : 10[2] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Channel 06/0 : 12[4] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO Channel 07/0 : 14[6] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO Channel 05/0 : 10[2] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Channel 03/0 : 8[0] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Channel 07/0 : 12[4] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO Channel 03/0 : 6[6] -> 14[6] [receive] via NET/IB/3/GDRDMA +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO Channel 07/0 : 6[6] -> 14[6] [receive] via NET/IB/3/GDRDMA +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO Channel 03/0 : 14[6] -> 6[6] [send] via NET/IB/3/GDRDMA +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO Channel 07/0 : 14[6] -> 6[6] [send] via NET/IB/3/GDRDMA +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO Channel 06/0 : 10[2] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Channel 05/0 : 8[0] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO Channel 02/0 : 13[5] -> 12[4] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Channel 02/0 : 4[4] -> 12[4] [receive] via NET/IB/2/GDRDMA +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Channel 06/0 : 4[4] -> 12[4] [receive] via NET/IB/2/GDRDMA +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Channel 02/0 : 12[4] -> 4[4] [send] via NET/IB/2/GDRDMA +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Channel 06/0 : 12[4] -> 4[4] [send] via NET/IB/2/GDRDMA +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO Channel 07/0 : 10[2] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Channel 07/0 : 8[0] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO Channel 06/0 : 13[5] -> 12[4] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO Channel 00/0 : 9[1] -> 8[0] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO Channel 01/0 : 2[2] -> 10[2] [receive] via NET/IB/1/GDRDMA +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO Channel 05/0 : 2[2] -> 10[2] [receive] via NET/IB/1/GDRDMA +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO Channel 01/0 : 10[2] -> 2[2] [send] via NET/IB/1/GDRDMA +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO Channel 05/0 : 10[2] -> 2[2] [send] via NET/IB/1/GDRDMA +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Channel 01/0 : 12[4] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO Channel 04/0 : 9[1] -> 8[0] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO Channel 01/0 : 15[7] -> 8[0] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Channel 03/0 : 12[4] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Channel 00/0 : 0[0] -> 8[0] [receive] via NET/IB/0/GDRDMA +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Channel 04/0 : 0[0] -> 8[0] [receive] via NET/IB/0/GDRDMA +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Channel 00/0 : 8[0] -> 0[0] [send] via NET/IB/0/GDRDMA +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Channel 04/0 : 8[0] -> 0[0] [send] via NET/IB/0/GDRDMA +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO Channel 02/0 : 15[7] -> 8[0] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Channel 05/0 : 12[4] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO Channel 03/0 : 15[7] -> 8[0] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Channel 07/0 : 12[4] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO Channel 01/0 : 11[3] -> 10[2] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO Channel 05/0 : 15[7] -> 8[0] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO Channel 05/0 : 11[3] -> 10[2] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO Channel 06/0 : 15[7] -> 8[0] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO Channel 07/0 : 15[7] -> 8[0] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO Channel 03/0 : 15[7] -> 14[6] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO Channel 07/0 : 15[7] -> 14[6] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO Connected all trees +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO 8 coll channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO Connected all trees +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO 8 coll channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO Connected all trees +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO 8 coll channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO Connected all trees +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO 8 coll channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO Connected all trees +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO 8 coll channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO Connected all trees +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO 8 coll channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO Connected all trees +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO 8 coll channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO Connected all trees +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO 8 coll channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer +dlc1bfcxl3sf01eb-worker-0:80:342 [7] NCCL INFO comm 0x9be138b0 rank 15 nranks 16 cudaDev 7 nvmlDev 7 busId 80 commId 0x38f5ae6ce4fb61e2 - Init COMPLETE +dlc1bfcxl3sf01eb-worker-0:76:346 [3] NCCL INFO comm 0x9bbffb30 rank 11 nranks 16 cudaDev 3 nvmlDev 3 busId 40 commId 0x38f5ae6ce4fb61e2 - Init COMPLETE +dlc1bfcxl3sf01eb-worker-0:75:316 [2] NCCL INFO comm 0x9b68b8f0 rank 10 nranks 16 cudaDev 2 nvmlDev 2 busId 30 commId 0x38f5ae6ce4fb61e2 - Init COMPLETE +dlc1bfcxl3sf01eb-worker-0:79:339 [6] NCCL INFO comm 0x9a3cae90 rank 14 nranks 16 cudaDev 6 nvmlDev 6 busId 70 commId 0x38f5ae6ce4fb61e2 - Init COMPLETE +dlc1bfcxl3sf01eb-worker-0:77:317 [4] NCCL INFO comm 0x99f52590 rank 12 nranks 16 cudaDev 4 nvmlDev 4 busId 50 commId 0x38f5ae6ce4fb61e2 - Init COMPLETE +dlc1bfcxl3sf01eb-worker-0:73:318 [0] NCCL INFO comm 0x9b16d510 rank 8 nranks 16 cudaDev 0 nvmlDev 0 busId 10 commId 0x38f5ae6ce4fb61e2 - Init COMPLETE +dlc1bfcxl3sf01eb-worker-0:78:314 [5] NCCL INFO comm 0x9abceb70 rank 13 nranks 16 cudaDev 5 nvmlDev 5 busId 60 commId 0x38f5ae6ce4fb61e2 - Init COMPLETE +dlc1bfcxl3sf01eb-worker-0:74:315 [1] NCCL INFO comm 0x9a88a060 rank 9 nranks 16 cudaDev 1 nvmlDev 1 busId 20 commId 0x38f5ae6ce4fb61e2 - Init COMPLETE + +Loading checkpoint shards: 0%| | 0/8 [00:00> Some weights of LlavaInternlm2ForCausalLM were not initialized from the model checkpoint at models/internlm/internlm2_5-7b-chat and are newly initialized: ['lm_head.weight'] +You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. +Using tokenizer from models/internlm/internlm2_5-7b-chat +using cache dir None + +Loading checkpoint shards: 100%|██████████| 8/8 [00:08<00:00, 1.06s/it] +Loading checkpoint shards: 100%|██████████| 8/8 [00:08<00:00, 1.04s/it] +[INFO|modeling_utils.py:4350] 2025-01-16 22:33:59,927 >> All model checkpoint weights were used when initializing LlavaInternlm2ForCausalLM. + +[WARNING|modeling_utils.py:4352] 2025-01-16 22:33:59,927 >> Some weights of LlavaInternlm2ForCausalLM were not initialized from the model checkpoint at models/internlm/internlm2_5-7b-chat and are newly initialized: ['lm_head.weight'] +You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. + +Loading checkpoint shards: 100%|██████████| 8/8 [00:08<00:00, 1.06s/it] +Loading checkpoint shards: 100%|██████████| 8/8 [00:08<00:00, 1.04s/it] +[WARNING|modeling_utils.py:4352] 2025-01-16 22:33:59,929 >> Some weights of LlavaInternlm2ForCausalLM were not initialized from the model checkpoint at models/internlm/internlm2_5-7b-chat and are newly initialized: ['lm_head.weight'] +You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. + +Loading checkpoint shards: 100%|██████████| 8/8 [00:08<00:00, 1.06s/it] +Loading checkpoint shards: 100%|██████████| 8/8 [00:08<00:00, 1.03s/it] +[WARNING|modeling_utils.py:4352] 2025-01-16 22:33:59,929 >> Some weights of LlavaInternlm2ForCausalLM were not initialized from the model checkpoint at models/internlm/internlm2_5-7b-chat and are newly initialized: ['lm_head.weight'] +You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. + +Loading checkpoint shards: 100%|██████████| 8/8 [00:08<00:00, 1.06s/it] +Loading checkpoint shards: 100%|██████████| 8/8 [00:08<00:00, 1.04s/it] +[WARNING|modeling_utils.py:4352] 2025-01-16 22:33:59,930 >> Some weights of LlavaInternlm2ForCausalLM were not initialized from the model checkpoint at models/internlm/internlm2_5-7b-chat and are newly initialized: ['lm_head.weight'] +You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. + +Loading checkpoint shards: 100%|██████████| 8/8 [00:08<00:00, 1.06s/it] +Loading checkpoint shards: 100%|██████████| 8/8 [00:08<00:00, 1.04s/it] +[WARNING|modeling_utils.py:4352] 2025-01-16 22:33:59,931 >> Some weights of LlavaInternl/fs-computility/mllm1/shared/hub/ the model checkpoint at models/internlm/internlm2_5-7b-chat and are newly initialized: ['lm_head.weight'] +You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. + +Loading checkpoint shards: 100%|██████████| 8/8 [00:08<00:00, 1.06s/it][INFO|configuration_utils.py:779] 2025-01-16 22:33:59,931 >> loading configuration file models/internlm/internlm2_5-7b-chat/generation_config.json + +Loading checkpoint shards: 100%|██████████| 8/8 [00:08<00:00, 1.04s/it] +[WARNING|modeling_utils.py:4352] 2025-01-16 22:33:59,931 >> Some weights of LlavaInternlm2ForCausalLM were not initialized from the model checkpoint at models/internlm/internlm2_5-7b-chat and are newly initialized: ['lm_head.weight'] +You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. +[INFO|configuration_utils.py:826] 2025-01-16 22:33:59,931 >> Generate config GenerationConfig { + "bos_token_id": 1, + "eos_token_id": [ + 2, + 92542 + ], + "pad_token_id": 2 +} + +Using tokenizer from models/internlm/internlm2_5-7b-chat +using cache dir None + +Loading checkpoint shards: 100%|██████████| 8/8 [00:08<00:00, 1.06s/it]/fs-computility/mllm1/shared/hub/ +Loading checkpoint shards: 100%|██████████| 8/8 [00:08<00:00, 1.04s/it] +[WARNING|modeling_utils.py:4352] 2025-01-16 22:33:59,932 >> Some weights of LlavaInternlm2ForCaus/fs-computility/mllm1/shared/hub/l checkpoint at models/internlm/internlm2_5-7b-chat and are newly initialized: ['lm_head.weight'] +You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. +Using tokenizer from models/internlm/internlm2_5-7b-chat +using cache dir None/fs-computility/mllm1/shared/hub/ +Using tokenizer from models/internlm/internlm2_5-7b-chat +using cache dir None +Using tokenizer from models/internlm/internlm2_5-7b-chat +using cache dir None +Using tokenizer from models/internlm/internlm2_5-7b-chat +using cache dir None +Using tokenizer from models/internlm/internlm2_5-7b-chat +using cache dir None +Using tokenizer from models/internlm/internlm2_5-7b-chat +using cache dir None +[INFO|tokenization_utils_base.py:2025] 2025-01-16 22:33:59,946 >> loading file ./tokenizer.model +[INFO|tokenization_utils_base.py:2025] 2025-01-16 22:33:59,946 >> loading file added_tokens.json +[INFO|tokenization_utils_base.py:2025] 2025-01-16 22:33:59,946 >> loading file special_tokens_map.json +[INFO|tokenization_utils_base.py:2025] 2025-01-16 22:33:59,946 >> loading file tokenizer_config.json +[INFO|tokenization_utils_base.py:2025] 2025-01-16 22:33:59,946 >> loading file tokenizer.json +01/16/2025 22:34:00 - WARNING - llava.train.train - Using conversation template: Conversation(system='<|im_start|>system\nYou are a helpful assistant. ', roles=('<|im_start|>user\n', '<|im_start|>assistant\n'), messages=[], offset=0, sep_style=, sep='<|im_end|>', sep2=None, version='internlm_v2', mm_system=None, skip_next=False) +01/16/2025 22:34:00 - INFO - llava.train.train - Using conversation template: Conversation(system='<|im_start|>system\nYou are a helpful assistant. ', roles=('<|im_start|>user\n', '<|im_start|>assistant\n'), messages=[], offset=0, sep_style=, sep='<|im_end|>', sep2=None, version='internlm_v2', mm_system=None, skip_next=False) +01/16/2025 22:34:00 - WARNING - llava.train.train - Using conversation template: Conversation(system='<|im_start|>system\nYou are a helpful assistant. ', roles=('<|im_start|>user\n', '<|im_start|>assistant\n'), messages=[], offset=0, sep_style=, sep='<|im_end|>', sep2=None, version='internlm_v2', mm_system=None, skip_next=False) +[INFO|image_processing_utils.py:373] 2025-01-16 22:34:00,114 >> loading configuration file /fs-computility/mllm1/shared/hub/models--openai--clip-vit-large-patch14-336/snapshots/ce19dc912ca5cd21c8a653c79e251e808ccabcd1/preprocessor_config.json +[INFO|image_processing_utils.py:738] 2025-01-16 22:34:00,114 >> size should be a dictionary on of the following set of keys: ({'height', 'width'}, {'shortest_edge'}, {'shortest_edge', 'longest_edge'}, {'longest_edge'}), got 336. Converted to {'shortest_edge': 336}. +[INFO|image_processing_utils.py:738] 2025-01-16 22:34:00,114 >> crop_size should be a dictionary on of the following set of keys: ({'height', 'width'}, {'shortest_edge'}, {'shortest_edge', 'longest_edge'}, {'longest_edge'}), got 336. Converted to {'height': 336, 'width': 336}. +[INFO|image_processing_utils.py:425] 2025-01-16 22:34:00,114 >> Image processor CLIPImageProcessor { + "crop_size": { + "height": 336, + "width": 336 + }, + "do_center_crop": true, + "do_convert_rgb": true, + "do_normalize": true, + "do_rescale": true, + "do_resize": true, + "image_mean": [ + 0.48145466, + 0.4578275, + 0.40821073 + ], + "image_processor_type": "CLIPImageProcessor", + "image_std": [ + 0.26862954, + 0.26130258, + 0.27577711 + ], + "resample": 3, + "rescale_factor": 0.00392156862745098, + "size": { + "shortest_edge": 336 + } +} + +01/16/2025 22:34:00 - WARNING - llava.train.train - Using conversation template: Conversation(system='<|im_start|>system\nYou are a helpful assistant. ', roles=('<|im_start|>user\n', '<|im_start|>assistant\n'), messages=[], offset=0, sep_style=, sep='<|im_end|>', sep2=None, version='internlm_v2', mm_system=None, skip_next=False) +01/16/2025 22:34:00 - WARNING - llava.train.train - Using conversation template: Conversation(system='<|im_start|>system\nYou are a helpful assistant. ', roles=('<|im_start|>user\n', '<|im_start|>assistant\n'), messages=[], offset=0, sep_style=, sep='<|im_end|>', sep2=None, version='internlm_v2', mm_system=None, skip_next=False) +01/16/2025 22:34:00 - WARNING - llava.train.train - Using conversation template: Conversation(system='<|im_start|>system\nYou are a helpful assistant. ', roles=('<|im_start|>user\n', '<|im_start|>assistant\n'), messages=[], offset=0, sep_style=, sep='<|im_end|>', sep2=None, version='internlm_v2', mm_system=None, skip_next=False) +01/16/2025 22:34:00 - WARNING - llava.train.train - Using conversation template: Conversation(system='<|im_start|>system\nYou are a helpful assistant. ', roles=('<|im_start|>user\n', '<|im_start|>assistant\n'), messages=[], offset=0, sep_style=, sep='<|im_end|>', sep2=None, version='internlm_v2', mm_system=None, skip_next=False) +01/16/2025 22:34:00 - WARNING - llava.train.train - Using conversation template: Conversation(system='<|im_start|>system\nYou are a helpful assistant. ', roles=('<|im_start|>user\n', '<|im_start|>assistant\n'), messages=[], offset=0, sep_style=, sep='<|im_end|>', sep2=None, version='internlm_v2', mm_system=None, skip_next=False) +[INFO|configuration_utils.py:727] 2025-01-16 22:34:00,123 >> loading configuration file /fs-computility/mllm1/shared/hub/models--openai--clip-vit-large-patch14-336/snapshots/ce19dc912ca5cd21c8a653c79e251e808ccabcd1/config.json +[INFO|configuration_utils.py:792] 2025-01-16 22:34:00,124 >> Model config CLIPVisionConfig { + "attention_dropout": 0.0, + "dropout": 0.0, + "hidden_act": "quick_gelu", + "hidden_size": 1024, + "image_size": 336, + "initializer_factor": 1.0, + "initializer_range": 0.02, + "intermediate_size": 4096, + "layer_norm_eps": 1e-05, + "model_type": "clip_vision_model", + "num_attention_heads": 16, + "num_channels": 3, + "num_hidden_layers": 24, + "patch_size": 14, + "projection_dim": 768, + "transformers_version": "4.37.2" +} + +[INFO|modeling_utils.py:3473] 2025-01-16 22:34:00,125 >> loading weights file /fs-computility/mllm1/shared/hub/models--openai--clip-vit-large-patch14-336/snapshots/ce19dc912ca5cd21c8a653c79e251e808ccabcd1/pytorch_model.bin +[INFO|modeling_utils.py:3582] 2025-01-16 22:34:03,503 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model +[INFO|modeling_utils.py:4340] 2025-01-16 22:34:05,335 >> Some weights of the model checkpoint at /fs-computility/mllm1/shared/hub/models--openai--clip-vit-large-patch14-336/snapshots/ce19dc912ca5cd21c8a653c79e251e808ccabcd1 were not used when initializing CLIPVisionModel: ['logit_scale', 'text_model.embeddings.position_embedding.weight', 'text_model.embeddings.position_ids', 'text_model.embeddings.token_embedding.weight', 'text_model.encoder.layers.0.layer_norm1.bias', 'text_model.encoder.layers.0.layer_norm1.weight', 'text_model.encoder.layers.0.layer_norm2.bias', 'text_model.encoder.layers.0.layer_norm2.weight', 'text_model.encoder.layers.0.mlp.fc1.bias', 'text_model.encoder.layers.0.mlp.fc1.weight', 'text_model.encoder.layers.0.mlp.fc2.bias', 'text_model.encoder.layers.0.mlp.fc2.weight', 'text_model.encoder.layers.0.self_attn.k_proj.bias', 'text_model.encoder.layers.0.self_attn.k_proj.weight', 'text_model.encoder.layers.0.self_attn.out_proj.bias', 'text_model.encoder.layers.0.self_attn.out_proj.weight', 'text_model.encoder.layers.0.self_attn.q_proj.bias', 'text_model.encoder.layers.0.self_attn.q_proj.weight', 'text_model.encoder.layers.0.self_attn.v_proj.bias', 'text_model.encoder.layers.0.self_attn.v_proj.weight', 'text_model.encoder.layers.1.layer_norm1.bias', 'text_model.encoder.layers.1.layer_norm1.weight', 'text_model.encoder.layers.1.layer_norm2.bias', 'text_model.encoder.layers.1.layer_norm2.weight', 'text_model.encoder.layers.1.mlp.fc1.bias', 'text_model.encoder.layers.1.mlp.fc1.weight', 'text_model.encoder.layers.1.mlp.fc2.bias', 'text_model.encoder.layers.1.mlp.fc2.weight', 'text_model.encoder.layers.1.self_attn.k_proj.bias', 'text_model.encoder.layers.1.self_attn.k_proj.weight', 'text_model.encoder.layers.1.self_attn.out_proj.bias', 'text_model.encoder.layers.1.self_attn.out_proj.weight', 'text_model.encoder.layers.1.self_attn.q_proj.bias', 'text_model.encoder.layers.1.self_attn.q_proj.weight', 'text_model.encoder.layers.1.self_attn.v_proj.bias', 'text_model.encoder.layers.1.self_attn.v_proj.weight', 'text_model.encoder.layers.10.layer_norm1.bias', 'text_model.encoder.layers.10.layer_norm1.weight', 'text_model.encoder.layers.10.layer_norm2.bias', 'text_model.encoder.layers.10.layer_norm2.weight', 'text_model.encoder.layers.10.mlp.fc1.bias', 'text_model.encoder.layers.10.mlp.fc1.weight', 'text_model.encoder.layers.10.mlp.fc2.bias', 'text_model.encoder.layers.10.mlp.fc2.weight', 'text_model.encoder.layers.10.self_attn.k_proj.bias', 'text_model.encoder.layers.10.self_attn.k_proj.weight', 'text_model.encoder.layers.10.self_attn.out_proj.bias', 'text_model.encoder.layers.10.self_attn.out_proj.weight', 'text_model.encoder.layers.10.self_attn.q_proj.bias', 'text_model.encoder.layers.10.self_attn.q_proj.weight', 'text_model.encoder.layers.10.self_attn.v_proj.bias', 'text_model.encoder.layers.10.self_attn.v_proj.weight', 'text_model.encoder.layers.11.layer_norm1.bias', 'text_model.encoder.layers.11.layer_norm1.weight', 'text_model.encoder.layers.11.layer_norm2.bias', 'text_model.encoder.layers.11.layer_norm2.weight', 'text_model.encoder.layers.11.mlp.fc1.bias', 'text_model.encoder.layers.11.mlp.fc1.weight', 'text_model.encoder.layers.11.mlp.fc2.bias', 'text_model.encoder.layers.11.mlp.fc2.weight', 'text_model.encoder.layers.11.self_attn.k_proj.bias', 'text_model.encoder.layers.11.self_attn.k_proj.weight', 'text_model.encoder.layers.11.self_attn.out_proj.bias', 'text_model.encoder.layers.11.self_attn.out_proj.weight', 'text_model.encoder.layers.11.self_attn.q_proj.bias', 'text_model.encoder.layers.11.self_attn.q_proj.weight', 'text_model.encoder.layers.11.self_attn.v_proj.bias', 'text_model.encoder.layers.11.self_attn.v_proj.weight', 'text_model.encoder.layers.2.layer_norm1.bias', 'text_model.encoder.layers.2.layer_norm1.weight', 'text_model.encoder.layers.2.layer_norm2.bias', 'text_model.encoder.layers.2.layer_norm2.weight', 'text_model.encoder.layers.2.mlp.fc1.bias', 'text_model.encoder.layers.2.mlp.fc1.weight', 'text_model.encoder.layers.2.mlp.fc2.bias', 'text_model.encoder.layers.2.mlp.fc2.weight', 'text_model.encoder.layers.2.self_attn.k_proj.bias', 'text_model.encoder.layers.2.self_attn.k_proj.weight', 'text_model.encoder.layers.2.self_attn.out_proj.bias', 'text_model.encoder.layers.2.self_attn.out_proj.weight', 'text_model.encoder.layers.2.self_attn.q_proj.bias', 'text_model.encoder.layers.2.self_attn.q_proj.weight', 'text_model.encoder.layers.2.self_attn.v_proj.bias', 'text_model.encoder.layers.2.self_attn.v_proj.weight', 'text_model.encoder.layers.3.layer_norm1.bias', 'text_model.encoder.layers.3.layer_norm1.weight', 'text_model.encoder.layers.3.layer_norm2.bias', 'text_model.encoder.layers.3.layer_norm2.weight', 'text_model.encoder.layers.3.mlp.fc1.bias', 'text_model.encoder.layers.3.mlp.fc1.weight', 'text_model.encoder.layers.3.mlp.fc2.bias', 'text_model.encoder.layers.3.mlp.fc2.weight', 'text_model.encoder.layers.3.self_attn.k_proj.bias', 'text_model.encoder.layers.3.self_attn.k_proj.weight', 'text_model.encoder.layers.3.self_attn.out_proj.bias', 'text_model.encoder.layers.3.self_attn.out_proj.weight', 'text_model.encoder.layers.3.self_attn.q_proj.bias', 'text_model.encoder.layers.3.self_attn.q_proj.weight', 'text_model.encoder.layers.3.self_attn.v_proj.bias', 'text_model.encoder.layers.3.self_attn.v_proj.weight', 'text_model.encoder.layers.4.layer_norm1.bias', 'text_model.encoder.layers.4.layer_norm1.weight', 'text_model.encoder.layers.4.layer_norm2.bias', 'text_model.encoder.layers.4.layer_norm2.weight', 'text_model.encoder.layers.4.mlp.fc1.bias', 'text_model.encoder.layers.4.mlp.fc1.weight', 'text_model.encoder.layers.4.mlp.fc2.bias', 'text_model.encoder.layers.4.mlp.fc2.weight', 'text_model.encoder.layers.4.self_attn.k_proj.bias', 'text_model.encoder.layers.4.self_attn.k_proj.weight', 'text_model.encoder.layers.4.self_attn.out_proj.bias', 'text_model.encoder.layers.4.self_attn.out_proj.weight', 'text_model.encoder.layers.4.self_attn.q_proj.bias', 'text_model.encoder.layers.4.self_attn.q_proj.weight', 'text_model.encoder.layers.4.self_attn.v_proj.bias', 'text_model.encoder.layers.4.self_attn.v_proj.weight', 'text_model.encoder.layers.5.layer_norm1.bias', 'text_model.encoder.layers.5.layer_norm1.weight', 'text_model.encoder.layers.5.layer_norm2.bias', 'text_model.encoder.layers.5.layer_norm2.weight', 'text_model.encoder.layers.5.mlp.fc1.bias', 'text_model.encoder.layers.5.mlp.fc1.weight', 'text_model.encoder.layers.5.mlp.fc2.bias', 'text_model.encoder.layers.5.mlp.fc2.weight', 'text_model.encoder.layers.5.self_attn.k_proj.bias', 'text_model.encoder.layers.5.self_attn.k_proj.weight', 'text_model.encoder.layers.5.self_attn.out_proj.bias', 'text_model.encoder.layers.5.self_attn.out_proj.weight', 'text_model.encoder.layers.5.self_attn.q_proj.bias', 'text_model.encoder.layers.5.self_attn.q_proj.weight', 'text_model.encoder.layers.5.self_attn.v_proj.bias', 'text_model.encoder.layers.5.self_attn.v_proj.weight', 'text_model.encoder.layers.6.layer_norm1.bias', 'text_model.encoder.layers.6.layer_norm1.weight', 'text_model.encoder.layers.6.layer_norm2.bias', 'text_model.encoder.layers.6.layer_norm2.weight', 'text_model.encoder.layers.6.mlp.fc1.bias', 'text_model.encoder.layers.6.mlp.fc1.weight', 'text_model.encoder.layers.6.mlp.fc2.bias', 'text_model.encoder.layers.6.mlp.fc2.weight', 'text_model.encoder.layers.6.self_attn.k_proj.bias', 'text_model.encoder.layers.6.self_attn.k_proj.weight', 'text_model.encoder.layers.6.self_attn.out_proj.bias', 'text_model.encoder.layers.6.self_attn.out_proj.weight', 'text_model.encoder.layers.6.self_attn.q_proj.bias', 'text_model.encoder.layers.6.self_attn.q_proj.weight', 'text_model.encoder.layers.6.self_attn.v_proj.bias', 'text_model.encoder.layers.6.self_attn.v_proj.weight', 'text_model.encoder.layers.7.layer_norm1.bias', 'text_model.encoder.layers.7.layer_norm1.weight', 'text_model.encoder.layers.7.layer_norm2.bias', 'text_model.encoder.layers.7.layer_norm2.weight', 'text_model.encoder.layers.7.mlp.fc1.bias', 'text_model.encoder.layers.7.mlp.fc1.weight', 'text_model.encoder.layers.7.mlp.fc2.bias', 'text_model.encoder.layers.7.mlp.fc2.weight', 'text_model.encoder.layers.7.self_attn.k_proj.bias', 'text_model.encoder.layers.7.self_attn.k_proj.weight', 'text_model.encoder.layers.7.self_attn.out_proj.bias', 'text_model.encoder.layers.7.self_attn.out_proj.weight', 'text_model.encoder.layers.7.self_attn.q_proj.bias', 'text_model.encoder.layers.7.self_attn.q_proj.weight', 'text_model.encoder.layers.7.self_attn.v_proj.bias', 'text_model.encoder.layers.7.self_attn.v_proj.weight', 'text_model.encoder.layers.8.layer_norm1.bias', 'text_model.encoder.layers.8.layer_norm1.weight', 'text_model.encoder.layers.8.layer_norm2.bias', 'text_model.encoder.layers.8.layer_norm2.weight', 'text_model.encoder.layers.8.mlp.fc1.bias', 'text_model.encoder.layers.8.mlp.fc1.weight', 'text_model.encoder.layers.8.mlp.fc2.bias', 'text_model.encoder.layers.8.mlp.fc2.weight', 'text_model.encoder.layers.8.self_attn.k_proj.bias', 'text_model.encoder.layers.8.self_attn.k_proj.weight', 'text_model.encoder.layers.8.self_attn.out_proj.bias', 'text_model.encoder.layers.8.self_attn.out_proj.weight', 'text_model.encoder.layers.8.self_attn.q_proj.bias', 'text_model.encoder.layers.8.self_attn.q_proj.weight', 'text_model.encoder.layers.8.self_attn.v_proj.bias', 'text_model.encoder.layers.8.self_attn.v_proj.weight', 'text_model.encoder.layers.9.layer_norm1.bias', 'text_model.encoder.layers.9.layer_norm1.weight', 'text_model.encoder.layers.9.layer_norm2.bias', 'text_model.encoder.layers.9.layer_norm2.weight', 'text_model.encoder.layers.9.mlp.fc1.bias', 'text_model.encoder.layers.9.mlp.fc1.weight', 'text_model.encoder.layers.9.mlp.fc2.bias', 'text_model.encoder.layers.9.mlp.fc2.weight', 'text_model.encoder.layers.9.self_attn.k_proj.bias', 'text_model.encoder.layers.9.self_attn.k_proj.weight', 'text_model.encoder.layers.9.self_attn.out_proj.bias', 'text_model.encoder.layers.9.self_attn.out_proj.weight', 'text_model.encoder.layers.9.self_attn.q_proj.bias', 'text_model.encoder.layers.9.self_attn.q_proj.weight', 'text_model.encoder.layers.9.self_attn.v_proj.bias', 'text_model.encoder.layers.9.self_attn.v_proj.weight', 'text_model.final_layer_norm.bias', 'text_model.final_layer_norm.weight', 'text_projection.weight', 'visual_projection.weight'] +- This IS expected if you are initializing CLIPVisionModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). +- This IS NOT expected if you are initializing CLIPVisionModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). +[INFO|modeling_utils.py:4358] 2025-01-16 22:34:05,335 >> All the weights of CLIPVisionModel were initialized from the model checkpoint at /fs-computility/mllm1/shared/hub/models--openai--clip-vit-large-patch14-336/snapshots/ce19dc912ca5cd21c8a653c79e251e808ccabcd1. +If your task is similar to the task the model of the checkpoint was trained on, you can already use CLIPVisionModel for predictions without further training. +01/16/2025 22:34:23 - INFO - llava.train.train - Add dataset: llava-next-sft-notext with length: 738601, data type: normal, seed: 0 +01/16/2025 22:34:26 - INFO - llava.train.train - Add dataset: knowledge_gqa9k_art1500_cc3m30k with length: 40813, data type: know, seed: 1 +01/16/2025 22:34:30 - INFO - llava.train.train - Add dataset: Inferencial_flickr7k_cc3m30k_polished_md with length: 37117, data type: inf_polishmd, seed: 2 +01/16/2025 22:34:33 - INFO - llava.train.train - Add dataset: Detail_flickr7k_cc3m28k with length: 35313, data type: detail, seed: 3 +01/16/2025 22:34:37 - INFO - llava.train.train - Add dataset: Knowledge_instruct40k with length: 40218, data type: know_ins, seed: 4 +01/16/2025 22:34:40 - INFO - llava.train.train - Add dataset: Creation10k_fixed with length: 9698, data type: creation, seed: 5 +01/16/2025 22:34:43 - INFO - llava.train.train - Add dataset: Chartqa_generate_11k_gpt_qwen_merge with length: 11160, data type: chart, seed: 6 +01/16/2025 22:34:46 - INFO - llava.train.train - Add dataset: Tqa_detail_qwengenerate_multi8k_gpt with length: 8391, data type: tqa, seed: 7 +01/16/2025 22:34:50 - INFO - llava.train.train - Add dataset: Infovqa_single_gpt with length: 23068, data type: info, seed: 8 +[INFO|trainer.py:571] 2025-01-16 22:34:50,107 >> Using auto half precision backend +[INFO|trainer.py:1721] 2025-01-16 22:35:33,753 >> ***** Running training ***** +[INFO|trainer.py:1722] 2025-01-16 22:35:33,753 >> Num examples = 944,379 +[INFO|trainer.py:1723] 2025-01-16 22:35:33,753 >> Num Epochs = 1 +[INFO|trainer.py:1724] 2025-01-16 22:35:33,753 >> Instantaneous batch size per device = 4 +[INFO|trainer.py:1727] 2025-01-16 22:35:33,753 >> Total train batch size (w. parallel, distributed & accumulation) = 128 +[INFO|trainer.py:1728] 2025-01-16 22:35:33,753 >> Gradient Accumulation steps = 2 +[INFO|trainer.py:1729] 2025-01-16 22:35:33,753 >> Total optimization steps = 7,378 +[INFO|trainer.py:1730] 2025-01-16 22:35:33,755 >> Number of trainable parameters = 8,441,260,032 +dlc1bfcxl3sf01eb-worker-0:80:3077 [7] NCCL INFO Using network IB +dlc1bfcxl3sf01eb-worker-0:78:3075 [5] NCCL INFO Using network IB +dlc1bfcxl3sf01eb-worker-0:76:3076 [3] NCCL INFO Using network IB +dlc1bfcxl3sf01eb-worker-0:74:3078 [1] NCCL INFO Using network IB +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Using network IB +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Using network IB +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO Using network IB +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO Using network IB +dlc1bfcxl3sf01eb-worker-0:80:3077 [7] NCCL INFO comm 0x7fa03c037ad0 rank 15 nranks 16 cudaDev 7 nvmlDev 7 busId 80 commId 0x338d070d42255f9a - Init START +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO comm 0x7f2518037710 rank 14 nranks 16 cudaDev 6 nvmlDev 6 busId 70 commId 0x338d070d42255f9a - Init START +dlc1bfcxl3sf01eb-worker-0:78:3075 [5] NCCL INFO comm 0x7faef00375b0 rank 13 nranks 16 cudaDev 5 nvmlDev 5 busId 60 commId 0x338d070d42255f9a - Init START +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO comm 0x7f08a4037770 rank 12 nranks 16 cudaDev 4 nvmlDev 4 busId 50 commId 0x338d070d42255f9a - Init START +dlc1bfcxl3sf01eb-worker-0:76:3076 [3] NCCL INFO comm 0x7f6af80377d0 rank 11 nranks 16 cudaDev 3 nvmlDev 3 busId 40 commId 0x338d070d42255f9a - Init START +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO comm 0x7f22740374a0 rank 8 nranks 16 cudaDev 0 nvmlDev 0 busId 10 commId 0x338d070d42255f9a - Init START +dlc1bfcxl3sf01eb-worker-0:74:3078 [1] NCCL INFO comm 0x7fee7c037660 rank 9 nranks 16 cudaDev 1 nvmlDev 1 busId 20 commId 0x338d070d42255f9a - Init START +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO comm 0x7fbad80375f0 rank 10 nranks 16 cudaDev 2 nvmlDev 2 busId 30 commId 0x338d070d42255f9a - Init START +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO NVLS multicast support is not available on dev 6 +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,ffffffff,ffffffff +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO NVLS multicast support is not available on dev 2 +dlc1bfcxl3sf01eb-worker-0:80:3077 [7] NCCL INFO NVLS multicast support is not available on dev 7 +dlc1bfcxl3sf01eb-worker-0:74:3078 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,ffffffff,ffffffff +dlc1bfcxl3sf01eb-worker-0:74:3078 [1] NCCL INFO NVLS multicast support is not available on dev 1 +dlc1bfcxl3sf01eb-worker-0:78:3075 [5] NCCL INFO NVLS multicast support is not available on dev 5 +dlc1bfcxl3sf01eb-worker-0:76:3076 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,ffffffff +dlc1bfcxl3sf01eb-worker-0:76:3076 [3] NCCL INFO NVLS multicast support is not available on dev 3 +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO NVLS multicast support is not available on dev 4 +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff,ffffffff +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO NVLS multicast support is not available on dev 0 +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Trees [0] 9/-1/-1->8->0 [1] 9/-1/-1->8->15 [2] 9/-1/-1->8->15 [3] 9/-1/-1->8->15 [4] 9/0/-1->8->-1 [5] 9/-1/-1->8->15 [6] 9/-1/-1->8->15 [7] 9/-1/-1->8->15 +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO P2P Chunksize set to 131072 +dlc1bfcxl3sf01eb-worker-0:74:3078 [1] NCCL INFO Trees [0] 10/-1/-1->9->8 [1] -1/-1/-1->9->8 [2] 10/-1/-1->9->8 [3] 10/-1/-1->9->8 [4] 10/-1/-1->9->8 [5] -1/-1/-1->9->8 [6] 10/-1/-1->9->8 [7] 10/-1/-1->9->8 +dlc1bfcxl3sf01eb-worker-0:74:3078 [1] NCCL INFO P2P Chunksize set to 131072 +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO Trees [0] 11/-1/-1->10->9 [1] 11/-1/-1->10->2 [2] 11/-1/-1->10->9 [3] 11/-1/-1->10->9 [4] 11/-1/-1->10->9 [5] 11/2/-1->10->-1 [6] 11/-1/-1->10->9 [7] 11/-1/-1->10->9 +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO P2P Chunksize set to 131072 +dlc1bfcxl3sf01eb-worker-0:80:3077 [7] NCCL INFO Trees [0] -1/-1/-1->15->14 [1] 8/-1/-1->15->14 [2] 8/-1/-1->15->14 [3] 8/-1/-1->15->14 [4] -1/-1/-1->15->14 [5] 8/-1/-1->15->14 [6] 8/-1/-1->15->14 [7] 8/-1/-1->15->14 +dlc1bfcxl3sf01eb-worker-0:80:3077 [7] NCCL INFO P2P Chunksize set to 131072 +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Trees [0] 13/-1/-1->12->11 [1] 13/-1/-1->12->11 [2] 13/-1/-1->12->4 [3] 13/-1/-1->12->11 [4] 13/-1/-1->12->11 [5] 13/-1/-1->12->11 [6] 13/4/-1->12->-1 [7] 13/-1/-1->12->11 +dlc1bfcxl3sf01eb-worker-0:78:3075 [5] NCCL INFO Trees [0] 14/-1/-1->13->12 [1] 14/-1/-1->13->12 [2] 14/-1/-1->13->12 [3] -1/-1/-1->13->12 [4] 14/-1/-1->13->12 [5] 14/-1/-1->13->12 [6] 14/-1/-1->13->12 [7] -1/-1/-1->13->12 +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO P2P Chunksize set to 131072 +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO Trees [0] 15/-1/-1->14->13 [1] 15/-1/-1->14->13 [2] 15/-1/-1->14->13 [3] 15/-1/-1->14->6 [4] 15/-1/-1->14->13 [5] 15/-1/-1->14->13 [6] 15/-1/-1->14->13 [7] 15/6/-1->14->-1 +dlc1bfcxl3sf01eb-worker-0:78:3075 [5] NCCL INFO P2P Chunksize set to 131072 +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO P2P Chunksize set to 131072 +dlc1bfcxl3sf01eb-worker-0:76:3076 [3] NCCL INFO Trees [0] 12/-1/-1->11->10 [1] 12/-1/-1->11->10 [2] -1/-1/-1->11->10 [3] 12/-1/-1->11->10 [4] 12/-1/-1->11->10 [5] 12/-1/-1->11->10 [6] -1/-1/-1->11->10 [7] 12/-1/-1->11->10 +dlc1bfcxl3sf01eb-worker-0:76:3076 [3] NCCL INFO P2P Chunksize set to 131072 +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Channel 01/0 : 8[0] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO Channel 01/0 : 10[2] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Channel 03/0 : 12[4] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:3078 [1] NCCL INFO Channel 00/0 : 9[1] -> 0[0] [send] via NET/IB/0/GDRDMA +dlc1bfcxl3sf01eb-worker-0:74:3078 [1] NCCL INFO Channel 04/0 : 9[1] -> 0[0] [send] via NET/IB/0/GDRDMA +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO Channel 03/0 : 7[7] -> 14[6] [receive] via NET/IB/3/GDRDMA +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO Channel 07/0 : 7[7] -> 14[6] [receive] via NET/IB/3/GDRDMA +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Channel 05/0 : 8[0] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO Channel 05/0 : 10[2] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Channel 07/0 : 12[4] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Channel 03/0 : 8[0] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Channel 07/0 : 8[0] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO Channel 03/0 : 14[6] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:76:3076 [3] NCCL INFO Channel 01/0 : 11[3] -> 2[2] [send] via NET/IB/1/GDRDMA +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Channel 02/0 : 5[5] -> 12[4] [receive] via NET/IB/2/GDRDMA +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Channel 06/0 : 5[5] -> 12[4] [receive] via NET/IB/2/GDRDMA +dlc1bfcxl3sf01eb-worker-0:76:3076 [3] NCCL INFO Channel 05/0 : 11[3] -> 2[2] [send] via NET/IB/1/GDRDMA +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Channel 01/0 : 12[4] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO Channel 07/0 : 14[6] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Channel 05/0 : 12[4] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:78:3075 [5] NCCL INFO Channel 02/0 : 13[5] -> 4[4] [send] via NET/IB/2/GDRDMA +dlc1bfcxl3sf01eb-worker-0:78:3075 [5] NCCL INFO Channel 06/0 : 13[5] -> 4[4] [send] via NET/IB/2/GDRDMA +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Channel 00/0 : 1[1] -> 8[0] [receive] via NET/IB/0/GDRDMA +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Channel 04/0 : 1[1] -> 8[0] [receive] via NET/IB/0/GDRDMA +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Channel 00/0 : 8[0] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO Channel 01/0 : 3[3] -> 10[2] [receive] via NET/IB/1/GDRDMA +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO Channel 05/0 : 3[3] -> 10[2] [receive] via NET/IB/1/GDRDMA +dlc1bfcxl3sf01eb-worker-0:80:3077 [7] NCCL INFO Channel 03/0 : 15[7] -> 6[6] [send] via NET/IB/3/GDRDMA +dlc1bfcxl3sf01eb-worker-0:80:3077 [7] NCCL INFO Channel 07/0 : 15[7] -> 6[6] [send] via NET/IB/3/GDRDMA +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Channel 02/0 : 8[0] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Channel 04/0 : 8[0] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Channel 06/0 : 8[0] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:76:3076 [3] NCCL INFO Channel 00/0 : 11[3] -> 10[2] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:80:3077 [7] NCCL INFO Channel 00/0 : 15[7] -> 14[6] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:76:3076 [3] NCCL INFO Channel 02/0 : 11[3] -> 10[2] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:80:3077 [7] NCCL INFO Channel 01/0 : 15[7] -> 14[6] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Channel 00/0 : 12[4] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:78:3075 [5] NCCL INFO Channel 00/0 : 13[5] -> 12[4] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:76:3076 [3] NCCL INFO Channel 03/0 : 11[3] -> 10[2] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO Channel 00/0 : 14[6] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:3078 [1] NCCL INFO Channel 01/0 : 9[1] -> 8[0] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:80:3077 [7] NCCL INFO Channel 02/0 : 15[7] -> 14[6] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO Channel 00/0 : 10[2] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Channel 02/0 : 12[4] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:78:3075 [5] NCCL INFO Channel 01/0 : 13[5] -> 12[4] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:76:3076 [3] NCCL INFO Channel 04/0 : 11[3] -> 10[2] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO Channel 01/0 : 14[6] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:3078 [1] NCCL INFO Channel 02/0 : 9[1] -> 8[0] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:80:3077 [7] NCCL INFO Channel 04/0 : 15[7] -> 14[6] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO Channel 02/0 : 10[2] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Channel 04/0 : 12[4] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:78:3075 [5] NCCL INFO Channel 03/0 : 13[5] -> 12[4] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:76:3076 [3] NCCL INFO Channel 06/0 : 11[3] -> 10[2] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:3078 [1] NCCL INFO Channel 03/0 : 9[1] -> 8[0] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO Channel 02/0 : 14[6] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:80:3077 [7] NCCL INFO Channel 05/0 : 15[7] -> 14[6] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO Channel 03/0 : 10[2] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Channel 06/0 : 12[4] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:78:3075 [5] NCCL INFO Channel 04/0 : 13[5] -> 12[4] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:76:3076 [3] NCCL INFO Channel 07/0 : 11[3] -> 10[2] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:3078 [1] NCCL INFO Channel 05/0 : 9[1] -> 8[0] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO Channel 04/0 : 14[6] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:80:3077 [7] NCCL INFO Channel 06/0 : 15[7] -> 14[6] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO Channel 04/0 : 10[2] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:78:3075 [5] NCCL INFO Channel 05/0 : 13[5] -> 12[4] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:3078 [1] NCCL INFO Channel 06/0 : 9[1] -> 8[0] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO Channel 05/0 : 14[6] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO Channel 06/0 : 10[2] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:78:3075 [5] NCCL INFO Channel 07/0 : 13[5] -> 12[4] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:3078 [1] NCCL INFO Channel 07/0 : 9[1] -> 8[0] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO Channel 06/0 : 14[6] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO Channel 07/0 : 10[2] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:78:3075 [5] NCCL INFO Connected all rings +dlc1bfcxl3sf01eb-worker-0:76:3076 [3] NCCL INFO Connected all rings +dlc1bfcxl3sf01eb-worker-0:78:3075 [5] NCCL INFO Channel 00/0 : 13[5] -> 14[6] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:78:3075 [5] NCCL INFO Channel 01/0 : 13[5] -> 14[6] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO Connected all rings +dlc1bfcxl3sf01eb-worker-0:78:3075 [5] NCCL INFO Channel 02/0 : 13[5] -> 14[6] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:78:3075 [5] NCCL INFO Channel 04/0 : 13[5] -> 14[6] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:80:3077 [7] NCCL INFO Connected all rings +dlc1bfcxl3sf01eb-worker-0:78:3075 [5] NCCL INFO Channel 05/0 : 13[5] -> 14[6] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:3078 [1] NCCL INFO Connected all rings +dlc1bfcxl3sf01eb-worker-0:78:3075 [5] NCCL INFO Channel 06/0 : 13[5] -> 14[6] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Connected all rings +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Channel 00/0 : 8[0] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO Connected all rings +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Channel 01/0 : 8[0] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Connected all rings +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO Channel 00/0 : 10[2] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Channel 02/0 : 8[0] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:76:3076 [3] NCCL INFO Channel 00/0 : 11[3] -> 12[4] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO Channel 01/0 : 10[2] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Channel 03/0 : 8[0] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:76:3076 [3] NCCL INFO Channel 01/0 : 11[3] -> 12[4] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO Channel 02/0 : 10[2] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Channel 04/0 : 8[0] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:76:3076 [3] NCCL INFO Channel 03/0 : 11[3] -> 12[4] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO Channel 03/0 : 10[2] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Channel 05/0 : 8[0] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:76:3076 [3] NCCL INFO Channel 04/0 : 11[3] -> 12[4] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO Channel 04/0 : 10[2] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Channel 06/0 : 8[0] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:76:3076 [3] NCCL INFO Channel 05/0 : 11[3] -> 12[4] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO Channel 00/0 : 14[6] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO Channel 05/0 : 10[2] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Channel 07/0 : 8[0] -> 9[1] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Channel 00/0 : 12[4] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:76:3076 [3] NCCL INFO Channel 07/0 : 11[3] -> 12[4] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:3078 [1] NCCL INFO Channel 00/0 : 9[1] -> 10[2] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO Channel 01/0 : 14[6] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO Channel 06/0 : 10[2] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Channel 01/0 : 12[4] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:3078 [1] NCCL INFO Channel 02/0 : 9[1] -> 10[2] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO Channel 02/0 : 14[6] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO Channel 07/0 : 10[2] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Channel 02/0 : 12[4] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:3078 [1] NCCL INFO Channel 03/0 : 9[1] -> 10[2] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO Channel 03/0 : 14[6] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Channel 03/0 : 12[4] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:3078 [1] NCCL INFO Channel 04/0 : 9[1] -> 10[2] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO Channel 04/0 : 14[6] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Channel 04/0 : 12[4] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:3078 [1] NCCL INFO Channel 06/0 : 9[1] -> 10[2] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO Channel 05/0 : 14[6] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Channel 05/0 : 12[4] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:3078 [1] NCCL INFO Channel 07/0 : 9[1] -> 10[2] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO Channel 06/0 : 14[6] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Channel 06/0 : 12[4] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:3078 [1] NCCL INFO Channel 00/0 : 9[1] -> 8[0] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Channel 01/0 : 8[0] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO Channel 01/0 : 2[2] -> 10[2] [receive] via NET/IB/1/GDRDMA +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO Channel 05/0 : 2[2] -> 10[2] [receive] via NET/IB/1/GDRDMA +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO Channel 01/0 : 10[2] -> 2[2] [send] via NET/IB/1/GDRDMA +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO Channel 05/0 : 10[2] -> 2[2] [send] via NET/IB/1/GDRDMA +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO Channel 07/0 : 14[6] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Channel 07/0 : 12[4] -> 13[5] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:74:3078 [1] NCCL INFO Channel 04/0 : 9[1] -> 8[0] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Channel 03/0 : 8[0] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO Channel 03/0 : 6[6] -> 14[6] [receive] via NET/IB/3/GDRDMA +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO Channel 07/0 : 6[6] -> 14[6] [receive] via NET/IB/3/GDRDMA +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO Channel 03/0 : 14[6] -> 6[6] [send] via NET/IB/3/GDRDMA +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO Channel 07/0 : 14[6] -> 6[6] [send] via NET/IB/3/GDRDMA +dlc1bfcxl3sf01eb-worker-0:78:3075 [5] NCCL INFO Channel 02/0 : 13[5] -> 12[4] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Channel 02/0 : 4[4] -> 12[4] [receive] via NET/IB/2/GDRDMA +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Channel 06/0 : 4[4] -> 12[4] [receive] via NET/IB/2/GDRDMA +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Channel 02/0 : 12[4] -> 4[4] [send] via NET/IB/2/GDRDMA +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Channel 06/0 : 12[4] -> 4[4] [send] via NET/IB/2/GDRDMA +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Channel 05/0 : 8[0] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:78:3075 [5] NCCL INFO Channel 06/0 : 13[5] -> 12[4] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Channel 07/0 : 8[0] -> 15[7] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:80:3077 [7] NCCL INFO Channel 01/0 : 15[7] -> 8[0] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:76:3076 [3] NCCL INFO Channel 01/0 : 11[3] -> 10[2] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Channel 00/0 : 0[0] -> 8[0] [receive] via NET/IB/0/GDRDMA +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Channel 04/0 : 0[0] -> 8[0] [receive] via NET/IB/0/GDRDMA +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Channel 00/0 : 8[0] -> 0[0] [send] via NET/IB/0/GDRDMA +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Channel 04/0 : 8[0] -> 0[0] [send] via NET/IB/0/GDRDMA +dlc1bfcxl3sf01eb-worker-0:80:3077 [7] NCCL INFO Channel 02/0 : 15[7] -> 8[0] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:76:3076 [3] NCCL INFO Channel 05/0 : 11[3] -> 10[2] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Channel 01/0 : 12[4] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:80:3077 [7] NCCL INFO Channel 03/0 : 15[7] -> 8[0] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Channel 03/0 : 12[4] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:80:3077 [7] NCCL INFO Channel 05/0 : 15[7] -> 8[0] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Channel 05/0 : 12[4] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:80:3077 [7] NCCL INFO Channel 06/0 : 15[7] -> 8[0] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Channel 07/0 : 12[4] -> 11[3] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:80:3077 [7] NCCL INFO Channel 07/0 : 15[7] -> 8[0] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:80:3077 [7] NCCL INFO Channel 03/0 : 15[7] -> 14[6] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:80:3077 [7] NCCL INFO Channel 07/0 : 15[7] -> 14[6] via P2P/IPC/read +dlc1bfcxl3sf01eb-worker-0:78:3075 [5] NCCL INFO Connected all trees +dlc1bfcxl3sf01eb-worker-0:78:3075 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 +dlc1bfcxl3sf01eb-worker-0:78:3075 [5] NCCL INFO 8 coll channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer +dlc1bfcxl3sf01eb-worker-0:76:3076 [3] NCCL INFO Connected all trees +dlc1bfcxl3sf01eb-worker-0:76:3076 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 +dlc1bfcxl3sf01eb-worker-0:76:3076 [3] NCCL INFO 8 coll channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer +dlc1bfcxl3sf01eb-worker-0:74:3078 [1] NCCL INFO Connected all trees +dlc1bfcxl3sf01eb-worker-0:74:3078 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 +dlc1bfcxl3sf01eb-worker-0:74:3078 [1] NCCL INFO 8 coll channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer +dlc1bfcxl3sf01eb-worker-0:80:3077 [7] NCCL INFO Connected all trees +dlc1bfcxl3sf01eb-worker-0:80:3077 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 +dlc1bfcxl3sf01eb-worker-0:80:3077 [7] NCCL INFO 8 coll channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO Connected all trees +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO 8 coll channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO Connected all trees +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO 8 coll channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO Connected all trees +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO 8 coll channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO Connected all trees +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO 8 coll channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer +dlc1bfcxl3sf01eb-worker-0:73:3079 [0] NCCL INFO comm 0x7f22740374a0 rank 8 nranks 16 cudaDev 0 nvmlDev 0 busId 10 commId 0x338d070d42255f9a - Init COMPLETE +dlc1bfcxl3sf01eb-worker-0:77:3080 [4] NCCL INFO comm 0x7f08a4037770 rank 12 nranks 16 cudaDev 4 nvmlDev 4 busId 50 commId 0x338d070d42255f9a - Init COMPLETE +dlc1bfcxl3sf01eb-worker-0:79:3082 [6] NCCL INFO comm 0x7f2518037710 rank 14 nranks 16 cudaDev 6 nvmlDev 6 busId 70 commId 0x338d070d42255f9a - Init COMPLETE +dlc1bfcxl3sf01eb-worker-0:75:3081 [2] NCCL INFO comm 0x7fbad80375f0 rank 10 nranks 16 cudaDev 2 nvmlDev 2 busId 30 commId 0x338d070d42255f9a - Init COMPLETE +dlc1bfcxl3sf01eb-worker-0:74:3078 [1] NCCL INFO comm 0x7fee7c037660 rank 9 nranks 16 cudaDev 1 nvmlDev 1 busId 20 commId 0x338d070d42255f9a - Init COMPLETE +dlc1bfcxl3sf01eb-worker-0:78:3075 [5] NCCL INFO comm 0x7faef00375b0 rank 13 nranks 16 cudaDev 5 nvmlDev 5 busId 60 commId 0x338d070d42255f9a - Init COMPLETE +dlc1bfcxl3sf01eb-worker-0:80:3077 [7] NCCL INFO comm 0x7fa03c037ad0 rank 15 nranks 16 cudaDev 7 nvmlDev 7 busId 80 commId 0x338d070d42255f9a - Init COMPLETE +dlc1bfcxl3sf01eb-worker-0:76:3076 [3] NCCL INFO comm 0x7f6af80377d0 rank 11 nranks 16 cudaDev 3 nvmlDev 3 busId 40 commId 0x338d070d42255f9a - Init COMPLETE +[INFO|trainer.py:1962] 2025-01-17 17:16:54,974 >> + +Training completed. Do not forget to share your model on huggingface.co/models =) + + +dlc1bfcxl3sf01eb-worker-0:77:3090 [4] NCCL INFO [Service thread] Connection closed by localRank 1 +dlc1bfcxl3sf01eb-worker-0:77:354 [4] NCCL INFO [Service thread] Connection closed by localRank 1 +dlc1bfcxl3sf01eb-worker-0:79:3088 [0] NCCL INFO [Service thread] Connection closed by localRank 1 +dlc1bfcxl3sf01eb-worker-0:79:361 [0] NCCL INFO [Service thread] Connection closed by localRank 1 +dlc1bfcxl3sf01eb-worker-0:77:3090 [4] NCCL INFO [Service thread] Connection closed by localRank 3 +dlc1bfcxl3sf01eb-worker-0:77:3090 [4] NCCL INFO [Service thread] Connection closed by localRank 7 +dlc1bfcxl3sf01eb-worker-0:77:3090 [4] NCCL INFO [Service thread] Connection closed by localRank 2 +dlc1bfcxl3sf01eb-worker-0:77:3090 [4] NCCL INFO [Service thread] Connection closed by localRank 0 +dlc1bfcxl3sf01eb-worker-0:77:354 [4] NCCL INFO [Service thread] Connection closed by localRank 2 +dlc1bfcxl3sf01eb-worker-0:77:354 [4] NCCL INFO [Service thread] Connection closed by localRank 0 +dlc1bfcxl3sf01eb-worker-0:77:354 [4] NCCL INFO [Service thread] Connection closed by localRank 3 +dlc1bfcxl3sf01eb-worker-0:77:354 [4] NCCL INFO [Service thread] Connection closed by localRank 7 +dlc1bfcxl3sf01eb-worker-0:77:3090 [4] NCCL INFO [Service thread] Connection closed by localRank 6 +dlc1bfcxl3sf01eb-worker-0:77:354 [4] NCCL INFO [Service thread] Connection closed by localRank 6