Error while fine-tuning

#6
by polieste - opened

I am finetuning this model on my own datasets. This is information about version of framework that i used:

  • python: 3.10
  • torch: 2.0.1
  • cudnn: 11.7
  • GPU: 3090
  • Additional library: Like in the instruction on HF
    But I got the error and i cant fix it. Please help me:
    /workspace/Vintern/internvl_chat
  • GPUS=1
  • BATCH_SIZE=1
  • PER_DEVICE_BATCH_SIZE=1
  • GRADIENT_ACC=1
  • pwd
  • export PYTHONPATH=:/workspace/Vintern/internvl_chat
  • export MASTER_PORT=34229
  • export TF_CPP_MIN_LOG_LEVEL=3
  • export LAUNCHER=pytorch
  • OUTPUT_DIR=work_dirs/internvl_chat_v2_0/Vintern_1B_v2_finetune_lora_viet_medical_vqa
  • [ ! -d work_dirs/internvl_chat_v2_0/Vintern_1B_v2_finetune_lora_viet_medical_vqa ]
  • torchrun --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --nproc_per_node=1 --master_port=34229 internvl/train/internvl_chat_finetune.py --model_name_or_path ../pretrained/Vintern-1B-v2 --conv_style Hermes-2 --output_dir work_dirs/internvl_chat_v2_0/Vintern_1B_v2_finetune_lora_viet_medical_vqa --meta_path+ shell/data/custom_fintune_datasets.json --overwrite_output_dir True --force_image_sizetee 448 -a --max_dynamic_patch work_dirs/internvl_chat_v2_0/Vintern_1B_v2_finetune_lora_viet_medical_vqa/training_log.txt 6
    --down_sample_ratio 0.5 --drop_path_rate 0.0 --freeze_llm True --freeze_mlp True --freeze_backbone True --use_llm_lora 16 --vision_select_layer -1 --dataloader_num_workers 4 --bf16 True --num_train_epochs 1 --per_device_train_batch_size 1 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 500 --save_total_limit 2 --learning_rate 4e-5 --weight_decay 0.01 --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 10 --max_seq_length 700 --do_train True --grad_checkpoint True --group_by_length True --dynamic_image_size True --use_thumbnail True --ps_version v2 --deepspeed zero_stage1_config.json --report_to tensorboard
    [2024-11-12 08:00:32,227] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
    /opt/conda/lib/python3.10/site-packages/torchvision/datapoints/init.py:12: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: https://github.com/pytorch/vision/issues/6753, and you can also check out https://github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
    warnings.warn(_BETA_TRANSFORMS_WARNING)
    /opt/conda/lib/python3.10/site-packages/torchvision/transforms/v2/init.py:54: UserWarning: The torchvision.datapoints and torchvision.transforms.v2 namespaces are still Beta. While we do not expect major breaking changes, some APIs may still change according to user feedback. Please submit any feedback you may have in this issue: https://github.com/pytorch/vision/issues/6753, and you can also check out https://github.com/pytorch/vision/issues/7319 to learn more about the APIs that we suspect might involve future changes. You can silence this warning by calling torchvision.disable_beta_transforms_warning().
    warnings.warn(_BETA_TRANSFORMS_WARNING)
    /opt/conda/lib/python3.10/site-packages/timm/models/layers/init.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
    warnings.warn(f"Importing from {name} is deprecated, please import via timm.layers", FutureWarning)
    petrel_client is not installed. If you read data locally instead of from ceph, ignore it.
    Replace train sampler!!
    petrel_client is not installed. Using PIL to load images.
    [2024-11-12 08:00:36,992] [INFO] [comm.py:652:init_distributed] cdb=None
    [2024-11-12 08:00:36,992] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
    11/12/2024 08:00:37 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
    11/12/2024 08:00:37 - INFO - main - Training/evaluation parameters TrainingArguments(
    _n_gpu=1,
    accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
    adafactor=False,
    adam_beta1=0.9,
    adam_beta2=0.999,
    adam_epsilon=1e-08,
    auto_find_batch_size=False,
    average_tokens_across_devices=False,
    batch_eval_metrics=False,
    bf16=True,
    bf16_full_eval=False,
    data_seed=None,
    dataloader_drop_last=False,
    dataloader_num_workers=4,
    dataloader_persistent_workers=False,
    dataloader_pin_memory=True,
    dataloader_prefetch_factor=None,
    ddp_backend=None,
    ddp_broadcast_buffers=None,
    ddp_bucket_cap_mb=None,
    ddp_find_unused_parameters=None,
    ddp_timeout=1800,
    debug=[],
    deepspeed=zero_stage1_config.json,
    disable_tqdm=False,
    dispatch_batches=None,
    do_eval=False,
    do_predict=False,
    do_train=True,
    eval_accumulation_steps=None,
    eval_delay=0,
    eval_do_concat_batches=True,
    eval_on_start=False,
    eval_steps=None,
    eval_strategy=no,
    eval_use_gather_object=False,
    evaluation_strategy=no,
    fp16=False,
    fp16_backend=auto,
    fp16_full_eval=False,
    fp16_opt_level=O1,
    fsdp=[],
    fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
    fsdp_min_num_params=0,
    fsdp_transformer_layer_cls_to_wrap=None,
    full_determinism=False,
    gradient_accumulation_steps=1,
    gradient_checkpointing=False,
    gradient_checkpointing_kwargs=None,
    greater_is_better=None,
    group_by_length=True,
    half_precision_backend=auto,
    hub_always_push=False,
    hub_model_id=None,
    hub_private_repo=False,
    hub_strategy=every_save,
    hub_token=,
    ignore_data_skip=False,
    include_for_metrics=[],
    include_inputs_for_metrics=False,
    include_num_input_tokens_seen=False,
    include_tokens_per_second=False,
    jit_mode_eval=False,
    label_names=None,
    label_smoothing_factor=0.0,
    learning_rate=4e-05,
    length_column_name=length,
    load_best_model_at_end=False,
    local_rank=0,
    log_level=passive,
    log_level_replica=warning,
    log_on_each_node=True,
    logging_dir=work_dirs/internvl_chat_v2_0/Vintern_1B_v2_finetune_lora_viet_medical_vqa/runs/Nov12_08-00-37_c820fe065355,
    logging_first_step=False,
    logging_nan_inf_filter=True,
    logging_steps=10,
    logging_strategy=steps,
    lr_scheduler_kwargs={},
    lr_scheduler_type=cosine,
    max_grad_norm=1.0,
    max_steps=-1,
    metric_for_best_model=None,
    mp_parameters=,
    neftune_noise_alpha=None,
    no_cuda=False,
    num_train_epochs=1.0,
    optim=adamw_torch,
    optim_args=None,
    optim_target_modules=None,
    output_dir=work_dirs/internvl_chat_v2_0/Vintern_1B_v2_finetune_lora_viet_medical_vqa,
    overwrite_output_dir=True,
    past_index=-1,
    per_device_eval_batch_size=8,
    per_device_train_batch_size=1,
    prediction_loss_only=False,
    push_to_hub=False,
    push_to_hub_model_id=None,
    push_to_hub_organization=None,
    push_to_hub_token=,
    ray_scope=last,
    remove_unused_columns=True,
    report_to=['tensorboard'],
    restore_callback_states_from_checkpoint=False,
    resume_from_checkpoint=None,
    run_name=work_dirs/internvl_chat_v2_0/Vintern_1B_v2_finetune_lora_viet_medical_vqa,
    save_on_each_node=False,
    save_only_model=False,
    save_safetensors=True,
    save_steps=500,
    save_strategy=steps,
    save_total_limit=2,
    seed=42,
    skip_memory_metrics=True,
    split_batches=None,
    tf32=None,
    torch_compile=False,
    torch_compile_backend=None,
    torch_compile_mode=None,
    torch_empty_cache_steps=None,
    torchdynamo=None,
    tpu_metrics_debug=False,
    tpu_num_cores=None,
    use_cpu=False,
    use_ipex=False,
    use_legacy_prediction_loop=False,
    use_liger_kernel=False,
    use_mps_device=False,
    warmup_ratio=0.03,
    warmup_steps=0,
    weight_decay=0.01,
    )
    11/12/2024 08:00:37 - INFO - main - Loading Tokenizer: ../pretrained/Vintern-1B-v2
    [INFO|tokenization_utils_base.py:2209] 2024-11-12 08:00:37,038 >> loading file vocab.json
    [INFO|tokenization_utils_base.py:2209] 2024-11-12 08:00:37,039 >> loading file merges.txt
    [INFO|tokenization_utils_base.py:2209] 2024-11-12 08:00:37,039 >> loading file added_tokens.json
    [INFO|tokenization_utils_base.py:2209] 2024-11-12 08:00:37,039 >> loading file special_tokens_map.json
    [INFO|tokenization_utils_base.py:2209] 2024-11-12 08:00:37,039 >> loading file tokenizer_config.json
    [INFO|tokenization_utils_base.py:2209] 2024-11-12 08:00:37,039 >> loading file tokenizer.json
    [INFO|tokenization_utils_base.py:2475] 2024-11-12 08:00:37,314 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
    11/12/2024 08:00:37 - INFO - main - Loading InternVLChatModel...
    [INFO|configuration_utils.py:677] 2024-11-12 08:00:37,322 >> loading configuration file ../pretrained/Vintern-1B-v2/config.json
    [INFO|configuration_utils.py:746] 2024-11-12 08:00:37,324 >> Model config InternVLChatConfig {
    "_commit_hash": null,
    "_name_or_path": "khang119966/vintern-final",
    "architectures": [
    "InternVLChatModel"
    ],
    "auto_map": {
    "AutoConfig": "5CD-AI/Vintern-1B-v2--configuration_internvl_chat.InternVLChatConfig",
    "AutoModel": "5CD-AI/Vintern-1B-v2--modeling_internvl_chat.InternVLChatModel",
    "AutoModelForCausalLM": "5CD-AI/Vintern-1B-v2--modeling_internvl_chat.InternVLChatModel"
    },
    "downsample_ratio": 0.5,
    "dynamic_image_size": true,
    "force_image_size": 448,
    "llm_config": {
    "_attn_implementation_autoset": false,
    "_name_or_path": "Qwen/Qwen2-0.5B-Instruct",
    "add_cross_attention": false,
    "architectures": [
    "Qwen2ForCausalLM"
    ],
    "attention_dropout": 0.0,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": 151643,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": 151645,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "silu",
    "hidden_size": 896,
    "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
    },
    "initializer_range": 0.02,
    "intermediate_size": 4864,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1
    },
    "length_penalty": 1.0,
    "max_length": 20,
    "max_position_embeddings": 32768,
    "max_window_layers": 24,
    "min_length": 0,
    "model_type": "qwen2",
    "no_repeat_ngram_size": 0,
    "num_attention_heads": 14,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_hidden_layers": 24,
    "num_key_value_heads": 2,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "rms_norm_eps": 1e-06,
    "rope_scaling": null,
    "rope_theta": 1000000.0,
    "sep_token_id": null,
    "sliding_window": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.46.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_cache": true,
    "use_sliding_window": false,
    "vocab_size": 151655
    },
    "max_dynamic_patch": 12,
    "min_dynamic_patch": 1,
    "model_type": "internvl_chat",
    "pad2square": false,
    "ps_version": "v2",
    "select_layer": -1,
    "template": "Hermes-2",
    "torch_dtype": "bfloat16",
    "transformers_version": null,
    "use_backbone_lora": 0,
    "use_llm_lora": 0,
    "use_thumbnail": true,
    "vision_config": {
    "_attn_implementation_autoset": false,
    "_name_or_path": "",
    "add_cross_attention": false,
    "architectures": [
    "InternVisionModel"
    ],
    "attention_dropout": 0.0,
    "bad_words_ids": null,
    "begin_suppress_tokens": null,
    "bos_token_id": null,
    "chunk_size_feed_forward": 0,
    "cross_attention_hidden_size": null,
    "decoder_start_token_id": null,
    "diversity_penalty": 0.0,
    "do_sample": false,
    "drop_path_rate": 0.0,
    "dropout": 0.0,
    "early_stopping": false,
    "encoder_no_repeat_ngram_size": 0,
    "eos_token_id": null,
    "exponential_decay_length_penalty": null,
    "finetuning_task": null,
    "forced_bos_token_id": null,
    "forced_eos_token_id": null,
    "hidden_act": "gelu",
    "hidden_size": 1024,
    "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
    },
    "image_size": 448,
    "initializer_factor": 1.0,
    "initializer_range": 0.02,
    "intermediate_size": 4096,
    "is_decoder": false,
    "is_encoder_decoder": false,
    "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1
    },
    "layer_norm_eps": 1e-06,
    "length_penalty": 1.0,
    "max_length": 20,
    "min_length": 0,
    "model_type": "intern_vit_6b",
    "no_repeat_ngram_size": 0,
    "norm_type": "layer_norm",
    "num_attention_heads": 16,
    "num_beam_groups": 1,
    "num_beams": 1,
    "num_channels": 3,
    "num_hidden_layers": 24,
    "num_return_sequences": 1,
    "output_attentions": false,
    "output_hidden_states": false,
    "output_scores": false,
    "pad_token_id": null,
    "patch_size": 14,
    "prefix": null,
    "problem_type": null,
    "pruned_heads": {},
    "qk_normalization": false,
    "qkv_bias": true,
    "remove_invalid_values": false,
    "repetition_penalty": 1.0,
    "return_dict": true,
    "return_dict_in_generate": false,
    "sep_token_id": null,
    "suppress_tokens": null,
    "task_specific_params": null,
    "temperature": 1.0,
    "tf_legacy_loss": false,
    "tie_encoder_decoder": false,
    "tie_word_embeddings": true,
    "tokenizer_class": null,
    "top_k": 50,
    "top_p": 1.0,
    "torch_dtype": "bfloat16",
    "torchscript": false,
    "transformers_version": "4.46.2",
    "typical_p": 1.0,
    "use_bfloat16": true,
    "use_flash_attn": false
    }
    }

11/12/2024 08:00:37 - INFO - main - Using flash_attention_2 for LLaMA
[INFO|modeling_utils.py:3934] 2024-11-12 08:00:37,325 >> loading weights file ../pretrained/Vintern-1B-v2/model.safetensors
[INFO|modeling_utils.py:1670] 2024-11-12 08:00:37,346 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:1096] 2024-11-12 08:00:37,348 >> Generate config GenerationConfig {}

[INFO|configuration_utils.py:1096] 2024-11-12 08:00:37,395 >> Generate config GenerationConfig {
"bos_token_id": 151643,
"eos_token_id": 151645
}

[INFO|modeling_utils.py:4800] 2024-11-12 08:00:38,361 >> All model checkpoint weights were used when initializing InternVLChatModel.

[INFO|modeling_utils.py:4808] 2024-11-12 08:00:38,361 >> All the weights of InternVLChatModel were initialized from the model checkpoint at ../pretrained/Vintern-1B-v2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use InternVLChatModel for predictions without further training.
[INFO|configuration_utils.py:1049] 2024-11-12 08:00:38,369 >> loading configuration file ../pretrained/Vintern-1B-v2/generation_config.json
[INFO|configuration_utils.py:1096] 2024-11-12 08:00:38,370 >> Generate config GenerationConfig {}

11/12/2024 08:00:38 - INFO - main - Finished
11/12/2024 08:00:38 - INFO - main - model.config.force_image_size: 448
11/12/2024 08:00:38 - INFO - main - data_args.force_image_size: 448
11/12/2024 08:00:38 - INFO - main - model.config.vision_config.image_size: 448
11/12/2024 08:00:38 - INFO - main - [Dataset] num_image_token: 256
11/12/2024 08:00:38 - INFO - main - [Dataset] dynamic_image_size: True
11/12/2024 08:00:38 - INFO - main - [Dataset] use_thumbnail: True
11/12/2024 08:00:38 - INFO - main - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6
11/12/2024 08:00:38 - INFO - main - Formatting inputs...Skip in lazy mode
11/12/2024 08:00:38 - INFO - main - Add dataset: vi-medical-vqa with length: 4349
trainable params: 8,798,208 || all params: 638,462,080 || trainable%: 1.3780
11/12/2024 08:00:39 - INFO - main - language_model.base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight
11/12/2024 08:00:39 - INFO - main - language_model.base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight
...
...
...
11/12/2024 08:00:39 - INFO - main - language_model.base_model.model.model.layers.23.mlp.down_proj.lora_A.default.weight
11/12/2024 08:00:39 - INFO - main - language_model.base_model.model.model.layers.23.mlp.down_proj.lora_B.default.weight
11/12/2024 08:00:39 - WARNING - accelerate.utils.other - Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[INFO|trainer.py:699] 2024-11-12 08:00:39,267 >> Using auto half precision backend
[WARNING|trainer.py:761] 2024-11-12 08:00:39,450 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[WARNING|trainer.py:761] 2024-11-12 08:00:39,450 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
[2024-11-12 08:00:39,481] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.15.4, git-hash=unknown, git-branch=unknown
[2024-11-12 08:00:39,481] [INFO] [config.py:733:init] Config mesh_device None world_size = 1
[2024-11-12 08:00:41,092] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Traceback (most recent call last):
File "/workspace/Vintern/internvl_chat/internvl/train/internvl_chat_finetune.py", line 847, in
main()
File "/workspace/Vintern/internvl_chat/internvl/train/internvl_chat_finetune.py", line 832, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2123, in train
return inner_training_loop(
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2278, in _inner_training_loop
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1323, in prepare
result = self._prepare_deepspeed(*args)
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1842, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = ds_initialize(**kwargs)
File "/opt/conda/lib/python3.10/site-packages/deepspeed/init.py", line 193, in initialize
engine = DeepSpeedEngine(args=args,
File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 313, in init
self._configure_optimizer(optimizer, model_parameters)
File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1276, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1353, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in init
fused_adam_cuda = FusedAdamBuilder().load()
File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 531, in load
return self.jit_load(verbose)
File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 578, in jit_load
op_module = load(name=self.name,
File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile
_write_ninja_file_and_build_library(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1598, in _write_ninja_file_and_build_library
get_compiler_abi_compatibility_and_version(compiler)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 337, in get_compiler_abi_compatibility_and_version
if not check_compiler_ok_for_platform(compiler):
File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 291, in check_compiler_ok_for_platform
which = subprocess.check_output(['which', compiler], stderr=subprocess.STDOUT)
File "/opt/conda/lib/python3.10/subprocess.py", line 421, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['which', 'c++']' returned non-zero exit status 1.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 7111) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.0.1', 'console_scripts', 'torchrun')())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

internvl/train/internvl_chat_finetune.py FAILED

Failures:

Root Cause (first observed failure):
[0]:
time : 2024-11-12_08:00:44
host : c820fe065355
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 7111)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Sign up or log in to comment