Deployment framework

by xro7 - opened 3 days ago

xro7

3 days ago

What framework did you use to deploy the model? I tried vllm with 8xH100 but got the following error.

2025-01-22T13:22:49.476492425Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:22:49 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052249.pkl...
2025-01-22T13:22:49.477126901Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:22:49 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052249.pkl...
2025-01-22T13:22:49.477129206Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:22:49 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052249.pkl...```

v2ray

Cognitive Computations org 3 days ago

Can you provide the full log and your start up command?

xro7

3 days ago

•

edited 3 days ago

I kept the logs for my 4xH200 experiment but got the same error for 8xH100

vllm parameters:
--host 0.0.0.0 --port 8000 --model cognitivecomputations/DeepSeek-R1-AWQ --gpu-memory-utilization 0.95 --tensor-parallel-size=4 --trust_remote_code

Logs:

2025-01-22T13:07:50.421133598Z INFO 01-22 05:07:50 api_server.py:712] vLLM API server version 0.6.6.post1
2025-01-22T13:07:50.421303357Z INFO 01-22 05:07:50 api_server.py:713] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='', lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='cognitivecomputations/DeepSeek-R1-AWQ', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=30000, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
2025-01-22T13:07:50.430906961Z INFO 01-22 05:07:50 api_server.py:199] Started engine process with PID 89
2025-01-22T13:07:50.643475046Z INFO 01-22 05:07:50 config.py:131] Replacing legacy 'type' key with 'rope_type'
2025-01-22T13:07:53.969666636Z INFO 01-22 05:07:53 config.py:131] Replacing legacy 'type' key with 'rope_type'
2025-01-22T13:07:55.208259634Z INFO 01-22 05:07:55 config.py:510] This model supports multiple tasks: {'score', 'generate', 'reward', 'classify', 'embed'}. Defaulting to 'generate'.
2025-01-22T13:07:55.844302051Z INFO 01-22 05:07:55 awq_marlin.py:109] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
2025-01-22T13:07:55.888077160Z INFO 01-22 05:07:55 config.py:1310] Defaulting to use mp for distributed inference
2025-01-22T13:07:55.888171171Z WARNING 01-22 05:07:55 cuda.py:98] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
2025-01-22T13:07:55.888191894Z WARNING 01-22 05:07:55 config.py:642] Async output processing is not supported on the current platform type cuda.
2025-01-22T13:07:58.487429442Z INFO 01-22 05:07:58 config.py:510] This model supports multiple tasks: {'embed', 'reward', 'classify', 'generate', 'score'}. Defaulting to 'generate'.
2025-01-22T13:07:59.106749422Z INFO 01-22 05:07:59 awq_marlin.py:109] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
2025-01-22T13:07:59.150778826Z INFO 01-22 05:07:59 config.py:1310] Defaulting to use mp for distributed inference
2025-01-22T13:07:59.150878529Z WARNING 01-22 05:07:59 cuda.py:98] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
2025-01-22T13:07:59.150900534Z WARNING 01-22 05:07:59 config.py:642] Async output processing is not supported on the current platform type cuda.
2025-01-22T13:07:59.173686852Z INFO 01-22 05:07:59 llm_engine.py:234] Initializing an LLM engine (v0.6.6.post1) with config: model='cognitivecomputations/DeepSeek-R1-AWQ', speculative_config=None, tokenizer='cognitivecomputations/DeepSeek-R1-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=30000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=cognitivecomputations/DeepSeek-R1-AWQ, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[],"max_capture_size":0}, use_cached_outputs=True,
2025-01-22T13:07:59.578249195Z WARNING 01-22 05:07:59 multiproc_worker_utils.py:312] Reducing Torch parallelism from 96 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
2025-01-22T13:07:59.583556350Z INFO 01-22 05:07:59 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
2025-01-22T13:07:59.644588714Z INFO 01-22 05:07:59 selector.py:120] Using Flash Attention backend.
2025-01-22T13:07:59.700087505Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:07:59 selector.py:120] Using Flash Attention backend.
2025-01-22T13:07:59.700196810Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:07:59 multiproc_worker_utils.py:222] Worker ready; awaiting tasks
2025-01-22T13:07:59.719623814Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:07:59 selector.py:120] Using Flash Attention backend.
2025-01-22T13:07:59.719626424Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:07:59 selector.py:120] Using Flash Attention backend.
2025-01-22T13:07:59.719717955Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:07:59 multiproc_worker_utils.py:222] Worker ready; awaiting tasks
2025-01-22T13:07:59.719719661Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:07:59 multiproc_worker_utils.py:222] Worker ready; awaiting tasks
2025-01-22T13:08:03.041911052Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:08:03 utils.py:918] Found nccl from library libnccl.so.2
2025-01-22T13:08:03.041943685Z INFO 01-22 05:08:03 utils.py:918] Found nccl from library libnccl.so.2
2025-01-22T13:08:03.042058901Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:08:03 pynccl.py:69] vLLM is using nccl==2.21.5
2025-01-22T13:08:03.042067625Z INFO 01-22 05:08:03 pynccl.py:69] vLLM is using nccl==2.21.5
2025-01-22T13:08:03.042084177Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:08:03 utils.py:918] Found nccl from library libnccl.so.2
2025-01-22T13:08:03.042089699Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:08:03 utils.py:918] Found nccl from library libnccl.so.2
2025-01-22T13:08:03.042269576Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:08:03 pynccl.py:69] vLLM is using nccl==2.21.5
2025-01-22T13:08:03.042297664Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:08:03 pynccl.py:69] vLLM is using nccl==2.21.5
2025-01-22T13:08:04.762844790Z INFO 01-22 05:08:04 custom_all_reduce_utils.py:204] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
2025-01-22T13:08:19.714251803Z INFO 01-22 05:08:19 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
2025-01-22T13:08:19.714368438Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:08:19 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
2025-01-22T13:08:19.714371653Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:08:19 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
2025-01-22T13:08:19.714609456Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:08:19 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
2025-01-22T13:08:19.747454967Z INFO 01-22 05:08:19 shm_broadcast.py:255] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_ed9c7126'), local_subscribe_port=53933, remote_subscribe_port=None)
2025-01-22T13:08:19.783713362Z INFO 01-22 05:08:19 model_runner.py:1094] Starting to load model cognitivecomputations/DeepSeek-R1-AWQ...
2025-01-22T13:08:19.783863117Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:08:19 model_runner.py:1094] Starting to load model cognitivecomputations/DeepSeek-R1-AWQ...
2025-01-22T13:08:19.784442981Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:08:19 model_runner.py:1094] Starting to load model cognitivecomputations/DeepSeek-R1-AWQ...
2025-01-22T13:08:19.784445640Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:08:19 model_runner.py:1094] Starting to load model cognitivecomputations/DeepSeek-R1-AWQ...
2025-01-22T13:08:20.194644565Z Cache shape torch.Size([163840, 64])
2025-01-22T13:08:20.194662173Z INFO 01-22 05:08:20 weight_utils.py:251] Using model weights format ['*.safetensors']
2025-01-22T13:08:20.234273784Z [1;36m(VllmWorkerProcess pid=361)[0;0m Cache shape torch.Size([163840, 64])
2025-01-22T13:08:20.234294554Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:08:20 weight_utils.py:251] Using model weights format ['*.safetensors']
2025-01-22T13:08:20.234579652Z [1;36m(VllmWorkerProcess pid=362)[0;0m Cache shape torch.Size([163840, 64])
2025-01-22T13:08:20.234583739Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:08:20 weight_utils.py:251] Using model weights format ['*.safetensors']
2025-01-22T13:08:20.243174528Z [1;36m(VllmWorkerProcess pid=363)[0;0m Cache shape torch.Size([163840, 64])
2025-01-22T13:08:20.243179760Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:08:20 weight_utils.py:251] Using model weights format ['*.safetensors']
2025-01-22T13:20:31.182095071Z 
Loading safetensors checkpoint shards:   0% Completed | 0/74 [00:00<?, ?it/s]
2025-01-22T13:20:31.489554762Z 
Loading safetensors checkpoint shards:   1% Completed | 1/74 [00:00<00:22,  3.25it/s]
2025-01-22T13:20:31.974300115Z 
Loading safetensors checkpoint shards:   3% Completed | 2/74 [00:00<00:29,  2.43it/s]
2025-01-22T13:20:32.455546706Z 
Loading safetensors checkpoint shards:   4% Completed | 3/74 [00:01<00:31,  2.25it/s]
2025-01-22T13:20:32.926504974Z 
Loading safetensors checkpoint shards:   5% Completed | 4/74 [00:01<00:31,  2.20it/s]
2025-01-22T13:20:33.397254243Z 
Loading safetensors checkpoint shards:   7% Completed | 5/74 [00:02<00:31,  2.17it/s]
2025-01-22T13:20:33.875270023Z 
Loading safetensors checkpoint shards:   8% Completed | 6/74 [00:02<00:31,  2.14it/s]
2025-01-22T13:20:34.344715583Z 
Loading safetensors checkpoint shards:   9% Completed | 7/74 [00:03<00:31,  2.14it/s]
2025-01-22T13:20:34.821748448Z 
Loading safetensors checkpoint shards:  11% Completed | 8/74 [00:03<00:31,  2.13it/s]
2025-01-22T13:20:35.290056371Z 
Loading safetensors checkpoint shards:  12% Completed | 9/74 [00:04<00:30,  2.13it/s]
2025-01-22T13:20:35.755523220Z 
Loading safetensors checkpoint shards:  14% Completed | 10/74 [00:04<00:29,  2.13it/s]
2025-01-22T13:20:36.228502702Z 
Loading safetensors checkpoint shards:  15% Completed | 11/74 [00:05<00:29,  2.13it/s]
2025-01-22T13:20:36.700871980Z 
Loading safetensors checkpoint shards:  16% Completed | 12/74 [00:05<00:29,  2.12it/s]
2025-01-22T13:20:37.183470090Z 
Loading safetensors checkpoint shards:  18% Completed | 13/74 [00:06<00:28,  2.11it/s]
2025-01-22T13:20:37.657741308Z 
Loading safetensors checkpoint shards:  19% Completed | 14/74 [00:06<00:28,  2.11it/s]
2025-01-22T13:20:38.121128353Z 
Loading safetensors checkpoint shards:  20% Completed | 15/74 [00:06<00:27,  2.12it/s]
2025-01-22T13:20:38.589453375Z 
Loading safetensors checkpoint shards:  22% Completed | 16/74 [00:07<00:27,  2.13it/s]
2025-01-22T13:20:39.047142026Z 
Loading safetensors checkpoint shards:  23% Completed | 17/74 [00:07<00:26,  2.14it/s]
2025-01-22T13:20:39.491344292Z 
Loading safetensors checkpoint shards:  24% Completed | 18/74 [00:08<00:25,  2.18it/s]
2025-01-22T13:20:39.929711441Z 
Loading safetensors checkpoint shards:  26% Completed | 19/74 [00:08<00:24,  2.21it/s]
2025-01-22T13:20:40.374986470Z 
Loading safetensors checkpoint shards:  27% Completed | 20/74 [00:09<00:24,  2.22it/s]
2025-01-22T13:20:40.818969728Z 
Loading safetensors checkpoint shards:  28% Completed | 21/74 [00:09<00:23,  2.23it/s]
2025-01-22T13:20:41.273748530Z 
Loading safetensors checkpoint shards:  30% Completed | 22/74 [00:10<00:23,  2.22it/s]
2025-01-22T13:20:41.739147123Z 
Loading safetensors checkpoint shards:  31% Completed | 23/74 [00:10<00:23,  2.20it/s]
2025-01-22T13:20:42.188972601Z 
Loading safetensors checkpoint shards:  32% Completed | 24/74 [00:11<00:22,  2.21it/s]
2025-01-22T13:20:42.641780672Z 
Loading safetensors checkpoint shards:  34% Completed | 25/74 [00:11<00:22,  2.21it/s]
2025-01-22T13:20:43.096641696Z 
Loading safetensors checkpoint shards:  35% Completed | 26/74 [00:11<00:21,  2.20it/s]
2025-01-22T13:20:43.567797093Z 
Loading safetensors checkpoint shards:  36% Completed | 27/74 [00:12<00:21,  2.18it/s]
2025-01-22T13:20:44.046209789Z 
Loading safetensors checkpoint shards:  38% Completed | 28/74 [00:12<00:21,  2.15it/s]
2025-01-22T13:20:44.525739823Z 
Loading safetensors checkpoint shards:  39% Completed | 29/74 [00:13<00:21,  2.13it/s]
2025-01-22T13:20:45.062838963Z 
Loading safetensors checkpoint shards:  41% Completed | 30/74 [00:13<00:21,  2.04it/s]
2025-01-22T13:20:45.538771429Z 
Loading safetensors checkpoint shards:  42% Completed | 31/74 [00:14<00:20,  2.06it/s]
2025-01-22T13:20:46.003535599Z 
Loading safetensors checkpoint shards:  43% Completed | 32/74 [00:14<00:20,  2.09it/s]
2025-01-22T13:20:46.479112534Z 
Loading safetensors checkpoint shards:  45% Completed | 33/74 [00:15<00:19,  2.09it/s]
2025-01-22T13:20:46.945277181Z 
Loading safetensors checkpoint shards:  46% Completed | 34/74 [00:15<00:18,  2.11it/s]
2025-01-22T13:20:47.399506630Z 
Loading safetensors checkpoint shards:  47% Completed | 35/74 [00:16<00:18,  2.13it/s]
2025-01-22T13:20:47.862872167Z 
Loading safetensors checkpoint shards:  49% Completed | 36/74 [00:16<00:17,  2.14it/s]
2025-01-22T13:20:48.339609077Z 
Loading safetensors checkpoint shards:  50% Completed | 37/74 [00:17<00:17,  2.13it/s]
2025-01-22T13:20:48.810059207Z 
Loading safetensors checkpoint shards:  51% Completed | 38/74 [00:17<00:16,  2.13it/s]
2025-01-22T13:20:49.280713034Z 
Loading safetensors checkpoint shards:  53% Completed | 39/74 [00:18<00:16,  2.13it/s]
2025-01-22T13:20:49.748002366Z 
Loading safetensors checkpoint shards:  54% Completed | 40/74 [00:18<00:15,  2.13it/s]
2025-01-22T13:20:50.200210526Z 
Loading safetensors checkpoint shards:  55% Completed | 41/74 [00:19<00:15,  2.15it/s]
2025-01-22T13:20:50.657614498Z 
Loading safetensors checkpoint shards:  57% Completed | 42/74 [00:19<00:14,  2.16it/s]
2025-01-22T13:20:51.128247380Z 
Loading safetensors checkpoint shards:  58% Completed | 43/74 [00:19<00:14,  2.15it/s]
2025-01-22T13:20:51.599344184Z 
Loading safetensors checkpoint shards:  59% Completed | 44/74 [00:20<00:13,  2.14it/s]
2025-01-22T13:20:52.074519018Z 
Loading safetensors checkpoint shards:  61% Completed | 45/74 [00:20<00:13,  2.13it/s]
2025-01-22T13:20:52.549870992Z 
Loading safetensors checkpoint shards:  62% Completed | 46/74 [00:21<00:13,  2.12it/s]
2025-01-22T13:20:53.041993357Z 
Loading safetensors checkpoint shards:  64% Completed | 47/74 [00:21<00:12,  2.09it/s]
2025-01-22T13:20:53.515416397Z 
Loading safetensors checkpoint shards:  65% Completed | 48/74 [00:22<00:12,  2.10it/s]
2025-01-22T13:20:53.985782219Z 
Loading safetensors checkpoint shards:  66% Completed | 49/74 [00:22<00:11,  2.11it/s]
2025-01-22T13:20:54.445680829Z 
Loading safetensors checkpoint shards:  68% Completed | 50/74 [00:23<00:11,  2.13it/s]
2025-01-22T13:20:54.916269219Z 
Loading safetensors checkpoint shards:  69% Completed | 51/74 [00:23<00:10,  2.13it/s]
2025-01-22T13:20:55.389394303Z 
Loading safetensors checkpoint shards:  70% Completed | 52/74 [00:24<00:10,  2.12it/s]
2025-01-22T13:20:55.866349991Z 
Loading safetensors checkpoint shards:  72% Completed | 53/74 [00:24<00:09,  2.11it/s]
2025-01-22T13:20:56.347850931Z 
Loading safetensors checkpoint shards:  73% Completed | 54/74 [00:25<00:09,  2.10it/s]
2025-01-22T13:20:56.794412370Z 
Loading safetensors checkpoint shards:  74% Completed | 55/74 [00:25<00:08,  2.14it/s]
2025-01-22T13:20:57.262317289Z 
Loading safetensors checkpoint shards:  76% Completed | 56/74 [00:26<00:08,  2.14it/s]
2025-01-22T13:20:57.732185124Z 
Loading safetensors checkpoint shards:  77% Completed | 57/74 [00:26<00:07,  2.14it/s]
2025-01-22T13:20:58.194820443Z 
Loading safetensors checkpoint shards:  78% Completed | 58/74 [00:27<00:07,  2.14it/s]
2025-01-22T13:20:58.670495387Z 
Loading safetensors checkpoint shards:  80% Completed | 59/74 [00:27<00:07,  2.13it/s]
2025-01-22T13:20:59.140341139Z 
Loading safetensors checkpoint shards:  81% Completed | 60/74 [00:27<00:06,  2.13it/s]
2025-01-22T13:20:59.613002800Z 
Loading safetensors checkpoint shards:  82% Completed | 61/74 [00:28<00:06,  2.13it/s]
2025-01-22T13:21:00.086442184Z 
Loading safetensors checkpoint shards:  84% Completed | 62/74 [00:28<00:05,  2.12it/s]
2025-01-22T13:21:00.560259399Z 
Loading safetensors checkpoint shards:  85% Completed | 63/74 [00:29<00:05,  2.12it/s]
2025-01-22T13:21:01.037240553Z 
Loading safetensors checkpoint shards:  86% Completed | 64/74 [00:29<00:04,  2.11it/s]
2025-01-22T13:21:01.498437763Z 
Loading safetensors checkpoint shards:  88% Completed | 65/74 [00:30<00:04,  2.13it/s]
2025-01-22T13:21:01.969160301Z 
Loading safetensors checkpoint shards:  89% Completed | 66/74 [00:30<00:03,  2.13it/s]
2025-01-22T13:21:02.440027377Z 
Loading safetensors checkpoint shards:  91% Completed | 67/74 [00:31<00:03,  2.13it/s]
2025-01-22T13:21:02.908381363Z 
Loading safetensors checkpoint shards:  92% Completed | 68/74 [00:31<00:02,  2.13it/s]
2025-01-22T13:21:03.381695121Z 
Loading safetensors checkpoint shards:  93% Completed | 69/74 [00:32<00:02,  2.12it/s]
2025-01-22T13:21:03.845546580Z 
Loading safetensors checkpoint shards:  95% Completed | 70/74 [00:32<00:01,  2.13it/s]
2025-01-22T13:21:04.311999508Z 
Loading safetensors checkpoint shards:  96% Completed | 71/74 [00:33<00:01,  2.14it/s]
2025-01-22T13:21:04.789659443Z 
Loading safetensors checkpoint shards:  97% Completed | 72/74 [00:33<00:00,  2.12it/s]
2025-01-22T13:21:05.098397817Z 
Loading safetensors checkpoint shards:  99% Completed | 73/74 [00:33<00:00,  2.37it/s]
2025-01-22T13:21:05.153431782Z 
Loading safetensors checkpoint shards: 100% Completed | 74/74 [00:33<00:00,  2.18it/s]
2025-01-22T13:21:22.200528061Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:21:22 model_runner.py:1099] Loading model weights took 85.5053 GB
2025-01-22T13:21:23.235285583Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:21:23 model_runner.py:1099] Loading model weights took 85.5053 GB
2025-01-22T13:21:23.662143488Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:21:23 model_runner.py:1099] Loading model weights took 85.5053 GB
2025-01-22T13:21:24.130898012Z INFO 01-22 05:21:24 model_runner.py:1099] Loading model weights took 85.5053 GB
2025-01-22T13:21:25.970200306Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:21:25 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl...
2025-01-22T13:21:25.970911363Z INFO 01-22 05:21:25 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl...
2025-01-22T13:21:25.973483812Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:21:25 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl...
2025-01-22T13:21:25.973539062Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:21:25 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl...
2025-01-22T13:21:25.978851389Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:21:25 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl.
2025-01-22T13:21:25.979807043Z INFO 01-22 05:21:25 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl.
2025-01-22T13:21:25.981052641Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
2025-01-22T13:21:25.981054883Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] Traceback (most recent call last):
2025-01-22T13:21:25.981057016Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
2025-01-22T13:21:25.981058711Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return func(*args, **kwargs)
2025-01-22T13:21:25.981060648Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981061614Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1691, in execute_model
2025-01-22T13:21:25.981062571Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     hidden_or_intermediate_states = model_executable(
2025-01-22T13:21:25.981064343Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]                                     ^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981067242Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
2025-01-22T13:21:25.981068486Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return self._call_impl(*args, **kwargs)
2025-01-22T13:21:25.981069736Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981070984Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
2025-01-22T13:21:25.981072436Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return forward_call(*args, **kwargs)
2025-01-22T13:21:25.981080706Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981082054Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 527, in forward
2025-01-22T13:21:25.981083090Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     hidden_states = self.model(input_ids, positions, kv_caches,
2025-01-22T13:21:25.981084195Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981085485Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
2025-01-22T13:21:25.981086427Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return self._call_impl(*args, **kwargs)
2025-01-22T13:21:25.981087552Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981088523Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
2025-01-22T13:21:25.981089556Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return forward_call(*args, **kwargs)
2025-01-22T13:21:25.981090564Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981091589Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 483, in forward
2025-01-22T13:21:25.981092501Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     hidden_states, residual = layer(positions, hidden_states,
2025-01-22T13:21:25.981093914Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981094910Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
2025-01-22T13:21:25.981096149Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return self._call_impl(*args, **kwargs)
2025-01-22T13:21:25.981097296Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981098229Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
2025-01-22T13:21:25.981099158Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return forward_call(*args, **kwargs)
2025-01-22T13:21:25.981100096Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981101044Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 412, in forward
2025-01-22T13:21:25.981101969Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     hidden_states = self.mlp(hidden_states)
2025-01-22T13:21:25.981104922Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]                     ^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981105903Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
2025-01-22T13:21:25.981106883Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return self._call_impl(*args, **kwargs)
2025-01-22T13:21:25.981107795Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981108916Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
2025-01-22T13:21:25.981109861Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return forward_call(*args, **kwargs)
2025-01-22T13:21:25.981110821Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981111777Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 158, in forward
2025-01-22T13:21:25.981112700Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     final_hidden_states = self.experts(
2025-01-22T13:21:25.981113637Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]                           ^^^^^^^^^^^^^
2025-01-22T13:21:25.981114804Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
2025-01-22T13:21:25.981115921Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return self._call_impl(*args, **kwargs)
2025-01-22T13:21:25.981117012Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981118081Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
2025-01-22T13:21:25.981119209Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return forward_call(*args, **kwargs)
2025-01-22T13:21:25.981120307Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981121683Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 522, in forward
2025-01-22T13:21:25.981123129Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     final_hidden_states = self.quant_method.apply(
2025-01-22T13:21:25.981124109Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]                           ^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981125040Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 463, in apply
2025-01-22T13:21:25.981126118Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return torch.ops.vllm.fused_marlin_moe(
2025-01-22T13:21:25.981127065Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981129560Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1116, in __call__
2025-01-22T13:21:25.981130829Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return self._op(*args, **(kwargs or {}))
2025-01-22T13:21:25.981132369Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981133322Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 202, in fused_marlin_moe
2025-01-22T13:21:25.981134990Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     assert hidden_states.dtype == torch.float16
2025-01-22T13:21:25.981135915Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981136986Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] AssertionError
2025-01-22T13:21:25.981138483Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]
2025-01-22T13:21:25.981140056Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] The above exception was the direct cause of the following exception:
2025-01-22T13:21:25.981141343Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]
2025-01-22T13:21:25.981142564Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] Traceback (most recent call last):
2025-01-22T13:21:25.981143687Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_worker_utils.py", line 230, in _run_worker_process
2025-01-22T13:21:25.981144781Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     output = executor(*args, **kwargs)
2025-01-22T13:21:25.981145931Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]              ^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981147028Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-01-22T13:21:25.981148165Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return func(*args, **kwargs)
2025-01-22T13:21:25.981149122Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981150069Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 202, in determine_num_available_blocks
2025-01-22T13:21:25.981150994Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     self.model_runner.profile_run()
2025-01-22T13:21:25.981151936Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-01-22T13:21:25.981152905Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return func(*args, **kwargs)
2025-01-22T13:21:25.981154018Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981155223Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1331, in profile_run
2025-01-22T13:21:25.981158094Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     self.execute_model(model_input, kv_caches, intermediate_tensors)
2025-01-22T13:21:25.981159193Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-01-22T13:21:25.981160159Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return func(*args, **kwargs)
2025-01-22T13:21:25.981161245Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981162372Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
2025-01-22T13:21:25.981179430Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     raise type(err)(
2025-01-22T13:21:25.981180548Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25

v2ray

Cognitive Computations org 3 days ago

Add --dtype float16 or use the new moe_wna16 kernel which needs to be built from source.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment