Deployment framework

#2
by xro7 - opened

What framework did you use to deploy the model? I tried vllm with 8xH100 but got the following error.

2025-01-22T13:22:49.476492425Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:22:49 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052249.pkl...
2025-01-22T13:22:49.477126901Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:22:49 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052249.pkl...
2025-01-22T13:22:49.477129206Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:22:49 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052249.pkl...```
Cognitive Computations org

Can you provide the full log and your start up command?

I kept the logs for my 4xH200 experiment but got the same error for 8xH100

vllm parameters:
--host 0.0.0.0 --port 8000 --model cognitivecomputations/DeepSeek-R1-AWQ --gpu-memory-utilization 0.95 --tensor-parallel-size=4 --trust_remote_code

Logs:

2025-01-22T13:07:50.421133598Z INFO 01-22 05:07:50 api_server.py:712] vLLM API server version 0.6.6.post1
2025-01-22T13:07:50.421303357Z INFO 01-22 05:07:50 api_server.py:713] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='', lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='cognitivecomputations/DeepSeek-R1-AWQ', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=30000, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
2025-01-22T13:07:50.430906961Z INFO 01-22 05:07:50 api_server.py:199] Started engine process with PID 89
2025-01-22T13:07:50.643475046Z INFO 01-22 05:07:50 config.py:131] Replacing legacy 'type' key with 'rope_type'
2025-01-22T13:07:53.969666636Z INFO 01-22 05:07:53 config.py:131] Replacing legacy 'type' key with 'rope_type'
2025-01-22T13:07:55.208259634Z INFO 01-22 05:07:55 config.py:510] This model supports multiple tasks: {'score', 'generate', 'reward', 'classify', 'embed'}. Defaulting to 'generate'.
2025-01-22T13:07:55.844302051Z INFO 01-22 05:07:55 awq_marlin.py:109] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
2025-01-22T13:07:55.888077160Z INFO 01-22 05:07:55 config.py:1310] Defaulting to use mp for distributed inference
2025-01-22T13:07:55.888171171Z WARNING 01-22 05:07:55 cuda.py:98] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
2025-01-22T13:07:55.888191894Z WARNING 01-22 05:07:55 config.py:642] Async output processing is not supported on the current platform type cuda.
2025-01-22T13:07:58.487429442Z INFO 01-22 05:07:58 config.py:510] This model supports multiple tasks: {'embed', 'reward', 'classify', 'generate', 'score'}. Defaulting to 'generate'.
2025-01-22T13:07:59.106749422Z INFO 01-22 05:07:59 awq_marlin.py:109] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
2025-01-22T13:07:59.150778826Z INFO 01-22 05:07:59 config.py:1310] Defaulting to use mp for distributed inference
2025-01-22T13:07:59.150878529Z WARNING 01-22 05:07:59 cuda.py:98] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
2025-01-22T13:07:59.150900534Z WARNING 01-22 05:07:59 config.py:642] Async output processing is not supported on the current platform type cuda.
2025-01-22T13:07:59.173686852Z INFO 01-22 05:07:59 llm_engine.py:234] Initializing an LLM engine (v0.6.6.post1) with config: model='cognitivecomputations/DeepSeek-R1-AWQ', speculative_config=None, tokenizer='cognitivecomputations/DeepSeek-R1-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=30000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=cognitivecomputations/DeepSeek-R1-AWQ, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[],"max_capture_size":0}, use_cached_outputs=True,
2025-01-22T13:07:59.578249195Z WARNING 01-22 05:07:59 multiproc_worker_utils.py:312] Reducing Torch parallelism from 96 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
2025-01-22T13:07:59.583556350Z INFO 01-22 05:07:59 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
2025-01-22T13:07:59.644588714Z INFO 01-22 05:07:59 selector.py:120] Using Flash Attention backend.
2025-01-22T13:07:59.700087505Z (VllmWorkerProcess pid=361) INFO 01-22 05:07:59 selector.py:120] Using Flash Attention backend.
2025-01-22T13:07:59.700196810Z (VllmWorkerProcess pid=361) INFO 01-22 05:07:59 multiproc_worker_utils.py:222] Worker ready; awaiting tasks
2025-01-22T13:07:59.719623814Z (VllmWorkerProcess pid=363) INFO 01-22 05:07:59 selector.py:120] Using Flash Attention backend.
2025-01-22T13:07:59.719626424Z (VllmWorkerProcess pid=362) INFO 01-22 05:07:59 selector.py:120] Using Flash Attention backend.
2025-01-22T13:07:59.719717955Z (VllmWorkerProcess pid=362) INFO 01-22 05:07:59 multiproc_worker_utils.py:222] Worker ready; awaiting tasks
2025-01-22T13:07:59.719719661Z (VllmWorkerProcess pid=363) INFO 01-22 05:07:59 multiproc_worker_utils.py:222] Worker ready; awaiting tasks
2025-01-22T13:08:03.041911052Z (VllmWorkerProcess pid=361) INFO 01-22 05:08:03 utils.py:918] Found nccl from library libnccl.so.2
2025-01-22T13:08:03.041943685Z INFO 01-22 05:08:03 utils.py:918] Found nccl from library libnccl.so.2
2025-01-22T13:08:03.042058901Z (VllmWorkerProcess pid=361) INFO 01-22 05:08:03 pynccl.py:69] vLLM is using nccl==2.21.5
2025-01-22T13:08:03.042067625Z INFO 01-22 05:08:03 pynccl.py:69] vLLM is using nccl==2.21.5
2025-01-22T13:08:03.042084177Z (VllmWorkerProcess pid=363) INFO 01-22 05:08:03 utils.py:918] Found nccl from library libnccl.so.2
2025-01-22T13:08:03.042089699Z (VllmWorkerProcess pid=362) INFO 01-22 05:08:03 utils.py:918] Found nccl from library libnccl.so.2
2025-01-22T13:08:03.042269576Z (VllmWorkerProcess pid=363) INFO 01-22 05:08:03 pynccl.py:69] vLLM is using nccl==2.21.5
2025-01-22T13:08:03.042297664Z (VllmWorkerProcess pid=362) INFO 01-22 05:08:03 pynccl.py:69] vLLM is using nccl==2.21.5
2025-01-22T13:08:04.762844790Z INFO 01-22 05:08:04 custom_all_reduce_utils.py:204] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
2025-01-22T13:08:19.714251803Z INFO 01-22 05:08:19 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
2025-01-22T13:08:19.714368438Z (VllmWorkerProcess pid=361) INFO 01-22 05:08:19 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
2025-01-22T13:08:19.714371653Z (VllmWorkerProcess pid=362) INFO 01-22 05:08:19 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
2025-01-22T13:08:19.714609456Z (VllmWorkerProcess pid=363) INFO 01-22 05:08:19 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
2025-01-22T13:08:19.747454967Z INFO 01-22 05:08:19 shm_broadcast.py:255] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_ed9c7126'), local_subscribe_port=53933, remote_subscribe_port=None)
2025-01-22T13:08:19.783713362Z INFO 01-22 05:08:19 model_runner.py:1094] Starting to load model cognitivecomputations/DeepSeek-R1-AWQ...
2025-01-22T13:08:19.783863117Z (VllmWorkerProcess pid=361) INFO 01-22 05:08:19 model_runner.py:1094] Starting to load model cognitivecomputations/DeepSeek-R1-AWQ...
2025-01-22T13:08:19.784442981Z (VllmWorkerProcess pid=362) INFO 01-22 05:08:19 model_runner.py:1094] Starting to load model cognitivecomputations/DeepSeek-R1-AWQ...
2025-01-22T13:08:19.784445640Z (VllmWorkerProcess pid=363) INFO 01-22 05:08:19 model_runner.py:1094] Starting to load model cognitivecomputations/DeepSeek-R1-AWQ...
2025-01-22T13:08:20.194644565Z Cache shape torch.Size([163840, 64])
2025-01-22T13:08:20.194662173Z INFO 01-22 05:08:20 weight_utils.py:251] Using model weights format ['*.safetensors']
2025-01-22T13:08:20.234273784Z (VllmWorkerProcess pid=361) Cache shape torch.Size([163840, 64])
2025-01-22T13:08:20.234294554Z (VllmWorkerProcess pid=361) INFO 01-22 05:08:20 weight_utils.py:251] Using model weights format ['*.safetensors']
2025-01-22T13:08:20.234579652Z (VllmWorkerProcess pid=362) Cache shape torch.Size([163840, 64])
2025-01-22T13:08:20.234583739Z (VllmWorkerProcess pid=362) INFO 01-22 05:08:20 weight_utils.py:251] Using model weights format ['*.safetensors']
2025-01-22T13:08:20.243174528Z (VllmWorkerProcess pid=363) Cache shape torch.Size([163840, 64])
2025-01-22T13:08:20.243179760Z (VllmWorkerProcess pid=363) INFO 01-22 05:08:20 weight_utils.py:251] Using model weights format ['*.safetensors']
2025-01-22T13:20:31.182095071Z 
Loading safetensors checkpoint shards:   0% Completed | 0/74 [00:00<?, ?it/s]
2025-01-22T13:20:31.489554762Z 
Loading safetensors checkpoint shards:   1% Completed | 1/74 [00:00<00:22,  3.25it/s]
2025-01-22T13:20:31.974300115Z 
Loading safetensors checkpoint shards:   3% Completed | 2/74 [00:00<00:29,  2.43it/s]
2025-01-22T13:20:32.455546706Z 
Loading safetensors checkpoint shards:   4% Completed | 3/74 [00:01<00:31,  2.25it/s]
2025-01-22T13:20:32.926504974Z 
Loading safetensors checkpoint shards:   5% Completed | 4/74 [00:01<00:31,  2.20it/s]
2025-01-22T13:20:33.397254243Z 
Loading safetensors checkpoint shards:   7% Completed | 5/74 [00:02<00:31,  2.17it/s]
2025-01-22T13:20:33.875270023Z 
Loading safetensors checkpoint shards:   8% Completed | 6/74 [00:02<00:31,  2.14it/s]
2025-01-22T13:20:34.344715583Z 
Loading safetensors checkpoint shards:   9% Completed | 7/74 [00:03<00:31,  2.14it/s]
2025-01-22T13:20:34.821748448Z 
Loading safetensors checkpoint shards:  11% Completed | 8/74 [00:03<00:31,  2.13it/s]
2025-01-22T13:20:35.290056371Z 
Loading safetensors checkpoint shards:  12% Completed | 9/74 [00:04<00:30,  2.13it/s]
2025-01-22T13:20:35.755523220Z 
Loading safetensors checkpoint shards:  14% Completed | 10/74 [00:04<00:29,  2.13it/s]
2025-01-22T13:20:36.228502702Z 
Loading safetensors checkpoint shards:  15% Completed | 11/74 [00:05<00:29,  2.13it/s]
2025-01-22T13:20:36.700871980Z 
Loading safetensors checkpoint shards:  16% Completed | 12/74 [00:05<00:29,  2.12it/s]
2025-01-22T13:20:37.183470090Z 
Loading safetensors checkpoint shards:  18% Completed | 13/74 [00:06<00:28,  2.11it/s]
2025-01-22T13:20:37.657741308Z 
Loading safetensors checkpoint shards:  19% Completed | 14/74 [00:06<00:28,  2.11it/s]
2025-01-22T13:20:38.121128353Z 
Loading safetensors checkpoint shards:  20% Completed | 15/74 [00:06<00:27,  2.12it/s]
2025-01-22T13:20:38.589453375Z 
Loading safetensors checkpoint shards:  22% Completed | 16/74 [00:07<00:27,  2.13it/s]
2025-01-22T13:20:39.047142026Z 
Loading safetensors checkpoint shards:  23% Completed | 17/74 [00:07<00:26,  2.14it/s]
2025-01-22T13:20:39.491344292Z 
Loading safetensors checkpoint shards:  24% Completed | 18/74 [00:08<00:25,  2.18it/s]
2025-01-22T13:20:39.929711441Z 
Loading safetensors checkpoint shards:  26% Completed | 19/74 [00:08<00:24,  2.21it/s]
2025-01-22T13:20:40.374986470Z 
Loading safetensors checkpoint shards:  27% Completed | 20/74 [00:09<00:24,  2.22it/s]
2025-01-22T13:20:40.818969728Z 
Loading safetensors checkpoint shards:  28% Completed | 21/74 [00:09<00:23,  2.23it/s]
2025-01-22T13:20:41.273748530Z 
Loading safetensors checkpoint shards:  30% Completed | 22/74 [00:10<00:23,  2.22it/s]
2025-01-22T13:20:41.739147123Z 
Loading safetensors checkpoint shards:  31% Completed | 23/74 [00:10<00:23,  2.20it/s]
2025-01-22T13:20:42.188972601Z 
Loading safetensors checkpoint shards:  32% Completed | 24/74 [00:11<00:22,  2.21it/s]
2025-01-22T13:20:42.641780672Z 
Loading safetensors checkpoint shards:  34% Completed | 25/74 [00:11<00:22,  2.21it/s]
2025-01-22T13:20:43.096641696Z 
Loading safetensors checkpoint shards:  35% Completed | 26/74 [00:11<00:21,  2.20it/s]
2025-01-22T13:20:43.567797093Z 
Loading safetensors checkpoint shards:  36% Completed | 27/74 [00:12<00:21,  2.18it/s]
2025-01-22T13:20:44.046209789Z 
Loading safetensors checkpoint shards:  38% Completed | 28/74 [00:12<00:21,  2.15it/s]
2025-01-22T13:20:44.525739823Z 
Loading safetensors checkpoint shards:  39% Completed | 29/74 [00:13<00:21,  2.13it/s]
2025-01-22T13:20:45.062838963Z 
Loading safetensors checkpoint shards:  41% Completed | 30/74 [00:13<00:21,  2.04it/s]
2025-01-22T13:20:45.538771429Z 
Loading safetensors checkpoint shards:  42% Completed | 31/74 [00:14<00:20,  2.06it/s]
2025-01-22T13:20:46.003535599Z 
Loading safetensors checkpoint shards:  43% Completed | 32/74 [00:14<00:20,  2.09it/s]
2025-01-22T13:20:46.479112534Z 
Loading safetensors checkpoint shards:  45% Completed | 33/74 [00:15<00:19,  2.09it/s]
2025-01-22T13:20:46.945277181Z 
Loading safetensors checkpoint shards:  46% Completed | 34/74 [00:15<00:18,  2.11it/s]
2025-01-22T13:20:47.399506630Z 
Loading safetensors checkpoint shards:  47% Completed | 35/74 [00:16<00:18,  2.13it/s]
2025-01-22T13:20:47.862872167Z 
Loading safetensors checkpoint shards:  49% Completed | 36/74 [00:16<00:17,  2.14it/s]
2025-01-22T13:20:48.339609077Z 
Loading safetensors checkpoint shards:  50% Completed | 37/74 [00:17<00:17,  2.13it/s]
2025-01-22T13:20:48.810059207Z 
Loading safetensors checkpoint shards:  51% Completed | 38/74 [00:17<00:16,  2.13it/s]
2025-01-22T13:20:49.280713034Z 
Loading safetensors checkpoint shards:  53% Completed | 39/74 [00:18<00:16,  2.13it/s]
2025-01-22T13:20:49.748002366Z 
Loading safetensors checkpoint shards:  54% Completed | 40/74 [00:18<00:15,  2.13it/s]
2025-01-22T13:20:50.200210526Z 
Loading safetensors checkpoint shards:  55% Completed | 41/74 [00:19<00:15,  2.15it/s]
2025-01-22T13:20:50.657614498Z 
Loading safetensors checkpoint shards:  57% Completed | 42/74 [00:19<00:14,  2.16it/s]
2025-01-22T13:20:51.128247380Z 
Loading safetensors checkpoint shards:  58% Completed | 43/74 [00:19<00:14,  2.15it/s]
2025-01-22T13:20:51.599344184Z 
Loading safetensors checkpoint shards:  59% Completed | 44/74 [00:20<00:13,  2.14it/s]
2025-01-22T13:20:52.074519018Z 
Loading safetensors checkpoint shards:  61% Completed | 45/74 [00:20<00:13,  2.13it/s]
2025-01-22T13:20:52.549870992Z 
Loading safetensors checkpoint shards:  62% Completed | 46/74 [00:21<00:13,  2.12it/s]
2025-01-22T13:20:53.041993357Z 
Loading safetensors checkpoint shards:  64% Completed | 47/74 [00:21<00:12,  2.09it/s]
2025-01-22T13:20:53.515416397Z 
Loading safetensors checkpoint shards:  65% Completed | 48/74 [00:22<00:12,  2.10it/s]
2025-01-22T13:20:53.985782219Z 
Loading safetensors checkpoint shards:  66% Completed | 49/74 [00:22<00:11,  2.11it/s]
2025-01-22T13:20:54.445680829Z 
Loading safetensors checkpoint shards:  68% Completed | 50/74 [00:23<00:11,  2.13it/s]
2025-01-22T13:20:54.916269219Z 
Loading safetensors checkpoint shards:  69% Completed | 51/74 [00:23<00:10,  2.13it/s]
2025-01-22T13:20:55.389394303Z 
Loading safetensors checkpoint shards:  70% Completed | 52/74 [00:24<00:10,  2.12it/s]
2025-01-22T13:20:55.866349991Z 
Loading safetensors checkpoint shards:  72% Completed | 53/74 [00:24<00:09,  2.11it/s]
2025-01-22T13:20:56.347850931Z 
Loading safetensors checkpoint shards:  73% Completed | 54/74 [00:25<00:09,  2.10it/s]
2025-01-22T13:20:56.794412370Z 
Loading safetensors checkpoint shards:  74% Completed | 55/74 [00:25<00:08,  2.14it/s]
2025-01-22T13:20:57.262317289Z 
Loading safetensors checkpoint shards:  76% Completed | 56/74 [00:26<00:08,  2.14it/s]
2025-01-22T13:20:57.732185124Z 
Loading safetensors checkpoint shards:  77% Completed | 57/74 [00:26<00:07,  2.14it/s]
2025-01-22T13:20:58.194820443Z 
Loading safetensors checkpoint shards:  78% Completed | 58/74 [00:27<00:07,  2.14it/s]
2025-01-22T13:20:58.670495387Z 
Loading safetensors checkpoint shards:  80% Completed | 59/74 [00:27<00:07,  2.13it/s]
2025-01-22T13:20:59.140341139Z 
Loading safetensors checkpoint shards:  81% Completed | 60/74 [00:27<00:06,  2.13it/s]
2025-01-22T13:20:59.613002800Z 
Loading safetensors checkpoint shards:  82% Completed | 61/74 [00:28<00:06,  2.13it/s]
2025-01-22T13:21:00.086442184Z 
Loading safetensors checkpoint shards:  84% Completed | 62/74 [00:28<00:05,  2.12it/s]
2025-01-22T13:21:00.560259399Z 
Loading safetensors checkpoint shards:  85% Completed | 63/74 [00:29<00:05,  2.12it/s]
2025-01-22T13:21:01.037240553Z 
Loading safetensors checkpoint shards:  86% Completed | 64/74 [00:29<00:04,  2.11it/s]
2025-01-22T13:21:01.498437763Z 
Loading safetensors checkpoint shards:  88% Completed | 65/74 [00:30<00:04,  2.13it/s]
2025-01-22T13:21:01.969160301Z 
Loading safetensors checkpoint shards:  89% Completed | 66/74 [00:30<00:03,  2.13it/s]
2025-01-22T13:21:02.440027377Z 
Loading safetensors checkpoint shards:  91% Completed | 67/74 [00:31<00:03,  2.13it/s]
2025-01-22T13:21:02.908381363Z 
Loading safetensors checkpoint shards:  92% Completed | 68/74 [00:31<00:02,  2.13it/s]
2025-01-22T13:21:03.381695121Z 
Loading safetensors checkpoint shards:  93% Completed | 69/74 [00:32<00:02,  2.12it/s]
2025-01-22T13:21:03.845546580Z 
Loading safetensors checkpoint shards:  95% Completed | 70/74 [00:32<00:01,  2.13it/s]
2025-01-22T13:21:04.311999508Z 
Loading safetensors checkpoint shards:  96% Completed | 71/74 [00:33<00:01,  2.14it/s]
2025-01-22T13:21:04.789659443Z 
Loading safetensors checkpoint shards:  97% Completed | 72/74 [00:33<00:00,  2.12it/s]
2025-01-22T13:21:05.098397817Z 
Loading safetensors checkpoint shards:  99% Completed | 73/74 [00:33<00:00,  2.37it/s]
2025-01-22T13:21:05.153431782Z 
Loading safetensors checkpoint shards: 100% Completed | 74/74 [00:33<00:00,  2.18it/s]
2025-01-22T13:21:22.200528061Z (VllmWorkerProcess pid=361) INFO 01-22 05:21:22 model_runner.py:1099] Loading model weights took 85.5053 GB
2025-01-22T13:21:23.235285583Z (VllmWorkerProcess pid=362) INFO 01-22 05:21:23 model_runner.py:1099] Loading model weights took 85.5053 GB
2025-01-22T13:21:23.662143488Z (VllmWorkerProcess pid=363) INFO 01-22 05:21:23 model_runner.py:1099] Loading model weights took 85.5053 GB
2025-01-22T13:21:24.130898012Z INFO 01-22 05:21:24 model_runner.py:1099] Loading model weights took 85.5053 GB
2025-01-22T13:21:25.970200306Z (VllmWorkerProcess pid=362) INFO 01-22 05:21:25 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl...
2025-01-22T13:21:25.970911363Z INFO 01-22 05:21:25 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl...
2025-01-22T13:21:25.973483812Z (VllmWorkerProcess pid=363) INFO 01-22 05:21:25 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl...
2025-01-22T13:21:25.973539062Z (VllmWorkerProcess pid=361) INFO 01-22 05:21:25 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl...
2025-01-22T13:21:25.978851389Z (VllmWorkerProcess pid=362) INFO 01-22 05:21:25 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl.
2025-01-22T13:21:25.979807043Z INFO 01-22 05:21:25 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl.
2025-01-22T13:21:25.981052641Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
2025-01-22T13:21:25.981054883Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] Traceback (most recent call last):
2025-01-22T13:21:25.981057016Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
2025-01-22T13:21:25.981058711Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return func(*args, **kwargs)
2025-01-22T13:21:25.981060648Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981061614Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1691, in execute_model
2025-01-22T13:21:25.981062571Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     hidden_or_intermediate_states = model_executable(
2025-01-22T13:21:25.981064343Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]                                     ^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981067242Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
2025-01-22T13:21:25.981068486Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return self._call_impl(*args, **kwargs)
2025-01-22T13:21:25.981069736Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981070984Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
2025-01-22T13:21:25.981072436Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return forward_call(*args, **kwargs)
2025-01-22T13:21:25.981080706Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981082054Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 527, in forward
2025-01-22T13:21:25.981083090Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     hidden_states = self.model(input_ids, positions, kv_caches,
2025-01-22T13:21:25.981084195Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981085485Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
2025-01-22T13:21:25.981086427Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return self._call_impl(*args, **kwargs)
2025-01-22T13:21:25.981087552Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981088523Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
2025-01-22T13:21:25.981089556Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return forward_call(*args, **kwargs)
2025-01-22T13:21:25.981090564Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981091589Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 483, in forward
2025-01-22T13:21:25.981092501Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     hidden_states, residual = layer(positions, hidden_states,
2025-01-22T13:21:25.981093914Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981094910Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
2025-01-22T13:21:25.981096149Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return self._call_impl(*args, **kwargs)
2025-01-22T13:21:25.981097296Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981098229Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
2025-01-22T13:21:25.981099158Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return forward_call(*args, **kwargs)
2025-01-22T13:21:25.981100096Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981101044Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 412, in forward
2025-01-22T13:21:25.981101969Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     hidden_states = self.mlp(hidden_states)
2025-01-22T13:21:25.981104922Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]                     ^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981105903Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
2025-01-22T13:21:25.981106883Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return self._call_impl(*args, **kwargs)
2025-01-22T13:21:25.981107795Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981108916Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
2025-01-22T13:21:25.981109861Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return forward_call(*args, **kwargs)
2025-01-22T13:21:25.981110821Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981111777Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 158, in forward
2025-01-22T13:21:25.981112700Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     final_hidden_states = self.experts(
2025-01-22T13:21:25.981113637Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]                           ^^^^^^^^^^^^^
2025-01-22T13:21:25.981114804Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
2025-01-22T13:21:25.981115921Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return self._call_impl(*args, **kwargs)
2025-01-22T13:21:25.981117012Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981118081Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
2025-01-22T13:21:25.981119209Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return forward_call(*args, **kwargs)
2025-01-22T13:21:25.981120307Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981121683Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 522, in forward
2025-01-22T13:21:25.981123129Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     final_hidden_states = self.quant_method.apply(
2025-01-22T13:21:25.981124109Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]                           ^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981125040Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 463, in apply
2025-01-22T13:21:25.981126118Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return torch.ops.vllm.fused_marlin_moe(
2025-01-22T13:21:25.981127065Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981129560Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1116, in __call__
2025-01-22T13:21:25.981130829Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return self._op(*args, **(kwargs or {}))
2025-01-22T13:21:25.981132369Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981133322Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 202, in fused_marlin_moe
2025-01-22T13:21:25.981134990Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     assert hidden_states.dtype == torch.float16
2025-01-22T13:21:25.981135915Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981136986Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] AssertionError
2025-01-22T13:21:25.981138483Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]
2025-01-22T13:21:25.981140056Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] The above exception was the direct cause of the following exception:
2025-01-22T13:21:25.981141343Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]
2025-01-22T13:21:25.981142564Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] Traceback (most recent call last):
2025-01-22T13:21:25.981143687Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_worker_utils.py", line 230, in _run_worker_process
2025-01-22T13:21:25.981144781Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     output = executor(*args, **kwargs)
2025-01-22T13:21:25.981145931Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]              ^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981147028Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-01-22T13:21:25.981148165Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return func(*args, **kwargs)
2025-01-22T13:21:25.981149122Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981150069Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 202, in determine_num_available_blocks
2025-01-22T13:21:25.981150994Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     self.model_runner.profile_run()
2025-01-22T13:21:25.981151936Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-01-22T13:21:25.981152905Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return func(*args, **kwargs)
2025-01-22T13:21:25.981154018Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981155223Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1331, in profile_run
2025-01-22T13:21:25.981158094Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     self.execute_model(model_input, kv_caches, intermediate_tensors)
2025-01-22T13:21:25.981159193Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-01-22T13:21:25.981160159Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return func(*args, **kwargs)
2025-01-22T13:21:25.981161245Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981162372Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
2025-01-22T13:21:25.981179430Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     raise type(err)(
2025-01-22T13:21:25.981180548Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 
Cognitive Computations org

Add --dtype float16 or use the new moe_wna16 kernel which needs to be built from source.

Sign up or log in to comment