tencent
/

Hunyuan-A13B-Instruct

@@ -98,7 +98,9 @@ Our model defaults to using slow-thinking reasoning, and there are two ways to d
 1. Pass "enable_thinking=False" when calling apply_chat_template.
 2. Adding "/no_think" before the prompt will force the model not to use perform CoT reasoning. Similarly, adding "/think" before the prompt will force the model to perform CoT reasoning.
-The following code snippet shows how to use the transformers library to load and apply the model. It also demonstrates how to enable and disable the reasoning mode , and how to parse the reasoning process along with the final output.
@@ -135,6 +137,28 @@ print(f"thinking_content:{think_content}\n\n")
 print(f"answer_content:{answer_content}\n\n")
 ```
 ## Deployment
 For deployment, you can use frameworks such as **TensorRT-LLM**, **vLLM**, or **SGLang** to serve the model and create an OpenAI-compatible API endpoint.
@@ -195,7 +219,7 @@ trtllm-serve \
 ```
-### vllm
 #### Docker Image
 We provide a pre-built Docker image containing vLLM 0.8.5 with full support for this model. The official vllm release is currently under development， **note: cuda 12.8 is require for this docker**.
@@ -217,25 +241,61 @@ docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm
 model download by huggingface:
 ```
-docker run  --privileged --user root  --net=host --ipc=host \
         -v ~/.cache:/root/.cache/ \
-        --gpus=all -it --entrypoint python  hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm
- \
-         -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 \
-         --tensor-parallel-size 4 --model tencent/Hunyuan-A13B-Instruct --trust-remote-code
 ```
 model downloaded by modelscope:
 ```
-docker run  --privileged --user root  --net=host --ipc=host \
         -v ~/.cache/modelscope:/root/.cache/modelscope \
-        --gpus=all -it --entrypoint python   hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm \
-         -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --tensor-parallel-size 4 --port 8000 \
-         --model /root/.cache/modelscope/hub/models/Tencent-Hunyuan/Hunyuan-A13B-Instruct/ --trust_remote_code
 ```
 ### SGLang
 #### Docker Image

 1. Pass "enable_thinking=False" when calling apply_chat_template.
 2. Adding "/no_think" before the prompt will force the model not to use perform CoT reasoning. Similarly, adding "/think" before the prompt will force the model to perform CoT reasoning.
+The following code snippet shows how to use the transformers library to load and apply the model.
+It also demonstrates how to enable and disable the reasoning mode ,
+and how to parse the reasoning process along with the final output.
 print(f"answer_content:{answer_content}\n\n")
 ```
+### Fast and slow thinking switch
+This model supports two modes of operation:
+- Slow Thinking Mode (Default): Enables detailed internal reasoning steps before producing the final answer.
+- Fast Thinking Mode: Skips the internal reasoning process for faster inference, going straight to the final answer.
+**Switching to Fast Thinking Mode:**
+To disable the reasoning process, set `enable_thinking=False` in the apply_chat_template call:
+```
+tokenized_chat = tokenizer.apply_chat_template(
+    messages,
+    tokenize=True,
+    add_generation_prompt=True,
+    return_tensors="pt",
+    enable_thinking=False  # Use fast thinking mode
+)
+```
 ## Deployment
 For deployment, you can use frameworks such as **TensorRT-LLM**, **vLLM**, or **SGLang** to serve the model and create an OpenAI-compatible API endpoint.
 ```
+### vLLM
 #### Docker Image
 We provide a pre-built Docker image containing vLLM 0.8.5 with full support for this model. The official vllm release is currently under development， **note: cuda 12.8 is require for this docker**.
 model download by huggingface:
 ```
+docker run --rm  --ipc=host \
         -v ~/.cache:/root/.cache/ \
+        --security-opt seccomp=unconfined \
+        --net=host \
+        --gpus=all \
+        -it \
+        -e VLLM_USE_V1=0 \
+        --entrypoint python hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm \
+        -m vllm.entrypoints.openai.api_server \
+        --host 0.0.0.0 \
+        --tensor-parallel-size 4 \
+        --port 8000 \
+        --model tencent/Hunyuan-A13B-Instruct  \
+        --trust_remote_code
 ```
 model downloaded by modelscope:
 ```
+docker run --rm  --ipc=host \
         -v ~/.cache/modelscope:/root/.cache/modelscope \
+        --security-opt seccomp=unconfined \
+        --net=host \
+        --gpus=all \
+        -it \
+        -e VLLM_USE_V1=0 \
+        --entrypoint python mirror.ccs.tencentyun.com/hunyuaninfer/hunyuan-large:hunyuan-moe-A13B-vllm \
+        -m vllm.entrypoints.openai.api_server \
+        --host 0.0.0.0 \
+        --tensor-parallel-size 4 \
+        --port 8000 \
+        --model /root/.cache/modelscope/hub/models/Tencent-Hunyuan/Hunyuan-A13B-Instruct/  \
+        --trust_remote_code
 ```
+#### Tool Calling with vLLM
+To support agent-based workflows and function calling capabilities, this model includes specialized parsing mechanisms for handling tool calls and internal reasoning steps.
+For a complete working example of how to implement and use these features in an agent setting, please refer to our full agent implementation on GitHub:
+🔗 [Hunyuan A13B Agent Example](https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/agent/)
+When deploying the model using **vLLM**, the following parameters can be used to configure the tool parsing behavior:
+| Parameter                | Value                                                                 |
+|--------------------------|-----------------------------------------------------------------------|
+| `--tool-parser-plugin`   | [Local Hunyuan A13B Tool Parser File](https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/agent/hunyuan_tool_parser.py) |
+| `--tool-call-parser`     | `hunyuan`                                                            |
+These settings enable vLLM to correctly interpret and route tool calls generated by the model according to the expected format.
+### Reasoning parser
+vLLM reasoning parser support on Hunyuan A13B model is under development.
 ### SGLang
 #### Docker Image