update doc, add function call and reason parser back.

#14
by asherszhang - opened
Files changed (1) hide show
  1. README.md +72 -12
README.md CHANGED
@@ -98,7 +98,9 @@ Our model defaults to using slow-thinking reasoning, and there are two ways to d
98
  1. Pass "enable_thinking=False" when calling apply_chat_template.
99
  2. Adding "/no_think" before the prompt will force the model not to use perform CoT reasoning. Similarly, adding "/think" before the prompt will force the model to perform CoT reasoning.
100
 
101
- The following code snippet shows how to use the transformers library to load and apply the model. It also demonstrates how to enable and disable the reasoning mode , and how to parse the reasoning process along with the final output.
 
 
102
 
103
 
104
 
@@ -135,6 +137,28 @@ print(f"thinking_content:{think_content}\n\n")
135
  print(f"answer_content:{answer_content}\n\n")
136
  ```
137
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
  ## Deployment
139
 
140
  For deployment, you can use frameworks such as **TensorRT-LLM**, **vLLM**, or **SGLang** to serve the model and create an OpenAI-compatible API endpoint.
@@ -195,7 +219,7 @@ trtllm-serve \
195
  ```
196
 
197
 
198
- ### vllm
199
 
200
  #### Docker Image
201
  We provide a pre-built Docker image containing vLLM 0.8.5 with full support for this model. The official vllm release is currently under development, **note: cuda 12.8 is require for this docker**.
@@ -217,25 +241,61 @@ docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm
217
 
218
  model download by huggingface:
219
  ```
220
- docker run --privileged --user root --net=host --ipc=host \
221
  -v ~/.cache:/root/.cache/ \
222
- --gpus=all -it --entrypoint python hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm
223
- \
224
- -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 \
225
- --tensor-parallel-size 4 --model tencent/Hunyuan-A13B-Instruct --trust-remote-code
226
-
 
 
 
 
 
 
 
227
  ```
228
 
229
  model downloaded by modelscope:
230
  ```
231
- docker run --privileged --user root --net=host --ipc=host \
232
  -v ~/.cache/modelscope:/root/.cache/modelscope \
233
- --gpus=all -it --entrypoint python hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm \
234
- -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --tensor-parallel-size 4 --port 8000 \
235
- --model /root/.cache/modelscope/hub/models/Tencent-Hunyuan/Hunyuan-A13B-Instruct/ --trust_remote_code
 
 
 
 
 
 
 
 
 
236
  ```
237
 
238
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
239
  ### SGLang
240
 
241
  #### Docker Image
 
98
  1. Pass "enable_thinking=False" when calling apply_chat_template.
99
  2. Adding "/no_think" before the prompt will force the model not to use perform CoT reasoning. Similarly, adding "/think" before the prompt will force the model to perform CoT reasoning.
100
 
101
+ The following code snippet shows how to use the transformers library to load and apply the model.
102
+ It also demonstrates how to enable and disable the reasoning mode ,
103
+ and how to parse the reasoning process along with the final output.
104
 
105
 
106
 
 
137
  print(f"answer_content:{answer_content}\n\n")
138
  ```
139
 
140
+ ### Fast and slow thinking switch
141
+
142
+ This model supports two modes of operation:
143
+
144
+ - Slow Thinking Mode (Default): Enables detailed internal reasoning steps before producing the final answer.
145
+ - Fast Thinking Mode: Skips the internal reasoning process for faster inference, going straight to the final answer.
146
+
147
+ **Switching to Fast Thinking Mode:**
148
+
149
+ To disable the reasoning process, set `enable_thinking=False` in the apply_chat_template call:
150
+ ```
151
+ tokenized_chat = tokenizer.apply_chat_template(
152
+ messages,
153
+ tokenize=True,
154
+ add_generation_prompt=True,
155
+ return_tensors="pt",
156
+ enable_thinking=False # Use fast thinking mode
157
+ )
158
+ ```
159
+
160
+
161
+
162
  ## Deployment
163
 
164
  For deployment, you can use frameworks such as **TensorRT-LLM**, **vLLM**, or **SGLang** to serve the model and create an OpenAI-compatible API endpoint.
 
219
  ```
220
 
221
 
222
+ ### vLLM
223
 
224
  #### Docker Image
225
  We provide a pre-built Docker image containing vLLM 0.8.5 with full support for this model. The official vllm release is currently under development, **note: cuda 12.8 is require for this docker**.
 
241
 
242
  model download by huggingface:
243
  ```
244
+ docker run --rm --ipc=host \
245
  -v ~/.cache:/root/.cache/ \
246
+ --security-opt seccomp=unconfined \
247
+ --net=host \
248
+ --gpus=all \
249
+ -it \
250
+ -e VLLM_USE_V1=0 \
251
+ --entrypoint python hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm \
252
+ -m vllm.entrypoints.openai.api_server \
253
+ --host 0.0.0.0 \
254
+ --tensor-parallel-size 4 \
255
+ --port 8000 \
256
+ --model tencent/Hunyuan-A13B-Instruct \
257
+ --trust_remote_code
258
  ```
259
 
260
  model downloaded by modelscope:
261
  ```
262
+ docker run --rm --ipc=host \
263
  -v ~/.cache/modelscope:/root/.cache/modelscope \
264
+ --security-opt seccomp=unconfined \
265
+ --net=host \
266
+ --gpus=all \
267
+ -it \
268
+ -e VLLM_USE_V1=0 \
269
+ --entrypoint python mirror.ccs.tencentyun.com/hunyuaninfer/hunyuan-large:hunyuan-moe-A13B-vllm \
270
+ -m vllm.entrypoints.openai.api_server \
271
+ --host 0.0.0.0 \
272
+ --tensor-parallel-size 4 \
273
+ --port 8000 \
274
+ --model /root/.cache/modelscope/hub/models/Tencent-Hunyuan/Hunyuan-A13B-Instruct/ \
275
+ --trust_remote_code
276
  ```
277
 
278
 
279
+ #### Tool Calling with vLLM
280
+
281
+ To support agent-based workflows and function calling capabilities, this model includes specialized parsing mechanisms for handling tool calls and internal reasoning steps.
282
+
283
+ For a complete working example of how to implement and use these features in an agent setting, please refer to our full agent implementation on GitHub:
284
+ 🔗 [Hunyuan A13B Agent Example](https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/agent/)
285
+
286
+ When deploying the model using **vLLM**, the following parameters can be used to configure the tool parsing behavior:
287
+
288
+ | Parameter | Value |
289
+ |--------------------------|-----------------------------------------------------------------------|
290
+ | `--tool-parser-plugin` | [Local Hunyuan A13B Tool Parser File](https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/agent/hunyuan_tool_parser.py) |
291
+ | `--tool-call-parser` | `hunyuan` |
292
+
293
+ These settings enable vLLM to correctly interpret and route tool calls generated by the model according to the expected format.
294
+
295
+ ### Reasoning parser
296
+
297
+ vLLM reasoning parser support on Hunyuan A13B model is under development.
298
+
299
  ### SGLang
300
 
301
  #### Docker Image