update doc, add function call and reason parser back.
Browse files
README.md
CHANGED
@@ -98,7 +98,9 @@ Our model defaults to using slow-thinking reasoning, and there are two ways to d
|
|
98 |
1. Pass "enable_thinking=False" when calling apply_chat_template.
|
99 |
2. Adding "/no_think" before the prompt will force the model not to use perform CoT reasoning. Similarly, adding "/think" before the prompt will force the model to perform CoT reasoning.
|
100 |
|
101 |
-
The following code snippet shows how to use the transformers library to load and apply the model.
|
|
|
|
|
102 |
|
103 |
|
104 |
|
@@ -135,6 +137,28 @@ print(f"thinking_content:{think_content}\n\n")
|
|
135 |
print(f"answer_content:{answer_content}\n\n")
|
136 |
```
|
137 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
138 |
## Deployment
|
139 |
|
140 |
For deployment, you can use frameworks such as **TensorRT-LLM**, **vLLM**, or **SGLang** to serve the model and create an OpenAI-compatible API endpoint.
|
@@ -195,7 +219,7 @@ trtllm-serve \
|
|
195 |
```
|
196 |
|
197 |
|
198 |
-
###
|
199 |
|
200 |
#### Docker Image
|
201 |
We provide a pre-built Docker image containing vLLM 0.8.5 with full support for this model. The official vllm release is currently under development, **note: cuda 12.8 is require for this docker**.
|
@@ -217,25 +241,61 @@ docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm
|
|
217 |
|
218 |
model download by huggingface:
|
219 |
```
|
220 |
-
docker run
|
221 |
-v ~/.cache:/root/.cache/ \
|
222 |
-
--
|
223 |
-
\
|
224 |
-
|
225 |
-
|
226 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
227 |
```
|
228 |
|
229 |
model downloaded by modelscope:
|
230 |
```
|
231 |
-
docker run
|
232 |
-v ~/.cache/modelscope:/root/.cache/modelscope \
|
233 |
-
--
|
234 |
-
|
235 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
236 |
```
|
237 |
|
238 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
239 |
### SGLang
|
240 |
|
241 |
#### Docker Image
|
|
|
98 |
1. Pass "enable_thinking=False" when calling apply_chat_template.
|
99 |
2. Adding "/no_think" before the prompt will force the model not to use perform CoT reasoning. Similarly, adding "/think" before the prompt will force the model to perform CoT reasoning.
|
100 |
|
101 |
+
The following code snippet shows how to use the transformers library to load and apply the model.
|
102 |
+
It also demonstrates how to enable and disable the reasoning mode ,
|
103 |
+
and how to parse the reasoning process along with the final output.
|
104 |
|
105 |
|
106 |
|
|
|
137 |
print(f"answer_content:{answer_content}\n\n")
|
138 |
```
|
139 |
|
140 |
+
### Fast and slow thinking switch
|
141 |
+
|
142 |
+
This model supports two modes of operation:
|
143 |
+
|
144 |
+
- Slow Thinking Mode (Default): Enables detailed internal reasoning steps before producing the final answer.
|
145 |
+
- Fast Thinking Mode: Skips the internal reasoning process for faster inference, going straight to the final answer.
|
146 |
+
|
147 |
+
**Switching to Fast Thinking Mode:**
|
148 |
+
|
149 |
+
To disable the reasoning process, set `enable_thinking=False` in the apply_chat_template call:
|
150 |
+
```
|
151 |
+
tokenized_chat = tokenizer.apply_chat_template(
|
152 |
+
messages,
|
153 |
+
tokenize=True,
|
154 |
+
add_generation_prompt=True,
|
155 |
+
return_tensors="pt",
|
156 |
+
enable_thinking=False # Use fast thinking mode
|
157 |
+
)
|
158 |
+
```
|
159 |
+
|
160 |
+
|
161 |
+
|
162 |
## Deployment
|
163 |
|
164 |
For deployment, you can use frameworks such as **TensorRT-LLM**, **vLLM**, or **SGLang** to serve the model and create an OpenAI-compatible API endpoint.
|
|
|
219 |
```
|
220 |
|
221 |
|
222 |
+
### vLLM
|
223 |
|
224 |
#### Docker Image
|
225 |
We provide a pre-built Docker image containing vLLM 0.8.5 with full support for this model. The official vllm release is currently under development, **note: cuda 12.8 is require for this docker**.
|
|
|
241 |
|
242 |
model download by huggingface:
|
243 |
```
|
244 |
+
docker run --rm --ipc=host \
|
245 |
-v ~/.cache:/root/.cache/ \
|
246 |
+
--security-opt seccomp=unconfined \
|
247 |
+
--net=host \
|
248 |
+
--gpus=all \
|
249 |
+
-it \
|
250 |
+
-e VLLM_USE_V1=0 \
|
251 |
+
--entrypoint python hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm \
|
252 |
+
-m vllm.entrypoints.openai.api_server \
|
253 |
+
--host 0.0.0.0 \
|
254 |
+
--tensor-parallel-size 4 \
|
255 |
+
--port 8000 \
|
256 |
+
--model tencent/Hunyuan-A13B-Instruct \
|
257 |
+
--trust_remote_code
|
258 |
```
|
259 |
|
260 |
model downloaded by modelscope:
|
261 |
```
|
262 |
+
docker run --rm --ipc=host \
|
263 |
-v ~/.cache/modelscope:/root/.cache/modelscope \
|
264 |
+
--security-opt seccomp=unconfined \
|
265 |
+
--net=host \
|
266 |
+
--gpus=all \
|
267 |
+
-it \
|
268 |
+
-e VLLM_USE_V1=0 \
|
269 |
+
--entrypoint python mirror.ccs.tencentyun.com/hunyuaninfer/hunyuan-large:hunyuan-moe-A13B-vllm \
|
270 |
+
-m vllm.entrypoints.openai.api_server \
|
271 |
+
--host 0.0.0.0 \
|
272 |
+
--tensor-parallel-size 4 \
|
273 |
+
--port 8000 \
|
274 |
+
--model /root/.cache/modelscope/hub/models/Tencent-Hunyuan/Hunyuan-A13B-Instruct/ \
|
275 |
+
--trust_remote_code
|
276 |
```
|
277 |
|
278 |
|
279 |
+
#### Tool Calling with vLLM
|
280 |
+
|
281 |
+
To support agent-based workflows and function calling capabilities, this model includes specialized parsing mechanisms for handling tool calls and internal reasoning steps.
|
282 |
+
|
283 |
+
For a complete working example of how to implement and use these features in an agent setting, please refer to our full agent implementation on GitHub:
|
284 |
+
🔗 [Hunyuan A13B Agent Example](https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/agent/)
|
285 |
+
|
286 |
+
When deploying the model using **vLLM**, the following parameters can be used to configure the tool parsing behavior:
|
287 |
+
|
288 |
+
| Parameter | Value |
|
289 |
+
|--------------------------|-----------------------------------------------------------------------|
|
290 |
+
| `--tool-parser-plugin` | [Local Hunyuan A13B Tool Parser File](https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/agent/hunyuan_tool_parser.py) |
|
291 |
+
| `--tool-call-parser` | `hunyuan` |
|
292 |
+
|
293 |
+
These settings enable vLLM to correctly interpret and route tool calls generated by the model according to the expected format.
|
294 |
+
|
295 |
+
### Reasoning parser
|
296 |
+
|
297 |
+
vLLM reasoning parser support on Hunyuan A13B model is under development.
|
298 |
+
|
299 |
### SGLang
|
300 |
|
301 |
#### Docker Image
|