doc: add reason switch and agent function call parameters. (#2)
Browse files- doc: add reason switch and agent function call parameters. (e1382cc668ab1175c3e3b4e611d1ccb88c4c5db3)
Co-authored-by: asher <[email protected]>
README.md
CHANGED
@@ -91,7 +91,7 @@ Hunyuan-A13B-Instruct has achieved highly competitive performance across multipl
|
|
91 |
|
92 |
|
93 |
## Use with transformers
|
94 |
-
|
95 |
|
96 |
```python
|
97 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
@@ -102,13 +102,20 @@ model_name_or_path = os.environ['MODEL_PATH']
|
|
102 |
# model_name_or_path = "tencent/Hunyuan-A13B-Instruct"
|
103 |
|
104 |
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
|
105 |
-
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
|
|
|
|
|
106 |
messages = [
|
107 |
{"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
|
108 |
]
|
109 |
-
|
110 |
-
|
111 |
-
|
|
|
|
|
|
|
|
|
|
|
112 |
|
113 |
outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=4096)
|
114 |
|
@@ -126,6 +133,27 @@ print(f"thinking_content:{think_content}\n\n")
|
|
126 |
print(f"answer_content:{answer_content}\n\n")
|
127 |
```
|
128 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
129 |
## Quantitative Compression
|
130 |
We used our own `AngleSlim` compression tool to produce FP8 and INT4 quantization models. `AngleSlim` compression tool is expected to be open source in early July, which will support one-click quantization and compression of large models, please look forward to it, and you can download our quantization models directly for deployment testing now.
|
131 |
|
@@ -197,7 +225,7 @@ trtllm-serve \
|
|
197 |
```
|
198 |
|
199 |
|
200 |
-
###
|
201 |
|
202 |
#### Docker Image
|
203 |
We provide a pre-built Docker image containing vLLM 0.8.5 with full support for this model. The official vllm release is currently under development, **note: cuda 12.8 is require for this docker**.
|
@@ -238,6 +266,25 @@ docker run --privileged --user root --net=host --ipc=host \
|
|
238 |
```
|
239 |
|
240 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
241 |
### SGLang
|
242 |
|
243 |
#### Docker Image
|
|
|
91 |
|
92 |
|
93 |
## Use with transformers
|
94 |
+
Below is an example of how to use this model with the Hugging Face transformers library. This includes loading the model and tokenizer, toggling reasoning (thinking) mode, and parsing both the reasoning process and final answer from the output.
|
95 |
|
96 |
```python
|
97 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
102 |
# model_name_or_path = "tencent/Hunyuan-A13B-Instruct"
|
103 |
|
104 |
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
|
105 |
+
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
|
106 |
+
device_map="auto",trust_remote_code=True) # You may want to use bfloat16 and/or move to GPU here
|
107 |
+
|
108 |
messages = [
|
109 |
{"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
|
110 |
]
|
111 |
+
|
112 |
+
tokenized_chat = tokenizer.apply_chat_template(
|
113 |
+
messages,
|
114 |
+
tokenize=True,
|
115 |
+
add_generation_prompt=True,
|
116 |
+
return_tensors="pt",
|
117 |
+
enable_thinking=True # Toggle thinking mode (default: True)
|
118 |
+
)
|
119 |
|
120 |
outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=4096)
|
121 |
|
|
|
133 |
print(f"answer_content:{answer_content}\n\n")
|
134 |
```
|
135 |
|
136 |
+
### Fast and slow thinking switch
|
137 |
+
|
138 |
+
This model supports two modes of operation:
|
139 |
+
|
140 |
+
- Slow Thinking Mode (Default): Enables detailed internal reasoning steps before producing the final answer.
|
141 |
+
- Fast Thinking Mode: Skips the internal reasoning process for faster inference, going straight to the final answer.
|
142 |
+
|
143 |
+
**Switching to Fast Thinking Mode:**
|
144 |
+
|
145 |
+
To disable the reasoning process, set `enable_thinking=False` in the apply_chat_template call:
|
146 |
+
```
|
147 |
+
tokenized_chat = tokenizer.apply_chat_template(
|
148 |
+
messages,
|
149 |
+
tokenize=True,
|
150 |
+
add_generation_prompt=True,
|
151 |
+
return_tensors="pt",
|
152 |
+
enable_thinking=False # Use fast thinking mode
|
153 |
+
)
|
154 |
+
```
|
155 |
+
|
156 |
+
|
157 |
## Quantitative Compression
|
158 |
We used our own `AngleSlim` compression tool to produce FP8 and INT4 quantization models. `AngleSlim` compression tool is expected to be open source in early July, which will support one-click quantization and compression of large models, please look forward to it, and you can download our quantization models directly for deployment testing now.
|
159 |
|
|
|
225 |
```
|
226 |
|
227 |
|
228 |
+
### vLLM
|
229 |
|
230 |
#### Docker Image
|
231 |
We provide a pre-built Docker image containing vLLM 0.8.5 with full support for this model. The official vllm release is currently under development, **note: cuda 12.8 is require for this docker**.
|
|
|
266 |
```
|
267 |
|
268 |
|
269 |
+
|
270 |
+
#### Tool Calling with vLLM
|
271 |
+
|
272 |
+
To support agent-based workflows and function calling capabilities, this model includes specialized parsing mechanisms for handling tool calls and internal reasoning steps.
|
273 |
+
|
274 |
+
For a complete working example of how to implement and use these features in an agent setting, please refer to our full agent implementation on GitHub:
|
275 |
+
🔗 [Hunyuan A13B Agent Example](https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/agent/)
|
276 |
+
|
277 |
+
When deploying the model using **vLLM**, the following parameters can be used to configure the tool parsing behavior:
|
278 |
+
|
279 |
+
| Parameter | Value |
|
280 |
+
|--------------------------|-----------------------------------------------------------------------|
|
281 |
+
| `--tool-parser-plugin` | [Local Hunyuan A13B Tool Parser File](https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/agent/hunyuan_tool_parser.py) |
|
282 |
+
| `--tool-call-parser` | `hunyuan` |
|
283 |
+
|
284 |
+
These settings enable vLLM to correctly interpret and route tool calls generated by the model according to the expected format.
|
285 |
+
|
286 |
+
|
287 |
+
|
288 |
### SGLang
|
289 |
|
290 |
#### Docker Image
|