manaestras asherszhang commited on
Commit
fee83ca
·
verified ·
1 Parent(s): 19cc7fc

doc: add reason switch and agent function call parameters. (#2)

Browse files

- doc: add reason switch and agent function call parameters. (e1382cc668ab1175c3e3b4e611d1ccb88c4c5db3)


Co-authored-by: asher <[email protected]>

Files changed (1) hide show
  1. README.md +53 -6
README.md CHANGED
@@ -91,7 +91,7 @@ Hunyuan-A13B-Instruct has achieved highly competitive performance across multipl
91
  &nbsp;
92
 
93
  ## Use with transformers
94
- The following code snippet shows how to use the transformers library to load and apply the model. It also demonstrates how to enable and disable the reasoning mode , and how to parse the reasoning process along with the final output.
95
 
96
  ```python
97
  from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -102,13 +102,20 @@ model_name_or_path = os.environ['MODEL_PATH']
102
  # model_name_or_path = "tencent/Hunyuan-A13B-Instruct"
103
 
104
  tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
105
- model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto",trust_remote_code=True) # You may want to use bfloat16 and/or move to GPU here
 
 
106
  messages = [
107
  {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
108
  ]
109
- tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True,return_tensors="pt",
110
- enable_thinking=True # Toggle thinking mode (default: True)
111
- )
 
 
 
 
 
112
 
113
  outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=4096)
114
 
@@ -126,6 +133,27 @@ print(f"thinking_content:{think_content}\n\n")
126
  print(f"answer_content:{answer_content}\n\n")
127
  ```
128
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
129
  ## Quantitative Compression
130
  We used our own `AngleSlim` compression tool to produce FP8 and INT4 quantization models. `AngleSlim` compression tool is expected to be open source in early July, which will support one-click quantization and compression of large models, please look forward to it, and you can download our quantization models directly for deployment testing now.
131
 
@@ -197,7 +225,7 @@ trtllm-serve \
197
  ```
198
 
199
 
200
- ### vllm
201
 
202
  #### Docker Image
203
  We provide a pre-built Docker image containing vLLM 0.8.5 with full support for this model. The official vllm release is currently under development, **note: cuda 12.8 is require for this docker**.
@@ -238,6 +266,25 @@ docker run --privileged --user root --net=host --ipc=host \
238
  ```
239
 
240
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
241
  ### SGLang
242
 
243
  #### Docker Image
 
91
  &nbsp;
92
 
93
  ## Use with transformers
94
+ Below is an example of how to use this model with the Hugging Face transformers library. This includes loading the model and tokenizer, toggling reasoning (thinking) mode, and parsing both the reasoning process and final answer from the output.
95
 
96
  ```python
97
  from transformers import AutoModelForCausalLM, AutoTokenizer
 
102
  # model_name_or_path = "tencent/Hunyuan-A13B-Instruct"
103
 
104
  tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
105
+ model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
106
+ device_map="auto",trust_remote_code=True) # You may want to use bfloat16 and/or move to GPU here
107
+
108
  messages = [
109
  {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
110
  ]
111
+
112
+ tokenized_chat = tokenizer.apply_chat_template(
113
+ messages,
114
+ tokenize=True,
115
+ add_generation_prompt=True,
116
+ return_tensors="pt",
117
+ enable_thinking=True # Toggle thinking mode (default: True)
118
+ )
119
 
120
  outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=4096)
121
 
 
133
  print(f"answer_content:{answer_content}\n\n")
134
  ```
135
 
136
+ ### Fast and slow thinking switch
137
+
138
+ This model supports two modes of operation:
139
+
140
+ - Slow Thinking Mode (Default): Enables detailed internal reasoning steps before producing the final answer.
141
+ - Fast Thinking Mode: Skips the internal reasoning process for faster inference, going straight to the final answer.
142
+
143
+ **Switching to Fast Thinking Mode:**
144
+
145
+ To disable the reasoning process, set `enable_thinking=False` in the apply_chat_template call:
146
+ ```
147
+ tokenized_chat = tokenizer.apply_chat_template(
148
+ messages,
149
+ tokenize=True,
150
+ add_generation_prompt=True,
151
+ return_tensors="pt",
152
+ enable_thinking=False # Use fast thinking mode
153
+ )
154
+ ```
155
+
156
+
157
  ## Quantitative Compression
158
  We used our own `AngleSlim` compression tool to produce FP8 and INT4 quantization models. `AngleSlim` compression tool is expected to be open source in early July, which will support one-click quantization and compression of large models, please look forward to it, and you can download our quantization models directly for deployment testing now.
159
 
 
225
  ```
226
 
227
 
228
+ ### vLLM
229
 
230
  #### Docker Image
231
  We provide a pre-built Docker image containing vLLM 0.8.5 with full support for this model. The official vllm release is currently under development, **note: cuda 12.8 is require for this docker**.
 
266
  ```
267
 
268
 
269
+
270
+ #### Tool Calling with vLLM
271
+
272
+ To support agent-based workflows and function calling capabilities, this model includes specialized parsing mechanisms for handling tool calls and internal reasoning steps.
273
+
274
+ For a complete working example of how to implement and use these features in an agent setting, please refer to our full agent implementation on GitHub:
275
+ 🔗 [Hunyuan A13B Agent Example](https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/agent/)
276
+
277
+ When deploying the model using **vLLM**, the following parameters can be used to configure the tool parsing behavior:
278
+
279
+ | Parameter | Value |
280
+ |--------------------------|-----------------------------------------------------------------------|
281
+ | `--tool-parser-plugin` | [Local Hunyuan A13B Tool Parser File](https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/agent/hunyuan_tool_parser.py) |
282
+ | `--tool-call-parser` | `hunyuan` |
283
+
284
+ These settings enable vLLM to correctly interpret and route tool calls generated by the model according to the expected format.
285
+
286
+
287
+
288
  ### SGLang
289
 
290
  #### Docker Image