asherszhang commited on
Commit
e1382cc
·
verified ·
1 Parent(s): e9d2b83

doc: add reason switch and agent function call parameters.

Browse files
Files changed (1) hide show
  1. README.md +53 -6
README.md CHANGED
@@ -90,7 +90,7 @@ Hunyuan-A13B-Instruct has achieved highly competitive performance across multipl
90
   
91
 
92
  ## Use with transformers
93
- The following code snippet shows how to use the transformers library to load and apply the model. It also demonstrates how to enable and disable the reasoning mode , and how to parse the reasoning process along with the final output.
94
 
95
  ```python
96
  from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -101,13 +101,20 @@ model_name_or_path = os.environ['MODEL_PATH']
101
  # model_name_or_path = "tencent/Hunyuan-A13B-Instruct"
102
 
103
  tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
104
- model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto",trust_remote_code=True) # You may want to use bfloat16 and/or move to GPU here
 
 
105
  messages = [
106
  {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
107
  ]
108
- tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True,return_tensors="pt",
109
- enable_thinking=True # Toggle thinking mode (default: True)
110
- )
 
 
 
 
 
111
 
112
  outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=4096)
113
 
@@ -125,6 +132,27 @@ print(f"thinking_content:{think_content}\n\n")
125
  print(f"answer_content:{answer_content}\n\n")
126
  ```
127
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
128
  ## Quantitative Compression
129
  We used our own `AngleSlim` compression tool to produce FP8 and INT4 quantization models. `AngleSlim` compression tool is expected to be open source in early July, which will support one-click quantization and compression of large models, please look forward to it, and you can download our quantization models directly for deployment testing now.
130
 
@@ -196,7 +224,7 @@ trtllm-serve \
196
  ```
197
 
198
 
199
- ### vllm
200
 
201
  #### Docker Image
202
  We provide a pre-built Docker image containing vLLM 0.8.5 with full support for this model. The official vllm release is currently under development, **note: cuda 12.8 is require for this docker**.
@@ -237,6 +265,25 @@ docker run --privileged --user root --net=host --ipc=host \
237
  ```
238
 
239
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
240
  ### SGLang
241
 
242
  #### Docker Image
 
90
   
91
 
92
  ## Use with transformers
93
+ Below is an example of how to use this model with the Hugging Face transformers library. This includes loading the model and tokenizer, toggling reasoning (thinking) mode, and parsing both the reasoning process and final answer from the output.
94
 
95
  ```python
96
  from transformers import AutoModelForCausalLM, AutoTokenizer
 
101
  # model_name_or_path = "tencent/Hunyuan-A13B-Instruct"
102
 
103
  tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
104
+ model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
105
+ device_map="auto",trust_remote_code=True) # You may want to use bfloat16 and/or move to GPU here
106
+
107
  messages = [
108
  {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
109
  ]
110
+
111
+ tokenized_chat = tokenizer.apply_chat_template(
112
+ messages,
113
+ tokenize=True,
114
+ add_generation_prompt=True,
115
+ return_tensors="pt",
116
+ enable_thinking=True # Toggle thinking mode (default: True)
117
+ )
118
 
119
  outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=4096)
120
 
 
132
  print(f"answer_content:{answer_content}\n\n")
133
  ```
134
 
135
+ ### Fast and slow thinking switch
136
+
137
+ This model supports two modes of operation:
138
+
139
+ - Slow Thinking Mode (Default): Enables detailed internal reasoning steps before producing the final answer.
140
+ - Fast Thinking Mode: Skips the internal reasoning process for faster inference, going straight to the final answer.
141
+
142
+ **Switching to Fast Thinking Mode:**
143
+
144
+ To disable the reasoning process, set `enable_thinking=False` in the apply_chat_template call:
145
+ ```
146
+ tokenized_chat = tokenizer.apply_chat_template(
147
+ messages,
148
+ tokenize=True,
149
+ add_generation_prompt=True,
150
+ return_tensors="pt",
151
+ enable_thinking=False # Use fast thinking mode
152
+ )
153
+ ```
154
+
155
+
156
  ## Quantitative Compression
157
  We used our own `AngleSlim` compression tool to produce FP8 and INT4 quantization models. `AngleSlim` compression tool is expected to be open source in early July, which will support one-click quantization and compression of large models, please look forward to it, and you can download our quantization models directly for deployment testing now.
158
 
 
224
  ```
225
 
226
 
227
+ ### vLLM
228
 
229
  #### Docker Image
230
  We provide a pre-built Docker image containing vLLM 0.8.5 with full support for this model. The official vllm release is currently under development, **note: cuda 12.8 is require for this docker**.
 
265
  ```
266
 
267
 
268
+
269
+ #### Tool Calling with vLLM
270
+
271
+ To support agent-based workflows and function calling capabilities, this model includes specialized parsing mechanisms for handling tool calls and internal reasoning steps.
272
+
273
+ For a complete working example of how to implement and use these features in an agent setting, please refer to our full agent implementation on GitHub:
274
+ 🔗 [Hunyuan A13B Agent Example](https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/agent/)
275
+
276
+ When deploying the model using **vLLM**, the following parameters can be used to configure the tool parsing behavior:
277
+
278
+ | Parameter | Value |
279
+ |--------------------------|-----------------------------------------------------------------------|
280
+ | `--tool-parser-plugin` | [Local Hunyuan A13B Tool Parser File](https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/agent/hunyuan_tool_parser.py) |
281
+ | `--tool-call-parser` | `hunyuan` |
282
+
283
+ These settings enable vLLM to correctly interpret and route tool calls generated by the model according to the expected format.
284
+
285
+
286
+
287
  ### SGLang
288
 
289
  #### Docker Image