Files changed (2) hide show
  1. README.md +40 -18
  2. README_CN.md +70 -3
README.md CHANGED
@@ -117,11 +117,17 @@ model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="aut
117
  messages = [
118
  {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
119
  ]
120
- tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt",
121
- enable_thinking=True # Toggle thinking mode (default: True)
122
- )
123
-
124
- outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=4096)
 
 
 
 
 
 
125
 
126
  output_text = tokenizer.decode(outputs[0])
127
 
@@ -148,13 +154,12 @@ This model supports two modes of operation:
148
 
149
  To disable the reasoning process, set `enable_thinking=False` in the apply_chat_template call:
150
  ```
151
- tokenized_chat = tokenizer.apply_chat_template(
152
- messages,
153
- tokenize=True,
154
- add_generation_prompt=True,
155
- return_tensors="pt",
156
- enable_thinking=False # Use fast thinking mode
157
- )
158
  ```
159
 
160
 
@@ -172,13 +177,30 @@ image: https://hub.docker.com/r/hunyuaninfer/hunyuan-a13b/tags
172
 
173
  We provide a pre-built Docker image based on the latest version of TensorRT-LLM.
174
 
175
- - To get started:
176
-
177
- https://hub.docker.com/r/hunyuaninfer/hunyuan-large/tags
178
 
 
179
  ```
180
  docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
181
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
182
  ```
183
  docker run --name hunyuanLLM_infer --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
184
  ```
@@ -287,10 +309,10 @@ docker run --rm --ipc=host \
287
  ```
288
 
289
  ### Source Code
290
- Support for this model has been added via this [PR 20114](https://github.com/vllm-project/vllm/pull/20114 ) in the vLLM project.
291
-
292
- You can build and run vLLM from source after merging this pull request into your local repository.
293
 
 
294
 
295
  ### Model Context Length Support
296
 
 
117
  messages = [
118
  {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
119
  ]
120
+
121
+ text = tokenizer.apply_chat_template(
122
+ messages,
123
+ tokenize=False,
124
+ enable_thinking=True
125
+ )
126
+
127
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
128
+ model_inputs.pop("token_type_ids", None)
129
+ outputs = model.generate(**model_inputs, max_new_tokens=4096)
130
+
131
 
132
  output_text = tokenizer.decode(outputs[0])
133
 
 
154
 
155
  To disable the reasoning process, set `enable_thinking=False` in the apply_chat_template call:
156
  ```
157
+
158
+ text = tokenizer.apply_chat_template(
159
+ messages,
160
+ tokenize=False,
161
+ enable_thinking=False
162
+ )
 
163
  ```
164
 
165
 
 
177
 
178
  We provide a pre-built Docker image based on the latest version of TensorRT-LLM.
179
 
180
+ - To Get Started, Download the Docker Image:
 
 
181
 
182
+ **From Docker Hub:**
183
  ```
184
  docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
185
  ```
186
+
187
+ **From China Mirror(Thanks to [CNB](https://cnb.cool/ "CNB.cool")):**
188
+
189
+
190
+ First, pull the image from CNB:
191
+ ```
192
+ docker pull docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-trtllm
193
+ ```
194
+
195
+ Then, rename the image to better align with the following scripts:
196
+ ```
197
+
198
+ docker tag docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-trtllm hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
199
+ ```
200
+
201
+
202
+ - start docker
203
+
204
  ```
205
  docker run --name hunyuanLLM_infer --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
206
  ```
 
309
  ```
310
 
311
  ### Source Code
312
+ Support for this model has been added via this [PR 20114](https://github.com/vllm-project/vllm/pull/20114 ) in the vLLM project,
313
+ This patch already been merged by community at Jul-1-2025.
 
314
 
315
+ You can build and run vLLM from source using code after `ecad85`.
316
 
317
  ### Model Context Length Support
318
 
README_CN.md CHANGED
@@ -89,6 +89,75 @@ Hunyuan-A13B采用了细粒度混合专家(Fine-grained Mixture of Experts,F
89
  | **NLU** | ComplexNLU<br>Word-Task | 64.7<br>67.1 | 64.5<br>76.3 | 59.8<br>56.4 | 61.2<br>62.9 |
90
  | **Agent** | BDCL v3<br> τ-Bench<br>ComplexFuncBench<br> $C^3$-Bench | 67.8<br>60.4<br>47.6<br>58.8 | 56.9<br>43.8<br>41.1<br>55.3 | 70.8<br>44.6<br>40.6<br>51.7 | 78.3<br>54.7<br>61.2<br>63.5 |
91
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92
 
93
  ## 推理和部署
94
 
@@ -246,9 +315,7 @@ docker run --rm --ipc=host \
246
 
247
  ### 源码部署
248
 
249
- 对本模型的支持已通过 [PR 20114](https://github.com/vllm-project/vllm/pull/20114) 提交至 vLLM 项目。
250
-
251
- 你可以在本地仓库中合并此 PR 后,从源码构建并运行 vLLM。
252
 
253
 
254
  ### 模型上下文长度支持
 
89
  | **NLU** | ComplexNLU<br>Word-Task | 64.7<br>67.1 | 64.5<br>76.3 | 59.8<br>56.4 | 61.2<br>62.9 |
90
  | **Agent** | BDCL v3<br> τ-Bench<br>ComplexFuncBench<br> $C^3$-Bench | 67.8<br>60.4<br>47.6<br>58.8 | 56.9<br>43.8<br>41.1<br>55.3 | 70.8<br>44.6<br>40.6<br>51.7 | 78.3<br>54.7<br>61.2<br>63.5 |
91
 
92
+ ## transformers推理
93
+
94
+ 我们的模型默认使用“慢思考”(即推理模式),有两种方式可以关闭 CoT(Chain-of-Thought,思维链)推理:
95
+ 1. 在调用 `apply_chat_template` 时传入参数 `"enable_thinking=False"`。
96
+ 2. 在提示词(prompt)前加上 `/no_think` 可以强制模型不使用 CoT 推理。类似地,在提示词前加上 `/think` 则会强制模型启用 CoT 推理。
97
+
98
+ 以下代码片段展示了如何使用 `transformers` 库加载并应用该模型。
99
+ 它还演示了如何开启和关闭推理模式,
100
+ 以及如何解析推理过程和最终输出。
101
+
102
+ ```python
103
+ from transformers import AutoModelForCausalLM, AutoTokenizer
104
+ import os
105
+ import re
106
+
107
+ model_name_or_path = os.environ['MODEL_PATH']
108
+ # model_name_or_path = "tencent/Hunyuan-A13B-Instruct"
109
+
110
+ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
111
+ model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto",trust_remote_code=True) # You may want to use bfloat16 and/or move to GPU here
112
+ messages = [
113
+ {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
114
+ ]
115
+
116
+ text = tokenizer.apply_chat_template(
117
+ messages,
118
+ tokenize=False,
119
+ enable_thinking=True
120
+ )
121
+
122
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
123
+ model_inputs.pop("token_type_ids", None)
124
+ outputs = model.generate(**model_inputs, max_new_tokens=4096)
125
+
126
+
127
+ output_text = tokenizer.decode(outputs[0])
128
+
129
+ think_pattern = r'<think>(.*?)</think>'
130
+ think_matches = re.findall(think_pattern, output_text, re.DOTALL)
131
+
132
+ answer_pattern = r'<answer>(.*?)</answer>'
133
+ answer_matches = re.findall(answer_pattern, output_text, re.DOTALL)
134
+
135
+ think_content = [match.strip() for match in think_matches][0]
136
+ answer_content = [match.strip() for match in answer_matches][0]
137
+ print(f"thinking_content:{think_content}\n\n")
138
+ print(f"answer_content:{answer_content}\n\n")
139
+ ```
140
+
141
+
142
+
143
+ ### 快速思考与慢速思考切换
144
+
145
+ 本模型支持两种运行模式:
146
+
147
+ - **慢速思考模式(默认)**:在生成最终答案之前进行详细的内部推理步骤。
148
+ - **快速思考模式**:跳过内部推理过程,直接输出最终答案,从而实现更快的推理速度。
149
+
150
+ **切换到快速思考模式的方法:**
151
+
152
+ 要禁用推理过程,请在调用 `apply_chat_template` 时设置 `enable_thinking=False`:
153
+
154
+ ```python
155
+ text = tokenizer.apply_chat_template(
156
+ messages,
157
+ tokenize=False,
158
+ enable_thinking=False # 使用快速思考模式
159
+ )
160
+ ```
161
 
162
  ## 推理和部署
163
 
 
315
 
316
  ### 源码部署
317
 
318
+ 对本模型的支持已通过 [PR 20114](https://github.com/vllm-project/vllm/pull/20114) 提交至 vLLM 项目并已经合并, 可以使用 vllm git commit`ecad85`以后的版本进行源代码编译。
 
 
319
 
320
 
321
  ### 模型上下文长度支持