update-readme
#30
by
asherszhang
- opened
- README.md +40 -18
- README_CN.md +70 -3
README.md
CHANGED
@@ -117,11 +117,17 @@ model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="aut
|
|
117 |
messages = [
|
118 |
{"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
|
119 |
]
|
120 |
-
|
121 |
-
|
122 |
-
|
123 |
-
|
124 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
125 |
|
126 |
output_text = tokenizer.decode(outputs[0])
|
127 |
|
@@ -148,13 +154,12 @@ This model supports two modes of operation:
|
|
148 |
|
149 |
To disable the reasoning process, set `enable_thinking=False` in the apply_chat_template call:
|
150 |
```
|
151 |
-
|
152 |
-
|
153 |
-
|
154 |
-
|
155 |
-
|
156 |
-
|
157 |
-
)
|
158 |
```
|
159 |
|
160 |
|
@@ -172,13 +177,30 @@ image: https://hub.docker.com/r/hunyuaninfer/hunyuan-a13b/tags
|
|
172 |
|
173 |
We provide a pre-built Docker image based on the latest version of TensorRT-LLM.
|
174 |
|
175 |
-
- To
|
176 |
-
|
177 |
-
https://hub.docker.com/r/hunyuaninfer/hunyuan-large/tags
|
178 |
|
|
|
179 |
```
|
180 |
docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
|
181 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
182 |
```
|
183 |
docker run --name hunyuanLLM_infer --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
|
184 |
```
|
@@ -287,10 +309,10 @@ docker run --rm --ipc=host \
|
|
287 |
```
|
288 |
|
289 |
### Source Code
|
290 |
-
Support for this model has been added via this [PR 20114](https://github.com/vllm-project/vllm/pull/20114 ) in the vLLM project
|
291 |
-
|
292 |
-
You can build and run vLLM from source after merging this pull request into your local repository.
|
293 |
|
|
|
294 |
|
295 |
### Model Context Length Support
|
296 |
|
|
|
117 |
messages = [
|
118 |
{"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
|
119 |
]
|
120 |
+
|
121 |
+
text = tokenizer.apply_chat_template(
|
122 |
+
messages,
|
123 |
+
tokenize=False,
|
124 |
+
enable_thinking=True
|
125 |
+
)
|
126 |
+
|
127 |
+
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
|
128 |
+
model_inputs.pop("token_type_ids", None)
|
129 |
+
outputs = model.generate(**model_inputs, max_new_tokens=4096)
|
130 |
+
|
131 |
|
132 |
output_text = tokenizer.decode(outputs[0])
|
133 |
|
|
|
154 |
|
155 |
To disable the reasoning process, set `enable_thinking=False` in the apply_chat_template call:
|
156 |
```
|
157 |
+
|
158 |
+
text = tokenizer.apply_chat_template(
|
159 |
+
messages,
|
160 |
+
tokenize=False,
|
161 |
+
enable_thinking=False
|
162 |
+
)
|
|
|
163 |
```
|
164 |
|
165 |
|
|
|
177 |
|
178 |
We provide a pre-built Docker image based on the latest version of TensorRT-LLM.
|
179 |
|
180 |
+
- To Get Started, Download the Docker Image:
|
|
|
|
|
181 |
|
182 |
+
**From Docker Hub:**
|
183 |
```
|
184 |
docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
|
185 |
```
|
186 |
+
|
187 |
+
**From China Mirror(Thanks to [CNB](https://cnb.cool/ "CNB.cool")):**
|
188 |
+
|
189 |
+
|
190 |
+
First, pull the image from CNB:
|
191 |
+
```
|
192 |
+
docker pull docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-trtllm
|
193 |
+
```
|
194 |
+
|
195 |
+
Then, rename the image to better align with the following scripts:
|
196 |
+
```
|
197 |
+
|
198 |
+
docker tag docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-trtllm hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
|
199 |
+
```
|
200 |
+
|
201 |
+
|
202 |
+
- start docker
|
203 |
+
|
204 |
```
|
205 |
docker run --name hunyuanLLM_infer --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
|
206 |
```
|
|
|
309 |
```
|
310 |
|
311 |
### Source Code
|
312 |
+
Support for this model has been added via this [PR 20114](https://github.com/vllm-project/vllm/pull/20114 ) in the vLLM project,
|
313 |
+
This patch already been merged by community at Jul-1-2025.
|
|
|
314 |
|
315 |
+
You can build and run vLLM from source using code after `ecad85`.
|
316 |
|
317 |
### Model Context Length Support
|
318 |
|
README_CN.md
CHANGED
@@ -89,6 +89,75 @@ Hunyuan-A13B采用了细粒度混合专家(Fine-grained Mixture of Experts,F
|
|
89 |
| **NLU** | ComplexNLU<br>Word-Task | 64.7<br>67.1 | 64.5<br>76.3 | 59.8<br>56.4 | 61.2<br>62.9 |
|
90 |
| **Agent** | BDCL v3<br> τ-Bench<br>ComplexFuncBench<br> $C^3$-Bench | 67.8<br>60.4<br>47.6<br>58.8 | 56.9<br>43.8<br>41.1<br>55.3 | 70.8<br>44.6<br>40.6<br>51.7 | 78.3<br>54.7<br>61.2<br>63.5 |
|
91 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
92 |
|
93 |
## 推理和部署
|
94 |
|
@@ -246,9 +315,7 @@ docker run --rm --ipc=host \
|
|
246 |
|
247 |
### 源码部署
|
248 |
|
249 |
-
对本模型的支持已通过 [PR 20114](https://github.com/vllm-project/vllm/pull/20114) 提交至 vLLM
|
250 |
-
|
251 |
-
你可以在本地仓库中合并此 PR 后,从源码构建并运行 vLLM。
|
252 |
|
253 |
|
254 |
### 模型上下文长度支持
|
|
|
89 |
| **NLU** | ComplexNLU<br>Word-Task | 64.7<br>67.1 | 64.5<br>76.3 | 59.8<br>56.4 | 61.2<br>62.9 |
|
90 |
| **Agent** | BDCL v3<br> τ-Bench<br>ComplexFuncBench<br> $C^3$-Bench | 67.8<br>60.4<br>47.6<br>58.8 | 56.9<br>43.8<br>41.1<br>55.3 | 70.8<br>44.6<br>40.6<br>51.7 | 78.3<br>54.7<br>61.2<br>63.5 |
|
91 |
|
92 |
+
## transformers推理
|
93 |
+
|
94 |
+
我们的模型默认使用“慢思考”(即推理模式),有两种方式可以关闭 CoT(Chain-of-Thought,思维链)推理:
|
95 |
+
1. 在调用 `apply_chat_template` 时传入参数 `"enable_thinking=False"`。
|
96 |
+
2. 在提示词(prompt)前加上 `/no_think` 可以强制模型不使用 CoT 推理。类似地,在提示词前加上 `/think` 则会强制模型启用 CoT 推理。
|
97 |
+
|
98 |
+
以下代码片段展示了如何使用 `transformers` 库加载并应用该模型。
|
99 |
+
它还演示了如何开启和关闭推理模式,
|
100 |
+
以及如何解析推理过程和最终输出。
|
101 |
+
|
102 |
+
```python
|
103 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
104 |
+
import os
|
105 |
+
import re
|
106 |
+
|
107 |
+
model_name_or_path = os.environ['MODEL_PATH']
|
108 |
+
# model_name_or_path = "tencent/Hunyuan-A13B-Instruct"
|
109 |
+
|
110 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
|
111 |
+
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto",trust_remote_code=True) # You may want to use bfloat16 and/or move to GPU here
|
112 |
+
messages = [
|
113 |
+
{"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
|
114 |
+
]
|
115 |
+
|
116 |
+
text = tokenizer.apply_chat_template(
|
117 |
+
messages,
|
118 |
+
tokenize=False,
|
119 |
+
enable_thinking=True
|
120 |
+
)
|
121 |
+
|
122 |
+
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
|
123 |
+
model_inputs.pop("token_type_ids", None)
|
124 |
+
outputs = model.generate(**model_inputs, max_new_tokens=4096)
|
125 |
+
|
126 |
+
|
127 |
+
output_text = tokenizer.decode(outputs[0])
|
128 |
+
|
129 |
+
think_pattern = r'<think>(.*?)</think>'
|
130 |
+
think_matches = re.findall(think_pattern, output_text, re.DOTALL)
|
131 |
+
|
132 |
+
answer_pattern = r'<answer>(.*?)</answer>'
|
133 |
+
answer_matches = re.findall(answer_pattern, output_text, re.DOTALL)
|
134 |
+
|
135 |
+
think_content = [match.strip() for match in think_matches][0]
|
136 |
+
answer_content = [match.strip() for match in answer_matches][0]
|
137 |
+
print(f"thinking_content:{think_content}\n\n")
|
138 |
+
print(f"answer_content:{answer_content}\n\n")
|
139 |
+
```
|
140 |
+
|
141 |
+
|
142 |
+
|
143 |
+
### 快速思考与慢速思考切换
|
144 |
+
|
145 |
+
本模型支持两种运行模式:
|
146 |
+
|
147 |
+
- **慢速思考模式(默认)**:在生成最终答案之前进行详细的内部推理步骤。
|
148 |
+
- **快速思考模式**:跳过内部推理过程,直接输出最终答案,从而实现更快的推理速度。
|
149 |
+
|
150 |
+
**切换到快速思考模式的方法:**
|
151 |
+
|
152 |
+
要禁用推理过程,请在调用 `apply_chat_template` 时设置 `enable_thinking=False`:
|
153 |
+
|
154 |
+
```python
|
155 |
+
text = tokenizer.apply_chat_template(
|
156 |
+
messages,
|
157 |
+
tokenize=False,
|
158 |
+
enable_thinking=False # 使用快速思考模式
|
159 |
+
)
|
160 |
+
```
|
161 |
|
162 |
## 推理和部署
|
163 |
|
|
|
315 |
|
316 |
### 源码部署
|
317 |
|
318 |
+
对本模型的支持已通过 [PR 20114](https://github.com/vllm-project/vllm/pull/20114) 提交至 vLLM 项目并已经合并, 可以使用 vllm git commit`ecad85`以后的版本进行源代码编译。
|
|
|
|
|
319 |
|
320 |
|
321 |
### 模型上下文长度支持
|