tencent
/

Hunyuan-A13B-Instruct

@@ -117,11 +117,17 @@ model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="aut
 messages = [
     {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
 ]
-tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt",
-                                                enable_thinking=True # Toggle thinking mode (default: True)
-                                                )
-outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=4096)
 output_text = tokenizer.decode(outputs[0])
@@ -148,13 +154,12 @@ This model supports two modes of operation:
 To disable the reasoning process, set `enable_thinking=False` in the apply_chat_template call:
 ```
-tokenized_chat = tokenizer.apply_chat_template(
-    messages,
-    tokenize=True,
-    add_generation_prompt=True,
-    return_tensors="pt",
-    enable_thinking=False  # Use fast thinking mode
-)
 ```
@@ -172,13 +177,30 @@ image: https://hub.docker.com/r/hunyuaninfer/hunyuan-a13b/tags
 We provide a pre-built Docker image based on the latest version of TensorRT-LLM.
-- To get started:
-https://hub.docker.com/r/hunyuaninfer/hunyuan-large/tags
 ```
 docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
 ```
 ```
 docker run --name hunyuanLLM_infer --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
 ```
@@ -287,10 +309,10 @@ docker run --rm  --ipc=host \
 ```
 ### Source Code
-Support for this model has been added via  this [PR 20114](https://github.com/vllm-project/vllm/pull/20114 ) in the vLLM project.
-You can build and run vLLM from source after merging this pull request into your local repository.
 ### Model Context Length Support

 messages = [
     {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
 ]
+text = tokenizer.apply_chat_template(
+            messages,
+            tokenize=False,
+            enable_thinking=True
+            )
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+model_inputs.pop("token_type_ids", None)
+outputs = model.generate(**model_inputs, max_new_tokens=4096)
 output_text = tokenizer.decode(outputs[0])
 To disable the reasoning process, set `enable_thinking=False` in the apply_chat_template call:
 ```
+text = tokenizer.apply_chat_template(
+            messages,
+            tokenize=False,
+            enable_thinking=False
+            )
 ```
 We provide a pre-built Docker image based on the latest version of TensorRT-LLM.
+- To Get Started, Download the Docker Image:
+**From Docker Hub:**
 ```
 docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
 ```
+**From China Mirror(Thanks to [CNB](https://cnb.cool/ "CNB.cool")):**
+First, pull the image from CNB:
+```
+docker pull docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-trtllm
+```
+Then, rename the image to better align with the following scripts:
+```
+docker tag docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-trtllm hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
+```
+- start docker
 ```
 docker run --name hunyuanLLM_infer --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
 ```
 ```
 ### Source Code
+Support for this model has been added via  this [PR 20114](https://github.com/vllm-project/vllm/pull/20114 ) in the vLLM project,
+This patch already been merged by community at Jul-1-2025.
+You can build and run vLLM from source using code after `ecad85`.
 ### Model Context Length Support

README_CN.md CHANGED Viewed

@@ -89,6 +89,75 @@ Hunyuan-A13B采用了细粒度混合专家（Fine-grained Mixture of Experts，F
 | **NLU**             | ComplexNLU<br>Word-Task     | 64.7<br>67.1 | 64.5<br>76.3 | 59.8<br>56.4 | 61.2<br>62.9 |
 | **Agent**           | BDCL v3<br> τ-Bench<br>ComplexFuncBench<br> $C^3$-Bench | 67.8<br>60.4<br>47.6<br>58.8 | 56.9<br>43.8<br>41.1<br>55.3 | 70.8<br>44.6<br>40.6<br>51.7 | 78.3<br>54.7<br>61.2<br>63.5 |
 ## 推理和部署
@@ -246,9 +315,7 @@ docker run --rm  --ipc=host \
 ### 源码部署
-对本模型的支持已通过 [PR 20114](https://github.com/vllm-project/vllm/pull/20114) 提交至 vLLM 项目。
-你可以在本地仓库中合并此 PR 后，从源码构建并运行 vLLM。
 ### 模型上下文长度支持

 | **NLU**             | ComplexNLU<br>Word-Task     | 64.7<br>67.1 | 64.5<br>76.3 | 59.8<br>56.4 | 61.2<br>62.9 |
 | **Agent**           | BDCL v3<br> τ-Bench<br>ComplexFuncBench<br> $C^3$-Bench | 67.8<br>60.4<br>47.6<br>58.8 | 56.9<br>43.8<br>41.1<br>55.3 | 70.8<br>44.6<br>40.6<br>51.7 | 78.3<br>54.7<br>61.2<br>63.5 |
+## transformers推理
+我们的模型默认使用“慢思考”（即推理模式），有两种方式可以关闭 CoT（Chain-of-Thought，思维链）推理：
+1. 在调用 `apply_chat_template` 时传入参数 `"enable_thinking=False"`。
+2. 在提示词（prompt）前加上 `/no_think` 可以强制模型不使用 CoT 推理。类似地，在提示词前加上 `/think` 则会强制模型启用 CoT 推理。
+以下代码片段展示了如何使用 `transformers` 库加载并应用该模型。
+它还演示了如何开启和关闭推理模式，
+以及如何解析推理过程和最终输出。
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import os
+import re
+model_name_or_path = os.environ['MODEL_PATH']
+# model_name_or_path = "tencent/Hunyuan-A13B-Instruct"
+tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto",trust_remote_code=True)  # You may want to use bfloat16 and/or move to GPU here
+messages = [
+    {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
+]
+text = tokenizer.apply_chat_template(
+            messages,
+            tokenize=False,
+            enable_thinking=True
+            )
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+model_inputs.pop("token_type_ids", None)
+outputs = model.generate(**model_inputs, max_new_tokens=4096)
+output_text = tokenizer.decode(outputs[0])
+think_pattern = r'<think>(.*?)</think>'
+think_matches = re.findall(think_pattern, output_text, re.DOTALL)
+answer_pattern = r'<answer>(.*?)</answer>'
+answer_matches = re.findall(answer_pattern, output_text, re.DOTALL)
+think_content = [match.strip() for match in think_matches][0]
+answer_content = [match.strip() for match in answer_matches][0]
+print(f"thinking_content:{think_content}\n\n")
+print(f"answer_content:{answer_content}\n\n")
+```
+### 快速思考与慢速思考切换
+本模型支持两种运行模式：
+- **慢速思考模式（默认）**：在生成最终答案之前进行详细的内部推理步骤。
+- **快速思考模式**：跳过内部推理过程，直接输出最终答案，从而实现更快的推理速度。
+**切换到快速思考模式的方法：**
+要禁用推理过程，请在调用 `apply_chat_template` 时设置 `enable_thinking=False`：
+```python
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    enable_thinking=False  # 使用快速思考模式
+)
+```
 ## 推理和部署
 ### 源码部署
+对本模型的支持已通过 [PR 20114](https://github.com/vllm-project/vllm/pull/20114) 提交至 vLLM 项目并已经合并, 可以使用 vllm git commit`ecad85`以后的版本进行源代码编译。
 ### 模型上下文长度支持