manaestras commited on
Commit
cb87680
·
verified ·
1 Parent(s): 7ae536b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +100 -146
README.md CHANGED
@@ -1,35 +1,53 @@
1
  ---
2
  license: other
3
  license_name: tencent-hunyuan-a13b
4
- license_link: LICENSE
5
  ---
6
 
 
 
 
 
 
7
 
8
 
9
  <p align="center">
10
  <img src="https://dscache.tencent-cloud.cn/upload/uploader/hunyuan-64b418fd052c033b228e04bc77bbc4b54fd7f5bc.png" width="400"/> <br>
11
  </p><p></p>
12
 
 
 
 
 
 
 
 
 
 
13
  <p align="center">
14
- &nbsp<a href="https://github.com/Tencent/Hunyuan-A13B"><b>GITHUB</b></a>&nbsp&nbsp
 
 
15
 
16
 
17
- Welcome to the official repository of **Hunyuan-A13B**, an innovative and open-source large language model (LLM) built on a fine-grained Mixture-of-Experts (MoE) architecture. Designed for efficiency and scalability, Hunyuan-A13B delivers cutting-edge performance with minimal computational overhead, making it an ideal choice for advanced reasoning and general-purpose applications, especially in resource-constrained environments.
 
 
18
 
19
- ## Key Features and Highlights
20
 
21
- - **High Performance with Fewer Parameters**: With only 13B active parameters (out of a total of 80B), Hunyuan-A13B achieves competitive results compared to much larger models across diverse benchmark tasks.
22
- - **Robust Pre-Training and Optimization**: Trained on a massive 20TB high-quality dataset, the model benefits from structured supervised fine-tuning and reinforcement learning strategies to enhance its reasoning, language comprehension, and general knowledge capabilities.
23
- - **Dual-Mode Chain-of-Thought (CoT) Framework**: This unique feature allows dynamic adjustment of reasoning depth, balancing computational efficiency with accuracy. It supports both concise responses for simple tasks and in-depth reasoning for complex challenges.
24
- - **Exceptional Long-Context Understanding**: Hunyuan-A13B natively supports a 256K context window, maintaining robust performance in long-text tasks.
25
 
26
- - **Advanced Agent-Oriented Capabilities**: Tailored optimizations enable effective handling of complex decision-making, with leading performance on agent benchmarks such as BFCL-v3 and τ-Bench.
27
- - **Superior Inference Efficiency**: Architectural innovations, including Grouped Query Attention (GQA) and support for multiple quantization formats , result in exceptional inference speed.
 
 
 
28
 
29
- ## Why Choose Hunyuan-A13B?
 
 
30
 
31
- Hunyuan-A13B stands out as a powerful, scalable, and computationally efficient LLM, perfectly suited for researchers and developers seeking high performance without the burden of excessive resource demands. Whether you're working on academic research, building cost-effective AI solutions, or exploring novel applications, Hunyuan-A13B provides a versatile foundation to build upon.
32
-
33
  &nbsp;
34
 
35
  ## Related News
@@ -60,146 +78,120 @@ Note: The following benchmarks are evaluated by TRT-LLM-backend
60
  | MMMLU | 76.89 | 79.28 * | 83.83 | 86.70 | 84.68 |
61
 
62
 
63
- &nbsp;
64
 
 
65
 
66
- | Topic | Bench | OpenAI-o1-1217 | DeepSeek R1 | Qwen3-A22B | Hunyuan-A13B-Instruct |
67
- |:-------------------:|:-----------------------------:|:-------------:|:------------:|:-----------:|:---------------------:|
68
- | **Mathematics** | AIME 2024<br>AIME 2025<br>MATH | 74.3<br>79.2<br>96.4 | 79.8<br>70<br>94.9 | 85.7<br>81.5<br>94.0 | 87.3<br>76.8<br>94.3 |
69
- | **Science** | GPQA-Diamond<br>OlympiadBench | 78<br>83.1 | 71.5<br>82.4 | 71.1<br>85.7 | 71.2<br>82.7 |
70
- | **Coding** | Livecodebench<br>Fullstackbench<br>ArtifactsBench | 63.9<br>64.6<br>38.6 | 65.9<br>71.6<br>44.6 | 70.7<br>65.6<br>44.6 | 63.9<br>67.8<br>43 |
71
- | **Reasoning** | BBH<br>DROP<br>ZebraLogic | 80.4<br>90.2<br>81 | 83.7<br>92.2<br>78.7 | 88.9<br>90.3<br>80.3 | 89.1<br>91.1<br>84.7 |
72
- | **Instruction<br>Following** | IF-Eval<br>SysBench | 91.8<br>82.5 | 88.3<br>77.7 | 83.4<br>74.2 | 84.7<br>76.1 |
73
- | **Text<br>Creation**| LengthCtrl<br>InsCtrl | 60.1<br>74.8 | 55.9<br>69 | 53.3<br>73.7 | 55.4<br>71.9 |
74
- | **NLU** | ComplexNLU<br>Word-Task | 64.7<br>67.1 | 64.5<br>81.8 | 59.8<br>56.4 | 61.2<br>62.9 |
75
- | **Agent** | BDCL v3<br> $\tau$-bench<br>ComplexFuncBench<br> $C^3$-Bench | 67.8<br>60.4<br>47.6<br>58.8 | 63.8<br>58.7<br>n/a<br>55.3 | 70.8<br>46.7<br>n/a<br>51.7 | 78.3<br>54.7<br>51.2<br>63.5 |
76
- | **Average** | - | n/a | n/a | n/a | n/a |
77
 
78
 
 
79
 
80
-
81
-
82
- ## Quick Start
83
-
84
- You can refer to the content in [Hunyuan-A13B](https://github.com/Tencent-Hunyuan/Hunyuan-A13B) to get started quickly. The training and inference code can use the version provided in this github repository.
85
-
86
-
87
- ### Transformer
88
 
89
  ```python
90
  from transformers import AutoModelForCausalLM, AutoTokenizer
91
  import os
 
 
 
 
92
 
 
 
 
 
 
 
 
 
 
 
93
 
94
- def main():
95
- model_name_or_path = os.environ['MODEL_PATH']
96
-
97
-
98
- tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
99
- model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto",
100
- trust_remote_code=True) # You may want to use bfloat16 and/or move to GPU here
101
- for name, param in model.named_parameters():
102
- print(f"{name}: {param.size()}")
103
- messages = [
104
- {
105
- "role": "system",
106
- "content": "You are a helpful assistant.",
107
- },
108
- {"role": "user", "content": "Write a short summary of the benefits of regular exercise."},
109
- ]
110
- tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
111
- outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=100,do_sample=True)
112
- print(tokenizer.decode(outputs[0]))
113
-
114
- if __name__ == '__main__':
115
- main()
116
 
 
 
 
 
 
 
 
 
 
 
117
  ```
118
 
119
 
120
- ## Deployment
121
 
122
- For deployment, you can use frameworks such as *vLLM*, *SGLang*, or *TensorRT-LLM* to serve the model and create an OpenAI-compatible API endpoint.
123
 
124
 
125
  ### vllm
126
 
127
  #### Docker Image
128
- We provide a pre-built Docker image containing vLLM 0.8.5 with full support for this model. The official support is currently under development.
129
-
130
 
131
  - To get started:
132
- ```
133
- Pull the Docker image:docker pull xxx
134
- ```
135
 
136
- - Start the API server:
137
 
138
  ```
139
- docker start xxx
140
- ```
141
-
142
-
143
- #### Source Code
144
-
145
- Support for this model has been added via this PR: (https://github.com/vllm-project/vllm/pull/20114 )in the vLLM project.
146
- You can build and run vLLM from source after merging this pull request into your local repository.
147
-
148
- After applying the changes, you can start the API server by following the standard vLLM setup instructions.
149
-
150
-
151
- ### SGLlang
152
-
153
- #### Docker Image
154
-
155
- We also provide a pre-built Docker image based on the latest version of SGLang.
156
-
157
- To get started:
158
-
159
- - Pull the Docker image
160
 
161
  ```
162
- docker pull xxx
163
- ```
 
 
 
164
 
165
  - Start the API server:
166
 
 
167
  ```
168
- docker run --gpus all \
169
- --shm-size 32g \
170
- -p 30000:30000 \
171
- --ipc=host \
172
- xxx \
173
- python3 -m sglang.launch_server --model-path hunyuan/huanyuan_A13B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000
174
- ```
175
-
176
-
177
- #### Source Code
178
 
179
- The necessary integration has already been merged into the main branch via this PR(https://github.com/sgl-project/sglang/pull/7549 ).
180
- Once you have cloned or updated your local SGLang repository, you can build and run the API server using the standard SGLang setup process.
181
-
182
- After applying the changes, you can start the API server by following the standard SGLang setup instructions.
183
 
 
184
  ```
185
- python3 -m sglang.launch_server --model-path hunyuan/huanyuan_A13B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000
 
 
 
 
186
  ```
187
 
188
 
189
-
190
- ### TensorRT-LLM
191
-
192
 
193
  #### Docker Image
194
 
195
- We also provide a pre-built Docker image based on the latest version of TensorRT-LLM.
196
 
197
  To get started:
198
 
199
  - Pull the Docker image
200
 
201
  ```
202
- docker pull xxx
203
  ```
204
 
205
  - Start the API server:
@@ -209,48 +201,10 @@ docker run --gpus all \
209
  --shm-size 32g \
210
  -p 30000:30000 \
211
  --ipc=host \
212
- xxx \
213
- python3 -m sglang.launch_server --model-path hunyuan/huanyuan_A13B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000
214
  ```
215
 
216
- #### Source Code
217
-
218
- The necessary integration has already been merged into the main branch via this PR(xxx ).
219
- Once you have cloned or updated your local TensorRT-LLM. repository, you can build and run the API server using the standard TensorRT-LLM. setup process.
220
-
221
- After applying the changes, you can start the API server by following the standard TensorRT-LLM. setup instructions.
222
-
223
-
224
-
225
- ## Inference Performance
226
-
227
- This section presents the efficiency test results of deploying various models using vLLM, including inference speed (tokens/s) under different batch sizes.
228
-
229
-
230
- Evaluation Script:
231
- ```python
232
- python3 benchmark_throughput.py --backend vllm \
233
- --input-len 2048 \
234
- --output-len 14336 \
235
- --model $MODEL_PATH \
236
- --tensor-parallel-size $TP \
237
- --use-v2-block-manager \
238
- --async-engine \
239
- --trust-remote-code \
240
- --num_prompts $BATCH_SIZE \
241
- --max-num-seqs $BATCH_SIZE
242
- ```
243
-
244
- | Inference Framework | Model | Number of GPUs (GPU productA) | input_length | batch=1 | batch=16 | batch=32 |
245
- |------|-----------------------------|-----------|-------------------------|---------------------|----------------------|----------------------|
246
- | vLLM | Hunyuan-A13B-Instruct | 8 | 2048 | 190.84 | 1246.54 | 1981.99 |
247
- | vLLM | Hunyuan-A13B-Instruct | 4 | 2048 | 158.90 | 779.10 | 1301.75 |
248
- | vLLM | Hunyuan-A13B-Instruct | 2 | 2048 | 111.72 | 327.31 | 346.54 |
249
- | vLLM | Hunyuan-A13B-Instruct(int8 weight only) | 2 | 2048 | 109.10 | 444.17 | 721.93 |
250
- | vLLM | Hunyuan-A13B-Instruct(W8A8C8-FP8) | 2 | 2048 | 91.83 | 372.01 | 617.70 |
251
- | vLLM | Hunyuan-A13B-Instruct(W8A8C8-FP8) | 1 | 2048 | 60.07 | 148.80 | 160.41 |
252
-
253
-
254
 
255
  ## Contact Us
256
 
 
1
  ---
2
  license: other
3
  license_name: tencent-hunyuan-a13b
4
+ license_link: https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/LICENSE
5
  ---
6
 
7
+ <p align="left">
8
+ <a href="README_CN.md">中文</a>&nbsp | English</a>
9
+ </p>
10
+ <br><br>
11
+
12
 
13
 
14
  <p align="center">
15
  <img src="https://dscache.tencent-cloud.cn/upload/uploader/hunyuan-64b418fd052c033b228e04bc77bbc4b54fd7f5bc.png" width="400"/> <br>
16
  </p><p></p>
17
 
18
+
19
+ <p align="center">
20
+ 🫣&nbsp;<a href="https://huggingface.co/tencent/Hunyuan-A13B-Instruct"><b>Hugging Face</b></a>&nbsp;&nbsp;|&nbsp;&nbsp;
21
+ 🖥️&nbsp;<a href="https://llm.hunyuan.tencent.com/" style="color: red;"><b>Official Website</b></a>&nbsp;&nbsp;|&nbsp;&nbsp;
22
+ 🕖&nbsp;<a href="https://cloud.tencent.com/product/hunyuan"><b>HunyuanAPI</b></a>&nbsp;&nbsp;|&nbsp;&nbsp;
23
+ 🕹️&nbsp;<a href="https://hunyuan.tencent.com/?model=hunyuan-a13b"><b>Demo</b></a>&nbsp;&nbsp;|&nbsp;&nbsp;
24
+ <img src="https://avatars.githubusercontent.com/u/109945100?s=200&v=4" width="16"/>&nbsp;<a href="https://modelscope.cn/models/Tencent-Hunyuan/Hunyuan-A13B-Instruct"><b>ModelScope</b></a>
25
+ </p>
26
+
27
  <p align="center">
28
+ <a href="https://github.com/Tencent/Hunyuan-A13B"><b>GITHUB</b></a>
29
+ </p>
30
+
31
 
32
 
33
+ Welcome to the official repository of **Hunyuan-A13B**, an innovative and open-source large language model (LLM) built on a fine-grained Mixture-of-Experts (MoE) architecture. Designed for efficiency and scalability, Hunyuan-A13B delivers cutting-edge performance with minimal computational overhead, making it an ideal choice for advanced reasoning and general-purpose applications, especially in resource-constrained environments.
34
+
35
+ ## Model Introduction
36
 
37
+ With the rapid advancement of artificial intelligence technology, large language models (LLMs) have achieved remarkable progress in natural language processing, computer vision, and scientific tasks. However, as model scales continue to expand, optimizing resource consumption while maintaining high performance has become a critical challenge. To address this, we have explored Mixture of Experts (MoE) architectures. The newly introduced Hunyuan-A13B model features a total of 80 billion parameters with 13 billion active parameters. It not only delivers high-performance results but also achieves optimal resource efficiency, successfully balancing computational power and resource utilization.
38
 
39
+ ### Key Features and Advantages
 
 
 
40
 
41
+ - **Compact yet Powerful**: With only 13 billion active parameters (out of a total of 80 billion), the model delivers competitive performance on a wide range of benchmark tasks, rivaling much larger models.
42
+ - **Hybrid Inference Support**: Supports both fast and slow thinking modes, allowing users to flexibly choose according to their needs.
43
+ - **Ultra-Long Context Understanding**: Natively supports a 256K context window, maintaining stable performance on long-text tasks.
44
+ - **Enhanced Agent Capabilities**: Optimized for agent tasks, achieving leading results on benchmarks such as BFCL-v3 and τ-Bench.
45
+ - **Efficient Inference**: Utilizes Grouped Query Attention (GQA) and supports multiple quantization formats, enabling highly efficient inference.
46
 
47
+ ### Why Choose Hunyuan-A13B?
48
+
49
+ As a powerful yet computationally efficient large model, Hunyuan-A13B is an ideal choice for researchers and developers seeking high performance under resource constraints. Whether for academic research, cost-effective AI solution development, or innovative application exploration, this model provides a robust foundation for advancement.
50
 
 
 
51
  &nbsp;
52
 
53
  ## Related News
 
78
  | MMMLU | 76.89 | 79.28 * | 83.83 | 86.70 | 84.68 |
79
 
80
 
 
81
 
82
+ Hunyuan-A13B-Instruct has achieved highly competitive performance across multiple benchmarks, particularly in mathematics, science, agent domains, and more. We compared it with several powerful models, and the results are shown below.
83
 
84
+ | Topic | Bench | OpenAI-o1-1217 | DeepSeek R1 | Qwen3-A22B | Hunyuan-A13B-Instruct |
85
+ |:-------------------:|:------------------------------------------------------------:|:-------------:|:------------:|:-----------:|:---------------------:|
86
+ | **Mathematics** | AIME 2024<br>AIME 2025<br>MATH | 74.3<br>79.2<br>96.4 | 79.8<br>70<br>94.9 | 85.7<br>81.5<br>94.0 | 87.3<br>76.8<br>94.3 |
87
+ | **Science** | GPQA-Diamond<br>OlympiadBench | 78<br>83.1 | 71.5<br>82.4 | 71.1<br>85.7 | 71.2<br>82.7 |
88
+ | **Coding** | Livecodebench<br>Fullstackbench<br>ArtifactsBench | 63.9<br>64.6<br>38.6 | 65.9<br>71.6<br>44.6 | 70.7<br>65.6<br>44.6 | 63.9<br>67.8<br>43 |
89
+ | **Reasoning** | BBH<br>DROP<br>ZebraLogic | 80.4<br>90.2<br>81 | 83.7<br>92.2<br>78.7 | 88.9<br>90.3<br>80.3 | 89.1<br>91.1<br>84.7 |
90
+ | **Instruction<br>Following** | IF-Eval<br>SysBench | 91.8<br>82.5 | 88.3<br>77.7 | 83.4<br>74.2 | 84.7<br>76.1 |
91
+ | **Text<br>Creation**| LengthCtrl<br>InsCtrl | 60.1<br>74.8 | 55.9<br>69 | 53.3<br>73.7 | 55.4<br>71.9 |
92
+ | **NLU** | ComplexNLU<br>Word-Task | 64.7<br>67.1 | 64.5<br>81.8 | 59.8<br>56.4 | 61.2<br>62.9 |
93
+ | **Agent** | BFCL v3<br> $\tau$-bench<br>ComplexFuncBench<br> $C^3$-Bench | 67.8<br>60.4<br>47.6<br>58.8 | 63.8<br>58.7<br>n/a<br>55.3 | 70.8<br>46.7<br>n/a<br>51.7 | 78.3<br>54.7<br>51.2<br>63.5 |
94
+ | **Average** | - | n/a | n/a | n/a | n/a |
95
 
96
 
97
+ &nbsp;
98
 
99
+ ## Use with transformers
100
+ The following code snippet shows how to use the transformers library to load and apply the model. It also demonstrates how to enable and disable the reasoning mode , and how to parse the reasoning process along with the final output.
 
 
 
 
 
 
101
 
102
  ```python
103
  from transformers import AutoModelForCausalLM, AutoTokenizer
104
  import os
105
+ import re
106
+
107
+ model_name_or_path = os.environ['MODEL_PATH']
108
+ # model_name_or_path = "tencent/Hunyuan-A13B-Instruct"
109
 
110
+ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
111
+ model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto",trust_remote_code=True) # You may want to use bfloat16 and/or move to GPU here
112
+ messages = [
113
+ {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
114
+ ]
115
+ tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True,return_tensors="pt",
116
+ enable_thinking=True # Toggle thinking mode (default: True)
117
+ )
118
+
119
+ outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=4096)
120
 
121
+ output_text = tokenizer.decode(outputs[0])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
122
 
123
+ think_pattern = r'<think>(.*?)</think>'
124
+ think_matches = re.findall(think_pattern, output_text, re.DOTALL)
125
+
126
+ answer_pattern = r'<answer>(.*?)</answer>'
127
+ answer_matches = re.findall(answer_pattern, output_text, re.DOTALL)
128
+
129
+ think_content = [match.strip() for match in think_matches][0]
130
+ answer_content = [match.strip() for match in answer_matches][0]
131
+ print(f"thinking_content:{think_content}\n\n")
132
+ print(f"answer_content:{answer_content}\n\n")
133
  ```
134
 
135
 
136
+ ## Deployment
137
 
138
+ For deployment, you can use frameworks such as **vLLM**, **SGLang**, or **TensorRT-LLM** to serve the model and create an OpenAI-compatible API endpoint.
139
 
140
 
141
  ### vllm
142
 
143
  #### Docker Image
144
+ We provide a pre-built Docker image containing vLLM 0.8.5 with full support for this model. The official vllm release is currently under development, **note: cuda 12.8 is require for this docker**.
 
145
 
146
  - To get started:
 
 
 
147
 
148
+ https://hub.docker.com/r/hunyuaninfer/hunyuan-large/tags
149
 
150
  ```
151
+ docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
152
 
153
  ```
154
+
155
+ - Download Model file:
156
+ - Huggingface: will download automicly by vllm.
157
+ - ModelScope: `modelscope download --model Tencent-Hunyuan/Hunyuan-A13B-Instruct`
158
+
159
 
160
  - Start the API server:
161
 
162
+ model download by huggingface:
163
  ```
164
+ docker run --privileged --user root --net=host --ipc=host \
165
+ -v ~/.cache:/root/.cache/ \
166
+ --gpus=all -it --entrypoint python hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm
167
+ \
168
+ -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 \
169
+ --tensor-parallel-size 4 --model tencent/Hunyuan-A13B-Instruct --trust-remote-code
 
 
 
 
170
 
171
+ ```
 
 
 
172
 
173
+ model downloaded by modelscope:
174
  ```
175
+ docker run --privileged --user root --net=host --ipc=host \
176
+ -v ~/.cache/modelscope:/root/.cache/modelscope \
177
+ --gpus=all -it --entrypoint python hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm \
178
+ -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --tensor-parallel-size 4 --port 8000 \
179
+ --model /root/.cache/modelscope/hub/models/Tencent-Hunyuan/Hunyuan-A13B-Instruct/ --trust_remote_code
180
  ```
181
 
182
 
183
+ ### SGLang
 
 
184
 
185
  #### Docker Image
186
 
187
+ We also provide a pre-built Docker image based on the latest version of SGLang.
188
 
189
  To get started:
190
 
191
  - Pull the Docker image
192
 
193
  ```
194
+ docker pull tiacc-test.tencentcloudcr.com/tiacc/sglang:0.4.7
195
  ```
196
 
197
  - Start the API server:
 
201
  --shm-size 32g \
202
  -p 30000:30000 \
203
  --ipc=host \
204
+ tiacc-test.tencentcloudcr.com/tiacc/sglang:0.4.7 \
205
+ -m sglang.launch_server --model-path hunyuan/huanyuan_A13B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000
206
  ```
207
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
208
 
209
  ## Contact Us
210