manaestras commited on
Commit
4eee8e6
·
verified ·
1 Parent(s): 379abf4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +279 -5
README.md CHANGED
@@ -1,5 +1,279 @@
1
- ---
2
- license: other
3
- license_name: tencent-hunyuan-a13b
4
- license_link: LICENSE
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: tencent-hunyuan-a13b
4
+ license_link: LICENSE
5
+ ---
6
+
7
+
8
+
9
+ <p align="center">
10
+ <img src="https://dscache.tencent-cloud.cn/upload/uploader/hunyuan-64b418fd052c033b228e04bc77bbc4b54fd7f5bc.png" width="400"/> <br>
11
+ </p><p></p>
12
+
13
+ <p align="center">
14
+ &nbsp<a href="https://github.com/Tencent/Hunyuan-A13B"><b>GITHUB</b></a>&nbsp&nbsp
15
+
16
+
17
+ ## Model Introduction
18
+
19
+ The A13B models released by Tencent Hunyuan this time: [Tencent-Hunyuan-A13B-Pretrain](https://huggingface.co/tencent/Hunyuan-A13B-Pretrain) , [Tencent-Hunyuan-A13B-Instruct](https://huggingface.co/tencent/Hunyuan-A13B-Instruct) , [Tencent-Hunyuan-A13B-Instruct-FP8](https://huggingface.co/tencent/Tencent-Hunyuan-A13B-Instruct-FP8) and [Tencent-Hunyuan-A13B-Instruct-FP8](https://huggingface.co/tencent/Tencent-Hunyuan-A13B-Instruct-FP8), use better data allocation and training, have strong performance, and have achieved a good balance between computing and performance. It stands out from many large-scale language models and is currently one of the strongest Chinese Mixture of Experts (MoE) models, featuring a total of 80 billion parameters and 13 billion active parameters.
20
+
21
+ ### Introduction to Technical Advantages
22
+
23
+ **Model**
24
+
25
+ - **High-Quality Synthetic Data**: By enhancing training with synthetic data, Hunyuan-A13B is able to learn richer representations, handle long-context inputs, and generalize better to unseen data.
26
+
27
+ - **KV Cache Compression**: Utilizing Grouped Query Attention (GQA) and Cross-Layer Attention (CLA) strategies, it significantly reduces memory usage and computational overhead of the KV cache, thereby improving inference throughput.
28
+
29
+ - **Expert-Specific Learning Rate Scaling**: Different learning rates are assigned to different experts, ensuring that each sub-model can effectively learn from the data and contribute to overall performance.
30
+
31
+ - **Long-Context Processing Capability**: Both the pre-trained model and the instruction-tuned model support text sequences of up to 256K tokens, significantly enhancing the ability to handle long-context tasks.
32
+
33
+ - **Extensive Benchmarking**: Extensive experiments across multiple languages and tasks have validated the practical effectiveness and safety of Hunyuan-A13B.
34
+
35
+ - **Hybrid Reasoning Capability**: It supports both fast thinking and slow thinking inference modes.
36
+
37
+
38
+
39
+ **Architecture**
40
+
41
+ Hunyuan-A13B adopts a Fine-grained Mixture of Experts (Fine-grained MoE) architecture, comprising a total of 80 billion parameters with 13 billion active parameters. The model has been trained on over 20 trillion tokens. It supports a context length of up to 256K tokens. The following are the detailed specifications of the model architecture:
42
+
43
+ - **Total Parameters**: 80B
44
+ - **Active Parameters**: 13B
45
+ - **Number of Layers**: 32
46
+ - **Attention Heads**: 32
47
+ - **Number of Shared Experts**: 1
48
+ - **Number of Non-Shared Experts**: 64
49
+ - **Routing Strategy**: Top-8
50
+ - **Activation Function**: SwiGLU
51
+ - **Hidden Layer Dimension**: 4096
52
+ - **Expert Hidden Layer Dimension**: 3072
53
+
54
+
55
+ &nbsp;
56
+
57
+ ## Related News
58
+ * 2025.6.27 We have open-sourced **Hunyuan-A13B-Pretrain** , **Hunyuan-A13B-Instruct** , **Hunyuan-A13B-Instruct-FP8** , **Hunyuan-A13B-Instruct** on Hugging Face.
59
+ <br>
60
+
61
+
62
+ ## Benchmark
63
+
64
+ Note: The following benchmarks are evaluated by TRT-LLM-backend
65
+
66
+ | Model | Hunyuan-Large | Qwen2.5-72B | Qwen3-32B | Qwen3-A22B | Hunyuan-A13B |
67
+ |------------------|---------------|--------------|---------------|-------------|---------------|
68
+ | MMLU | 88.4 | 86.1 | 83.61 | 87.81 | 88.17 |
69
+ | MMLU-Pro | 60.20 | 58.10 | 65.54 | 68.18 | 67.23 |
70
+ | MMLU-Redux | 87.47 | 83.90 | 83.41 | 87.40 | 87.67 |
71
+ | BBH | 86.30 | 85.8 | 87.38 | 88.87 | 87.56 |
72
+ | SuperGPQA | 38.90 | 37.84 * | 39.78 | 44.06 | 41.32 |
73
+ | EvalPlus | 75.69 | 66.05 | 72.05 | 77.60 | 78.64 |
74
+ | MultiPL-E | 59.13 | 61.00 | 67.06 | 65.94 | 69.33 |
75
+ | MBPP | 72.60 | 84.70 | 78.20 | 81.40 | 83.86 |
76
+ | CRUX-O | 60.63 | 56.00 * | 72.50 | 79.00 | 77.00 |
77
+ | MATH | 69.80 | 62.1 | 61.62 | 71.84 | 72.35 |
78
+ | GSM8k | 92.80 | 91.5 | 93.40 | 94.39 | 91.83 |
79
+ | GPQA | - | 45.9 | 47.97 | 47.47 | 43.44 |
80
+ | INCLUDE | 66.48 | 76.98 * | 67.97 | 73.46 | 74.90 |
81
+ | MGSM | 67.52 | 79.53 * | 82.68 | 83.53 | 76.00 |
82
+ | MMMLU | 76.89 | 79.28 * | 83.83 | 86.70 | 84.68 |
83
+
84
+
85
+ &nbsp;
86
+
87
+
88
+ | Topic | Bench | OpenAI-o1-1217 | DeepSeek R1 | Qwen3-A22B | Hunyuan-A13B-Instruct |
89
+ |:-------------------:|:-----------------------------:|:-------------:|:------------:|:-----------:|:---------------------:|
90
+ | **Mathematics** | AIME 2024<br>AIME 2025<br>MATH | 74.3<br>79.2<br>96.4 | 79.8<br>70<br>94.9 | 85.7<br>81.5<br>94.0 | 87.3<br>76.8<br>94.3 |
91
+ | **Science** | GPQA-Diamond<br>OlympiadBench | 78<br>83.1 | 71.5<br>82.4 | 71.1<br>85.7 | 71.2<br>82.7 |
92
+ | **Coding** | Livecodebench<br>Fullstackbench<br>ArtifactsBench | 63.9<br>64.6<br>38.6 | 65.9<br>71.6<br>44.6 | 70.7<br>65.6<br>44.6 | 63.9<br>67.8<br>43 |
93
+ | **Reasoning** | BBH<br>DROP<br>ZebraLogic | 80.4<br>90.2<br>81 | 83.7<br>92.2<br>78.7 | 88.9<br>90.3<br>80.3 | 89.1<br>91.1<br>84.7 |
94
+ | **Instruction<br>Following** | IF-Eval<br>SysBench | 91.8<br>82.5 | 88.3<br>77.7 | 83.4<br>74.2 | 84.7<br>76.1 |
95
+ | **Text<br>Creation**| LengthCtrl<br>InsCtrl | 60.1<br>74.8 | 55.9<br>69 | 53.3<br>73.7 | 55.4<br>71.9 |
96
+ | **NLU** | ComplexNLU<br>Word-Task | 64.7<br>67.1 | 64.5<br>81.8 | 59.8<br>56.4 | 61.2<br>62.9 |
97
+ | **Agent** | BDCL v3<br> $\tau$-bench<br>ComplexFuncBench<br> $C^3$-Bench | 67.8<br>60.4<br>47.6<br>58.8 | 63.8<br>58.7<br>n/a<br>55.3 | 70.8<br>46.7<br>n/a<br>51.7 | 78.3<br>54.7<br>51.2<br>63.5 |
98
+ | **Average** | - | n/a | n/a | n/a | n/a |
99
+
100
+
101
+
102
+
103
+
104
+ ## Quick Start
105
+
106
+ You can refer to the content in [Hunyuan-A13B](https://github.com/Tencent-Hunyuan/Hunyuan-A13B) to get started quickly. The training and inference code can use the version provided in this github repository.
107
+
108
+
109
+ ### Transformer
110
+
111
+ ```python
112
+ from transformers import AutoModelForCausalLM, AutoTokenizer
113
+ import os
114
+
115
+
116
+ def main():
117
+ model_name_or_path = os.environ['MODEL_PATH']
118
+
119
+
120
+ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
121
+ model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto",
122
+ trust_remote_code=True) # You may want to use bfloat16 and/or move to GPU here
123
+ for name, param in model.named_parameters():
124
+ print(f"{name}: {param.size()}")
125
+ messages = [
126
+ {
127
+ "role": "system",
128
+ "content": "You are a helpful assistant.",
129
+ },
130
+ {"role": "user", "content": "Write a short summary of the benefits of regular exercise."},
131
+ ]
132
+ tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
133
+ outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=100,do_sample=True)
134
+ print(tokenizer.decode(outputs[0]))
135
+
136
+ if __name__ == '__main__':
137
+ main()
138
+
139
+ ```
140
+
141
+
142
+ ## Deployment
143
+
144
+ For deployment, you can use frameworks such as *vLLM*, *SGLang*, or *TensorRT-LLM* to serve the model and create an OpenAI-compatible API endpoint.
145
+
146
+
147
+ ### vllm
148
+
149
+ #### Docker Image
150
+ We provide a pre-built Docker image containing vLLM 0.8.5 with full support for this model. The official support is currently under development.
151
+
152
+
153
+ - To get started:
154
+ ```
155
+ Pull the Docker image:docker pull xxx
156
+ ```
157
+
158
+ - Start the API server:
159
+
160
+ ```
161
+ docker start xxx
162
+ ```
163
+
164
+
165
+ #### Source Code
166
+
167
+ Support for this model has been added via this PR: (https://github.com/vllm-project/vllm/pull/20114 )in the vLLM project.
168
+ You can build and run vLLM from source after merging this pull request into your local repository.
169
+
170
+ After applying the changes, you can start the API server by following the standard vLLM setup instructions.
171
+
172
+
173
+ ### SGLlang
174
+
175
+ #### Docker Image
176
+
177
+ We also provide a pre-built Docker image based on the latest version of SGLang.
178
+
179
+ To get started:
180
+
181
+ - Pull the Docker image
182
+
183
+ ```
184
+ docker pull xxx
185
+ ```
186
+
187
+ - Start the API server:
188
+
189
+ ```
190
+ docker run --gpus all \
191
+ --shm-size 32g \
192
+ -p 30000:30000 \
193
+ --ipc=host \
194
+ xxx \
195
+ python3 -m sglang.launch_server --model-path hunyuan/huanyuan_A13B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000
196
+ ```
197
+
198
+
199
+ #### Source Code
200
+
201
+ The necessary integration has already been merged into the main branch via this PR(https://github.com/sgl-project/sglang/pull/7549 ).
202
+ Once you have cloned or updated your local SGLang repository, you can build and run the API server using the standard SGLang setup process.
203
+
204
+ After applying the changes, you can start the API server by following the standard SGLang setup instructions.
205
+
206
+ ```
207
+ python3 -m sglang.launch_server --model-path hunyuan/huanyuan_A13B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000
208
+ ```
209
+
210
+
211
+
212
+ ### TensorRT-LLM
213
+
214
+
215
+ #### Docker Image
216
+
217
+ We also provide a pre-built Docker image based on the latest version of TensorRT-LLM.
218
+
219
+ To get started:
220
+
221
+ - Pull the Docker image
222
+
223
+ ```
224
+ docker pull xxx
225
+ ```
226
+
227
+ - Start the API server:
228
+
229
+ ```
230
+ docker run --gpus all \
231
+ --shm-size 32g \
232
+ -p 30000:30000 \
233
+ --ipc=host \
234
+ xxx \
235
+ python3 -m sglang.launch_server --model-path hunyuan/huanyuan_A13B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000
236
+ ```
237
+
238
+ #### Source Code
239
+
240
+ The necessary integration has already been merged into the main branch via this PR(xxx ).
241
+ Once you have cloned or updated your local TensorRT-LLM. repository, you can build and run the API server using the standard TensorRT-LLM. setup process.
242
+
243
+ After applying the changes, you can start the API server by following the standard TensorRT-LLM. setup instructions.
244
+
245
+
246
+
247
+ ## Inference Performance
248
+
249
+ This section presents the efficiency test results of deploying various models using vLLM, including inference speed (tokens/s) under different batch sizes.
250
+
251
+
252
+ Evaluation Script:
253
+ ```python
254
+ python3 benchmark_throughput.py --backend vllm \
255
+ --input-len 2048 \
256
+ --output-len 14336 \
257
+ --model $MODEL_PATH \
258
+ --tensor-parallel-size $TP \
259
+ --use-v2-block-manager \
260
+ --async-engine \
261
+ --trust-remote-code \
262
+ --num_prompts $BATCH_SIZE \
263
+ --max-num-seqs $BATCH_SIZE
264
+ ```
265
+
266
+ | Inference Framework | Model | Number of GPUs (GPU productA) | input_length | batch=1 | batch=16 | batch=32 |
267
+ |------|-----------------------------|-----------|-------------------------|---------------------|----------------------|----------------------|
268
+ | vLLM | Hunyuan-A13B-Instruct | 8 | 2048 | 190.84 | 1246.54 | 1981.99 |
269
+ | vLLM | Hunyuan-A13B-Instruct | 4 | 2048 | 158.90 | 779.10 | 1301.75 |
270
+ | vLLM | Hunyuan-A13B-Instruct | 2 | 2048 | 111.72 | 327.31 | 346.54 |
271
+ | vLLM | Hunyuan-A13B-Instruct(int8 weight only) | 2 | 2048 | 109.10 | 444.17 | 721.93 |
272
+ | vLLM | Hunyuan-A13B-Instruct(W8A8C8-FP8) | 2 | 2048 | 91.83 | 372.01 | 617.70 |
273
+ | vLLM | Hunyuan-A13B-Instruct(W8A8C8-FP8) | 1 | 2048 | 60.07 | 148.80 | 160.41 |
274
+
275
+
276
+
277
+ ## Contact Us
278
+
279
+ If you would like to leave a message for our R&D and product teams, Welcome to contact our open-source team . You can also contact us via email ([email protected]).