update doc
#28
by
asherszhang
- opened
- README.md +19 -4
- README_CN.md +18 -3
README.md
CHANGED
@@ -221,15 +221,30 @@ trtllm-serve \
|
|
221 |
|
222 |
### vLLM
|
223 |
|
224 |
-
#### Docker Image
|
225 |
-
We provide a pre-built Docker image containing vLLM 0.8.5 with full support for this model. The official vllm release is currently under development, **note: cuda 12.
|
226 |
|
227 |
-
- To get started:
|
228 |
|
|
|
|
|
|
|
229 |
```
|
230 |
docker pull hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1
|
231 |
```
|
232 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
233 |
- Download Model file:
|
234 |
- Huggingface: will download automicly by vllm.
|
235 |
- ModelScope: `modelscope download --model Tencent-Hunyuan/Hunyuan-A13B-Instruct`
|
@@ -279,7 +294,7 @@ You can build and run vLLM from source after merging this pull request into your
|
|
279 |
|
280 |
### Model Context Length Support
|
281 |
|
282 |
-
The Hunyuan A13B model supports a maximum context length of **256K tokens (262,144
|
283 |
|
284 |
#### Extending Context Length to 256K
|
285 |
|
|
|
221 |
|
222 |
### vLLM
|
223 |
|
224 |
+
#### Inference from Docker Image
|
225 |
+
We provide a pre-built Docker image containing vLLM 0.8.5 with full support for this model. The official vllm release is currently under development, **note: cuda 12.4 is require for this docker**.
|
226 |
|
|
|
227 |
|
228 |
+
- To Get Started, Download the Docker Image:
|
229 |
+
|
230 |
+
**From Docker Hub:**
|
231 |
```
|
232 |
docker pull hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1
|
233 |
```
|
234 |
|
235 |
+
**From China Mirror(Thanks to [CNB](https://cnb.cool/ "CNB.cool")):**
|
236 |
+
|
237 |
+
|
238 |
+
First, pull the image from CNB:
|
239 |
+
```
|
240 |
+
docker pull docker.cnb.cool/tencent/hunyuan/hunyuan-a13b/hunyuan-infer-vllm-cuda12.4:v1
|
241 |
+
```
|
242 |
+
|
243 |
+
Then, rename the image to better align with the following scripts:
|
244 |
+
```
|
245 |
+
docker tag docker.cnb.cool/tencent/hunyuan/hunyuan-a13b/hunyuan-infer-vllm-cuda12.4:v1 hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1
|
246 |
+
```
|
247 |
+
|
248 |
- Download Model file:
|
249 |
- Huggingface: will download automicly by vllm.
|
250 |
- ModelScope: `modelscope download --model Tencent-Hunyuan/Hunyuan-A13B-Instruct`
|
|
|
294 |
|
295 |
### Model Context Length Support
|
296 |
|
297 |
+
The Hunyuan A13B model supports a maximum context length of **256K tokens (262,144 tokens)**. However, due to GPU memory constraints on most hardware setups, the default configuration in `config.json` limits the context length to **32K tokens** to prevent out-of-memory (OOM) errors.
|
298 |
|
299 |
#### Extending Context Length to 256K
|
300 |
|
README_CN.md
CHANGED
@@ -178,16 +178,31 @@ print(response)
|
|
178 |
|
179 |
## vLLM 部署
|
180 |
|
181 |
-
### Docker
|
182 |
|
183 |
我们提供了一个基于官方 vLLM 0.8.5 版本的 Docker 镜像方便快速部署和测试。**注意:该镜像要求使用 CUDA 12.4 版本。**
|
184 |
|
185 |
-
-
|
186 |
|
|
|
187 |
```
|
188 |
docker pull hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1
|
189 |
```
|
190 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
191 |
- 下载模型文件:
|
192 |
- Huggingface:vLLM 会自动下载。
|
193 |
- ModelScope:`modelscope download --model Tencent-Hunyuan/Hunyuan-A13B-Instruct`
|
@@ -238,7 +253,7 @@ docker run --rm --ipc=host \
|
|
238 |
|
239 |
### 模型上下文长度支持
|
240 |
|
241 |
-
Hunyuan A13B 模型支持最大 **256K token
|
242 |
|
243 |
#### 将上下文长度扩展至 256K
|
244 |
|
|
|
178 |
|
179 |
## vLLM 部署
|
180 |
|
181 |
+
### Docker 镜像推理
|
182 |
|
183 |
我们提供了一个基于官方 vLLM 0.8.5 版本的 Docker 镜像方便快速部署和测试。**注意:该镜像要求使用 CUDA 12.4 版本。**
|
184 |
|
185 |
+
- 首先,下载 Docker 镜像文件:
|
186 |
|
187 |
+
**从Docker Hub下载**:
|
188 |
```
|
189 |
docker pull hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1
|
190 |
```
|
191 |
|
192 |
+
**中国国内镜像**:
|
193 |
+
|
194 |
+
考虑到下载速度, 也可以选择从 CNB 下载镜像,感谢[CNB云原生构建](https://cnb.cool/)提供支持:
|
195 |
+
|
196 |
+
1. 下载镜像
|
197 |
+
```
|
198 |
+
docker pull docker.cnb.cool/tencent/hunyuan/hunyuan-a13b/hunyuan-infer-vllm-cuda12.4:v1
|
199 |
+
```
|
200 |
+
|
201 |
+
2. 然后更名镜像(可选,更好的和下面脚本名字匹配)
|
202 |
+
```
|
203 |
+
docker tag docker.cnb.cool/tencent/hunyuan/hunyuan-a13b/hunyuan-infer-vllm-cuda12.4:v1 hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1
|
204 |
+
```
|
205 |
+
|
206 |
- 下载模型文件:
|
207 |
- Huggingface:vLLM 会自动下载。
|
208 |
- ModelScope:`modelscope download --model Tencent-Hunyuan/Hunyuan-A13B-Instruct`
|
|
|
253 |
|
254 |
### 模型上下文长度支持
|
255 |
|
256 |
+
Hunyuan A13B 模型支持最大 **256K token(262,144 Token)** 的上下文长度。但由于大多数 GPU 硬件配置的显存限制,默认 `config.json` 中将上下文长度限制为 **32K token**,以避免出现显存溢出(OOM)问题。
|
257 |
|
258 |
#### 将上下文长度扩展至 256K
|
259 |
|