tencent
/

Hunyuan-A13B-Instruct

@@ -221,15 +221,30 @@ trtllm-serve \
 ### vLLM
-#### Docker Image
-We provide a pre-built Docker image containing vLLM 0.8.5 with full support for this model. The official vllm release is currently under development， **note: cuda 12.8 is require for this docker**.
-- To get started:
 ```
 docker pull hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1
 ```
 - Download Model file:
   - Huggingface:  will download automicly by vllm.
   - ModelScope: `modelscope download --model Tencent-Hunyuan/Hunyuan-A13B-Instruct`
@@ -279,7 +294,7 @@ You can build and run vLLM from source after merging this pull request into your
 ### Model Context Length Support
-The Hunyuan A13B model supports a maximum context length of **256K tokens (262,144 token positions)**. However, due to GPU memory constraints on most hardware setups, the default configuration in `config.json` limits the context length to **32K tokens** to prevent out-of-memory (OOM) errors.
 #### Extending Context Length to 256K

 ### vLLM
+#### Inference from Docker Image
+We provide a pre-built Docker image containing vLLM 0.8.5 with full support for this model. The official vllm release is currently under development， **note: cuda 12.4 is require for this docker**.
+- To Get Started, Download the Docker Image:
+**From Docker Hub:**
 ```
 docker pull hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1
 ```
+**From China Mirror(Thanks to [CNB](https://cnb.cool/ "CNB.cool")):**
+First, pull the image from CNB:
+```
+docker pull docker.cnb.cool/tencent/hunyuan/hunyuan-a13b/hunyuan-infer-vllm-cuda12.4:v1
+```
+Then, rename the image to better align with the following scripts:
+```
+docker tag docker.cnb.cool/tencent/hunyuan/hunyuan-a13b/hunyuan-infer-vllm-cuda12.4:v1 hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1
+```
 - Download Model file:
   - Huggingface:  will download automicly by vllm.
   - ModelScope: `modelscope download --model Tencent-Hunyuan/Hunyuan-A13B-Instruct`
 ### Model Context Length Support
+The Hunyuan A13B model supports a maximum context length of **256K tokens (262,144 tokens)**. However, due to GPU memory constraints on most hardware setups, the default configuration in `config.json` limits the context length to **32K tokens** to prevent out-of-memory (OOM) errors.
 #### Extending Context Length to 256K

README_CN.md CHANGED Viewed

@@ -178,16 +178,31 @@ print(response)
 ## vLLM 部署
-### Docker 镜像
 我们提供了一个基于官方 vLLM 0.8.5 版本的 Docker 镜像方便快速部署和测试。**注意：该镜像要求使用 CUDA 12.4 版本。**
-- 快速开始方式如下：
 ```
 docker pull hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1
 ```
 - 下载模型文件：
   - Huggingface：vLLM 会自动下载。
   - ModelScope：`modelscope download --model Tencent-Hunyuan/Hunyuan-A13B-Instruct`
@@ -238,7 +253,7 @@ docker run --rm  --ipc=host \
 ### 模型上下文长度支持
-Hunyuan A13B 模型支持最大 **256K token（即 262,144 个位置）** 的上下文长度。但由于大多数 GPU 硬件配置的显存限制，默认 `config.json` 中将上下文长度限制为 **32K token**，以避免出现显存溢出（OOM）问题。
 #### 将上下文长度扩展至 256K

 ## vLLM 部署
+### Docker 镜像推理
 我们提供了一个基于官方 vLLM 0.8.5 版本的 Docker 镜像方便快速部署和测试。**注意：该镜像要求使用 CUDA 12.4 版本。**
+- 首先，下载 Docker 镜像文件：
+**从Docker Hub下载**:
 ```
 docker pull hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1
 ```
+**中国国内镜像**:
+考虑到下载速度， 也可以选择从 CNB 下载镜像,感谢[CNB云原生构建](https://cnb.cool/)提供支持:
+1. 下载镜像
+```
+docker pull docker.cnb.cool/tencent/hunyuan/hunyuan-a13b/hunyuan-infer-vllm-cuda12.4:v1
+```
+2. 然后更名镜像（可选，更好的和下面脚本名字匹配）
+```
+docker tag docker.cnb.cool/tencent/hunyuan/hunyuan-a13b/hunyuan-infer-vllm-cuda12.4:v1 hunyuaninfer/hunyuan-infer-vllm-cuda12.4:v1
+```
 - 下载模型文件：
   - Huggingface：vLLM 会自动下载。
   - ModelScope：`modelscope download --model Tencent-Hunyuan/Hunyuan-A13B-Instruct`
 ### 模型上下文长度支持
+Hunyuan A13B 模型支持最大 **256K token（262,144 Token）** 的上下文长度。但由于大多数 GPU 硬件配置的显存限制，默认 `config.json` 中将上下文长度限制为 **32K token**，以避免出现显存溢出（OOM）问题。
 #### 将上下文长度扩展至 256K