File size: 5,346 Bytes
75b83bd fdf6c12 75b83bd eb3e70f 75b83bd b2a1137 75b83bd bb49ad0 b2a1137 75b83bd eb3e70f b2a1137 75b83bd f6f543f 75b83bd c28f1af 75b83bd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 |
# ๐ MiniMax Models vLLM Deployment Guide
[vLLMไธญๆ็้จ็ฝฒๆๅ](./vllm_deployment_guide_cn.md)
## ๐ Introduction
We recommend using [vLLM](https://docs.vllm.ai/en/latest/) to deploy [MiniMax-M1](https://huggingface.co/MiniMaxAI/MiniMax-M1-40k) model. Based on our testing, vLLM performs excellently when deploying this model, with the following features:
- ๐ฅ Outstanding service throughput performance
- โก Efficient and intelligent memory management
- ๐ฆ Powerful batch request processing capability
- โ๏ธ Deeply optimized underlying performance
The MiniMax-M1 model can run efficiently on a single server equipped with 8 H800 or 8 H20 GPUs. In terms of hardware configuration, a server with 8 H800 GPUs can process context inputs up to 2 million tokens, while a server equipped with 8 H20 GPUs can support ultra-long context processing capabilities of up to 5 million tokens.
## ๐พ Obtaining MiniMax Models
### MiniMax-M1 Model Obtaining
You can download the model from our official HuggingFace repository: [MiniMax-M1-40k](https://huggingface.co/MiniMaxAI/MiniMax-M1-40k), [MiniMax-M1-80k](https://huggingface.co/MiniMaxAI/MiniMax-M1-80k)
Download command:
```
pip install -U huggingface-hub
huggingface-cli download MiniMaxAI/MiniMax-M1-40k
# huggingface-cli download MiniMaxAI/MiniMax-M1-80k
# If you encounter network issues, you can set a proxy
export HF_ENDPOINT=https://hf-mirror.com
```
Or download using git:
```bash
git lfs install
git clone https://huggingface.co/MiniMaxAI/MiniMax-M1-40k
git clone https://huggingface.co/MiniMaxAI/MiniMax-M1-80k
```
โ ๏ธ **Important Note**: Please ensure that [Git LFS](https://git-lfs.github.com/) is installed on your system, which is necessary for completely downloading the model weight files.
## ๐ ๏ธ Deployment Options
### Option 1: Deploy Using Docker (Recommended)
To ensure consistency and stability of the deployment environment, we recommend using Docker for deployment.
โ ๏ธ **Version Requirements**:
- MiniMax-M1 model requires vLLM version 0.8.3 or later for full support
- If you are using a Docker image with vLLM version lower than the required version, you will need to:
1. Update to the latest vLLM code
2. Recompile vLLM from source. Follow the compilation instructions in Solution 2 of the Common Issues section
- Special Note: For vLLM versions between 0.8.3 and 0.9.2, you need to modify the model configuration:
1. Open `config.json`
2. Change `config['architectures'] = ["MiniMaxM1ForCausalLM"]` to `config['architectures'] = ["MiniMaxText01ForCausalLM"]`
1. Get the container image:
```bash
docker pull vllm/vllm-openai:v0.8.3
```
2. Run the container:
```bash
# Set environment variables
IMAGE=vllm/vllm-openai:v0.8.3
MODEL_DIR=<model storage path>
CODE_DIR=<code path>
NAME=MiniMaxImage
# Docker run configuration
DOCKER_RUN_CMD="--network=host --privileged --ipc=host --ulimit memlock=-1 --shm-size=2gb --rm --gpus all --ulimit stack=67108864"
# Start the container
sudo docker run -it \
-v $MODEL_DIR:$MODEL_DIR \
-v $CODE_DIR:$CODE_DIR \
--name $NAME \
$DOCKER_RUN_CMD \
$IMAGE /bin/bash
```
### Option 2: Direct Installation of vLLM
If your environment meets the following requirements:
- CUDA 12.1
- PyTorch 2.1
You can directly install vLLM
Installation command:
```bash
pip install vllm
```
๐ก If you are using other environment configurations, please refer to the [vLLM Installation Guide](https://docs.vllm.ai/en/latest/getting_started/installation.html)
## ๐ Starting the Service
### Launch MiniMax-M1 Service
```bash
export SAFETENSORS_FAST_GPU=1
export VLLM_USE_V1=0
python3 -m vllm.entrypoints.openai.api_server \
--model <model storage path> \
--tensor-parallel-size 8 \
--trust-remote-code \
--quantization experts_int8 \
--max_model_len 4096 \
--dtype bfloat16
```
### API Call Example
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MiniMaxAI/MiniMax-M1",
"messages": [
{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
{"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
]
}'
```
## โ Common Issues
### Module Loading Problems
If you encounter the following error:
```
import vllm._C # noqa
ModuleNotFoundError: No module named 'vllm._C'
```
Or
```
MiniMax-M1 model is not currently supported
```
We provide two solutions:
#### Solution 1: Copy Dependency Files
```bash
cd <working directory>
git clone https://github.com/vllm-project/vllm.git
cd vllm
cp /usr/local/lib/python3.12/dist-packages/vllm/*.so vllm
cp -r /usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/* vllm/vllm_flash_attn
```
#### Solution 2: Install from Source
```bash
cd <working directory>
git clone https://github.com/vllm-project/vllm.git
cd vllm/
pip install -e .
```
## ๐ฎ Getting Support
If you encounter any issues while deploying MiniMax-M1 model:
- Please check our official documentation
- Contact our technical support team through official channels
- Submit an [Issue](https://github.com/MiniMax-AI/MiniMax-M1/issues) on our GitHub repository
We will continuously optimize the deployment experience of this model and welcome your feedback!
|