|
# ๐ MiniMax Models vLLM Deployment Guide |
|
|
|
[vLLMไธญๆ็้จ็ฝฒๆๅ](./vllm_deployment_guide_cn.md) |
|
|
|
## ๐ Introduction |
|
|
|
We recommend using [vLLM](https://docs.vllm.ai/en/latest/) to deploy [MiniMax-M1](https://huggingface.co/MiniMaxAI/MiniMax-M1-40k) model. Based on our testing, vLLM performs excellently when deploying this model, with the following features: |
|
|
|
- ๐ฅ Outstanding service throughput performance |
|
- โก Efficient and intelligent memory management |
|
- ๐ฆ Powerful batch request processing capability |
|
- โ๏ธ Deeply optimized underlying performance |
|
|
|
The MiniMax-M1 model can run efficiently on a single server equipped with 8 H800 or 8 H20 GPUs. In terms of hardware configuration, a server with 8 H800 GPUs can process context inputs up to 2 million tokens, while a server equipped with 8 H20 GPUs can support ultra-long context processing capabilities of up to 5 million tokens. |
|
|
|
## ๐พ Obtaining MiniMax Models |
|
|
|
### MiniMax-M1 Model Obtaining |
|
|
|
You can download the model from our official HuggingFace repository: [MiniMax-M1-40k](https://huggingface.co/MiniMaxAI/MiniMax-M1-40k), [MiniMax-M1-80k](https://huggingface.co/MiniMaxAI/MiniMax-M1-80k) |
|
|
|
Download command: |
|
``` |
|
pip install -U huggingface-hub |
|
huggingface-cli download MiniMaxAI/MiniMax-M1-40k |
|
# huggingface-cli download MiniMaxAI/MiniMax-M1-80k |
|
|
|
# If you encounter network issues, you can set a proxy |
|
export HF_ENDPOINT=https://hf-mirror.com |
|
``` |
|
|
|
Or download using git: |
|
|
|
```bash |
|
git lfs install |
|
git clone https://huggingface.co/MiniMaxAI/MiniMax-M1-40k |
|
git clone https://huggingface.co/MiniMaxAI/MiniMax-M1-80k |
|
``` |
|
|
|
โ ๏ธ **Important Note**: Please ensure that [Git LFS](https://git-lfs.github.com/) is installed on your system, which is necessary for completely downloading the model weight files. |
|
|
|
## ๐ ๏ธ Deployment Options |
|
|
|
### Option 1: Deploy Using Docker (Recommended) |
|
|
|
To ensure consistency and stability of the deployment environment, we recommend using Docker for deployment. |
|
|
|
โ ๏ธ **Version Requirements**: |
|
- MiniMax-M1 model requires vLLM version 0.8.3 or later for full support |
|
- If you are using a Docker image with vLLM version lower than the required version, you will need to: |
|
1. Update to the latest vLLM code |
|
2. Recompile vLLM from source. Follow the compilation instructions in Solution 2 of the Common Issues section |
|
- Special Note: For vLLM versions between 0.8.3 and 0.9.2, you need to modify the model configuration: |
|
1. Open `config.json` |
|
2. Change `config['architectures'] = ["MiniMaxM1ForCausalLM"]` to `config['architectures'] = ["MiniMaxText01ForCausalLM"]` |
|
|
|
1. Get the container image: |
|
```bash |
|
docker pull vllm/vllm-openai:v0.8.3 |
|
``` |
|
|
|
2. Run the container: |
|
```bash |
|
# Set environment variables |
|
IMAGE=vllm/vllm-openai:v0.8.3 |
|
MODEL_DIR=<model storage path> |
|
CODE_DIR=<code path> |
|
NAME=MiniMaxImage |
|
|
|
# Docker run configuration |
|
DOCKER_RUN_CMD="--network=host --privileged --ipc=host --ulimit memlock=-1 --shm-size=2gb --rm --gpus all --ulimit stack=67108864" |
|
|
|
# Start the container |
|
sudo docker run -it \ |
|
-v $MODEL_DIR:$MODEL_DIR \ |
|
-v $CODE_DIR:$CODE_DIR \ |
|
--name $NAME \ |
|
$DOCKER_RUN_CMD \ |
|
$IMAGE /bin/bash |
|
``` |
|
|
|
|
|
### Option 2: Direct Installation of vLLM |
|
|
|
If your environment meets the following requirements: |
|
|
|
- CUDA 12.1 |
|
- PyTorch 2.1 |
|
|
|
You can directly install vLLM |
|
|
|
Installation command: |
|
```bash |
|
pip install vllm |
|
``` |
|
|
|
๐ก If you are using other environment configurations, please refer to the [vLLM Installation Guide](https://docs.vllm.ai/en/latest/getting_started/installation.html) |
|
|
|
## ๐ Starting the Service |
|
|
|
### Launch MiniMax-M1 Service |
|
|
|
```bash |
|
export SAFETENSORS_FAST_GPU=1 |
|
export VLLM_USE_V1=0 |
|
python3 -m vllm.entrypoints.openai.api_server \ |
|
--model <model storage path> \ |
|
--tensor-parallel-size 8 \ |
|
--trust-remote-code \ |
|
--quantization experts_int8 \ |
|
--max_model_len 4096 \ |
|
--dtype bfloat16 |
|
``` |
|
|
|
### API Call Example |
|
|
|
```bash |
|
curl http://localhost:8000/v1/chat/completions \ |
|
-H "Content-Type: application/json" \ |
|
-d '{ |
|
"model": "MiniMaxAI/MiniMax-M1", |
|
"messages": [ |
|
{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]}, |
|
{"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]} |
|
] |
|
}' |
|
``` |
|
|
|
## โ Common Issues |
|
|
|
### Module Loading Problems |
|
If you encounter the following error: |
|
``` |
|
import vllm._C # noqa |
|
ModuleNotFoundError: No module named 'vllm._C' |
|
``` |
|
|
|
Or |
|
|
|
``` |
|
MiniMax-M1 model is not currently supported |
|
``` |
|
|
|
We provide two solutions: |
|
|
|
#### Solution 1: Copy Dependency Files |
|
```bash |
|
cd <working directory> |
|
git clone https://github.com/vllm-project/vllm.git |
|
cd vllm |
|
cp /usr/local/lib/python3.12/dist-packages/vllm/*.so vllm |
|
cp -r /usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/* vllm/vllm_flash_attn |
|
``` |
|
|
|
#### Solution 2: Install from Source |
|
```bash |
|
cd <working directory> |
|
git clone https://github.com/vllm-project/vllm.git |
|
|
|
cd vllm/ |
|
pip install -e . |
|
``` |
|
|
|
## ๐ฎ Getting Support |
|
|
|
If you encounter any issues while deploying MiniMax-M1 model: |
|
- Please check our official documentation |
|
- Contact our technical support team through official channels |
|
- Submit an [Issue](https://github.com/MiniMax-AI/MiniMax-M1/issues) on our GitHub repository |
|
|
|
We will continuously optimize the deployment experience of this model and welcome your feedback! |
|
|