# 🚀 MiniMax Models vLLM Deployment Guide

[vLLM中文版部署指南](./vllm_deployment_guide_cn.md)

## 📖 Introduction

We recommend using [vLLM](https://docs.vllm.ai/en/latest/) to deploy [MiniMax-M1](https://huggingface.co/MiniMaxAI/MiniMax-M1-40k) model. Based on our testing, vLLM performs excellently when deploying this model, with the following features:

- 🔥 Outstanding service throughput performance
- ⚡ Efficient and intelligent memory management
- 📦 Powerful batch request processing capability
- ⚙️ Deeply optimized underlying performance

The MiniMax-M1 model can run efficiently on a single server equipped with 8 H800 or 8 H20 GPUs. In terms of hardware configuration, a server with 8 H800 GPUs can process context inputs up to 2 million tokens, while a server equipped with 8 H20 GPUs can support ultra-long context processing capabilities of up to 5 million tokens.

## 💾 Obtaining MiniMax Models

### MiniMax-M1 Model Obtaining

You can download the model from our official HuggingFace repository: [MiniMax-M1-40k](https://huggingface.co/MiniMaxAI/MiniMax-M1-40k), [MiniMax-M1-80k](https://huggingface.co/MiniMaxAI/MiniMax-M1-80k)

Download command:
```
pip install -U huggingface-hub
huggingface-cli download MiniMaxAI/MiniMax-M1-40k
# huggingface-cli download MiniMaxAI/MiniMax-M1-80k

# If you encounter network issues, you can set a proxy
export HF_ENDPOINT=https://hf-mirror.com
```

Or download using git:

```bash
git lfs install
git clone https://huggingface.co/MiniMaxAI/MiniMax-M1-40k
git clone https://huggingface.co/MiniMaxAI/MiniMax-M1-80k
```

⚠️ **Important Note**: Please ensure that [Git LFS](https://git-lfs.github.com/) is installed on your system, which is necessary for completely downloading the model weight files.

## 🛠️ Deployment Options

### Option 1: Deploy Using Docker (Recommended)

To ensure consistency and stability of the deployment environment, we recommend using Docker for deployment.

⚠️ **Version Requirements**: 
- MiniMax-M1 model requires vLLM version 0.8.3 or later for full support
- If you are using a Docker image with vLLM version lower than the required version, you will need to:
  1. Update to the latest vLLM code
  2. Recompile vLLM from source. Follow the compilation instructions in Solution 2 of the Common Issues section
- Special Note: For vLLM versions between 0.8.3 and 0.9.2, you need to modify the model configuration:
  1. Open `config.json`
  2. Change `config['architectures'] = ["MiniMaxM1ForCausalLM"]` to `config['architectures'] = ["MiniMaxText01ForCausalLM"]`

1. Get the container image:
```bash
docker pull vllm/vllm-openai:v0.8.3
```

2. Run the container:
```bash
# Set environment variables
IMAGE=vllm/vllm-openai:v0.8.3
MODEL_DIR=<model storage path>
CODE_DIR=<code path>
NAME=MiniMaxImage

# Docker run configuration
DOCKER_RUN_CMD="--network=host --privileged --ipc=host --ulimit memlock=-1 --shm-size=2gb --rm --gpus all --ulimit stack=67108864"

# Start the container
sudo docker run -it \
    -v $MODEL_DIR:$MODEL_DIR \
    -v $CODE_DIR:$CODE_DIR \
    --name $NAME \
    $DOCKER_RUN_CMD \
    $IMAGE /bin/bash
```


### Option 2: Direct Installation of vLLM

If your environment meets the following requirements:

- CUDA 12.1
- PyTorch 2.1

You can directly install vLLM

Installation command:
```bash
pip install vllm
```

💡 If you are using other environment configurations, please refer to the [vLLM Installation Guide](https://docs.vllm.ai/en/latest/getting_started/installation.html)

## 🚀 Starting the Service

### Launch MiniMax-M1 Service

```bash
export SAFETENSORS_FAST_GPU=1
export VLLM_USE_V1=0
python3 -m vllm.entrypoints.openai.api_server \
--model <model storage path> \
--tensor-parallel-size 8 \
--trust-remote-code \
--quantization experts_int8  \
--max_model_len 4096 \
--dtype bfloat16
```

### API Call Example

```bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "MiniMaxAI/MiniMax-M1",
        "messages": [
            {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
            {"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
        ]
    }'
```

## ❗ Common Issues

### Module Loading Problems
If you encounter the following error:
```
import vllm._C  # noqa
ModuleNotFoundError: No module named 'vllm._C'
```

Or

```
MiniMax-M1 model is not currently supported
```

We provide two solutions:

#### Solution 1: Copy Dependency Files
```bash
cd <working directory>
git clone https://github.com/vllm-project/vllm.git
cd vllm
cp /usr/local/lib/python3.12/dist-packages/vllm/*.so vllm 
cp -r /usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/* vllm/vllm_flash_attn
```

#### Solution 2: Install from Source
```bash
cd <working directory>
git clone https://github.com/vllm-project/vllm.git

cd vllm/
pip install -e .
```

## 📮 Getting Support

If you encounter any issues while deploying MiniMax-M1 model:
- Please check our official documentation
- Contact our technical support team through official channels
- Submit an [Issue](https://github.com/MiniMax-AI/MiniMax-M1/issues) on our GitHub repository

We will continuously optimize the deployment experience of this model and welcome your feedback!