File size: 5,346 Bytes
f8a8008
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
218ef2f
f8a8008
 
 
 
86a3c57
218ef2f
f8a8008
 
 
 
 
 
 
 
 
 
218ef2f
f8a8008
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49bc30d
 
 
f8a8008
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
# ๐Ÿš€ MiniMax Models vLLM Deployment Guide

[vLLMไธญๆ–‡็‰ˆ้ƒจ็ฝฒๆŒ‡ๅ—](./vllm_deployment_guide_cn.md)

## ๐Ÿ“– Introduction

We recommend using [vLLM](https://docs.vllm.ai/en/latest/) to deploy [MiniMax-M1](https://huggingface.co/MiniMaxAI/MiniMax-M1-40k) model. Based on our testing, vLLM performs excellently when deploying this model, with the following features:

- ๐Ÿ”ฅ Outstanding service throughput performance
- โšก Efficient and intelligent memory management
- ๐Ÿ“ฆ Powerful batch request processing capability
- โš™๏ธ Deeply optimized underlying performance

The MiniMax-M1 model can run efficiently on a single server equipped with 8 H800 or 8 H20 GPUs. In terms of hardware configuration, a server with 8 H800 GPUs can process context inputs up to 2 million tokens, while a server equipped with 8 H20 GPUs can support ultra-long context processing capabilities of up to 5 million tokens.

## ๐Ÿ’พ Obtaining MiniMax Models

### MiniMax-M1 Model Obtaining

You can download the model from our official HuggingFace repository: [MiniMax-M1-40k](https://huggingface.co/MiniMaxAI/MiniMax-M1-40k), [MiniMax-M1-80k](https://huggingface.co/MiniMaxAI/MiniMax-M1-80k)

Download command:
```
pip install -U huggingface-hub
huggingface-cli download MiniMaxAI/MiniMax-M1-40k
# huggingface-cli download MiniMaxAI/MiniMax-M1-80k

# If you encounter network issues, you can set a proxy
export HF_ENDPOINT=https://hf-mirror.com
```

Or download using git:

```bash
git lfs install
git clone https://huggingface.co/MiniMaxAI/MiniMax-M1-40k
git clone https://huggingface.co/MiniMaxAI/MiniMax-M1-80k
```

โš ๏ธ **Important Note**: Please ensure that [Git LFS](https://git-lfs.github.com/) is installed on your system, which is necessary for completely downloading the model weight files.

## ๐Ÿ› ๏ธ Deployment Options

### Option 1: Deploy Using Docker (Recommended)

To ensure consistency and stability of the deployment environment, we recommend using Docker for deployment.

โš ๏ธ **Version Requirements**: 
- MiniMax-M1 model requires vLLM version 0.8.3 or later for full support
- If you are using a Docker image with vLLM version lower than the required version, you will need to:
  1. Update to the latest vLLM code
  2. Recompile vLLM from source. Follow the compilation instructions in Solution 2 of the Common Issues section
- Special Note: For vLLM versions between 0.8.3 and 0.9.2, you need to modify the model configuration:
  1. Open `config.json`
  2. Change `config['architectures'] = ["MiniMaxM1ForCausalLM"]` to `config['architectures'] = ["MiniMaxText01ForCausalLM"]`

1. Get the container image:
```bash
docker pull vllm/vllm-openai:v0.8.3
```

2. Run the container:
```bash
# Set environment variables
IMAGE=vllm/vllm-openai:v0.8.3
MODEL_DIR=<model storage path>
CODE_DIR=<code path>
NAME=MiniMaxImage

# Docker run configuration
DOCKER_RUN_CMD="--network=host --privileged --ipc=host --ulimit memlock=-1 --shm-size=2gb --rm --gpus all --ulimit stack=67108864"

# Start the container
sudo docker run -it \
    -v $MODEL_DIR:$MODEL_DIR \
    -v $CODE_DIR:$CODE_DIR \
    --name $NAME \
    $DOCKER_RUN_CMD \
    $IMAGE /bin/bash
```


### Option 2: Direct Installation of vLLM

If your environment meets the following requirements:

- CUDA 12.1
- PyTorch 2.1

You can directly install vLLM

Installation command:
```bash
pip install vllm
```

๐Ÿ’ก If you are using other environment configurations, please refer to the [vLLM Installation Guide](https://docs.vllm.ai/en/latest/getting_started/installation.html)

## ๐Ÿš€ Starting the Service

### Launch MiniMax-M1 Service

```bash
export SAFETENSORS_FAST_GPU=1
export VLLM_USE_V1=0
python3 -m vllm.entrypoints.openai.api_server \
--model <model storage path> \
--tensor-parallel-size 8 \
--trust-remote-code \
--quantization experts_int8  \
--max_model_len 4096 \
--dtype bfloat16
```

### API Call Example

```bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "MiniMaxAI/MiniMax-M1",
        "messages": [
            {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
            {"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
        ]
    }'
```

## โ— Common Issues

### Module Loading Problems
If you encounter the following error:
```
import vllm._C  # noqa
ModuleNotFoundError: No module named 'vllm._C'
```

Or

```
MiniMax-M1 model is not currently supported
```

We provide two solutions:

#### Solution 1: Copy Dependency Files
```bash
cd <working directory>
git clone https://github.com/vllm-project/vllm.git
cd vllm
cp /usr/local/lib/python3.12/dist-packages/vllm/*.so vllm 
cp -r /usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/* vllm/vllm_flash_attn
```

#### Solution 2: Install from Source
```bash
cd <working directory>
git clone https://github.com/vllm-project/vllm.git

cd vllm/
pip install -e .
```

## ๐Ÿ“ฎ Getting Support

If you encounter any issues while deploying MiniMax-M1 model:
- Please check our official documentation
- Contact our technical support team through official channels
- Submit an [Issue](https://github.com/MiniMax-AI/MiniMax-M1/issues) on our GitHub repository

We will continuously optimize the deployment experience of this model and welcome your feedback!