QscQ commited on
Commit
75b83bd
·
1 Parent(s): 6b33168
function_call_guide.md ADDED
@@ -0,0 +1,270 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MiniMax-M1 Function Call Guide
2
+
3
+ [FunctionCall中文使用指南](./function_call_guide_cn.md)
4
+
5
+ ## 📖 Introduction
6
+
7
+ The MiniMax-M1 model supports function calling capabilities, enabling the model to identify when external functions need to be called and output function call parameters in a structured format. This document provides detailed instructions on how to use the function calling feature of MiniMax-M1.
8
+
9
+ ## 🚀 Quick Start
10
+
11
+ ### Using Chat Template
12
+
13
+ MiniMax-M1 uses a specific chat template format to handle function calls. The chat template is defined in `tokenizer_config.json`, and you can use it in your code through the template.
14
+
15
+ ```python
16
+ from transformers import AutoTokenizer
17
+
18
+ def get_default_tools():
19
+ return [
20
+ {
21
+ {
22
+ "name": "get_current_weather",
23
+ "description": "Get the latest weather for a location",
24
+ "parameters": {
25
+ "type": "object",
26
+ "properties": {
27
+ "location": {
28
+ "type": "string",
29
+ "description": "A certain city, such as Beijing, Shanghai"
30
+ }
31
+ },
32
+ }
33
+ "required": ["location"],
34
+ "type": "object"
35
+ }
36
+ }
37
+ ]
38
+
39
+ # Load model and tokenizer
40
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
41
+ prompt = "What's the weather like in Shanghai today?"
42
+ messages = [
43
+ {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant created by Minimax based on MiniMax-Text-01 model."}]},
44
+ {"role": "user", "content": [{"type": "text", "text": prompt}]},
45
+ ]
46
+
47
+ # Enable function call tools
48
+ tools = get_default_tools()
49
+
50
+ # Apply chat template and add tool definitions
51
+ text = tokenizer.apply_chat_template(
52
+ messages,
53
+ tokenize=False,
54
+ add_generation_prompt=True,
55
+ tools=tools
56
+ )
57
+ ```
58
+
59
+ ## 🛠️ Function Call Definition
60
+
61
+ ### Function Structure
62
+
63
+ Function calls need to be defined in the `tools` field of the request body. Each function consists of the following components:
64
+
65
+ ```json
66
+ {
67
+ "tools": [
68
+ {
69
+ "name": "search_web",
70
+ "description": "Search function.",
71
+ "parameters": {
72
+ "properties": {
73
+ "query_list": {
74
+ "description": "Keywords for search, with list element count of 1.",
75
+ "items": { "type": "string" },
76
+ "type": "array"
77
+ },
78
+ "query_tag": {
79
+ "description": "Classification of the query",
80
+ "items": { "type": "string" },
81
+ "type": "array"
82
+ }
83
+ },
84
+ "required": [ "query_list", "query_tag" ],
85
+ "type": "object"
86
+ }
87
+ }
88
+ ]
89
+ }
90
+ ```
91
+
92
+ **Field Descriptions:**
93
+ - `name`: Function name
94
+ - `description`: Function description
95
+ - `parameters`: Function parameter definition
96
+ - `properties`: Parameter property definitions, where key is the parameter name and value contains detailed parameter description
97
+ - `required`: List of required parameters
98
+ - `type`: Parameter type (usually "object")
99
+
100
+ ### Internal Model Processing Format
101
+
102
+ When processed internally by the model, function definitions are converted to a special format and concatenated to the input text:
103
+
104
+ ```
105
+ ]~!b[]~b]system ai_setting=Conch AI
106
+ MiniMax AI is an AI assistant independently developed by MiniMax. [e~[
107
+ ]~b]system tool_setting=tools
108
+ You are provided with these tools:
109
+ <tools>
110
+ {"name": "search_web", "description": "Search function.", "parameters": {"properties": {"query_list": {"description": "Keywords for search, with list element count of 1.", "items": {"type": "string"}, "type": "array"}, "query_tag": {"description": "Classification of the query", "items": {"type": "string"}, "type": "array"}}, "required": ["query_list", "query_tag"], "type": "object"}}
111
+ </tools>
112
+
113
+ If you need to call tools, please respond with <tool_calls></tool_calls> XML tags, and provide tool-name and json-object of arguments, following the format below:
114
+ <tool_calls>
115
+ {"name": <tool-name>, "arguments": <args-json-object>}
116
+ ...
117
+ </tool_calls>[e~[
118
+ ]~b]user name=User
119
+ When were the most recent launch events for OpenAI and Gemini?[e~[
120
+ ]~b]ai name=Conch AI
121
+ ```
122
+
123
+ ### Model Output Format
124
+
125
+ The model outputs function calls in the following format:
126
+
127
+ ```xml
128
+ <think>
129
+ Okay, I will search for the OpenAI and Gemini latest release.
130
+ </think>
131
+ <tool_calls>
132
+ {"name": "search_web", "arguments": {"query_tag": ["technology", "events"], "query_list": ["\"OpenAI\" \"latest\" \"release\""]}}
133
+ {"name": "search_web", "arguments": {"query_tag": ["technology", "events"], "query_list": ["\"Gemini\" \"latest\" \"release\""]}}
134
+ </tool_calls>
135
+ ```
136
+
137
+ ## 📥 Function Call Result Processing
138
+
139
+ ### Parsing Function Calls
140
+
141
+ You can use the following code to parse function calls from the model output:
142
+
143
+ ```python
144
+ import re
145
+ import json
146
+
147
+ def parse_function_calls(content: str):
148
+ """
149
+ Parse function calls from model output
150
+ """
151
+ function_calls = []
152
+
153
+ # Match content within <tool_calls> tags
154
+ tool_calls_pattern = r"<tool_calls>(.*?)</tool_calls>"
155
+ tool_calls_match = re.search(tool_calls_pattern, content, re.DOTALL)
156
+
157
+ if not tool_calls_match:
158
+ return function_calls
159
+
160
+ tool_calls_content = tool_calls_match.group(1).strip()
161
+
162
+ # Parse each function call (one JSON object per line)
163
+ for line in tool_calls_content.split('\n'):
164
+ line = line.strip()
165
+ if not line:
166
+ continue
167
+
168
+ try:
169
+ # Parse JSON format function call
170
+ call_data = json.loads(line)
171
+ function_name = call_data.get("name")
172
+ arguments = call_data.get("arguments", {})
173
+
174
+ function_calls.append({
175
+ "name": function_name,
176
+ "arguments": arguments
177
+ })
178
+
179
+ print(f"Function call: {function_name}, Arguments: {arguments}")
180
+
181
+ except json.JSONDecodeError as e:
182
+ print(f"Parameter parsing failed: {line}, Error: {e}")
183
+
184
+ return function_calls
185
+
186
+ # Example: Handle weather query function
187
+ def execute_function_call(function_name: str, arguments: dict):
188
+ """
189
+ Execute function call and return result
190
+ """
191
+ if function_name == "get_current_weather":
192
+ location = arguments.get("location", "Unknown location")
193
+ # Build function execution result
194
+ return {
195
+ "role": "tool",
196
+ "name": function_name,
197
+ "content": json.dumps({
198
+ "location": location,
199
+ "temperature": "25",
200
+ "unit": "celsius",
201
+ "weather": "Sunny"
202
+ }, ensure_ascii=False)
203
+ }
204
+ elif function_name == "search_web":
205
+ query_list = arguments.get("query_list", [])
206
+ query_tag = arguments.get("query_tag", [])
207
+ # Simulate search results
208
+ return {
209
+ "role": "tool",
210
+ "name": function_name,
211
+ "content": f"Search keywords: {query_list}, Categories: {query_tag}\nSearch results: Relevant information found"
212
+ }
213
+
214
+ return None
215
+ ```
216
+
217
+ ### Returning Function Execution Results to the Model
218
+
219
+ After successfully parsing function calls, you should add the function execution results to the conversation history so that the model can access and utilize this information in subsequent interactions.
220
+
221
+ #### Single Result
222
+
223
+ If the model decides to call `search_web`, we suggest you to return the function result in the following format, with the `name` field set to the specific tool name.
224
+
225
+ ```json
226
+ {
227
+ "data": [
228
+ {
229
+ "role": "tool",
230
+ "name": "search_web",
231
+ "content": "search_result"
232
+ }
233
+ ]
234
+ }
235
+ ```
236
+
237
+ Corresponding model input format:
238
+ ```
239
+ ]~b]tool name=search_web
240
+ search_result[e~[
241
+ ```
242
+
243
+
244
+ #### Multiple Result
245
+ If the model decides to call `search_web` and `get_current_weather` at the same time, we suggest you to return the multiple function results in the following format, with the `name` field set to "tools", and use the `content` field to contain multiple results.
246
+
247
+
248
+ ```json
249
+ {
250
+ "data": [
251
+ {
252
+ "role": "tool",
253
+ "name": "tools",
254
+ "content": "Tool name: search_web\nTool result: test_result1\n\nTool name: get_current_weather\nTool result: test_result2"
255
+ }
256
+ ]
257
+ }
258
+ ```
259
+
260
+ Corresponding model input format:
261
+ ```
262
+ ]~b]tool name=tools
263
+ Tool name: search_web
264
+ Tool result: test_result1
265
+
266
+ Tool name: search_web
267
+ Tool result: test_result2[e~[
268
+ ```
269
+
270
+ While we suggest following the above formats, as long as the model input is easy to understand, the specific values of `name` and `content` is entirely up to the caller.
function_call_guide_cn.md ADDED
@@ -0,0 +1,267 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MiniMax-M1 函数调用(Function Call)功能指南
2
+
3
+ ## 📖 简介
4
+
5
+ MiniMax-M1 模型支持函数调用功能,使模型能够识别何时需要调用外部函数,并以结构化格式输出函数调用参数。本文档详细介绍了如何使用 MiniMax-M1 的函数调用功能。
6
+
7
+ ## 🚀 快速开始
8
+
9
+ ### 聊天模板使用
10
+
11
+ MiniMax-M1 使用特定的聊天模板格式处理函数调用。聊天模板定义在 `tokenizer_config.json` 中,你可以在代码中通过 template 来进行使用。
12
+
13
+ ```python
14
+ from transformers import AutoTokenizer
15
+
16
+ def get_default_tools():
17
+ return [
18
+ {
19
+ {
20
+ "name": "get_current_weather",
21
+ "description": "Get the latest weather for a location",
22
+ "parameters": {
23
+ "type": "object",
24
+ "properties": {
25
+ "location": {
26
+ "type": "string",
27
+ "description": "A certain city, such as Beijing, Shanghai"
28
+ }
29
+ },
30
+ }
31
+ "required": ["location"],
32
+ "type": "object"
33
+ }
34
+ }
35
+ ]
36
+
37
+ # 加载模型和分词器
38
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
39
+ prompt = "What's the weather like in Shanghai today?"
40
+ messages = [
41
+ {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant created by Minimax based on MiniMax-Text-01 model."}]},
42
+ {"role": "user", "content": [{"type": "text", "text": prompt}]},
43
+ ]
44
+
45
+ # 启用函数调用工具
46
+ tools = get_default_tools()
47
+
48
+ # 应用聊天模板,并加入工具定义
49
+ text = tokenizer.apply_chat_template(
50
+ messages,
51
+ tokenize=False,
52
+ add_generation_prompt=True,
53
+ tools=tools
54
+ )
55
+ ```
56
+
57
+ ## 🛠️ 函数调用的定义
58
+
59
+ ### 函数结构体
60
+
61
+ 函数调用需要在请求体中定义 `tools` 字段,每个函数由以下部分组成:
62
+
63
+ ```json
64
+ {
65
+ "tools": [
66
+ {
67
+ "name": "search_web",
68
+ "description": "搜索函数。",
69
+ "parameters": {
70
+ "properties": {
71
+ "query_list": {
72
+ "description": "进行搜索的关键词,列表元素个数为1。",
73
+ "items": { "type": "string" },
74
+ "type": "array"
75
+ },
76
+ "query_tag": {
77
+ "description": "query的分类",
78
+ "items": { "type": "string" },
79
+ "type": "array"
80
+ }
81
+ },
82
+ "required": [ "query_list", "query_tag" ],
83
+ "type": "object"
84
+ }
85
+ }
86
+ ]
87
+ }
88
+ ```
89
+
90
+ **字段说明:**
91
+ - `name`: 函数名称
92
+ - `description`: 函数功能描述
93
+ - `parameters`: 函数参数定义
94
+ - `properties`: 参数属性定义,key 是参数名,value 包含参数的详细描述
95
+ - `required`: 必填参数列表
96
+ - `type`: 参数类型(通常为 "object")
97
+
98
+ ### 模型内部处理格式
99
+
100
+ 在模型内部处理时,函数定义会被转换为特殊格式并拼接到输入文本中:
101
+
102
+ ```
103
+ ]~!b[]~b]system ai_setting=海螺AI
104
+ MiniMax AI是由上海稀宇科技有限公司(MiniMax)自主研发的AI助理。[e~[
105
+ ]~b]system tool_setting=tools
106
+ You are provided with these tools:
107
+ <tools>
108
+ {"name": "search_web", "description": "搜索函数。", "parameters": {"properties": {"query_list": {"description": "进行搜索的关键词,列表元素个数为1。", "items": {"type": "string"}, "type": "array"}, "query_tag": {"description": "query的分类", "items": {"type": "string"}, "type": "array"}}, "required": ["query_list", "query_tag"], "type": "object"}}
109
+ </tools>
110
+
111
+ If you need to call tools, please respond with <tool_calls></tool_calls> XML tags, and provide tool-name and json-object of arguments, following the format below:
112
+ <tool_calls>
113
+ {"name": <tool-name>, "arguments": <args-json-object>}
114
+ ...
115
+ </tool_calls>[e~[
116
+ ]~b]user name=用户
117
+ OpenAI 和 Gemini 的最近一次发布会都是什么时候?[e~[
118
+ ]~b]ai name=海螺AI
119
+ ```
120
+
121
+ ### 模型输出格式
122
+
123
+ 模型会以以下格式输出函数调用:
124
+
125
+ ```xml
126
+ <think>
127
+ Okay, I will search for the OpenAI and Gemini latest release.
128
+ </think>
129
+ <tool_calls>
130
+ {"name": "search_web", "arguments": {"query_tag": ["technology", "events"], "query_list": ["\"OpenAI\" \"latest\" \"release\""]}}
131
+ {"name": "search_web", "arguments": {"query_tag": ["technology", "events"], "query_list": ["\"Gemini\" \"latest\" \"release\""]}}
132
+ </tool_calls>
133
+ ```
134
+
135
+ ## 📥 函数调用结果处理
136
+
137
+ ### 解析函数调用
138
+
139
+ 您可以使用以下代码解析模型输出的函数调用:
140
+
141
+ ```python
142
+ import re
143
+ import json
144
+
145
+ def parse_function_calls(content: str):
146
+ """
147
+ 解析模型输出中的函数调用
148
+ """
149
+ function_calls = []
150
+
151
+ # 匹配 <tool_calls> 标签内的内容
152
+ tool_calls_pattern = r"<tool_calls>(.*?)</tool_calls>"
153
+ tool_calls_match = re.search(tool_calls_pattern, content, re.DOTALL)
154
+
155
+ if not tool_calls_match:
156
+ return function_calls
157
+
158
+ tool_calls_content = tool_calls_match.group(1).strip()
159
+
160
+ # 解析每个函数调用(每行一��JSON对象)
161
+ for line in tool_calls_content.split('\n'):
162
+ line = line.strip()
163
+ if not line:
164
+ continue
165
+
166
+ try:
167
+ # 解析JSON格式的函数调用
168
+ call_data = json.loads(line)
169
+ function_name = call_data.get("name")
170
+ arguments = call_data.get("arguments", {})
171
+
172
+ function_calls.append({
173
+ "name": function_name,
174
+ "arguments": arguments
175
+ })
176
+
177
+ print(f"调用函数: {function_name}, 参数: {arguments}")
178
+
179
+ except json.JSONDecodeError as e:
180
+ print(f"参数解析失败: {line}, 错误: {e}")
181
+
182
+ return function_calls
183
+
184
+ # 示例:处理天气查询函数
185
+ def execute_function_call(function_name: str, arguments: dict):
186
+ """
187
+ 执行函数调用并返回结果
188
+ """
189
+ if function_name == "get_current_weather":
190
+ location = arguments.get("location", "未知位置")
191
+ # 构建函数执行结果
192
+ return {
193
+ "role": "tool",
194
+ "name": function_name,
195
+ "content": json.dumps({
196
+ "location": location,
197
+ "temperature": "25",
198
+ "unit": "celsius",
199
+ "weather": "晴朗"
200
+ }, ensure_ascii=False)
201
+ }
202
+ elif function_name == "search_web":
203
+ query_list = arguments.get("query_list", [])
204
+ query_tag = arguments.get("query_tag", [])
205
+ # 模拟搜索结果
206
+ return {
207
+ "role": "tool",
208
+ "name": function_name,
209
+ "content": f"搜索关键词: {query_list}, 分类: {query_tag}\n搜索结果: 相关信息已找到"
210
+ }
211
+
212
+ return None
213
+ ```
214
+
215
+ ### 将函数执行结果返回给模型
216
+
217
+ 成功解析函数调用后,您应将函数执行结果添加到对话历史中,以便模型在后续交互中能够访问和利用这些信息。
218
+
219
+ #### 单个结果
220
+
221
+ 假如模型调用了 `search_web` 函数,您可以参考如下格式添加执行结果,`name` 字段为具体的函数名称。
222
+
223
+ ```json
224
+ {
225
+ "data": [
226
+ {
227
+ "role": "tool",
228
+ "name": "search_web",
229
+ "content": "search_result"
230
+ }
231
+ ]
232
+ }
233
+ ```
234
+
235
+ 对应如下的模型输入格式:
236
+ ```
237
+ ]~b]tool name=search_web
238
+ search_result[e~[
239
+ ```
240
+
241
+
242
+ #### 多个结果
243
+ 假如模型同时调用了 `search_web` 和 `get_current_weather` 函数,您可以参考如下格式添加执行结果,`name` 字段为"tools",`content`包含多个结果。
244
+
245
+ ```json
246
+ {
247
+ "data": [
248
+ {
249
+ "role": "tool",
250
+ "name": "tools",
251
+ "content": "Tool name: search_web\nTool result: test_result1\n\nTool name: get_current_weather\nTool result: test_result2"
252
+ }
253
+ ]
254
+ }
255
+ ```
256
+
257
+ 对应如下的模型输入格式:
258
+ ```
259
+ ]~b]tool name=tools
260
+ Tool name: search_web
261
+ Tool result: test_result1
262
+
263
+ Tool name: search_web
264
+ Tool result: test_result2[e~[
265
+ ```
266
+
267
+ 虽然我们建议您参考以上格式,但只要返回给模型的输入易于理解,`name` 和 `content` 的具体内容完全由您自主决定。
transformers_deployment_guide.md ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🚀 MiniMax Model Transformers Deployment Guide
2
+
3
+ [Transformers中文版部署指南](./transformers_deployment_guide_cn.md)
4
+
5
+ ## 📖 Introduction
6
+
7
+ This guide will help you deploy the MiniMax-M1 model using the [Transformers](https://huggingface.co/docs/transformers/index) library. Transformers is a widely used deep learning library that provides a rich collection of pre-trained models and flexible model operation interfaces.
8
+
9
+ ## 🛠️ Environment Setup
10
+
11
+ ### Installing Transformers
12
+
13
+ ```bash
14
+ pip install transformers torch accelerate
15
+ ```
16
+
17
+ ## 📋 Basic Usage Example
18
+
19
+ The pre-trained model can be used as follows:
20
+
21
+ ```python
22
+ from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
23
+
24
+ MODEL_PATH = "{MODEL_PATH}"
25
+ model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="auto", trust_remote_code=True)
26
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
27
+
28
+ messages = [
29
+ {"role": "user", "content": "What is your favourite condiment?"},
30
+ {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
31
+ {"role": "user", "content": "Do you have mayonnaise recipes?"}
32
+ ]
33
+
34
+ text = tokenizer.apply_chat_template(
35
+ messages,
36
+ tokenize=False,
37
+ add_generation_prompt=True
38
+ )
39
+
40
+ model_inputs = tokenizer(text, return_tensors="pt").to(model.device)
41
+
42
+ generation_config = GenerationConfig(
43
+ max_new_tokens=20,
44
+ eos_token_id=tokenizer.eos_token_id,
45
+ use_cache=True,
46
+ )
47
+
48
+ generated_ids = model.generate(**model_inputs, generation_config=generation_config)
49
+
50
+ generated_ids = [
51
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
52
+ ]
53
+
54
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
55
+ print(response)
56
+ ```
57
+
58
+ ## ⚡ Performance Optimization
59
+
60
+ ### Speeding up with Flash Attention
61
+
62
+ The code snippet above showcases inference without any optimization tricks. However, one can drastically speed up the model by leveraging [Flash Attention](../perf_train_gpu_one#flash-attention-2), which is a faster implementation of the attention mechanism used inside the model.
63
+
64
+ First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature:
65
+
66
+ ```bash
67
+ pip install -U flash-attn --no-build-isolation
68
+ ```
69
+
70
+ Also make sure that you have hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of the [Flash Attention repository](https://github.com/Dao-AILab/flash-attention). Additionally, ensure you load your model in half-precision (e.g. `torch.float16`).
71
+
72
+ To load and run a model using Flash Attention-2, refer to the snippet below:
73
+
74
+ ```python
75
+ import torch
76
+ from transformers import AutoModelForCausalLM, AutoTokenizer
77
+
78
+ MODEL_PATH = "{MODEL_PATH}"
79
+ model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, trust_remote_code=True, torch_dtype=torch.float16, attn_implementation="flash_attention_2", device_map="auto")
80
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
81
+
82
+ prompt = "My favourite condiment is"
83
+
84
+ model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
85
+ generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
86
+ response = tokenizer.batch_decode(generated_ids)[0]
87
+ print(response)
88
+ ```
89
+
90
+ ## 📮 Getting Support
91
+
92
+ If you encounter any issues while deploying the MiniMax-M1 model:
93
+ - Please check our official documentation
94
+ - Contact our technical support team through official channels
95
+ - Submit an Issue on our GitHub repository
96
+
97
+ We continuously optimize the deployment experience on Transformers and welcome your feedback!
transformers_deployment_guide_cn.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🚀 MiniMax 模型 Transformers 部署指南
2
+
3
+ ## 📖 简介
4
+
5
+ 本指南将帮助您使用 [Transformers](https://huggingface.co/docs/transformers/index) 库部署 MiniMax-M1 模型。Transformers 是一个广泛使用的深度学习库,提供了丰富的预训练模型和灵活的模型操作接口。
6
+
7
+ ## 🛠️ 环境准备
8
+
9
+ ### 安装 Transformers
10
+
11
+ ```bash
12
+ pip install transformers torch accelerate
13
+ ```
14
+
15
+ ## 📋 基本使用示例
16
+
17
+ 预训练模型可以按照以下方式使用:
18
+
19
+ ```python
20
+ from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
21
+
22
+ MODEL_PATH = "{MODEL_PATH}"
23
+ model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="auto", trust_remote_code=True)
24
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
25
+
26
+ messages = [
27
+ {"role": "user", "content": "What is your favourite condiment?"},
28
+ {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
29
+ {"role": "user", "content": "Do you have mayonnaise recipes?"}
30
+ ]
31
+
32
+ text = tokenizer.apply_chat_template(
33
+ messages,
34
+ tokenize=False,
35
+ add_generation_prompt=True
36
+ )
37
+
38
+ model_inputs = tokenizer(text, return_tensors="pt").to(model.device)
39
+
40
+ generation_config = GenerationConfig(
41
+ max_new_tokens=20,
42
+ eos_token_id=tokenizer.eos_token_id,
43
+ use_cache=True,
44
+ )
45
+
46
+ generated_ids = model.generate(**model_inputs, generation_config=generation_config)
47
+
48
+ generated_ids = [
49
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
50
+ ]
51
+
52
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
53
+ print(response)
54
+ ```
55
+
56
+ ## ⚡ 性能优化
57
+
58
+ ### 使用 Flash Attention 加速
59
+
60
+ 上面的代码片段展示了不使用任何优化技巧的推理过程。但通过利用 [Flash Attention](../perf_train_gpu_one#flash-attention-2),可以大幅加速模型,因为它提供了模型内部使用的注意力机制的更快实现。
61
+
62
+ 首先,确保安装最新版本的 Flash Attention 2 以包含滑动窗口注意力功能:
63
+
64
+ ```bash
65
+ pip install -U flash-attn --no-build-isolation
66
+ ```
67
+
68
+ 还要确保您拥有与 Flash-Attention 2 兼容的硬件。在[Flash Attention 官方仓库](https://github.com/Dao-AILab/flash-attention)的官方文档中了解更多信息。此外,请确保以半精度(例如 `torch.float16`)加载模型。
69
+
70
+ 要使用 Flash Attention-2 加载和运行模型,请参考以下代码片段:
71
+
72
+ ```python
73
+ import torch
74
+ from transformers import AutoModelForCausalLM, AutoTokenizer
75
+
76
+ MODEL_PATH = "{MODEL_PATH}"
77
+ model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, trust_remote_code=True, torch_dtype=torch.float16, attn_implementation="flash_attention_2", device_map="auto")
78
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
79
+
80
+ prompt = "My favourite condiment is"
81
+
82
+ model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
83
+ generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
84
+ response = tokenizer.batch_decode(generated_ids)[0]
85
+ print(response)
86
+ ```
87
+
88
+ ## 📮 获取支持
89
+
90
+ 如果您在部署 MiniMax-M1 模型过程中遇到任何问题:
91
+ - 请查看我们的官方文档
92
+ - 通过官方渠道联系我们的技术支持团队
93
+ - 在我们的 GitHub 仓库提交 Issue
94
+
95
+ 我们会持续优化 Transformers 上的部署体验,欢迎您的反馈!
vllm_deployment_guide.md ADDED
@@ -0,0 +1,166 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🚀 MiniMax Models vLLM Deployment Guide
2
+
3
+ [VLLM中文版部署指南](./vllm_deployment_guide_cn.md)
4
+
5
+ ## 📖 Introduction
6
+
7
+ We recommend using [vLLM](https://docs.vllm.ai/en/latest/) to deploy MiniMax-M1 model. Based on our testing, vLLM performs excellently when deploying this model, with the following features:
8
+
9
+ - 🔥 Outstanding service throughput performance
10
+ - ⚡ Efficient and intelligent memory management
11
+ - 📦 Powerful batch request processing capability
12
+ - ⚙️ Deeply optimized underlying performance
13
+
14
+ The MiniMax-M1 model can run efficiently on a single server equipped with 8 H800 or 8 H20 GPUs. In terms of hardware configuration, a server with 8 H800 GPUs can process context inputs up to 2 million tokens, while a server equipped with 8 H20 GPUs can support ultra-long context processing capabilities of up to 5 million tokens.
15
+
16
+ ## 💾 Obtaining MiniMax Models
17
+
18
+ ### MiniMax-M1 Model Obtaining
19
+
20
+ You can download the model from our official HuggingFace repository: [MiniMax-M1](https://huggingface.co/MiniMaxAI/MiniMax-M1)
21
+
22
+ Download command:
23
+ ```
24
+ pip install -U huggingface-hub
25
+ huggingface-cli download MiniMaxAI/MiniMax-M1
26
+
27
+ # If you encounter network issues, you can set a proxy
28
+ export HF_ENDPOINT=https://hf-mirror.com
29
+ ```
30
+
31
+ Or download using git:
32
+
33
+ ```bash
34
+ git lfs install
35
+ git clone https://huggingface.co/MiniMaxAI/MiniMax-M1
36
+ ```
37
+
38
+ ⚠️ **Important Note**: Please ensure that [Git LFS](https://git-lfs.github.com/) is installed on your system, which is necessary for completely downloading the model weight files.
39
+
40
+ ## 🛠️ Deployment Options
41
+
42
+ ### Option 1: Deploy Using Docker (Recommended)
43
+
44
+ To ensure consistency and stability of the deployment environment, we recommend using Docker for deployment.
45
+
46
+ ⚠️ **Version Requirements**:
47
+ - MiniMax-M1 model requires vLLM version 0.8.3 or later for full support
48
+ - If you are using a Docker image with vLLM version lower than the required version, you will need to:
49
+ 1. Update to the latest vLLM code
50
+ 2. Recompile vLLM from source. Follow the compilation instructions in Solution 2 of the Common Issues section
51
+
52
+ 1. Get the container image:
53
+ ```bash
54
+ docker pull vllm/vllm-openai:v0.8.3
55
+ ```
56
+
57
+ 2. Run the container:
58
+ ```bash
59
+ # Set environment variables
60
+ IMAGE=vllm/vllm-openai:v0.8.3
61
+ MODEL_DIR=<model storage path>
62
+ CODE_DIR=<code path>
63
+ NAME=MiniMaxImage
64
+
65
+ # Docker run configuration
66
+ DOCKER_RUN_CMD="--network=host --privileged --ipc=host --ulimit memlock=-1 --shm-size=2gb --rm --gpus all --ulimit stack=67108864"
67
+
68
+ # Start the container
69
+ sudo docker run -it \
70
+ -v $MODEL_DIR:$MODEL_DIR \
71
+ -v $CODE_DIR:$CODE_DIR \
72
+ --name $NAME \
73
+ $DOCKER_RUN_CMD \
74
+ $IMAGE /bin/bash
75
+ ```
76
+
77
+
78
+ ### Option 2: Direct Installation of vLLM
79
+
80
+ If your environment meets the following requirements:
81
+
82
+ - CUDA 12.1
83
+ - PyTorch 2.1
84
+
85
+ You can directly install vLLM
86
+
87
+ Installation command:
88
+ ```bash
89
+ pip install vllm
90
+ ```
91
+
92
+ 💡 If you are using other environment configurations, please refer to the [vLLM Installation Guide](https://docs.vllm.ai/en/latest/getting_started/installation.html)
93
+
94
+ ## 🚀 Starting the Service
95
+
96
+ ### Launch MiniMax-M1 Service
97
+
98
+ ```bash
99
+ export SAFETENSORS_FAST_GPU=1
100
+ export VLLM_USE_V1=0
101
+ python3 -m vllm.entrypoints.openai.api_server \
102
+ --model <model storage path> \
103
+ --tensor-parallel-size 8 \
104
+ --trust-remote-code \
105
+ --quantization experts_int8 \
106
+ --max_model_len 4096 \
107
+ --dtype bfloat16
108
+ ```
109
+
110
+ ### API Call Example
111
+
112
+ ```bash
113
+ curl http://localhost:8000/v1/chat/completions \
114
+ -H "Content-Type: application/json" \
115
+ -d '{
116
+ "model": "MiniMaxAI/MiniMax-Text-01",
117
+ "messages": [
118
+ {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
119
+ {"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
120
+ ]
121
+ }'
122
+ ```
123
+
124
+ ## ❗ Common Issues
125
+
126
+ ### Module Loading Problems
127
+ If you encounter the following error:
128
+ ```
129
+ import vllm._C # noqa
130
+ ModuleNotFoundError: No module named 'vllm._C'
131
+ ```
132
+
133
+ Or
134
+
135
+ ```
136
+ MiniMax-M1 model is not currently supported
137
+ ```
138
+
139
+ We provide two solutions:
140
+
141
+ #### Solution 1: Copy Dependency Files
142
+ ```bash
143
+ cd <working directory>
144
+ git clone https://github.com/vllm-project/vllm.git
145
+ cd vllm
146
+ cp /usr/local/lib/python3.12/dist-packages/vllm/*.so vllm
147
+ cp -r /usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/* vllm/vllm_flash_attn
148
+ ```
149
+
150
+ #### Solution 2: Install from Source
151
+ ```bash
152
+ cd <working directory>
153
+ git clone https://github.com/vllm-project/vllm.git
154
+
155
+ cd vllm/
156
+ pip install -e .
157
+ ```
158
+
159
+ ## 📮 Getting Support
160
+
161
+ If you encounter any issues while deploying MiniMax-M1 model:
162
+ - Please check our official documentation
163
+ - Contact our technical support team through official channels
164
+ - Submit an [Issue](https://github.com/MiniMax-AI/MiniMax-M1/issues) on our GitHub repository
165
+
166
+ We will continuously optimize the deployment experience of this model and welcome your feedback!
vllm_deployment_guide_cn.md ADDED
@@ -0,0 +1,161 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🚀 MiniMax 模型 vLLM 部署指南
2
+
3
+ ## 📖 简介
4
+
5
+ 我们推荐使用 [vLLM](https://docs.vllm.ai/en/latest/) 来部署 [MiniMax-M1](https://huggingface.co/MiniMaxAI/MiniMax-M1) 模型。经过我们的测试,vLLM 在部署这个模型时表现出色,具有以下特点:
6
+
7
+ - 🔥 卓越的服务吞吐量性能
8
+ - ⚡ 高效智能的内存管理机制
9
+ - 📦 强大的批量请求处理能力
10
+ - ⚙️ 深度优化的底层性能
11
+
12
+ MiniMax-M1 模型可在单台配备8个H800或8个H20 GPU的服务器上高效运行。在硬件配置方面,搭载8个H800 GPU的服务器可处理长达200万token的上下文输入,而配备8个H20 GPU的服务器则能够支持高达500万token的超长上下文处理能力。
13
+
14
+ ## 💾 获取 MiniMax 模型
15
+
16
+ ### MiniMax-M1 模型获取
17
+
18
+ 您可以从我们的官方 HuggingFace 仓库下载模型:[MiniMax-M1](https://huggingface.co/MiniMaxAI/MiniMax-M1)
19
+
20
+ 下载命令:
21
+ ```
22
+ pip install -U huggingface-hub
23
+ huggingface-cli download MiniMaxAI/MiniMax-M1
24
+
25
+ # 如果遇到网络问题,可以设置代理
26
+ export HF_ENDPOINT=https://hf-mirror.com
27
+ ```
28
+
29
+ 或者使用 git 下载:
30
+
31
+ ```bash
32
+ git lfs install
33
+ git clone https://huggingface.co/MiniMaxAI/MiniMax-M1
34
+ ```
35
+
36
+ ⚠️ **重要提示**:请确保系统已安装 [Git LFS](https://git-lfs.github.com/),这对于完整下载模型权重文件是必需的。
37
+
38
+ ## 🛠️ 部署方案
39
+
40
+ ### 方案一:使用 Docker 部署(推荐)
41
+
42
+ 为确保部署环境的一致性和稳定性,我们推荐使用 Docker 进行部署。
43
+
44
+ ⚠️ **版本要求**:
45
+ - MiniMax-M1 模型需要 vLLM 0.8.3 或更高版本才能获得完整支持
46
+
47
+ 1. 获取容器镜像:
48
+ ```bash
49
+ docker pull vllm/vllm-openai:v0.8.3
50
+ ```
51
+
52
+ 2. 运行容器:
53
+ ```bash
54
+ # 设置环境变量
55
+ IMAGE=vllm/vllm-openai:v0.8.3
56
+ MODEL_DIR=<模型存放路径>
57
+ CODE_DIR=<代码路径>
58
+ NAME=MiniMaxImage
59
+
60
+ # Docker运行配置
61
+ DOCKER_RUN_CMD="--network=host --privileged --ipc=host --ulimit memlock=-1 --shm-size=2gb --rm --gpus all --ulimit stack=67108864"
62
+
63
+ # 启动容器
64
+ sudo docker run -it \
65
+ -v $MODEL_DIR:$MODEL_DIR \
66
+ -v $CODE_DIR:$CODE_DIR \
67
+ --name $NAME \
68
+ $DOCKER_RUN_CMD \
69
+ $IMAGE /bin/bash
70
+ ```
71
+
72
+
73
+ ### 方案二:直接安装 vLLM
74
+
75
+ 如果您的环境满足以下要求:
76
+
77
+ - CUDA 12.1
78
+ - PyTorch 2.1
79
+
80
+ 可以直接安装 vLLM
81
+
82
+ 安装命令:
83
+ ```bash
84
+ pip install vllm
85
+ ```
86
+
87
+ 💡 如果您使用其他环境配置,请参考 [vLLM 安装指南](https://docs.vllm.ai/en/latest/getting_started/installation.html)
88
+
89
+ ## 🚀 启动服务
90
+
91
+ ### 启动 MiniMax-M1 服务
92
+
93
+ ```bash
94
+ export SAFETENSORS_FAST_GPU=1
95
+ export VLLM_USE_V1=0
96
+ python3 -m vllm.entrypoints.openai.api_server \
97
+ --model <模型存放路径> \
98
+ --tensor-parallel-size 8 \
99
+ --trust-remote-code \
100
+ --quantization experts_int8 \
101
+ --max_model_len 4096 \
102
+ --dtype bfloat16
103
+ ```
104
+
105
+ ### API 调用示例
106
+
107
+ ```bash
108
+ curl http://localhost:8000/v1/chat/completions \
109
+ -H "Content-Type: application/json" \
110
+ -d '{
111
+ "model": "MiniMaxAI/MiniMax-Text-01",
112
+ "messages": [
113
+ {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
114
+ {"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
115
+ ]
116
+ }'
117
+ ```
118
+
119
+ ## ❗ 常见问题
120
+
121
+ ### 模块加载问题
122
+ 如果遇到以下错误:
123
+ ```
124
+ import vllm._C # noqa
125
+ ModuleNotFoundError: No module named 'vllm._C'
126
+ ```
127
+
128
+
129
+
130
+ ```
131
+ 当前并不支持 MiniMax-M1 模型
132
+ ```
133
+
134
+ 我们提供两种解决方案:
135
+
136
+ #### 解决方案一:复制依赖文件
137
+ ```bash
138
+ cd <工作目录>
139
+ git clone https://github.com/vllm-project/vllm.git
140
+ cd vllm
141
+ cp /usr/local/lib/python3.12/dist-packages/vllm/*.so vllm
142
+ cp -r /usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/* vllm/vllm_flash_attn
143
+ ```
144
+
145
+ #### 解决方案二:从源码安装
146
+ ```bash
147
+ cd <工作目录>
148
+ git clone https://github.com/vllm-project/vllm.git
149
+
150
+ cd vllm/
151
+ pip install -e .
152
+ ```
153
+
154
+ ## 📮 获取支持
155
+
156
+ 如果您在部署 MiniMax-M1 模型过程中遇到任何问题:
157
+ - 请查看我们的官方文档
158
+ - 通过官方渠道联系我们的技术支持团队
159
+ - 在我们的 GitHub 仓库提交 [Issue](https://github.com/MiniMax-AI/MiniMax-M1/issues)
160
+
161
+ 我们会持续优化模型的部署体验,欢迎您的反馈!