add docs

Browse files

Files changed (6) hide show

function_call_guide.md +270 -0
function_call_guide_cn.md +267 -0
transformers_deployment_guide.md +97 -0
transformers_deployment_guide_cn.md +95 -0
vllm_deployment_guide.md +166 -0
vllm_deployment_guide_cn.md +161 -0

function_call_guide.md ADDED Viewed

	@@ -0,0 +1,270 @@

+# MiniMax-M1 Function Call Guide
+[FunctionCall中文使用指南](./function_call_guide_cn.md)
+## 📖 Introduction
+The MiniMax-M1 model supports function calling capabilities, enabling the model to identify when external functions need to be called and output function call parameters in a structured format. This document provides detailed instructions on how to use the function calling feature of MiniMax-M1.
+## 🚀 Quick Start
+### Using Chat Template
+MiniMax-M1 uses a specific chat template format to handle function calls. The chat template is defined in `tokenizer_config.json`, and you can use it in your code through the template.
+```python
+from transformers import AutoTokenizer
+def get_default_tools():
+    return [
+        {
+          {
+            "name": "get_current_weather",
+            "description": "Get the latest weather for a location",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "location": {
+                        "type": "string",
+                        "description": "A certain city, such as Beijing, Shanghai"
+                    }
+                },
+            }
+            "required": ["location"],
+            "type": "object"
+          }
+        }
+    ]
+# Load model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+prompt = "What's the weather like in Shanghai today?"
+messages = [
+    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant created by Minimax based on MiniMax-Text-01 model."}]},
+    {"role": "user", "content": [{"type": "text", "text": prompt}]},
+]
+# Enable function call tools
+tools = get_default_tools()
+# Apply chat template and add tool definitions
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+    tools=tools
+)
+```
+## 🛠️ Function Call Definition
+### Function Structure
+Function calls need to be defined in the `tools` field of the request body. Each function consists of the following components:
+```json
+{
+  "tools": [
+    {
+      "name": "search_web",
+      "description": "Search function.",
+      "parameters": {
+        "properties": {
+          "query_list": {
+            "description": "Keywords for search, with list element count of 1.",
+            "items": { "type": "string" },
+            "type": "array"
+          },
+          "query_tag": {
+            "description": "Classification of the query",
+            "items": { "type": "string" },
+            "type": "array"
+          }
+        },
+        "required": [ "query_list", "query_tag" ],
+        "type": "object"
+      }
+    }
+  ]
+}
+```
+**Field Descriptions:**
+- `name`: Function name
+- `description`: Function description
+- `parameters`: Function parameter definition
+  - `properties`: Parameter property definitions, where key is the parameter name and value contains detailed parameter description
+  - `required`: List of required parameters
+  - `type`: Parameter type (usually "object")
+### Internal Model Processing Format
+When processed internally by the model, function definitions are converted to a special format and concatenated to the input text:
+```
+]~!b[]~b]system ai_setting=Conch AI
+MiniMax AI is an AI assistant independently developed by MiniMax. [e~[
+]~b]system tool_setting=tools
+You are provided with these tools:
+<tools>
+{"name": "search_web", "description": "Search function.", "parameters": {"properties": {"query_list": {"description": "Keywords for search, with list element count of 1.", "items": {"type": "string"}, "type": "array"}, "query_tag": {"description": "Classification of the query", "items": {"type": "string"}, "type": "array"}}, "required": ["query_list", "query_tag"], "type": "object"}}
+</tools>
+If you need to call tools, please respond with <tool_calls></tool_calls> XML tags, and provide tool-name and json-object of arguments, following the format below:
+<tool_calls>
+{"name": <tool-name>, "arguments": <args-json-object>}
+...
+</tool_calls>[e~[
+]~b]user name=User
+When were the most recent launch events for OpenAI and Gemini?[e~[
+]~b]ai name=Conch AI
+```
+### Model Output Format
+The model outputs function calls in the following format:
+```xml
+<think>
+Okay, I will search for the OpenAI and Gemini latest release.
+</think>
+<tool_calls>
+{"name": "search_web", "arguments": {"query_tag": ["technology", "events"], "query_list": ["\"OpenAI\" \"latest\" \"release\""]}}
+{"name": "search_web", "arguments": {"query_tag": ["technology", "events"], "query_list": ["\"Gemini\" \"latest\" \"release\""]}}
+</tool_calls>
+```
+## 📥 Function Call Result Processing
+### Parsing Function Calls
+You can use the following code to parse function calls from the model output:
+```python
+import re
+import json
+def parse_function_calls(content: str):
+    """
+    Parse function calls from model output
+    """
+    function_calls = []
+    # Match content within <tool_calls> tags
+    tool_calls_pattern = r"<tool_calls>(.*?)</tool_calls>"
+    tool_calls_match = re.search(tool_calls_pattern, content, re.DOTALL)
+    if not tool_calls_match:
+        return function_calls
+    tool_calls_content = tool_calls_match.group(1).strip()
+    # Parse each function call (one JSON object per line)
+    for line in tool_calls_content.split('\n'):
+        line = line.strip()
+        if not line:
+            continue
+        try:
+            # Parse JSON format function call
+            call_data = json.loads(line)
+            function_name = call_data.get("name")
+            arguments = call_data.get("arguments", {})
+            function_calls.append({
+                "name": function_name,
+                "arguments": arguments
+            })
+            print(f"Function call: {function_name}, Arguments: {arguments}")
+        except json.JSONDecodeError as e:
+            print(f"Parameter parsing failed: {line}, Error: {e}")
+    return function_calls
+# Example: Handle weather query function
+def execute_function_call(function_name: str, arguments: dict):
+    """
+    Execute function call and return result
+    """
+    if function_name == "get_current_weather":
+        location = arguments.get("location", "Unknown location")
+        # Build function execution result
+        return {
+            "role": "tool",
+            "name": function_name,
+            "content": json.dumps({
+                "location": location,
+                "temperature": "25",
+                "unit": "celsius",
+                "weather": "Sunny"
+            }, ensure_ascii=False)
+        }
+    elif function_name == "search_web":
+        query_list = arguments.get("query_list", [])
+        query_tag = arguments.get("query_tag", [])
+        # Simulate search results
+        return {
+            "role": "tool",
+            "name": function_name,
+            "content": f"Search keywords: {query_list}, Categories: {query_tag}\nSearch results: Relevant information found"
+        }
+    return None
+```
+### Returning Function Execution Results to the Model
+After successfully parsing function calls, you should add the function execution results to the conversation history so that the model can access and utilize this information in subsequent interactions.
+#### Single Result
+If the model decides to call `search_web`, we suggest you to return the function result in the following format, with the `name` field set to the specific tool name.
+```json
+{
+  "data": [
+     {
+       "role": "tool",
+       "name": "search_web",
+       "content": "search_result"
+     }
+  ]
+}
+```
+Corresponding model input format:
+```
+]~b]tool name=search_web
+search_result[e~[
+```
+#### Multiple Result
+If the model decides to call `search_web` and `get_current_weather` at the same time, we suggest you to return the multiple function results in the following format, with the `name` field set to "tools", and use the `content` field to contain multiple results.
+```json
+{
+  "data": [
+     {
+       "role": "tool",
+       "name": "tools",
+       "content": "Tool name: search_web\nTool result: test_result1\n\nTool name: get_current_weather\nTool result: test_result2"
+     }
+  ]
+}
+```
+Corresponding model input format:
+```
+]~b]tool name=tools
+Tool name: search_web
+Tool result: test_result1
+Tool name: search_web
+Tool result: test_result2[e~[
+```
+While we suggest following the above formats, as long as the model input is easy to understand, the specific values of `name` and `content` is entirely up to the caller.

function_call_guide_cn.md ADDED Viewed

	@@ -0,0 +1,267 @@

+# MiniMax-M1 函数调用（Function Call）功能指南
+## 📖 简介
+MiniMax-M1 模型支持函数调用功能，使模型能够识别何时需要调用外部函数，并以结构化格式输出函数调用参数。本文档详细介绍了如何使用 MiniMax-M1 的函数调用功能。
+## 🚀 快速开始
+### 聊天模板使用
+MiniMax-M1 使用特定的聊天模板格式处理函数调用。聊天模板定义在 `tokenizer_config.json` 中，你可以在代码中通过 template 来进行使用。
+```python
+from transformers import AutoTokenizer
+def get_default_tools():
+    return [
+        {
+          {
+            "name": "get_current_weather",
+            "description": "Get the latest weather for a location",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "location": {
+                        "type": "string",
+                        "description": "A certain city, such as Beijing, Shanghai"
+                    }
+                },
+            }
+            "required": ["location"],
+            "type": "object"
+          }
+        }
+    ]
+# 加载模型和分词器
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+prompt = "What's the weather like in Shanghai today?"
+messages = [
+    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant created by Minimax based on MiniMax-Text-01 model."}]},
+    {"role": "user", "content": [{"type": "text", "text": prompt}]},
+]
+# 启用函数调用工具
+tools = get_default_tools()
+# 应用聊天模板，并加入工具定义
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+    tools=tools
+)
+```
+## 🛠️ 函数调用的定义
+### 函数结构体
+函数调用需要在请求体中定义 `tools` 字段，每个函数由以下部分组成：
+```json
+{
+  "tools": [
+    {
+      "name": "search_web",
+      "description": "搜索函数。",
+      "parameters": {
+        "properties": {
+          "query_list": {
+            "description": "进行搜索的关键词，列表元素个数为1。",
+            "items": { "type": "string" },
+            "type": "array"
+          },
+          "query_tag": {
+            "description": "query的分类",
+            "items": { "type": "string" },
+            "type": "array"
+          }
+        },
+        "required": [ "query_list", "query_tag" ],
+        "type": "object"
+      }
+    }
+  ]
+}
+```
+**字段说明：**
+- `name`: 函数名称
+- `description`: 函数功能描述
+- `parameters`: 函数参数定义
+  - `properties`: 参数属性定义，key 是参数名，value 包含参数的详细描述
+  - `required`: 必填参数列表
+  - `type`: 参数类型（通常为 "object"）
+### 模型内部处理格式
+在模型内部处理时，函数定义会被转换为特殊格式并拼接到输入文本中：
+```
+]~!b[]~b]system ai_setting=海螺AI
+MiniMax AI是由上海稀宇科技有限公司（MiniMax）自主研发的AI助理。[e~[
+]~b]system tool_setting=tools
+You are provided with these tools:
+<tools>
+{"name": "search_web", "description": "搜索函数。", "parameters": {"properties": {"query_list": {"description": "进行搜索的关键词，列表元素个数为1。", "items": {"type": "string"}, "type": "array"}, "query_tag": {"description": "query的分类", "items": {"type": "string"}, "type": "array"}}, "required": ["query_list", "query_tag"], "type": "object"}}
+</tools>
+If you need to call tools, please respond with <tool_calls></tool_calls> XML tags, and provide tool-name and json-object of arguments, following the format below:
+<tool_calls>
+{"name": <tool-name>, "arguments": <args-json-object>}
+...
+</tool_calls>[e~[
+]~b]user name=用户
+OpenAI 和 Gemini 的最近一次发布会都是什么时候?[e~[
+]~b]ai name=海螺AI
+```
+### 模型输出格式
+模型会以以下格式输出函数调用：
+```xml
+<think>
+Okay, I will search for the OpenAI and Gemini latest release.
+</think>
+<tool_calls>
+{"name": "search_web", "arguments": {"query_tag": ["technology", "events"], "query_list": ["\"OpenAI\" \"latest\" \"release\""]}}
+{"name": "search_web", "arguments": {"query_tag": ["technology", "events"], "query_list": ["\"Gemini\" \"latest\" \"release\""]}}
+</tool_calls>
+```
+## 📥 函数调用结果处理
+### 解析函数调用
+您可以使用以下代码解析模型输出的函数调用：
+```python
+import re
+import json
+def parse_function_calls(content: str):
+    """
+    解析模型输出中的函数调用
+    """
+    function_calls = []
+    # 匹配 <tool_calls> 标签内的内容
+    tool_calls_pattern = r"<tool_calls>(.*?)</tool_calls>"
+    tool_calls_match = re.search(tool_calls_pattern, content, re.DOTALL)
+    if not tool_calls_match:
+        return function_calls
+    tool_calls_content = tool_calls_match.group(1).strip()
+    # 解析每个函数调用（每行一��JSON对象）
+    for line in tool_calls_content.split('\n'):
+        line = line.strip()
+        if not line:
+            continue
+        try:
+            # 解析JSON格式的函数调用
+            call_data = json.loads(line)
+            function_name = call_data.get("name")
+            arguments = call_data.get("arguments", {})
+            function_calls.append({
+                "name": function_name,
+                "arguments": arguments
+            })
+            print(f"调用函数: {function_name}, 参数: {arguments}")
+        except json.JSONDecodeError as e:
+            print(f"参数解析失败: {line}, 错误: {e}")
+    return function_calls
+# 示例：处理天气查询函数
+def execute_function_call(function_name: str, arguments: dict):
+    """
+    执行函数调用并返回结果
+    """
+    if function_name == "get_current_weather":
+        location = arguments.get("location", "未知位置")
+        # 构建函数执行结果
+        return {
+            "role": "tool",
+            "name": function_name,
+            "content": json.dumps({
+                "location": location,
+                "temperature": "25",
+                "unit": "celsius",
+                "weather": "晴朗"
+            }, ensure_ascii=False)
+        }
+    elif function_name == "search_web":
+        query_list = arguments.get("query_list", [])
+        query_tag = arguments.get("query_tag", [])
+        # 模拟搜索结果
+        return {
+            "role": "tool",
+            "name": function_name,
+            "content": f"搜索关键词: {query_list}, 分类: {query_tag}\n搜索结果: 相关信息已找到"
+        }
+    return None
+```
+### 将函数执行结果返回给模型
+成功解析函数调用后，您应将函数执行结果添加到对话历史中，以便模型在后续交互中能够访问和利用这些信息。
+#### 单个结果
+假如模型调用了 `search_web` 函数，您可以参考如下格式添加执行结果，`name` 字段为具体的函数名称。
+```json
+{
+  "data": [
+     {
+       "role": "tool",
+       "name": "search_web",
+       "content": "search_result"
+     }
+  ]
+}
+```
+对应如下的模型输入格式：
+```
+]~b]tool name=search_web
+search_result[e~[
+```
+#### 多个结果
+假如模型同时调用了 `search_web` 和 `get_current_weather` 函数，您可以参考如下格式添加执行结果，`name` 字段为"tools"，`content`包含多个结果。
+```json
+{
+  "data": [
+     {
+       "role": "tool",
+       "name": "tools",
+       "content": "Tool name: search_web\nTool result: test_result1\n\nTool name: get_current_weather\nTool result: test_result2"
+     }
+  ]
+}
+```
+对应如下的模型输入格式：
+```
+]~b]tool name=tools
+Tool name: search_web
+Tool result: test_result1
+Tool name: search_web
+Tool result: test_result2[e~[
+```
+虽然我们建议您参考以上格式，但只要返回给模型的输入易于理解，`name` 和 `content` 的具体内容完全由您自主决定。

transformers_deployment_guide.md ADDED Viewed

	@@ -0,0 +1,97 @@

+# 🚀 MiniMax Model Transformers Deployment Guide
+[Transformers中文版部署指南](./transformers_deployment_guide_cn.md)
+## 📖 Introduction
+This guide will help you deploy the MiniMax-M1 model using the [Transformers](https://huggingface.co/docs/transformers/index) library. Transformers is a widely used deep learning library that provides a rich collection of pre-trained models and flexible model operation interfaces.
+## 🛠️ Environment Setup
+### Installing Transformers
+```bash
+pip install transformers torch accelerate
+```
+## 📋 Basic Usage Example
+The pre-trained model can be used as follows:
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
+MODEL_PATH = "{MODEL_PATH}"
+model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="auto", trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
+messages = [
+    {"role": "user", "content": "What is your favourite condiment?"},
+    {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
+    {"role": "user", "content": "Do you have mayonnaise recipes?"}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+model_inputs = tokenizer(text, return_tensors="pt").to(model.device)
+generation_config = GenerationConfig(
+    max_new_tokens=20,
+    eos_token_id=tokenizer.eos_token_id,
+    use_cache=True,
+)
+generated_ids = model.generate(**model_inputs, generation_config=generation_config)
+generated_ids = [
+    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+]
+response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+print(response)
+```
+## ⚡ Performance Optimization
+### Speeding up with Flash Attention
+The code snippet above showcases inference without any optimization tricks. However, one can drastically speed up the model by leveraging [Flash Attention](../perf_train_gpu_one#flash-attention-2), which is a faster implementation of the attention mechanism used inside the model.
+First, make sure to install the latest version of Flash Attention 2 to include the sliding window attention feature:
+```bash
+pip install -U flash-attn --no-build-isolation
+```
+Also make sure that you have hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of the [Flash Attention repository](https://github.com/Dao-AILab/flash-attention). Additionally, ensure you load your model in half-precision (e.g. `torch.float16`).
+To load and run a model using Flash Attention-2, refer to the snippet below:
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+MODEL_PATH = "{MODEL_PATH}"
+model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, trust_remote_code=True, torch_dtype=torch.float16, attn_implementation="flash_attention_2", device_map="auto")
+tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
+prompt = "My favourite condiment is"
+model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
+generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
+response = tokenizer.batch_decode(generated_ids)[0]
+print(response)
+```
+## 📮 Getting Support
+If you encounter any issues while deploying the MiniMax-M1 model:
+- Please check our official documentation
+- Contact our technical support team through official channels
+- Submit an Issue on our GitHub repository
+We continuously optimize the deployment experience on Transformers and welcome your feedback!

transformers_deployment_guide_cn.md ADDED Viewed

	@@ -0,0 +1,95 @@

+# 🚀 MiniMax 模型 Transformers 部署指南
+## 📖 简介
+本指南将帮助您使用 [Transformers](https://huggingface.co/docs/transformers/index) 库部署 MiniMax-M1 模型。Transformers 是一个广泛使用的深度学习库，提供了丰富的预训练模型和灵活的模型操作接口。
+## 🛠️ 环境准备
+### 安装 Transformers
+```bash
+pip install transformers torch accelerate
+```
+## 📋 基本使用示例
+预训练模型可以按照以下方式使用：
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
+MODEL_PATH = "{MODEL_PATH}"
+model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="auto", trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
+messages = [
+    {"role": "user", "content": "What is your favourite condiment?"},
+    {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
+    {"role": "user", "content": "Do you have mayonnaise recipes?"}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+model_inputs = tokenizer(text, return_tensors="pt").to(model.device)
+generation_config = GenerationConfig(
+    max_new_tokens=20,
+    eos_token_id=tokenizer.eos_token_id,
+    use_cache=True,
+)
+generated_ids = model.generate(**model_inputs, generation_config=generation_config)
+generated_ids = [
+    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+]
+response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+print(response)
+```
+## ⚡ 性能优化
+### 使用 Flash Attention 加速
+上面的代码片段展示了不使用任何优化技巧的推理过程。但通过利用 [Flash Attention](../perf_train_gpu_one#flash-attention-2)，可以大幅加速模型，因为它提供了模型内部使用的注意力机制的更快实现。
+首先，确保安装最新版本的 Flash Attention 2 以包含滑动窗口注意力功能：
+```bash
+pip install -U flash-attn --no-build-isolation
+```
+还要确保您拥有与 Flash-Attention 2 兼容的硬件。在[Flash Attention 官方仓库](https://github.com/Dao-AILab/flash-attention)的官方文档中了解更多信息。此外，请确保以半精度（例如 `torch.float16`）加载模型。
+要使用 Flash Attention-2 加载和运行模型，请参考以下代码片段：
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+MODEL_PATH = "{MODEL_PATH}"
+model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, trust_remote_code=True, torch_dtype=torch.float16, attn_implementation="flash_attention_2", device_map="auto")
+tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
+prompt = "My favourite condiment is"
+model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
+generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
+response = tokenizer.batch_decode(generated_ids)[0]
+print(response)
+```
+## 📮 获取支持
+如果您在部署 MiniMax-M1 模型过程中遇到任何问题：
+- 请查看我们的官方文档
+- 通过官方渠道联系我们的技术支持团队
+- 在我们的 GitHub 仓库提交 Issue
+我们会持续优化 Transformers 上的部署体验，欢迎您的反馈！

vllm_deployment_guide.md ADDED Viewed

	@@ -0,0 +1,166 @@

+# 🚀 MiniMax Models vLLM Deployment Guide
+[VLLM中文版部署指南](./vllm_deployment_guide_cn.md)
+## 📖 Introduction
+We recommend using [vLLM](https://docs.vllm.ai/en/latest/) to deploy MiniMax-M1 model. Based on our testing, vLLM performs excellently when deploying this model, with the following features:
+- 🔥 Outstanding service throughput performance
+- ⚡ Efficient and intelligent memory management
+- 📦 Powerful batch request processing capability
+- ⚙️ Deeply optimized underlying performance
+The MiniMax-M1 model can run efficiently on a single server equipped with 8 H800 or 8 H20 GPUs. In terms of hardware configuration, a server with 8 H800 GPUs can process context inputs up to 2 million tokens, while a server equipped with 8 H20 GPUs can support ultra-long context processing capabilities of up to 5 million tokens.
+## 💾 Obtaining MiniMax Models
+### MiniMax-M1 Model Obtaining
+You can download the model from our official HuggingFace repository: [MiniMax-M1](https://huggingface.co/MiniMaxAI/MiniMax-M1)
+Download command:
+```
+pip install -U huggingface-hub
+huggingface-cli download MiniMaxAI/MiniMax-M1
+# If you encounter network issues, you can set a proxy
+export HF_ENDPOINT=https://hf-mirror.com
+```
+Or download using git:
+```bash
+git lfs install
+git clone https://huggingface.co/MiniMaxAI/MiniMax-M1
+```
+⚠️ **Important Note**: Please ensure that [Git LFS](https://git-lfs.github.com/) is installed on your system, which is necessary for completely downloading the model weight files.
+## 🛠️ Deployment Options
+### Option 1: Deploy Using Docker (Recommended)
+To ensure consistency and stability of the deployment environment, we recommend using Docker for deployment.
+⚠️ **Version Requirements**:
+- MiniMax-M1 model requires vLLM version 0.8.3 or later for full support
+- If you are using a Docker image with vLLM version lower than the required version, you will need to:
+  1. Update to the latest vLLM code
+  2. Recompile vLLM from source. Follow the compilation instructions in Solution 2 of the Common Issues section
+1. Get the container image:
+```bash
+docker pull vllm/vllm-openai:v0.8.3
+```
+2. Run the container:
+```bash
+# Set environment variables
+IMAGE=vllm/vllm-openai:v0.8.3
+MODEL_DIR=<model storage path>
+CODE_DIR=<code path>
+NAME=MiniMaxImage
+# Docker run configuration
+DOCKER_RUN_CMD="--network=host --privileged --ipc=host --ulimit memlock=-1 --shm-size=2gb --rm --gpus all --ulimit stack=67108864"
+# Start the container
+sudo docker run -it \
+    -v $MODEL_DIR:$MODEL_DIR \
+    -v $CODE_DIR:$CODE_DIR \
+    --name $NAME \
+    $DOCKER_RUN_CMD \
+    $IMAGE /bin/bash
+```
+### Option 2: Direct Installation of vLLM
+If your environment meets the following requirements:
+- CUDA 12.1
+- PyTorch 2.1
+You can directly install vLLM
+Installation command:
+```bash
+pip install vllm
+```
+💡 If you are using other environment configurations, please refer to the [vLLM Installation Guide](https://docs.vllm.ai/en/latest/getting_started/installation.html)
+## 🚀 Starting the Service
+### Launch MiniMax-M1 Service
+```bash
+export SAFETENSORS_FAST_GPU=1
+export VLLM_USE_V1=0
+python3 -m vllm.entrypoints.openai.api_server \
+--model <model storage path> \
+--tensor-parallel-size 8 \
+--trust-remote-code \
+--quantization experts_int8  \
+--max_model_len 4096 \
+--dtype bfloat16
+```
+### API Call Example
+```bash
+curl http://localhost:8000/v1/chat/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "MiniMaxAI/MiniMax-Text-01",
+        "messages": [
+            {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
+            {"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
+        ]
+    }'
+```
+## ❗ Common Issues
+### Module Loading Problems
+If you encounter the following error:
+```
+import vllm._C  # noqa
+ModuleNotFoundError: No module named 'vllm._C'
+```
+Or
+```
+MiniMax-M1 model is not currently supported
+```
+We provide two solutions:
+#### Solution 1: Copy Dependency Files
+```bash
+cd <working directory>
+git clone https://github.com/vllm-project/vllm.git
+cd vllm
+cp /usr/local/lib/python3.12/dist-packages/vllm/*.so vllm
+cp -r /usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/* vllm/vllm_flash_attn
+```
+#### Solution 2: Install from Source
+```bash
+cd <working directory>
+git clone https://github.com/vllm-project/vllm.git
+cd vllm/
+pip install -e .
+```
+## 📮 Getting Support
+If you encounter any issues while deploying MiniMax-M1 model:
+- Please check our official documentation
+- Contact our technical support team through official channels
+- Submit an [Issue](https://github.com/MiniMax-AI/MiniMax-M1/issues) on our GitHub repository
+We will continuously optimize the deployment experience of this model and welcome your feedback!

vllm_deployment_guide_cn.md ADDED Viewed

	@@ -0,0 +1,161 @@

+# 🚀 MiniMax 模型 vLLM 部署指南
+## 📖 简介
+我们推荐使用 [vLLM](https://docs.vllm.ai/en/latest/) 来部署 [MiniMax-M1](https://huggingface.co/MiniMaxAI/MiniMax-M1) 模型。经过我们的测试，vLLM 在部署这个模型时表现出色，具有以下特点：
+- 🔥 卓越的服务吞吐量性能
+- ⚡ 高效智能的内存管理机制
+- 📦 强大的批量请求处理能力
+- ⚙️ 深度优化的底层性能
+MiniMax-M1 模型可在单台配备8个H800或8个H20 GPU的服务器上高效运行。在硬件配置方面，搭载8个H800 GPU的服务器可处理长达200万token的上下文输入，而配备8个H20 GPU的服务器则能够支持高达500万token的超长上下文处理能力。
+## 💾 获取 MiniMax 模型
+### MiniMax-M1 模型获取
+您可以从我们的官方 HuggingFace 仓库下载模型：[MiniMax-M1](https://huggingface.co/MiniMaxAI/MiniMax-M1)
+下载命令：
+```
+pip install -U huggingface-hub
+huggingface-cli download MiniMaxAI/MiniMax-M1
+# 如果遇到网络问题，可以设置代理
+export HF_ENDPOINT=https://hf-mirror.com
+```
+或者使用 git 下载：
+```bash
+git lfs install
+git clone https://huggingface.co/MiniMaxAI/MiniMax-M1
+```
+⚠️ **重要提示**：请确保系统已安装 [Git LFS](https://git-lfs.github.com/)，这对于完整下载模型权重文件是必需的。
+## 🛠️ 部署方案
+### 方案一：使用 Docker 部署（推荐）
+为确保部署环境的一致性和稳定性，我们推荐使用 Docker 进行部署。
+⚠️ **版本要求**：
+- MiniMax-M1 模型需要 vLLM 0.8.3 或更高版本才能获得完整支持
+1. 获取容器镜像：
+```bash
+docker pull vllm/vllm-openai:v0.8.3
+```
+2. 运行容器：
+```bash
+# 设置环境变量
+IMAGE=vllm/vllm-openai:v0.8.3
+MODEL_DIR=<模型存放路径>
+CODE_DIR=<代码路径>
+NAME=MiniMaxImage
+# Docker运行配置
+DOCKER_RUN_CMD="--network=host --privileged --ipc=host --ulimit memlock=-1 --shm-size=2gb --rm --gpus all --ulimit stack=67108864"
+# 启动容器
+sudo docker run -it \
+    -v $MODEL_DIR:$MODEL_DIR \
+    -v $CODE_DIR:$CODE_DIR \
+    --name $NAME \
+    $DOCKER_RUN_CMD \
+    $IMAGE /bin/bash
+```
+### 方案二：直接安装 vLLM
+如果您的环境满足以下要求：
+- CUDA 12.1
+- PyTorch 2.1
+可以直接安装 vLLM
+安装命令：
+```bash
+pip install vllm
+```
+💡 如果您使用其他环境配置，请参考 [vLLM 安装指南](https://docs.vllm.ai/en/latest/getting_started/installation.html)
+## 🚀 启动服务
+### 启动 MiniMax-M1 服务
+```bash
+export SAFETENSORS_FAST_GPU=1
+export VLLM_USE_V1=0
+python3 -m vllm.entrypoints.openai.api_server \
+--model <模型存放路径> \
+--tensor-parallel-size 8 \
+--trust-remote-code \
+--quantization experts_int8  \
+--max_model_len 4096 \
+--dtype bfloat16
+```
+### API 调用示例
+```bash
+curl http://localhost:8000/v1/chat/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "MiniMaxAI/MiniMax-Text-01",
+        "messages": [
+            {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
+            {"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
+        ]
+    }'
+```
+## ❗ 常见问题
+### 模块加载问题
+如果遇到以下错误：
+```
+import vllm._C  # noqa
+ModuleNotFoundError: No module named 'vllm._C'
+```
+或
+```
+当前并不支持 MiniMax-M1 模型
+```
+我们提供两种解决方案：
+#### 解决方案一：复制依赖文件
+```bash
+cd <工作目录>
+git clone https://github.com/vllm-project/vllm.git
+cd vllm
+cp /usr/local/lib/python3.12/dist-packages/vllm/*.so vllm
+cp -r /usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/* vllm/vllm_flash_attn
+```
+#### 解决方案二：从源码安装
+```bash
+cd <工作目录>
+git clone https://github.com/vllm-project/vllm.git
+cd vllm/
+pip install -e .
+```
+## 📮 获取支持
+如果您在部署 MiniMax-M1 模型过程中遇到任何问题：
+- 请查看我们的官方文档
+- 通过官方渠道联系我们的技术支持团队
+- 在我们的 GitHub 仓库提交 [Issue](https://github.com/MiniMax-AI/MiniMax-M1/issues)
+我们会持续优化模型的部署体验，欢迎您的反馈！