MiniMaxAI
/

MiniMax-VL-01

@@ -60,7 +60,7 @@ pipeline_tag: image-text-to-text
 # MiniMax-VL-01
 ## 1. Introduction
-We are delighted to introduce our **MiniMax-VL-01** model. It adopts the “ViT-MLP-LLM” framework, which is a commonly used technique in the field of multimodal large language models. The model is initialized and trained with three key parts: a 303-million-parameter Vision Transformer (ViT) for visual encoding, a randomly initialized two-layer MLP projector for image adaptation, and the MiniMax-Text-01 as the base LLM.
 MiniMax-VL-01 has a notable dynamic resolution feature. Input images are resized per a pre-set grid, with resolutions from 336×336 to 2016×2016, keeping a 336×336 thumbnail. The resized images are split into non-overlapping patches of the same size. These patches and the thumbnail are encoded separately and then combined for a full image representation.
 The training data for MiniMax-VL-01 consists of caption, description, and instruction data. The Vision Transformer (ViT) is trained on 694 million image-caption pairs from scratch. Across four distinct stages of the training pipeline, a total of 512 billion tokens are processed, leveraging this vast amount of data to endow the model with strong capabilities.
 Finally, MiniMax-VL-01 has reached top-level performance on multimodal leaderboards, demonstrating its edge and dependability in complex multimodal tasks.
@@ -190,9 +190,18 @@ For production deployment, we recommend using [vLLM](https://docs.vllm.ai/en/lat
 ⚡ Efficient and intelligent memory management
 📦 Powerful batch request processing capability
 ⚙️ Deeply optimized underlying performance
-For detailed deployment instructions, please refer to our [vLLM Deployment Guide](https://github.com/MiniMax-AI/MiniMax-01/blob/main/docs/vllm_deployment_guild.md).
-# 5. Citation
 ```
 @misc{minimax2025minimax01scalingfoundationmodels,
@@ -206,9 +215,9 @@ For detailed deployment instructions, please refer to our [vLLM Deployment Guide
 }
 ```
-## 5. Chatbot & API
 For general use and evaluation, we provide a [Chatbot](https://www.hailuo.ai/) with online search capabilities and the [online API](https://www.minimax.io/platform) for developers. For general use and evaluation, we provide the [MiniMax MCP Server](https://github.com/MiniMax-AI/MiniMax-MCP) with video generation, image generation, speech synthesis, and voice cloning for developers.
-## 6. Contact Us
 Contact us at [[email protected]](mailto:[email protected]).

 # MiniMax-VL-01
 ## 1. Introduction
+We are delighted to introduce our **MiniMax-VL-01** model. It adopts the "ViT-MLP-LLM" framework, which is a commonly used technique in the field of multimodal large language models. The model is initialized and trained with three key parts: a 303-million-parameter Vision Transformer (ViT) for visual encoding, a randomly initialized two-layer MLP projector for image adaptation, and the MiniMax-Text-01 as the base LLM.
 MiniMax-VL-01 has a notable dynamic resolution feature. Input images are resized per a pre-set grid, with resolutions from 336×336 to 2016×2016, keeping a 336×336 thumbnail. The resized images are split into non-overlapping patches of the same size. These patches and the thumbnail are encoded separately and then combined for a full image representation.
 The training data for MiniMax-VL-01 consists of caption, description, and instruction data. The Vision Transformer (ViT) is trained on 694 million image-caption pairs from scratch. Across four distinct stages of the training pipeline, a total of 512 billion tokens are processed, leveraging this vast amount of data to endow the model with strong capabilities.
 Finally, MiniMax-VL-01 has reached top-level performance on multimodal leaderboards, demonstrating its edge and dependability in complex multimodal tasks.
 ⚡ Efficient and intelligent memory management
 📦 Powerful batch request processing capability
 ⚙️ Deeply optimized underlying performance
+For detailed deployment instructions, please refer to our [vLLM Deployment Guide](https://github.com/MiniMax-AI/MiniMax-01/blob/main/docs/vllm_deployment_guide.md).
+## 5. Function Calling
+MiniMax-VL-01 supports Function Calling capability, enabling the model to intelligently identify when external functions need to be called and output parameters in structured JSON format. With Function Calling, you can:
+- Let the model recognize implicit function call needs in user requests
+- Receive structured parameter outputs for seamless application integration
+- Support various complex parameter types, including nested objects and arrays
+Function Calling supports standard OpenAI-compatible format definitions and integrates seamlessly with the Transformers library. For detailed usage instructions, please refer to our [Function Call Guide](./MiniMax-VL-01_Function_Call_Guide.md) or [Chinese Guide](./MiniMax-VL-01_Function_Call_Guide_CN.md).
+## 6. Citation
 ```
 @misc{minimax2025minimax01scalingfoundationmodels,
 }
 ```
+## 7. Chatbot & API
 For general use and evaluation, we provide a [Chatbot](https://www.hailuo.ai/) with online search capabilities and the [online API](https://www.minimax.io/platform) for developers. For general use and evaluation, we provide the [MiniMax MCP Server](https://github.com/MiniMax-AI/MiniMax-MCP) with video generation, image generation, speech synthesis, and voice cloning for developers.
+## 8. Contact Us
 Contact us at [[email protected]](mailto:[email protected]).