metadata

language:
  - en
  - fr
  - de
  - es
  - it
  - pt
  - zh
  - ja
  - ru
  - ko
license: other
license_name: mrl
base_model: mistralai/Pixtral-Large-Instruct-2411
inference: false
license_link: https://mistral.ai/licenses/MRL-0.1.md
library_name: transformers
pipeline_tag: image-text-to-text

Pixtral-Large-Instruct-2411 🧡

Transformers implementation of Pixtral-Large-Instruct-2411.

21 Dec 2024: This model has been a LOT of fun to experiment and learn with. Model card updated below with changes made to this repo over the last week.

Architecture Differences to Pixtral 12B

Pixtral 12B has bias keys for the multi_modal_projector layers, whereas Pixtral Large does not. Instead of including with low/zero values this conversion does not include those bias keys, aligning with the keys present in the original Pixtral Large upload from Mistral. The model's config.json file includes "multimodal_projector_bias": false to flag this. n.b. If anyone in the community confirms initializing these keys with zero values is the better way to go I'm happy to reupload without them excluded.

Tokenizer

This model uses a conversion of the Mistral v7m1 tokenizer. Pixtral 12B and Large use different tokenizers with different vocab sizes, so make sure you use the right tokenizer.

Prompting / Chat Template

The included chat_template.json supports all of Mistral's defined features with some of my own additions.

I believe this implementation should give quite a lot of flexibility for using the model, and in my testing has worked quite well.

Example (line breaks added for readability)

<s>[SYSTEM_PROMPT] <system prompt>[/SYSTEM_PROMPT]  
[INST] [IMG]<user message>  
[AVAILABLE_TOOLS] [<tool definitions>][/AVAILABLE_TOOLS][/INST]  
[IMG]<assistant response>  
[TOOL_CALLS] [<tool calls>][/TOOL_CALLS]  
[TOOL_RESULTS] <tool results including images>[/TOOL_RESULTS]  
</s>[INST] <user message>[/INST]

System Prompts:
Messages with role "system" will be parsed as [SYSTEM_PROMPT] <content>[/SYSTEM_PROMPT] anywhere they appear in chat history.

This appears to work pretty well for passing extra instructions at various depths, and keeps instructions separate from conversation.

Allowing Non-Alternating Roles:
Multiple user messages in a row can be provided, and each will be separated with [INST][/INST]. This could work well in group conversation settings, or environments where multiple user messages can be provided before the model is invoked. Having a [/INST] breaking each one up appeared to help prevent the model thinking it needs to respond to every previous message and focus on the last message, while still retaining knowledge of what messages sit before it.

Image Inputs Everywhere:
Images can now be sent in user, assistant, and tool result messages. And seems to actually work. I did tests like including an image on an assistant reply 10-15 messages back in the conversation, asked the assistant to recall what image they previously sent, and it was able to accurately describe it.

Having this flexibility could allow for interesting applications, for example if you were to define a tool definition for image generation:

tool is invoked and calls image generation api/model
image returned inside tool result message
model responds with a message with context of the image generated
you can have further conversation about the generated image, or make revisions with the model actually knowing what was created

Usage

When loading in transformers you'll probably want to add some handling to ensure the lack of mmproj bias is respected for it to handle vision input properly.

Most of my testing has been using TabbyAPI and ExLlamaV2 (dev branch) with working vision input.

Quantizations

EXL2 quants are available in different sizes here. You'll need to use dev branch of ExLlamaV2 for vision input.