language:
- en
- fr
- de
- es
- it
- pt
- zh
- ja
- ru
- ko
license: other
license_name: mrl
base_model: mistralai/Pixtral-Large-Instruct-2411
inference: false
license_link: https://mistral.ai/licenses/MRL-0.1.md
library_name: transformers
pipeline_tag: image-text-to-text
Pixtral-Large-Instruct-2411 🧡
Transformers implementation of Pixtral-Large-Instruct-2411.
21 Dec 2024: This model has been a LOT of fun to experiment and learn with. Model card updated below with changes made to this repo over the last week.
Architecture Differences to Pixtral 12B
Pixtral 12B has bias keys for the multi_modal_projector layers, whereas Pixtral Large does not. Instead of including with low/zero values
this conversion does not include those bias keys, aligning with the keys present in the original Pixtral Large upload from Mistral. The
model's config.json file includes "multimodal_projector_bias": false
to flag this. n.b. If anyone in the community confirms initializing
these keys with zero values is the better way to go I'm happy to reupload without them excluded.
Tokenizer
This model uses a conversion of the Mistral v7m1 tokenizer. Pixtral 12B and Large use different tokenizers with different vocab sizes, so make sure you use the right tokenizer.
Prompting / Chat Template
The included chat_template.json supports all of Mistral's defined features with some of my own additions.
I believe this implementation should give quite a lot of flexibility for using the model, and in my testing has worked quite well.
Example (line breaks added for readability)
<s>[SYSTEM_PROMPT] <system prompt>[/SYSTEM_PROMPT]
[INST] [IMG]<user message>
[AVAILABLE_TOOLS] [<tool definitions>][/AVAILABLE_TOOLS][/INST]
[IMG]<assistant response>
[TOOL_CALLS] [<tool calls>][/TOOL_CALLS]
[TOOL_RESULTS] <tool results including images>[/TOOL_RESULTS]
</s>[INST] <user message>[/INST]
System Prompts:
Messages with role "system" will be parsed as [SYSTEM_PROMPT] <content>[/SYSTEM_PROMPT]
anywhere they appear in chat history.
This appears to work pretty well for passing extra instructions at various depths, and keeps instructions separate from conversation.
Allowing Non-Alternating Roles:
Multiple user messages in a row can be provided, and each will be separated with [INST][/INST]
. This could work well in group conversation
settings, or environments where multiple user messages can be provided before the model is invoked. Having a [/INST]
breaking each one up
appeared to help prevent the model thinking it needs to respond to every previous message and focus on the last message, while still retaining
knowledge of what messages sit before it.
Image Inputs Everywhere:
Images can now be sent in user, assistant, and tool result messages. And seems to actually work. I did tests like including an image on an
assistant reply 10-15 messages back in the conversation, asked the assistant to recall what image they previously sent, and it was able to
accurately describe it.
Having this flexibility could allow for interesting applications, for example if you were to define a tool definition for image generation:
- tool is invoked and calls image generation api/model
- image returned inside tool result message
- model responds with a message with context of the image generated
- you can have further conversation about the generated image, or make revisions with the model actually knowing what was created
Usage
When loading in transformers you'll probably want to add some handling to ensure the lack of mmproj bias is respected for it to handle vision input properly.
Most of my testing has been using TabbyAPI and ExLlamaV2 (dev branch) with working vision input.
Quantizations
EXL2 quants are available in different sizes here. You'll need to use dev branch of ExLlamaV2 for vision input.