--- license: apache-2.0 base_model: - mistralai/Pixtral-12B-2409 library_name: transformers tags: - text-generation-inference --- # Pixtral-12B-2409 - HuggingFace Transformers Compatible Weights ## Model Overview This repository contains the HuggingFace Transformers compatible weights for the Pixtral-12B-2409 multimodal model. The weights have been converted to ensure seamless integration with the Hugging Face Transformers library, allowing easy loading and usage in your projects. ## Model Details - **Original Model**: Pixtral-12B-2409 by Mistral AI - **Model Type**: Multimodal Language Model - **Parameters**: 12B parameters + 400M parameter vision encoder - **Sequence Length**: 128k tokens - **License**: Apache 2.0 ## Key Features - Natively multimodal, trained with interleaved image and text data - Supports variable image sizes - Leading performance in its weight class on multimodal tasks - Maintains state-of-the-art performance on text-only benchmarks ## Conversion Details This repository provides the original Pixtral model weights converted to be fully compatible with the HuggingFace Transformers library. The conversion process ensures: - Seamless loading using `from_pretrained()` - Full compatibility with HuggingFace Transformers pipeline - No modifications to the original model weights or architecture ## Installation You can install the model using the Transformers library: ```python from transformers import AutoProcessor, AutoModelForImageTextToText import torch processor = AutoProcessor.from_pretrained("Prarabdha/pixtral-12b-240910-hf") model = AutoModelForImageTextToText.from_pretrained("Prarabdha/pixtral-12b-240910-hf", torch_dtype=torch.float16, device_map="auto") ``` ## Example Usage ```python from PIL import Image import requests # Load an image url = "https://example.com/sample-image.jpg" image = Image.open(requests.get(url, stream=True).raw) # Prepare conversation conversation = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "What is shown in this image?"}, ], } ] # Process and generate prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) inputs = processor(images=[image], text=prompt, return_tensors="pt") generate_ids = model.generate(**inputs, max_new_tokens=30) response = processor.batch_decode(generate_ids, skip_special_tokens=True) ``` ## Performance Benchmarks ### Multimodal Benchmarks | Benchmark | Pixtral 12B | Qwen2 7B VL | LLaVA-OV 7B | Phi-3 Vision | |-----------|-------------|-------------|-------------|--------------| | MMMU (CoT) | 52.5 | 47.6 | 45.1 | 40.3 | | Mathvista (CoT) | 58.0 | 54.4 | 36.1 | 36.4 | | ChartQA (CoT) | 81.8 | 38.6 | 67.1 | 72.0 | *(Full benchmark details available in the original model card)* ## Acknowledgements A huge thank you to the Mistral team for creating and releasing the original Pixtral model. ## Citation If you use this model, please cite the original Mistral AI research. ## License This model is distributed under the Apache 2.0 License. ## Original Model Card For more comprehensive details, please refer to the [original Mistral model card](https://huggingface.co/mistralai/Pixtral-12B-2409).