File size: 3,251 Bytes

---
license: apache-2.0
base_model:
- mistralai/Pixtral-12B-2409
library_name: transformers
tags:
- text-generation-inference
---
# Pixtral-12B-2409 - HuggingFace Transformers Compatible Weights

## Model Overview

This repository contains the HuggingFace Transformers compatible weights for the Pixtral-12B-2409 multimodal model. The weights have been converted to ensure seamless integration with the Hugging Face Transformers library, allowing easy loading and usage in your projects.

## Model Details

- **Original Model**: Pixtral-12B-2409 by Mistral AI
- **Model Type**: Multimodal Language Model
- **Parameters**: 12B parameters + 400M parameter vision encoder
- **Sequence Length**: 128k tokens
- **License**: Apache 2.0

## Key Features

- Natively multimodal, trained with interleaved image and text data
- Supports variable image sizes
- Leading performance in its weight class on multimodal tasks
- Maintains state-of-the-art performance on text-only benchmarks

## Conversion Details

This repository provides the original Pixtral model weights converted to be fully compatible with the HuggingFace Transformers library. The conversion process ensures:

- Seamless loading using `from_pretrained()`
- Full compatibility with HuggingFace Transformers pipeline
- No modifications to the original model weights or architecture

## Installation

You can install the model using the Transformers library:

```python
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

processor = AutoProcessor.from_pretrained("Prarabdha/pixtral-12b-240910-hf")
model = AutoModelForImageTextToText.from_pretrained("Prarabdha/pixtral-12b-240910-hf", torch_dtype=torch.float16, device_map="auto")
```

## Example Usage

```python
from PIL import Image
import requests

# Load an image
url = "https://example.com/sample-image.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Prepare conversation
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What is shown in this image?"},
        ],
    }
]

# Process and generate
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=[image], text=prompt, return_tensors="pt")
generate_ids = model.generate(**inputs, max_new_tokens=30)
response = processor.batch_decode(generate_ids, skip_special_tokens=True)
```

## Performance Benchmarks

### Multimodal Benchmarks

| Benchmark | Pixtral 12B | Qwen2 7B VL | LLaVA-OV 7B | Phi-3 Vision |
|-----------|-------------|-------------|-------------|--------------|
| MMMU (CoT) | 52.5 | 47.6 | 45.1 | 40.3 |
| Mathvista (CoT) | 58.0 | 54.4 | 36.1 | 36.4 |
| ChartQA (CoT) | 81.8 | 38.6 | 67.1 | 72.0 |

*(Full benchmark details available in the original model card)*

## Acknowledgements

A huge thank you to the Mistral team for creating and releasing the original Pixtral model.

## Citation

If you use this model, please cite the original Mistral AI research.

## License

This model is distributed under the Apache 2.0 License.

## Original Model Card

For more comprehensive details, please refer to the [original Mistral model card](https://huggingface.co/mistralai/Pixtral-12B-2409).