|
--- |
|
license: apache-2.0 |
|
base_model: |
|
- mistralai/Pixtral-12B-2409 |
|
library_name: transformers |
|
tags: |
|
- text-generation-inference |
|
--- |
|
# Pixtral-12B-2409 - HuggingFace Transformers Compatible Weights |
|
|
|
## Model Overview |
|
|
|
This repository contains the HuggingFace Transformers compatible weights for the Pixtral-12B-2409 multimodal model. The weights have been converted to ensure seamless integration with the Hugging Face Transformers library, allowing easy loading and usage in your projects. |
|
|
|
## Model Details |
|
|
|
- **Original Model**: Pixtral-12B-2409 by Mistral AI |
|
- **Model Type**: Multimodal Language Model |
|
- **Parameters**: 12B parameters + 400M parameter vision encoder |
|
- **Sequence Length**: 128k tokens |
|
- **License**: Apache 2.0 |
|
|
|
## Key Features |
|
|
|
- Natively multimodal, trained with interleaved image and text data |
|
- Supports variable image sizes |
|
- Leading performance in its weight class on multimodal tasks |
|
- Maintains state-of-the-art performance on text-only benchmarks |
|
|
|
## Conversion Details |
|
|
|
This repository provides the original Pixtral model weights converted to be fully compatible with the HuggingFace Transformers library. The conversion process ensures: |
|
|
|
- Seamless loading using `from_pretrained()` |
|
- Full compatibility with HuggingFace Transformers pipeline |
|
- No modifications to the original model weights or architecture |
|
|
|
## Installation |
|
|
|
You can install the model using the Transformers library: |
|
|
|
```python |
|
from transformers import AutoProcessor, AutoModelForImageTextToText |
|
import torch |
|
|
|
processor = AutoProcessor.from_pretrained("Prarabdha/pixtral-12b-240910-hf") |
|
model = AutoModelForImageTextToText.from_pretrained("Prarabdha/pixtral-12b-240910-hf", torch_dtype=torch.float16, device_map="auto") |
|
``` |
|
|
|
## Example Usage |
|
|
|
```python |
|
from PIL import Image |
|
import requests |
|
|
|
# Load an image |
|
url = "https://example.com/sample-image.jpg" |
|
image = Image.open(requests.get(url, stream=True).raw) |
|
|
|
# Prepare conversation |
|
conversation = [ |
|
{ |
|
"role": "user", |
|
"content": [ |
|
{"type": "image"}, |
|
{"type": "text", "text": "What is shown in this image?"}, |
|
], |
|
} |
|
] |
|
|
|
# Process and generate |
|
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) |
|
inputs = processor(images=[image], text=prompt, return_tensors="pt") |
|
generate_ids = model.generate(**inputs, max_new_tokens=30) |
|
response = processor.batch_decode(generate_ids, skip_special_tokens=True) |
|
``` |
|
|
|
## Performance Benchmarks |
|
|
|
### Multimodal Benchmarks |
|
|
|
| Benchmark | Pixtral 12B | Qwen2 7B VL | LLaVA-OV 7B | Phi-3 Vision | |
|
|-----------|-------------|-------------|-------------|--------------| |
|
| MMMU (CoT) | 52.5 | 47.6 | 45.1 | 40.3 | |
|
| Mathvista (CoT) | 58.0 | 54.4 | 36.1 | 36.4 | |
|
| ChartQA (CoT) | 81.8 | 38.6 | 67.1 | 72.0 | |
|
|
|
*(Full benchmark details available in the original model card)* |
|
|
|
## Acknowledgements |
|
|
|
A huge thank you to the Mistral team for creating and releasing the original Pixtral model. |
|
|
|
## Citation |
|
|
|
If you use this model, please cite the original Mistral AI research. |
|
|
|
## License |
|
|
|
This model is distributed under the Apache 2.0 License. |
|
|
|
## Original Model Card |
|
|
|
For more comprehensive details, please refer to the [original Mistral model card](https://huggingface.co/mistralai/Pixtral-12B-2409). |