File size: 3,251 Bytes
2c4cffa dac1233 2c4cffa c2164c1 be24820 c2164c1 dac1233 be24820 c2164c1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
---
license: apache-2.0
base_model:
- mistralai/Pixtral-12B-2409
library_name: transformers
tags:
- text-generation-inference
---
# Pixtral-12B-2409 - HuggingFace Transformers Compatible Weights
## Model Overview
This repository contains the HuggingFace Transformers compatible weights for the Pixtral-12B-2409 multimodal model. The weights have been converted to ensure seamless integration with the Hugging Face Transformers library, allowing easy loading and usage in your projects.
## Model Details
- **Original Model**: Pixtral-12B-2409 by Mistral AI
- **Model Type**: Multimodal Language Model
- **Parameters**: 12B parameters + 400M parameter vision encoder
- **Sequence Length**: 128k tokens
- **License**: Apache 2.0
## Key Features
- Natively multimodal, trained with interleaved image and text data
- Supports variable image sizes
- Leading performance in its weight class on multimodal tasks
- Maintains state-of-the-art performance on text-only benchmarks
## Conversion Details
This repository provides the original Pixtral model weights converted to be fully compatible with the HuggingFace Transformers library. The conversion process ensures:
- Seamless loading using `from_pretrained()`
- Full compatibility with HuggingFace Transformers pipeline
- No modifications to the original model weights or architecture
## Installation
You can install the model using the Transformers library:
```python
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
processor = AutoProcessor.from_pretrained("Prarabdha/pixtral-12b-240910-hf")
model = AutoModelForImageTextToText.from_pretrained("Prarabdha/pixtral-12b-240910-hf", torch_dtype=torch.float16, device_map="auto")
```
## Example Usage
```python
from PIL import Image
import requests
# Load an image
url = "https://example.com/sample-image.jpg"
image = Image.open(requests.get(url, stream=True).raw)
# Prepare conversation
conversation = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What is shown in this image?"},
],
}
]
# Process and generate
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=[image], text=prompt, return_tensors="pt")
generate_ids = model.generate(**inputs, max_new_tokens=30)
response = processor.batch_decode(generate_ids, skip_special_tokens=True)
```
## Performance Benchmarks
### Multimodal Benchmarks
| Benchmark | Pixtral 12B | Qwen2 7B VL | LLaVA-OV 7B | Phi-3 Vision |
|-----------|-------------|-------------|-------------|--------------|
| MMMU (CoT) | 52.5 | 47.6 | 45.1 | 40.3 |
| Mathvista (CoT) | 58.0 | 54.4 | 36.1 | 36.4 |
| ChartQA (CoT) | 81.8 | 38.6 | 67.1 | 72.0 |
*(Full benchmark details available in the original model card)*
## Acknowledgements
A huge thank you to the Mistral team for creating and releasing the original Pixtral model.
## Citation
If you use this model, please cite the original Mistral AI research.
## License
This model is distributed under the Apache 2.0 License.
## Original Model Card
For more comprehensive details, please refer to the [original Mistral model card](https://huggingface.co/mistralai/Pixtral-12B-2409). |