Prarabdha
/

pixtral-12b-240910-hf

 library_name: transformers
 tags:
 - text-generation-inference
+---
+# Pixtral-12B-2409 - HuggingFace Transformers Compatible Weights
+## Model Overview
+This repository contains the HuggingFace Transformers compatible weights for the Pixtral-12B-2409 multimodal model. The weights have been converted to ensure seamless integration with the Hugging Face Transformers library, allowing easy loading and usage in your projects.
+## Model Details
+- **Original Model**: Pixtral-12B-2409 by Mistral AI
+- **Model Type**: Multimodal Language Model
+- **Parameters**: 12B parameters + 400M parameter vision encoder
+- **Sequence Length**: 128k tokens
+- **License**: Apache 2.0
+## Key Features
+- Natively multimodal, trained with interleaved image and text data
+- Supports variable image sizes
+- Leading performance in its weight class on multimodal tasks
+- Maintains state-of-the-art performance on text-only benchmarks
+## Conversion Details
+This repository provides the original Pixtral model weights converted to be fully compatible with the HuggingFace Transformers library. The conversion process ensures:
+- Seamless loading using `from_pretrained()`
+- Full compatibility with HuggingFace Transformers pipeline
+- No modifications to the original model weights or architecture
+## Installation
+You can install the model using the Transformers library:
+```python
+from transformers import AutoProcessor, LLavaForConditionalGeneration
+import torch
+model = LLavaForConditionalGeneration.from_pretrained("your-username/pixtral-12b-2409", torch_dtype=torch.float16, device_map="auto")
+processor = AutoProcessor.from_pretrained("your-username/pixtral-12b-2409")
+```
+## Example Usage
+```python
+from PIL import Image
+import requests
+# Load an image
+url = "https://example.com/sample-image.jpg"
+image = Image.open(requests.get(url, stream=True).raw)
+# Prepare conversation
+conversation = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image"},
+            {"type": "text", "text": "What is shown in this image?"},
+        ],
+    }
+]
+# Process and generate
+prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
+inputs = processor(images=[image], text=prompt, return_tensors="pt")
+generate_ids = model.generate(**inputs, max_new_tokens=30)
+response = processor.batch_decode(generate_ids, skip_special_tokens=True)
+```
+## Performance Benchmarks
+### Multimodal Benchmarks
+| Benchmark | Pixtral 12B | Qwen2 7B VL | LLaVA-OV 7B | Phi-3 Vision |
+|-----------|-------------|-------------|-------------|--------------|
+| MMMU (CoT) | 52.5 | 47.6 | 45.1 | 40.3 |
+| Mathvista (CoT) | 58.0 | 54.4 | 36.1 | 36.4 |
+| ChartQA (CoT) | 81.8 | 38.6 | 67.1 | 72.0 |
+*(Full benchmark details available in the original model card)*
+## Acknowledgements
+A huge thank you to the Mistral team for creating and releasing the original Pixtral model.
+## Citation
+If you use this model, please cite the original Mistral AI research.
+## License
+This model is distributed under the Apache 2.0 License.
+## Original Model Card
+For more comprehensive details, please refer to the [original Mistral model card](https://huggingface.co/mistralai/Pixtral-12B-2409).