Prarabdha
/

pixtral-12b-240910-hf

Image-Text-to-Text

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

pixtral-12b-240910-hf / README.md

Prarabdha's picture

updated readme

be24820 verified 3 months ago

|

history blame contribute delete

3.25 kB

	---
	license: apache-2.0
	base_model:
	- mistralai/Pixtral-12B-2409
	library_name: transformers
	tags:
	- text-generation-inference
	---
	# Pixtral-12B-2409 - HuggingFace Transformers Compatible Weights

	## Model Overview

	This repository contains the HuggingFace Transformers compatible weights for the Pixtral-12B-2409 multimodal model. The weights have been converted to ensure seamless integration with the Hugging Face Transformers library, allowing easy loading and usage in your projects.

	## Model Details

	- Original Model: Pixtral-12B-2409 by Mistral AI
	- Model Type: Multimodal Language Model
	- Parameters: 12B parameters + 400M parameter vision encoder
	- Sequence Length: 128k tokens
	- License: Apache 2.0

	## Key Features

	- Natively multimodal, trained with interleaved image and text data
	- Supports variable image sizes
	- Leading performance in its weight class on multimodal tasks
	- Maintains state-of-the-art performance on text-only benchmarks

	## Conversion Details

	This repository provides the original Pixtral model weights converted to be fully compatible with the HuggingFace Transformers library. The conversion process ensures:

	- Seamless loading using `from_pretrained()`
	- Full compatibility with HuggingFace Transformers pipeline
	- No modifications to the original model weights or architecture

	## Installation

	You can install the model using the Transformers library:

	```python
	from transformers import AutoProcessor, AutoModelForImageTextToText
	import torch

	processor = AutoProcessor.from_pretrained("Prarabdha/pixtral-12b-240910-hf")
	model = AutoModelForImageTextToText.from_pretrained("Prarabdha/pixtral-12b-240910-hf", torch_dtype=torch.float16, device_map="auto")
	```

	## Example Usage

	```python
	from PIL import Image
	import requests

	# Load an image
	url = "https://example.com/sample-image.jpg"
	image = Image.open(requests.get(url, stream=True).raw)

	# Prepare conversation
	conversation = [
	{
	"role": "user",
	"content": [
	{"type": "image"},
	{"type": "text", "text": "What is shown in this image?"},
	],
	}
	]

	# Process and generate
	prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
	inputs = processor(images=[image], text=prompt, return_tensors="pt")
	generate_ids = model.generate(**inputs, max_new_tokens=30)
	response = processor.batch_decode(generate_ids, skip_special_tokens=True)
	```

	## Performance Benchmarks

	### Multimodal Benchmarks

	\| Benchmark \| Pixtral 12B \| Qwen2 7B VL \| LLaVA-OV 7B \| Phi-3 Vision \|
	\|-----------\|-------------\|-------------\|-------------\|--------------\|
	\| MMMU (CoT) \| 52.5 \| 47.6 \| 45.1 \| 40.3 \|
	\| Mathvista (CoT) \| 58.0 \| 54.4 \| 36.1 \| 36.4 \|
	\| ChartQA (CoT) \| 81.8 \| 38.6 \| 67.1 \| 72.0 \|

	(Full benchmark details available in the original model card)

	## Acknowledgements

	A huge thank you to the Mistral team for creating and releasing the original Pixtral model.

	## Citation

	If you use this model, please cite the original Mistral AI research.

	## License

	This model is distributed under the Apache 2.0 License.

	## Original Model Card

	For more comprehensive details, please refer to the [original Mistral model card](https://huggingface.co/mistralai/Pixtral-12B-2409).