--- license: cc tags: - multimodal - conversational - GGUF - Image-Text-to-Text --- ## Model Information Omni-Vision is a compact multimodal model that processes both visual and text inputs. Built upon LLaVA's architecture principles, it introduces a novel token compression method that significantly reduces the size image tokens (729 to 81), achieving best-in-class efficiency while maintaining exceptional visual understanding capabilities for edge devices. **Model Architecture:** Omni-Vision's architecture consists of three key components: - Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs - Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings - Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding. **Feedback:** Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model [README](https://github.com/meta-llama/llama-models/tree/main/models/llama3_2). For more technical information about generation parameters and recipes for how to use Llama 3.2-Vision in applications, please go [here](https://github.com/meta-llama/llama-recipes). ## Intended Use Cases 1. Visual Question Answering (VQA) and Visual Reasoning: Imagine a machine that looks at a picture and understands your questions about it. 2. Image Captioning: Image captioning bridges the gap between vision and language, extracting details, understanding the scene, and then crafting a sentence or two that tells the story. ## ## Benchmarks | Benchmark | Nexa AI Omni-Vision | nanoLLAVA | Qwen2-VL-2B | |-------------------|----------------------|-----------|-------------| | MM-VET | 27.5 | 23.9 | 49.5 | | ChartQA (Test) | 59.2 | NA | 73.5 | | MMMU (Test) | 41.8 | 28.6 | 41.1 | | MMMU (Eval) | 39.9 | 30.4 | 41.1 | | ScienceQA (Eval) | 62.2 | 59.0 | NA | | ScienceQA (Test) | 64.5 | 59.0 | NA | | POPE | 89.4 | 84.1 | NA | ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/6ztPlo5TgBAsFvZpGMy9H.png) ## How to use This repository contains two versions of Llama-3.2-11B-Vision-Instruct, for use with transformers and with the original `llama` codebase. **Test in HuggingFace Space** **Run Locally** Install Nexa-SDK ```bash nexa run omnivision ``` ## Training We developed Omni-Vision through a three-stage training pipeline: **Pretraining:** The initial stage focuses on establishing basic visual-linguistic alignments using image-caption pairs, during which only the projection layer parameters are unfrozen to learn these fundamental relationships. **Supervised Fine-tuning (SFT):** We enhance the model's contextual understanding using image-based question-answering datasets. This stage involves training on structured chat histories that incorporate images for the model to generate more contextually appropriate responses. **Direct Preference Optimization (DPO):** The final stage implements DPO by first generating responses to images using the base model. A teacher model then produces minimally edited corrections while maintaining high semantic similarity with the original responses, focusing specifically on accuracy-critical elements. These original and corrected outputs form chosen-rejected pairs. The fine-tuning targeted at essential model output improvements without altering the model's core response characteristics ### Learn more in our blogs ### Join Discord Community: ### Website: nexa.ai