--- license: cc tags: - multimodal - conversational - GGUF - Image-Text-to-Text --- # Omnivision ## Introduction Omni-Vision is a sub-billion (968M) multimodal model capable of processing both visual and text inputs. Built upon LLaVA's architecture, it introduces a novel token compression technique to reduce image token sizes (from 729 to 81), optimizing efficiency without compromising visual understanding on edge devices. It has two key enhancements: - **9x Token Reduction through Token Compression**: Significant decrease in image token count, reducing latency and computational cost, ideal for on-device applications. - **Minimal-Edit DPO for Enhanced Response Quality**: Improves model responses by using targeted edits, maintaining core capabilities without significant behavior shifts. **Quick Links:** 1. Interact in our HuggingFace Space. 2. [Quickstart to run locally](#how-to-run-locally) 3. Learn more in [blogs](https://nexa.ai) **Feedback:** Send questions or comments about the model in our [Discord](https://discord.gg/nexa-ai) ## Intended Use Cases OmniVision is intended for Visual Question Answering (answering questions about images) and Image Captioning (describing scenes in photos), optimized for edge devices. **Example Demo:** Omni-Vision generated captions for a 1046×1568 pixel poster | **Processing time: <2s** | Device: MacBook M4 Pro Example ## Benchmarks Below we demonstrate a figure to show how omnivision performs against nanollava. In all the tasks, omnivision outperforms the previous world's smallest vision-language model. Benchmark Radar Chart We have conducted a series of experiments on benchmark datasets, including MM-VET, ChartQA, MMMU, ScienceQA, POPE to evaluate the performance of omnivision. | Benchmark | Nexa AI Omni-Vision | nanoLLAVA | Qwen2-VL-2B | |-------------------|----------------------|-----------|-------------| | MM-VET | 27.5 | 23.9 | 49.5 | | ChartQA (Test) | 59.2 | NA | 73.5 | | MMMU (Test) | 41.8 | 28.6 | 41.1 | | MMMU (Eval) | 39.9 | 30.4 | 41.1 | | ScienceQA (Eval) | 62.2 | 59.0 | NA | | ScienceQA (Test) | 64.5 | 59.0 | NA | | POPE | 89.4 | 84.1 | NA | ## How to Use - Quickstart In the following, we demonstrate how to run omnivision locally on your device. **Step 1: Install Nexa-SDK (local on-device inference framework)** [Install Nexa-SDK](https://github.com/NexaAI/nexa-sdk?tab=readme-ov-file#install-option-1-executable-installer) > Nexa-SDK is a open-sourced, local on-device inference framework, supporting text generation, image generation, vision-language models (VLM), audio-language models, speech-to-text (ASR), and text-to-speech (TTS) capabilities. Installable via Python Package or Executable Installer. **Step 2: Then run the following code in your terminal** ```bash nexa run omnivision ``` ## Model Architecture ## Omni-Vision's architecture consists of three key components: - Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs - Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings - Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding. ## Training We developed Omni-Vision through a three-stage training pipeline: **Pretraining:** The initial stage focuses on establishing basic visual-linguistic alignments using image-caption pairs, during which only the projection layer parameters are unfrozen to learn these fundamental relationships. **Supervised Fine-tuning (SFT):** We enhance the model's contextual understanding using image-based question-answering datasets. This stage involves training on structured chat histories that incorporate images for the model to generate more contextually appropriate responses. **Direct Preference Optimization (DPO):** The final stage implements DPO by first generating responses to images using the base model. A teacher model then produces minimally edited corrections while maintaining high semantic similarity with the original responses, focusing specifically on accuracy-critical elements. These original and corrected outputs form chosen-rejected pairs. The fine-tuning targeted at essential model output improvements without altering the model's core response characteristics ### Learn more in our blogs [Blogs](https://nexa.ai) ### Join Discord Community [Discord](https://discord.gg/nexa-ai)