--- license: cc tags: - multimodal - conversational - GGUF - Image-Text-to-Text --- # Omnivision ## Introduction Omnivision is a compact, sub-billion (968M) multimodal model for processing both visual and text inputs, optimized for edge devices. Improved on LLaVA's architecture, it features: - **9x Token Reduction**: Reduces image tokens from 729 to 81, cutting latency and computational cost. - **Trustworthy Result**: Reduces hallucinations using **DPO** training from trustworthy data. **Quick Links:** 1. Interactive Demo in our [Hugging Face Space](https://huggingface.co/spaces/NexaAIDev/omnivlm-dpo-demo). 2. [Quickstart for local setup](#how-to-use-on-device) 3. Learn more in our [Blogs](https://nexa.ai) **Feedback:** Send questions or comments about the model in our [Discord](https://discord.gg/nexa-ai) ## Intended Use Cases Omnivision is intended for **Visual Question Answering** (answering questions about images) and **Image Captioning** (describing scenes in photos), making it ideal for on-device applications. **Example Demo:** Omnivision generated captions for a 1046×1568 pixel poster | **Processing time: <2s** | Device: MacBook M4 Pro | FP16 requires 988 MB RAM and 948 MB storage space. Example ## Benchmarks Below we demonstrate a figure to show how Omnivision performs against nanollava. In all the tasks, Omnivision outperforms the previous world's smallest vision-language model. Benchmark Radar Chart We have conducted a series of experiments on benchmark datasets, including MM-VET, ChartQA, MMMU, ScienceQA, POPE to evaluate the performance of Omnivision. | Benchmark | Nexa AI Omnivision | nanoLLAVA | Qwen2-VL-2B | |-------------------|----------------------|-----------|-------------| | MM-VET | 27.5 | 23.9 | 49.5 | | ChartQA (Test) | 59.2 | NA | 73.5 | | MMMU (Test) | 41.8 | 28.6 | 41.1 | | MMMU (Eval) | 39.9 | 30.4 | 41.1 | | ScienceQA (Eval) | 62.2 | 59.0 | NA | | ScienceQA (Test) | 64.5 | 59.0 | NA | | POPE | 89.4 | 84.1 | NA | ## How to Use On Device In the following, we demonstrate how to run Omnivision locally on your device. **Step 1: Install Nexa-SDK (local on-device inference framework)** [Install Nexa-SDK](https://github.com/NexaAI/nexa-sdk?tab=readme-ov-file#install-option-1-executable-installer) > Nexa-SDK is a open-sourced, local on-device inference framework, supporting text generation, image generation, vision-language models (VLM), audio-language models, speech-to-text (ASR), and text-to-speech (TTS) capabilities. Installable via Python Package or Executable Installer. **Step 2: Then run the following code in your terminal** ```bash nexa run omnivision ``` ## Model Architecture ## Omnivision's architecture consists of three key components: - Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs - Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings - Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space. Compared to vanilla Llava architecture, we designed a projector that reduce 9X image tokens. The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding. ## Training We developed Omnivision through a three-stage training pipeline: **Pretraining:** The initial stage focuses on establishing basic visual-linguistic alignments using image-caption pairs, during which only the projection layer parameters are unfrozen to learn these fundamental relationships. **Supervised Fine-tuning (SFT):** We enhance the model's contextual understanding using image-based question-answering datasets. This stage involves training on structured chat histories that incorporate images for the model to generate more contextually appropriate responses. **Direct Preference Optimization (DPO):** The final stage implements DPO by first generating responses to images using the base model. A teacher model then produces minimally edited corrections while maintaining high semantic similarity with the original responses, focusing specifically on accuracy-critical elements. These original and corrected outputs form chosen-rejected pairs. The fine-tuning targeted at essential model output improvements without altering the model's core response characteristics ## What's next for Omnivision? Omnivision is in early development and we are working to address current limitations: - Expand DPO Training: Increase the scope of DPO (Direct Preference Optimization) training in an iterative process to continually improve model performance and response quality. - Improve document and text understanding In the long term, we aim to develop Omnivision as a fully optimized, production-ready solution for edge AI multimodal applications. ### Follow us [Blogs](https://nexa.ai) | [Discord](https://discord.gg/nexa-ai) | [X(Twitter)](https://x.com/alanzhuly)