---
license: cc
tags:
- multimodal
- conversational
- GGUF
- Image-Text-to-Text
---
# Omnivision

## Introduction

Omnivision is a compact, sub-billion (968M) multimodal model for processing both visual and text inputs, optimized for edge devices. Improved on LLaVA's architecture, it features:

- **9x Token Reduction**: Reduces image tokens from 729 to 81, cutting latency and computational cost.
- **Trustworthy Result**: Reduces hallucinations using **DPO** training from trustworthy data.
  
**Quick Links:**
1. Interactive Demo in our [Hugging Face Space](https://huggingface.co/spaces/NexaAIDev/omnivlm-dpo-demo).
2. [Quickstart for local setup](#how-to-use-on-device)
3. Learn more in our [Blogs](https://nexa.ai)

**Feedback:** Send questions or comments about the model in our [Discord](https://discord.gg/nexa-ai)

## Intended Use Cases
Omnivision is intended for **Visual Question Answering** (answering questions about images) and **Image Captioning** (describing scenes in photos), making it ideal for on-device applications.

**Example Demo:**
Omnivision generated captions for a 1046×1568 pixel poster | **Processing time: <2s** | Device: MacBook M4 Pro | FP16 requires 988 MB RAM and 948 MB storage space.

<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/PTG3_n_p7_atBHCwRLOEE.png" alt="Example" style="width:700px;"/>


## Benchmarks

Below we demonstrate a figure to show how Omnivision performs against nanollava. In all the tasks, Omnivision outperforms the previous world's smallest vision-language model.

<img src="https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/KsN-gTFM5MfJA5E3aDRJI.png" alt="Benchmark Radar Chart" style="width:500px;"/>

We have conducted a series of experiments on benchmark datasets, including MM-VET, ChartQA, MMMU, ScienceQA, POPE to evaluate the performance of Omnivision.

| Benchmark         | Nexa AI Omnivision | nanoLLAVA | Qwen2-VL-2B |
|-------------------|----------------------|-----------|-------------|
| MM-VET            | 27.5                | 23.9      | 49.5        |
| ChartQA (Test)    | 59.2                | NA        | 73.5        |
| MMMU (Test)       | 41.8                | 28.6      | 41.1        |
| MMMU (Eval)       | 39.9                | 30.4      | 41.1        |
| ScienceQA (Eval)  | 62.2                | 59.0      | NA          |
| ScienceQA (Test)  | 64.5                | 59.0      | NA          |
| POPE              | 89.4                | 84.1      | NA          |


## How to Use On Device
In the following, we demonstrate how to run Omnivision locally on your device.

**Step 1: Install Nexa-SDK (local on-device inference framework)**

[Install Nexa-SDK](https://github.com/NexaAI/nexa-sdk?tab=readme-ov-file#install-option-1-executable-installer)

> Nexa-SDK is a open-sourced, local on-device inference framework, supporting text generation, image generation, vision-language models (VLM), audio-language models, speech-to-text (ASR), and text-to-speech (TTS) capabilities. Installable via Python Package or Executable Installer.

**Step 2: Then run the following code in your terminal**

```bash
nexa run omnivision 
```

## Model Architecture ##
Omnivision's architecture consists of three key components:

- Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs
- Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings
- Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space. Compared to vanilla Llava architecture, we designed a projector that reduce 9X image tokens.

The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding.

## Training

We developed Omnivision through a three-stage training pipeline:

**Pretraining:**
The initial stage focuses on establishing basic visual-linguistic alignments using image-caption pairs, during which only the projection layer parameters are unfrozen to learn these fundamental relationships.

**Supervised Fine-tuning (SFT):**
We enhance the model's contextual understanding using image-based question-answering datasets. This stage involves training on structured chat histories that incorporate images for the model to generate more contextually appropriate responses.

**Direct Preference Optimization (DPO):**
The final stage implements DPO by first generating responses to images using the base model. A teacher model then produces minimally edited corrections while maintaining high semantic similarity with the original responses, focusing specifically on accuracy-critical elements. These original and corrected outputs form chosen-rejected pairs. The fine-tuning targeted at essential model output improvements without altering the model's core response characteristics

## What's next for Omnivision?
Omnivision is in early development and we are working to address current limitations:
- Expand DPO Training: Increase the scope of DPO (Direct Preference Optimization) training in an iterative process to continually improve model performance and response quality.
- Improve document and text understanding
  
In the long term, we aim to develop Omnivision as a fully optimized, production-ready solution for edge AI multimodal applications.

### Follow us
[Blogs](https://nexa.ai) | [Discord](https://discord.gg/nexa-ai) | [X(Twitter)](https://x.com/alanzhuly)