NexaAIDev
/

OmniVLM-968M

+---
+license: cc
+tags:
+- multimodal
+- conversational
+- GGUF
+- Image-Text-to-Text
+---
+## Model Information
+Omni-Vision is a compact multimodal model that processes both visual and text inputs. Built upon LLaVA's architecture principles, it introduces a novel token compression method that significantly reduces the size image tokens (729 to 81), achieving best-in-class efficiency while maintaining exceptional visual understanding capabilities for edge devices.
+**Model Architecture:** Omni-Vision's architecture consists of three key components:
+- Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs
+- Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings
+- Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space
+The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding.
+**Feedback:** Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model [README](https://github.com/meta-llama/llama-models/tree/main/models/llama3_2). For more technical information about generation parameters and recipes for how to use Llama 3.2-Vision in applications, please go [here](https://github.com/meta-llama/llama-recipes).
+## Intended Use Cases
+1. Visual Question Answering (VQA) and Visual Reasoning: Imagine a machine that looks at a picture and understands your questions about it.
+2. Image Captioning: Image captioning bridges the gap between vision and language, extracting details, understanding the scene, and then crafting a sentence or two that tells the story.
+## ## Benchmarks
+| Benchmark         | Nexa AI Omni-Vision | nanoLLAVA | Qwen2-VL-2B |
+|-------------------|----------------------|-----------|-------------|
+| MM-VET            | 27.5                | 23.9      | 49.5        |
+| ChartQA (Test)    | 59.2                | NA        | 73.5        |
+| MMMU (Test)       | 41.8                | 28.6      | 41.1        |
+| MMMU (Eval)       | 39.9                | 30.4      | 41.1        |
+| ScienceQA (Eval)  | 62.2                | 59.0      | NA          |
+| ScienceQA (Test)  | 64.5                | 59.0      | NA          |
+| POPE              | 89.4                | 84.1      | NA          |
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/6ztPlo5TgBAsFvZpGMy9H.png)
+## How to use
+This repository contains two versions of Llama-3.2-11B-Vision-Instruct, for use with transformers and with the original `llama` codebase.
+**Test in HuggingFace Space**
+**Run Locally**
+Install Nexa-SDK
+```bash
+nexa run omnivision
+```
+## Training
+We developed Omni-Vision through a three-stage training pipeline:
+**Pretraining:**
+The initial stage focuses on establishing basic visual-linguistic alignments using image-caption pairs, during which only the projection layer parameters are unfrozen to learn these fundamental relationships.
+**Supervised Fine-tuning (SFT):**
+We enhance the model's contextual understanding using image-based question-answering datasets. This stage involves training on structured chat histories that incorporate images for the model to generate more contextually appropriate responses.
+**Direct Preference Optimization (DPO):**
+The final stage implements DPO by first generating responses to images using the base model. A teacher model then produces minimally edited corrections while maintaining high semantic similarity with the original responses, focusing specifically on accuracy-critical elements. These original and corrected outputs form chosen-rejected pairs. The fine-tuning targeted at essential model output improvements without altering the model's core response characteristics
+### Learn more in our blogs
+### Join Discord Community:
+### Website: nexa.ai