|
--- |
|
license: cc |
|
tags: |
|
- multimodal |
|
- conversational |
|
- GGUF |
|
- Image-Text-to-Text |
|
--- |
|
## Model Information |
|
|
|
Omni-Vision is a compact multimodal model that processes both visual and text inputs. Built upon LLaVA's architecture principles, it introduces a novel token compression method that significantly reduces the size image tokens (729 to 81), achieving best-in-class efficiency while maintaining exceptional visual understanding capabilities for edge devices. |
|
|
|
|
|
**Model Architecture:** Omni-Vision's architecture consists of three key components: |
|
- Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs |
|
- Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings |
|
- Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space |
|
The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding. |
|
|
|
**Feedback:** Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model [README](https://github.com/meta-llama/llama-models/tree/main/models/llama3_2). For more technical information about generation parameters and recipes for how to use Llama 3.2-Vision in applications, please go [here](https://github.com/meta-llama/llama-recipes). |
|
|
|
## Intended Use Cases |
|
|
|
1. Visual Question Answering (VQA) and Visual Reasoning: Imagine a machine that looks at a picture and understands your questions about it. |
|
2. Image Captioning: Image captioning bridges the gap between vision and language, extracting details, understanding the scene, and then crafting a sentence or two that tells the story. |
|
|
|
## ## Benchmarks |
|
|
|
| Benchmark | Nexa AI Omni-Vision | nanoLLAVA | Qwen2-VL-2B | |
|
|-------------------|----------------------|-----------|-------------| |
|
| MM-VET | 27.5 | 23.9 | 49.5 | |
|
| ChartQA (Test) | 59.2 | NA | 73.5 | |
|
| MMMU (Test) | 41.8 | 28.6 | 41.1 | |
|
| MMMU (Eval) | 39.9 | 30.4 | 41.1 | |
|
| ScienceQA (Eval) | 62.2 | 59.0 | NA | |
|
| ScienceQA (Test) | 64.5 | 59.0 | NA | |
|
| POPE | 89.4 | 84.1 | NA | |
|
|
|
 |
|
|
|
## How to use |
|
|
|
This repository contains two versions of Llama-3.2-11B-Vision-Instruct, for use with transformers and with the original `llama` codebase. |
|
|
|
**Test in HuggingFace Space** |
|
|
|
**Run Locally** |
|
|
|
Install Nexa-SDK |
|
|
|
```bash |
|
nexa run omnivision |
|
``` |
|
|
|
|
|
## Training |
|
|
|
We developed Omni-Vision through a three-stage training pipeline: |
|
|
|
**Pretraining:** |
|
The initial stage focuses on establishing basic visual-linguistic alignments using image-caption pairs, during which only the projection layer parameters are unfrozen to learn these fundamental relationships. |
|
|
|
**Supervised Fine-tuning (SFT):** |
|
We enhance the model's contextual understanding using image-based question-answering datasets. This stage involves training on structured chat histories that incorporate images for the model to generate more contextually appropriate responses. |
|
|
|
**Direct Preference Optimization (DPO):** |
|
The final stage implements DPO by first generating responses to images using the base model. A teacher model then produces minimally edited corrections while maintaining high semantic similarity with the original responses, focusing specifically on accuracy-critical elements. These original and corrected outputs form chosen-rejected pairs. The fine-tuning targeted at essential model output improvements without altering the model's core response characteristics |
|
|
|
|
|
### Learn more in our blogs |
|
### Join Discord Community: |
|
### Website: nexa.ai |
|
|
|
|