Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,73 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc
|
3 |
+
tags:
|
4 |
+
- multimodal
|
5 |
+
- conversational
|
6 |
+
- GGUF
|
7 |
+
- Image-Text-to-Text
|
8 |
+
---
|
9 |
+
## Model Information
|
10 |
+
|
11 |
+
Omni-Vision is a compact multimodal model that processes both visual and text inputs. Built upon LLaVA's architecture principles, it introduces a novel token compression method that significantly reduces the size image tokens (729 to 81), achieving best-in-class efficiency while maintaining exceptional visual understanding capabilities for edge devices.
|
12 |
+
|
13 |
+
|
14 |
+
**Model Architecture:** Omni-Vision's architecture consists of three key components:
|
15 |
+
- Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs
|
16 |
+
- Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings
|
17 |
+
- Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space
|
18 |
+
The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding.
|
19 |
+
|
20 |
+
**Feedback:** Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model [README](https://github.com/meta-llama/llama-models/tree/main/models/llama3_2). For more technical information about generation parameters and recipes for how to use Llama 3.2-Vision in applications, please go [here](https://github.com/meta-llama/llama-recipes).
|
21 |
+
|
22 |
+
## Intended Use Cases
|
23 |
+
|
24 |
+
1. Visual Question Answering (VQA) and Visual Reasoning: Imagine a machine that looks at a picture and understands your questions about it.
|
25 |
+
2. Image Captioning: Image captioning bridges the gap between vision and language, extracting details, understanding the scene, and then crafting a sentence or two that tells the story.
|
26 |
+
|
27 |
+
## ## Benchmarks
|
28 |
+
|
29 |
+
| Benchmark | Nexa AI Omni-Vision | nanoLLAVA | Qwen2-VL-2B |
|
30 |
+
|-------------------|----------------------|-----------|-------------|
|
31 |
+
| MM-VET | 27.5 | 23.9 | 49.5 |
|
32 |
+
| ChartQA (Test) | 59.2 | NA | 73.5 |
|
33 |
+
| MMMU (Test) | 41.8 | 28.6 | 41.1 |
|
34 |
+
| MMMU (Eval) | 39.9 | 30.4 | 41.1 |
|
35 |
+
| ScienceQA (Eval) | 62.2 | 59.0 | NA |
|
36 |
+
| ScienceQA (Test) | 64.5 | 59.0 | NA |
|
37 |
+
| POPE | 89.4 | 84.1 | NA |
|
38 |
+
|
39 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/6ztPlo5TgBAsFvZpGMy9H.png)
|
40 |
+
|
41 |
+
## How to use
|
42 |
+
|
43 |
+
This repository contains two versions of Llama-3.2-11B-Vision-Instruct, for use with transformers and with the original `llama` codebase.
|
44 |
+
|
45 |
+
**Test in HuggingFace Space**
|
46 |
+
|
47 |
+
**Run Locally**
|
48 |
+
|
49 |
+
Install Nexa-SDK
|
50 |
+
|
51 |
+
```bash
|
52 |
+
nexa run omnivision
|
53 |
+
```
|
54 |
+
|
55 |
+
|
56 |
+
## Training
|
57 |
+
|
58 |
+
We developed Omni-Vision through a three-stage training pipeline:
|
59 |
+
|
60 |
+
**Pretraining:**
|
61 |
+
The initial stage focuses on establishing basic visual-linguistic alignments using image-caption pairs, during which only the projection layer parameters are unfrozen to learn these fundamental relationships.
|
62 |
+
|
63 |
+
**Supervised Fine-tuning (SFT):**
|
64 |
+
We enhance the model's contextual understanding using image-based question-answering datasets. This stage involves training on structured chat histories that incorporate images for the model to generate more contextually appropriate responses.
|
65 |
+
|
66 |
+
**Direct Preference Optimization (DPO):**
|
67 |
+
The final stage implements DPO by first generating responses to images using the base model. A teacher model then produces minimally edited corrections while maintaining high semantic similarity with the original responses, focusing specifically on accuracy-critical elements. These original and corrected outputs form chosen-rejected pairs. The fine-tuning targeted at essential model output improvements without altering the model's core response characteristics
|
68 |
+
|
69 |
+
|
70 |
+
### Learn more in our blogs
|
71 |
+
### Join Discord Community:
|
72 |
+
### Website: nexa.ai
|
73 |
+
|