alanzhuly commited on
Commit
c9118f4
1 Parent(s): dd792c4

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +73 -0
README.md ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc
3
+ tags:
4
+ - multimodal
5
+ - conversational
6
+ - GGUF
7
+ - Image-Text-to-Text
8
+ ---
9
+ ## Model Information
10
+
11
+ Omni-Vision is a compact multimodal model that processes both visual and text inputs. Built upon LLaVA's architecture principles, it introduces a novel token compression method that significantly reduces the size image tokens (729 to 81), achieving best-in-class efficiency while maintaining exceptional visual understanding capabilities for edge devices.
12
+
13
+
14
+ **Model Architecture:** Omni-Vision's architecture consists of three key components:
15
+ - Base Language Model: Qwen2.5-0.5B-Instruct functions as the base model to process text inputs
16
+ - Vision Encoder: SigLIP-400M operates at 384 resolution with 14×14 patch size to generate image embeddings
17
+ - Projection Layer: Multi-Layer Perceptron (MLP) aligns the vision encoder's embeddings with the language model's token space
18
+ The vision encoder first transforms input images into embeddings, which are then processed by the projection layer to match the token space of Qwen2.5-0.5B-Instruct, enabling end-to-end visual-language understanding.
19
+
20
+ **Feedback:** Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model [README](https://github.com/meta-llama/llama-models/tree/main/models/llama3_2). For more technical information about generation parameters and recipes for how to use Llama 3.2-Vision in applications, please go [here](https://github.com/meta-llama/llama-recipes).
21
+
22
+ ## Intended Use Cases
23
+
24
+ 1. Visual Question Answering (VQA) and Visual Reasoning: Imagine a machine that looks at a picture and understands your questions about it.
25
+ 2. Image Captioning: Image captioning bridges the gap between vision and language, extracting details, understanding the scene, and then crafting a sentence or two that tells the story.
26
+
27
+ ## ## Benchmarks
28
+
29
+ | Benchmark | Nexa AI Omni-Vision | nanoLLAVA | Qwen2-VL-2B |
30
+ |-------------------|----------------------|-----------|-------------|
31
+ | MM-VET | 27.5 | 23.9 | 49.5 |
32
+ | ChartQA (Test) | 59.2 | NA | 73.5 |
33
+ | MMMU (Test) | 41.8 | 28.6 | 41.1 |
34
+ | MMMU (Eval) | 39.9 | 30.4 | 41.1 |
35
+ | ScienceQA (Eval) | 62.2 | 59.0 | NA |
36
+ | ScienceQA (Test) | 64.5 | 59.0 | NA |
37
+ | POPE | 89.4 | 84.1 | NA |
38
+
39
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6618e0424dbef6bd3c72f89a/6ztPlo5TgBAsFvZpGMy9H.png)
40
+
41
+ ## How to use
42
+
43
+ This repository contains two versions of Llama-3.2-11B-Vision-Instruct, for use with transformers and with the original `llama` codebase.
44
+
45
+ **Test in HuggingFace Space**
46
+
47
+ **Run Locally**
48
+
49
+ Install Nexa-SDK
50
+
51
+ ```bash
52
+ nexa run omnivision
53
+ ```
54
+
55
+
56
+ ## Training
57
+
58
+ We developed Omni-Vision through a three-stage training pipeline:
59
+
60
+ **Pretraining:**
61
+ The initial stage focuses on establishing basic visual-linguistic alignments using image-caption pairs, during which only the projection layer parameters are unfrozen to learn these fundamental relationships.
62
+
63
+ **Supervised Fine-tuning (SFT):**
64
+ We enhance the model's contextual understanding using image-based question-answering datasets. This stage involves training on structured chat histories that incorporate images for the model to generate more contextually appropriate responses.
65
+
66
+ **Direct Preference Optimization (DPO):**
67
+ The final stage implements DPO by first generating responses to images using the base model. A teacher model then produces minimally edited corrections while maintaining high semantic similarity with the original responses, focusing specifically on accuracy-critical elements. These original and corrected outputs form chosen-rejected pairs. The fine-tuning targeted at essential model output improvements without altering the model's core response characteristics
68
+
69
+
70
+ ### Learn more in our blogs
71
+ ### Join Discord Community:
72
+ ### Website: nexa.ai
73
+