Feature Extraction
Safetensors
clip_vision_model
Vision
LLaVA
xiangan commited on
Commit
8fec223
·
1 Parent(s): cc018bb

Add MLCD Vision Base Model to HuggingFace

Browse files

- Updated README.md to include detailed information about the MLCD model
architecture, usage, and performance metrics.

This update aims to provide a robust and scalable vision model to the
HuggingFace community, facilitating advanced research and application
development in computer vision.

Files changed (1) hide show
  1. README.md +87 -3
README.md CHANGED
@@ -1,3 +1,87 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - laion/laion400m
5
+ - kakaobrain/coyo-700m
6
+ pipeline_tag: feature-extraction
7
+ tags:
8
+ - Vision
9
+ - LLaVA
10
+ ---
11
+
12
+ [[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom)
13
+ ## Model
14
+ We used the same Vision Transformer architecture [ViT-L/14@336px as CLIP](https://huggingface.co/openai/clip-vit-large-patch14-336).
15
+
16
+ ## Data
17
+ Our model was trained on publicly available image-caption data from the [LAION400M](https://arxiv.org/abs/2111.02114) and [COYO700M](https://github.com/kakaobrain/coyo-dataset) datasets.
18
+
19
+ ## Performance and Limitations
20
+
21
+ ### A. MLLMs Evaluation Results
22
+ In our experiments, we replaced the CLIP model in [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B). The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs.
23
+
24
+ | Vision Tower | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
25
+ |:----------------|:-------------|:-------------|
26
+ | LLM | Qwen2.5-7B | Qwen2.5-7B |
27
+ | AI2D | **76.98** | 73.15 |
28
+ | ChartQA | **67.84** | 66.52 |
29
+ | DocVQA_val | **76.46** | 75.21 |
30
+ | GQA | **64.17** | 63.31 |
31
+ | Infovqa_val | **43.48** | 38.88 |
32
+ | MMbench_cn_dev | **74.83** | 72.51 |
33
+ | MMBench_en_dev | **76.37** | 74.57 |
34
+ | MME(cognition) | **432.50** | 384.29 |
35
+ | MME(perception) | **1598.02** | 1512.37 |
36
+ | MMMU | **44.30** | 44.20 |
37
+ | OCRBench | **531.00** | 525.00 |
38
+ | POPE | 88.69 | **88.83** |
39
+ | ScienceQA_img | **78.09** | 76.35 |
40
+ | textvqa_val | 61.69 | **62.47** |
41
+ | seedbench | **68.20** | 66.80 |
42
+ | seedbench_img | **73.75** | 72.72 |
43
+ | MMStar | **50.98** | 48.98 |
44
+
45
+
46
+
47
+ ### B. Linear Probe Evaluation Results
48
+ This table presents the results of linear probe evaluations comparing CLIP and MLCD models on the ViT_L_14_336px architecture across various datasets. The linear probe test freezes the pre-trained model's weights and trains a linear classifier on top to assess how well the model's representations generalize to different tasks.
49
+
50
+ | Dataset | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
51
+ |:---------------|:----------------------|:----------------------|
52
+ | **AVG** | **87.15** | 85.35 |
53
+ | Food101 | **96.21** | 95.90 |
54
+ | CIFAR-10 | **99.36** | 97.90 |
55
+ | CIFAR-100 | **93.69** | 87.40 |
56
+ | Birdsnap | **88.18** | 79.90 |
57
+ | SUN397 | **87.96** | 82.20 |
58
+ | Stanford Cars | **95.16** | 91.50 |
59
+ | FGVC Aircraft | **86.38** | 71.60 |
60
+ | Describable Textures Dataset | **86.70** | 83.00 |
61
+ | Oxford-IIIT Pets | **96.27** | 95.10 |
62
+ | Caltech-101 | **97.92** | 96.00 |
63
+ | Flowers102 | **99.58** | 99.20 |
64
+ | MNIST | 98.67 | **99.20** |
65
+ | STL-10 | 99.28 | **99.70** |
66
+ | EuroSAT | **99.06** | 98.10 |
67
+ | RESISC45 | **95.48** | 94.90 |
68
+ | GTSRB | 92.32 | **92.40** |
69
+ | KITTI | **75.39** | 69.20 |
70
+ | Country211 | 38.12 | **46.40** |
71
+ | PatchCamelyon | **88.00** | 85.60 |
72
+ | UCF101 | **92.86** | 92.00 |
73
+ | Kinetics-700 | **73.35** | 73.00 |
74
+ | CLEVR | **64.40** | 60.30 |
75
+ | Hateful Memes | 72.00 | **77.30** |
76
+ | SST-2 | 76.33 | **80.50** |
77
+ | ImageNet | **86.10** | 85.40 |
78
+
79
+
80
+ ### C. Limitations
81
+
82
+ Models with higher resolution are more friendly to OCR results. We are currently training such models and will soon make them available.
83
+
84
+
85
+ ## Acknowledgments
86
+
87
+ We would like to express our gratitude to [Xie Yin](https://huggingface.co/Yin-Xie) for her significant contributions to the experimental validation in MLLMs.