Add MLCD Vision Base Model to HuggingFace
Browse files- Updated README.md to include detailed information about the MLCD model
architecture, usage, and performance metrics.
This update aims to provide a robust and scalable vision model to the
HuggingFace community, facilitating advanced research and application
development in computer vision.
README.md
CHANGED
@@ -1,3 +1,87 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- laion/laion400m
|
5 |
+
- kakaobrain/coyo-700m
|
6 |
+
pipeline_tag: feature-extraction
|
7 |
+
tags:
|
8 |
+
- Vision
|
9 |
+
- LLaVA
|
10 |
+
---
|
11 |
+
|
12 |
+
[[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom)
|
13 |
+
## Model
|
14 |
+
We used the same Vision Transformer architecture [ViT-L/14@336px as CLIP](https://huggingface.co/openai/clip-vit-large-patch14-336).
|
15 |
+
|
16 |
+
## Data
|
17 |
+
Our model was trained on publicly available image-caption data from the [LAION400M](https://arxiv.org/abs/2111.02114) and [COYO700M](https://github.com/kakaobrain/coyo-dataset) datasets.
|
18 |
+
|
19 |
+
## Performance and Limitations
|
20 |
+
|
21 |
+
### A. MLLMs Evaluation Results
|
22 |
+
In our experiments, we replaced the CLIP model in [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B). The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs.
|
23 |
+
|
24 |
+
| Vision Tower | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
|
25 |
+
|:----------------|:-------------|:-------------|
|
26 |
+
| LLM | Qwen2.5-7B | Qwen2.5-7B |
|
27 |
+
| AI2D | **76.98** | 73.15 |
|
28 |
+
| ChartQA | **67.84** | 66.52 |
|
29 |
+
| DocVQA_val | **76.46** | 75.21 |
|
30 |
+
| GQA | **64.17** | 63.31 |
|
31 |
+
| Infovqa_val | **43.48** | 38.88 |
|
32 |
+
| MMbench_cn_dev | **74.83** | 72.51 |
|
33 |
+
| MMBench_en_dev | **76.37** | 74.57 |
|
34 |
+
| MME(cognition) | **432.50** | 384.29 |
|
35 |
+
| MME(perception) | **1598.02** | 1512.37 |
|
36 |
+
| MMMU | **44.30** | 44.20 |
|
37 |
+
| OCRBench | **531.00** | 525.00 |
|
38 |
+
| POPE | 88.69 | **88.83** |
|
39 |
+
| ScienceQA_img | **78.09** | 76.35 |
|
40 |
+
| textvqa_val | 61.69 | **62.47** |
|
41 |
+
| seedbench | **68.20** | 66.80 |
|
42 |
+
| seedbench_img | **73.75** | 72.72 |
|
43 |
+
| MMStar | **50.98** | 48.98 |
|
44 |
+
|
45 |
+
|
46 |
+
|
47 |
+
### B. Linear Probe Evaluation Results
|
48 |
+
This table presents the results of linear probe evaluations comparing CLIP and MLCD models on the ViT_L_14_336px architecture across various datasets. The linear probe test freezes the pre-trained model's weights and trains a linear classifier on top to assess how well the model's representations generalize to different tasks.
|
49 |
+
|
50 |
+
| Dataset | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
|
51 |
+
|:---------------|:----------------------|:----------------------|
|
52 |
+
| **AVG** | **87.15** | 85.35 |
|
53 |
+
| Food101 | **96.21** | 95.90 |
|
54 |
+
| CIFAR-10 | **99.36** | 97.90 |
|
55 |
+
| CIFAR-100 | **93.69** | 87.40 |
|
56 |
+
| Birdsnap | **88.18** | 79.90 |
|
57 |
+
| SUN397 | **87.96** | 82.20 |
|
58 |
+
| Stanford Cars | **95.16** | 91.50 |
|
59 |
+
| FGVC Aircraft | **86.38** | 71.60 |
|
60 |
+
| Describable Textures Dataset | **86.70** | 83.00 |
|
61 |
+
| Oxford-IIIT Pets | **96.27** | 95.10 |
|
62 |
+
| Caltech-101 | **97.92** | 96.00 |
|
63 |
+
| Flowers102 | **99.58** | 99.20 |
|
64 |
+
| MNIST | 98.67 | **99.20** |
|
65 |
+
| STL-10 | 99.28 | **99.70** |
|
66 |
+
| EuroSAT | **99.06** | 98.10 |
|
67 |
+
| RESISC45 | **95.48** | 94.90 |
|
68 |
+
| GTSRB | 92.32 | **92.40** |
|
69 |
+
| KITTI | **75.39** | 69.20 |
|
70 |
+
| Country211 | 38.12 | **46.40** |
|
71 |
+
| PatchCamelyon | **88.00** | 85.60 |
|
72 |
+
| UCF101 | **92.86** | 92.00 |
|
73 |
+
| Kinetics-700 | **73.35** | 73.00 |
|
74 |
+
| CLEVR | **64.40** | 60.30 |
|
75 |
+
| Hateful Memes | 72.00 | **77.30** |
|
76 |
+
| SST-2 | 76.33 | **80.50** |
|
77 |
+
| ImageNet | **86.10** | 85.40 |
|
78 |
+
|
79 |
+
|
80 |
+
### C. Limitations
|
81 |
+
|
82 |
+
Models with higher resolution are more friendly to OCR results. We are currently training such models and will soon make them available.
|
83 |
+
|
84 |
+
|
85 |
+
## Acknowledgments
|
86 |
+
|
87 |
+
We would like to express our gratitude to [Xie Yin](https://huggingface.co/Yin-Xie) for her significant contributions to the experimental validation in MLLMs.
|