Visual Question Answering
Transformers
Safetensors
llava
image-text-to-text
AIGC
LLaVA
huangfx1020 commited on
Commit
32e58f4
verified
1 Parent(s): c095e03

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +75 -1
README.md CHANGED
@@ -5,4 +5,78 @@ library_name: transformers
5
  tags:
6
  - AIGC
7
  - LlaVA
8
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  tags:
6
  - AIGC
7
  - LlaVA
8
+ ---
9
+ # Human-LLaVA-(HumanCaption-10M dataset)
10
+
11
+ ### Introduction
12
+
13
+ Human-related vision and language tasks are widely applied across various social scenarios. The latest studies demonstrate that the large vision-language model can enhance the performance of various downstream tasks in visual-language understanding. Since, models in the general domain often not perform well in the specialized field. In this study, we train a domain-specific Large Language-Vision model, Human-LLaVA, which aim to construct an unified multimodal Language-Vision Model for Human-related tasks.
14
+
15
+ Specifically, (1) we first construct a large-scale and high-quality human-related image-text (caption) dataset extracted from Internet for domain-specific alignment in the first stage (Coming soon); (2) we also propose to construct a multi-granularity caption for human-related images (Coming soon), including human face, human body, and whole image, thereby fine-tuning a large language model. Lastly, we evaluate our model on a series of downstream tasks, our Human-LLaVA achieved the best overall performance among multimodal models of similar scale. In particular, it exhibits the best performance in a series of human-related tasks, significantly surpassing similar models and ChatGPT-4o. We believe that the Huaman-LLaVA model and a series of datasets presented in this work can promote research in related fields.
16
+
17
+
18
+ ## Architecture
19
+
20
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/SkFB0x3JunWE_Wae808Nq.png)
21
+
22
+
23
+
24
+
25
+
26
+
27
+
28
+ #### Data Cleaning Process
29
+
30
+ ![img](file:///C:\Users\hp-pc\AppData\Local\Temp\ksohtml4716\wps1.png)
31
+
32
+ ## Get the Dataset
33
+
34
+ #### Domain Alignment Stage
35
+
36
+ HumanCaption-10M(ours): coming soon
37
+
38
+ #### Instruction Tuning Stage
39
+
40
+ **Caption**
41
+
42
+ HumanCaption-300K: [FreedomIntelligence/PubMedVision 路 Datasets at Hugging Face](https://huggingface.co/datasets/FreedomIntelligence/PubMedVision)
43
+
44
+ ShareGPT4V:
45
+
46
+ **VQA**
47
+
48
+ LLaVA-Instruct_zh :
49
+
50
+ ShareGPT4V:
51
+
52
+ **Visual Grounding**
53
+
54
+ verified_ref3rec:
55
+
56
+ verified_ref3reg:
57
+
58
+ verified_shikra:
59
+
60
+ **Face Attributes Recognition**
61
+
62
+ celeba_attribute:
63
+
64
+ Face_hq:
65
+
66
+ ## Result
67
+
68
+
69
+
70
+
71
+
72
+
73
+ ## Citation
74
+
75
+ ```
76
+
77
+ ```
78
+
79
+ ## contact
80
+
81
82
+