Update README.md
Browse files
README.md
CHANGED
@@ -5,4 +5,78 @@ library_name: transformers
|
|
5 |
tags:
|
6 |
- AIGC
|
7 |
- LlaVA
|
8 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
tags:
|
6 |
- AIGC
|
7 |
- LlaVA
|
8 |
+
---
|
9 |
+
# Human-LLaVA-(HumanCaption-10M dataset)
|
10 |
+
|
11 |
+
### Introduction
|
12 |
+
|
13 |
+
Human-related vision and language tasks are widely applied across various social scenarios. The latest studies demonstrate that the large vision-language model can enhance the performance of various downstream tasks in visual-language understanding. Since, models in the general domain often not perform well in the specialized field. In this study, we train a domain-specific Large Language-Vision model, Human-LLaVA, which aim to construct an unified multimodal Language-Vision Model for Human-related tasks.
|
14 |
+
|
15 |
+
Specifically, (1) we first construct a large-scale and high-quality human-related image-text (caption) dataset extracted from Internet for domain-specific alignment in the first stage (Coming soon); (2) we also propose to construct a multi-granularity caption for human-related images (Coming soon), including human face, human body, and whole image, thereby fine-tuning a large language model. Lastly, we evaluate our model on a series of downstream tasks, our Human-LLaVA achieved the best overall performance among multimodal models of similar scale. In particular, it exhibits the best performance in a series of human-related tasks, significantly surpassing similar models and ChatGPT-4o. We believe that the Huaman-LLaVA model and a series of datasets presented in this work can promote research in related fields.
|
16 |
+
|
17 |
+
|
18 |
+
## Architecture
|
19 |
+
|
20 |
+

|
21 |
+
|
22 |
+
|
23 |
+
|
24 |
+
|
25 |
+
|
26 |
+
|
27 |
+
|
28 |
+
#### Data Cleaning Process
|
29 |
+
|
30 |
+

|
31 |
+
|
32 |
+
## Get the Dataset
|
33 |
+
|
34 |
+
#### Domain Alignment Stage
|
35 |
+
|
36 |
+
HumanCaption-10M(ours): coming soon
|
37 |
+
|
38 |
+
#### Instruction Tuning Stage
|
39 |
+
|
40 |
+
**Caption**
|
41 |
+
|
42 |
+
HumanCaption-300K: [FreedomIntelligence/PubMedVision 路 Datasets at Hugging Face](https://huggingface.co/datasets/FreedomIntelligence/PubMedVision)
|
43 |
+
|
44 |
+
ShareGPT4V:
|
45 |
+
|
46 |
+
**VQA**
|
47 |
+
|
48 |
+
LLaVA-Instruct_zh :
|
49 |
+
|
50 |
+
ShareGPT4V:
|
51 |
+
|
52 |
+
**Visual Grounding**
|
53 |
+
|
54 |
+
verified_ref3rec:
|
55 |
+
|
56 |
+
verified_ref3reg:
|
57 |
+
|
58 |
+
verified_shikra:
|
59 |
+
|
60 |
+
**Face Attributes Recognition**
|
61 |
+
|
62 |
+
celeba_attribute:
|
63 |
+
|
64 |
+
Face_hq:
|
65 |
+
|
66 |
+
## Result
|
67 |
+
|
68 |
+
|
69 |
+
|
70 |
+
|
71 |
+
|
72 |
+
|
73 |
+
## Citation
|
74 |
+
|
75 |
+
```
|
76 |
+
|
77 |
+
```
|
78 |
+
|
79 |
+
## contact
|
80 |
+
|
81 |
+
mailto: [[email protected]](mailto:[email protected]) or [[email protected]](mailto:[email protected])
|
82 |
+
|