OpenFace-CQUPT
/

Human_LLaVA

Visual Question Answering

image-text-to-text

Model card Files Files and versions Community

huangfx1020 commited on Sep 7, 2024

Commit

32e58f4

·

verified ·

1 Parent(s): c095e03

Update README.md

Files changed (1) hide show

README.md +75 -1

README.md CHANGED Viewed

@@ -5,4 +5,78 @@ library_name: transformers
 tags:
 - AIGC
 - LlaVA
----

 tags:
 - AIGC
 - LlaVA
+---
+# Human-LLaVA-(HumanCaption-10M dataset)
+### Introduction
+Human-related vision and language tasks are widely applied across various social scenarios.  The latest studies demonstrate that the large vision-language model can enhance the performance of various downstream tasks in visual-language understanding.  Since, models in the general domain often not perform well in the specialized field.  In this study, we train a domain-specific Large Language-Vision model, Human-LLaVA, which aim to construct an unified multimodal Language-Vision Model for Human-related tasks.
+Specifically, (1) we first construct a large-scale and high-quality human-related image-text (caption) dataset extracted from Internet for domain-specific alignment in the first stage (Coming soon);  (2) we also propose to construct a multi-granularity caption for human-related images (Coming soon), including human face, human body, and whole image, thereby fine-tuning a large language model.  Lastly, we evaluate our model on a series of downstream tasks, our Human-LLaVA achieved the best overall performance among multimodal models of similar scale.  In particular, it exhibits the best performance in a series of human-related tasks, significantly surpassing similar models and ChatGPT-4o.  We believe that the Huaman-LLaVA model and a series of datasets presented in this work can promote research in related fields.
+## Architecture
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/SkFB0x3JunWE_Wae808Nq.png)
+#### Data Cleaning Process
+![img](file:///C:\Users\hp-pc\AppData\Local\Temp\ksohtml4716\wps1.png)
+## Get the Dataset
+#### Domain Alignment Stage
+HumanCaption-10M(ours): coming soon
+#### Instruction Tuning Stage
+**Caption**
+HumanCaption-300K: [FreedomIntelligence/PubMedVision · Datasets at Hugging Face](https://huggingface.co/datasets/FreedomIntelligence/PubMedVision)
+ShareGPT4V:
+**VQA**
+LLaVA-Instruct_zh :
+ShareGPT4V:
+**Visual Grounding**
+verified_ref3rec:
+verified_ref3reg:
+verified_shikra:
+**Face Attributes Recognition**
+celeba_attribute:
+Face_hq:
+## Result
+## Citation
+```
+```
+## contact
+mailto: [[email protected]](mailto:[email protected]) or [[email protected]](mailto:[email protected])