File size: 3,482 Bytes
73dda27 d6a3894 d298730 d6a3894 7d6e37a a9e2f0f 90ec428 a9e2f0f 99287bc a9e2f0f 77f390e a9e2f0f 1fd8182 d4cd572 a9e2f0f 90ec428 a9e2f0f 9bf47fe 865e986 a9e2f0f 9bf47fe a9e2f0f 9bf47fe a9e2f0f 55ed408 865e986 77f390e 0f50160 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
---
license: apache-2.0
datasets:
- OpenFace-CQUPT/FaceCaption-15M
language:
- zh
- en
metrics:
- accuracy
pipeline_tag: image-to-text
---
# Demonstration of Cross-modal Retrieval (FLIP-based model)
<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/TGxEwHBbWZIbW67kG9jMH.mp4"></video>
# FLIP (Facial Language Image Pretraining)
This repository is the official implementation of [FaceCaption-15M]().
# Updates:
**[24/07/20] The usage of FLIP has been released! [OpenFace-CQUPT/FLIP-demo](https://huggingface.co/OpenFace-CQUPT/FLIP/tree/main/FLIP-demo)**
**[24/07/17] The model named FLIP has been released! [OpenFace-CQUPT/FLIP](https://huggingface.co/OpenFace-CQUPT/FLIP)**
**Overview of FLIP architecture.**
![image-20240318101027127](https://img.yutangli.net/img/202403181010116.png)
**Fig.1:(a). Same color represents shared parameters. “12x” stands for 12-layer transformer modules. (b), (c) and (d) FLIP-based model are applied to the tasks of text-image retrieval, facial attributes prediction and sketch less facial image retrieval, respectively.**
## Training
Coming soon......(Only for the datasets been published, the code of training is meaningful.)
```shell
python pretrain.py > log.log
```
## Pre-trained Models
We provide pretrained model weights :
FLIP Base —— click [here](https://huggingface.co/OpenFace-CQUPT/Facial-language-image-pretraining-model/tree/main/ckpt)
FLIP Large —— coming soon......
## Datasets
Download the FaceCaption-15M dataset from [here](https://huggingface.co/datasets/OpenFace-CQUPT/FaceCaption-15M).
## Results
### Task1: Text-Image Retrieval
**Table 1:** Comparison with other classical pretrained models. All pretrained model backbones are frozen, with only the linear layer being fine-tuned. † represents the model pretrained on the LAION-Face [86] dataset; * represents the model pretrained on the FaceCaption dataset constructed without using LLM text generation.
![](https://img.yutangli.net/img/202403181015142.png)
### Task2: Facial Attributes Prediction
**Table 2:** Comparison with other classical models. † represents the model pre-trained on the original LAION-Face dataset.
![image-20240318101126897](https://img.yutangli.net/img/202403181011115.png)
### Task3: Sketch Less Facial Image Retrieval
**Table 3:** Comparative results with different baseline methods. † represents the model pre-trained on the LAION-Face dataset.
![image-20240318101633671](https://img.yutangli.net/img/202403181016876.png)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/snd-9JBKJnRuZpm0Wp38f.png)
**Fig.2:Demonstration of our FLIP-based model on the SLFIR task. Both methods can retrieve the target face photo from the top-5 list using a partial sketch. Our proposed FLIP-based model can achieve this using fewer strokes than the baseline. The number at the bottom denotes the rank of the paired (true match) photos at every stage.**
## Contacts
mailto: [email protected] or dw[email protected]
## Citation
```tex
@misc{dai202415mmultimodalfacialimagetext,
title={15M Multimodal Facial Image-Text Dataset},
author={Dawei Dai and YuTang Li and YingGe Liu and Mingming Jia and Zhang YuanHui and Guoyin Wang},
year={2024},
eprint={2407.08515},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2407.08515},
}
``` |