File size: 4,067 Bytes
58d60c9
 
 
8d4c0cf
58d60c9
e856cef
 
f7100ec
 
 
 
 
e856cef
 
 
f7100ec
 
 
 
 
 
e856cef
 
 
 
 
 
 
 
1ae5c5c
 
e856cef
 
 
 
 
 
 
 
 
 
 
 
f7100ec
 
e856cef
f7100ec
 
e856cef
f7100ec
 
 
e856cef
 
 
 
 
f7100ec
e856cef
 
f7100ec
e856cef
f7100ec
 
e856cef
f7100ec
 
 
 
e856cef
 
 
 
f7100ec
e856cef
f7100ec
e856cef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
language:
- en
license: cc-by-nc-4.0
---
## Model Details
The CFT-CLIP was developed by HUMANE Lab researchers at Soongsil University to assess news thumbnail representativeness by counterfactual text-guided contrastive language-image pretraining.

# Model Date
January 2024

# Model Type
The model uses a ViT-L/14 transformer architecture as an image encoder and a causal text transformer as a text encoder.
These encoders initialized weight for [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) before training.
It is trained that the similarity of positive (image, text) pairs is high, and the similarity of in-batch negatives and hard negatives is low via contrastive loss.

Input: image and text

output: image and text representation


## Uses

### Use with Transformers
```python
import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

processor = AutoProcessor.from_pretrained("humane-lab/CFT-CLIP")
model = AutoModel.from_pretrained("humane-lab/CFT-CLIP")


image = "cat.jpg"
image = Image.open(image)
inputs = processor(text=["this is a cat"], images=image, return_tensors="pt")

outputs = model(**inputs)
text_embeds = outputs.text_embeds
image_embeds = outputs.image_embeds
```

### Intended Use
The model is intended as a research output for research communities. 

### Primary intended uses
The primary intended users of these models are AI researchers.

### Out-of-Scope Use Cases
The model was not intentionally trained or evaluated in any language other than English. Therefore, use of the model should be limited to English use cases.


## Factors
### Relevant factors
We trained the models with the AdamW optimizer with the initial learning rate of 1e-4, updated by the cosine annealing scheduler.
The minibatch size is 128. The temperature τ in the loss equation is 0.05. Other hyperparameters were optimized by random search using a validation set.
Model training was early-stopped when the validation loss was not decreased five times consecutively, measured for every 20 iterations.

### Evaluation factors
We conducted a threshold-based evaluation about [NewsTT](https://github.com/ssu-humane/news-images-acl24). At this time, we optimized the validation.

## Metrics
Model performance measures: F1-score between model predictions and labels and Spearman between cosine similarity of models between labels.

Decision thresholds: Validation cosine-similarity based.

Approaches to uncertainty and variability: Measure by changing the random seed 5 times


## Data
### Training Data
The model was trained using the summary text and thumbnail image for the image in the first paragraph of the publicly available [BBC English Dataset](https://aclanthology.org/2023.eacl-main.263/).
The original implementation had two variants: one using a [NELA-GT-2021](https://arxiv.org/abs/2203.05659v1) and the other using the titles instead of summary text from BBC Dataset.

### Evaluation Data
In NELA-GT-2021, annotation was performed by randomly sampling 1,000 in 10,000 samples not included in the train and valid set. 
For more details, please refer to [NewsTT](https://github.com/ssu-humane/news-images-acl24).

## Evaluation
we measured the ability of pretrained vision language models. In addition to CLIP, we used BLIP and BLIP-2. BLIP-2+SBERT is a pipelined approach that integrates BLIP-2 with SentenceBERT.

|Model|F1|Spearman|
|---|---|---|
|CFT-CLIP|**0.815+-0.003**|**0.491+-0.005**|
|CLIPAdapt|0.767+-0.006|0.459+-0.004|
|CLIP|0.763|0.409|
|BLIP|0.737|0.408|
|BLIP-2|0.707|0.415|
|BLIP-2+SBERT|0.694|0.341|

## Ethical Considerations
For pretraining, this study used publicly available news articles shared by news media.
While we tried to have a high-quality corpus for pretraining, it is possible that the model learned hidden biases in online news.
Also, Since CFT-CLIP was updated from the pretrained CLIP weights, it may inherit the bias of CLIP.
A user should be cautious about applying the method to problems in a general context and be aware of a potential bias.