SanghyukChun commited on
Commit
d45295a
1 Parent(s): bec627a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -3
README.md CHANGED
@@ -2,8 +2,73 @@
2
  tags:
3
  - pytorch_model_hub_mixin
4
  - model_hub_mixin
 
 
 
 
 
5
  ---
 
6
 
7
- This model has been pushed to the Hub using ****:
8
- - Repo: [More Information Needed]
9
- - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  tags:
3
  - pytorch_model_hub_mixin
4
  - model_hub_mixin
5
+ license: mit
6
+ datasets:
7
+ - mlfoundations/datacomp_1b
8
+ base_model:
9
+ - apple/DFN5B-CLIP-ViT-H-14
10
  ---
11
+ ## Official implementation of fine-tuned ViT-H/14 ProLIP on DataComp 1B
12
 
13
+ - This weight is a fine-tuned version of ViT-H/14 provided by https://huggingface.co/apple/DFN5B-CLIP-ViT-H-14
14
+ - Fine-tuned dataset
15
+ - DataComp 1B / Seen samples 1.28B
16
+ - Architectural difference
17
+ - ProLIP text encoder uses the `[CLS]` token for pooling, while the original model uses the last token without specifying the `[CLS]` token.
18
+
19
+ ### Overview
20
+ - Paper: https://arxiv.org/abs/2410.18857
21
+ - GitHub: https://github.com/naver-ai/prolip
22
+ - More models are available at https://huggingface.co/collections/SanghyukChun/prolip-6712595dfc87fd8597350291
23
+
24
+ ### Performance overview
25
+ - Zero-shot ImageNet-1k top-1 accuracy: 79.4%
26
+ - Zero-shot ImageNet distribution shifts: 68.3%
27
+ - Zero-shot VTAB performance: 64.4%
28
+ - Zero-shot retrieval performance: 61.6%
29
+ - Average zero-shot performance on 38 tasks: 66.9%
30
+
31
+ ```python
32
+ import requests
33
+ from PIL import Image
34
+
35
+ import torch
36
+ from prolip.model import ProLIPHF
37
+ from transformers import CLIPProcessor
38
+ from prolip.tokenizer import HFTokenizer
39
+
40
+ import warnings
41
+ warnings.simplefilter(action='ignore', category=FutureWarning)
42
+
43
+ processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")
44
+ model = ProLIPHF.from_pretrained("SanghyukChun/ProLIP-ViT-H-14-FT-DC-1B-1_28M")
45
+ tokenizer = HFTokenizer("apple/DFN5B-CLIP-ViT-H-14", context_length=77)
46
+
47
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
48
+ image = Image.open(requests.get(url, stream=True).raw)
49
+ inputs = processor(images=image, return_tensors="pt", padding=True)
50
+ texts = ["A couple of cats laying on top of a pink blanket.", "A man walks through a flooded road during a rainstorm", "photo"]
51
+ texts = tokenizer(texts)
52
+
53
+ outputs = model(image=inputs["pixel_values"], text=texts)
54
+
55
+ l2_logit = outputs["image_features"]["mean"] @ outputs["text_features"]["mean"].T
56
+ i_unc = torch.exp(outputs["image_features"]["std"]).sum(dim=-1)
57
+ t_unc = torch.exp(outputs["text_features"]["std"]).sum(dim=-1)
58
+ csd_logit = l2_logit - 0.5 * t_unc
59
+ csd_logit2 = l2_logit.T - 0.5 * i_unc
60
+ print("Mean-only image-to-text logits (by L2 distance):", l2_logit)
61
+ print("Uncertainty-aware image-to-text logits (by CSD):", csd_logit)
62
+ print("Uncertainty-aware text-to-image logits (by CSD):", csd_logit2.T)
63
+ print("Image uncertainty: ", i_unc)
64
+ print("Text uncertainty: ", t_unc)
65
+ ```
66
+
67
+ ```
68
+ @article{chun2024prolip,
69
+ title={Probabilistic Language-Image Pre-Training},
70
+ author={Chun, Sanghyuk and Kim, Wonjae and Park, Song and Yun, Sangdoo},
71
+ journal={arXiv preprint arXiv:2410.18857},
72
+ year={2024}
73
+ }
74
+ ```