SanghyukChun
/

ProLIP-ViT-H-14-FT-DC-1B-1_28B

pytorch_model_hub_mixin

model_hub_mixin

Model card Files Files and versions Community

ProLIP-ViT-H-14-FT-DC-1B-1_28B / README.md

SanghyukChun's picture

Update README.md

5c872ad verified 18 days ago

|

history blame contribute delete

2.77 kB

	---
	tags:
	- pytorch_model_hub_mixin
	- model_hub_mixin
	license: mit
	datasets:
	- mlfoundations/datacomp_1b
	base_model:
	- apple/DFN5B-CLIP-ViT-H-14
	---
	## Official implementation of fine-tuned ViT-H/14 ProLIP on DataComp 1B

	- This weight is a fine-tuned version of ViT-H/14 by Probabilistic Language-Image Pre-Training (ProLIP)
	- Pre-trained weight
	- https://huggingface.co/apple/DFN5B-CLIP-ViT-H-14
	- Fine-tuned dataset
	- DataComp 1B / Seen samples 1.28B
	- Architectural difference
	- ProLIP text encoder uses the `[CLS]` token for pooling, while the original model uses the last token without specifying the `[CLS]` token.

	### Overview
	- Paper: https://arxiv.org/abs/2410.18857
	- GitHub: https://github.com/naver-ai/prolip
	- More models are available at https://huggingface.co/collections/SanghyukChun/prolip-6712595dfc87fd8597350291

	### Performance overview
	- Zero-shot ImageNet-1k top-1 accuracy: 79.4%
	- Zero-shot ImageNet distribution shifts: 68.3%
	- Zero-shot VTAB performance: 64.4%
	- Zero-shot retrieval performance: 61.6%
	- Average zero-shot performance on 38 tasks: 66.9%

	```python
	import requests
	from PIL import Image

	import torch
	from prolip.model import ProLIPHF
	from transformers import CLIPProcessor
	from prolip.tokenizer import HFTokenizer

	import warnings
	warnings.simplefilter(action='ignore', category=FutureWarning)

	processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")
	model = ProLIPHF.from_pretrained("SanghyukChun/ProLIP-ViT-H-14-FT-DC-1B-1_28M")
	tokenizer = HFTokenizer("apple/DFN5B-CLIP-ViT-H-14", context_length=77)

	url = "http://images.cocodataset.org/val2017/000000039769.jpg"
	image = Image.open(requests.get(url, stream=True).raw)
	inputs = processor(images=image, return_tensors="pt", padding=True)
	texts = ["A couple of cats laying on top of a pink blanket.", "A man walks through a flooded road during a rainstorm", "photo"]
	texts = tokenizer(texts)

	outputs = model(image=inputs["pixel_values"], text=texts)

	l2_logit = outputs["image_features"]["mean"] @ outputs["text_features"]["mean"].T
	i_unc = torch.exp(outputs["image_features"]["std"]).sum(dim=-1)
	t_unc = torch.exp(outputs["text_features"]["std"]).sum(dim=-1)
	csd_logit = l2_logit - 0.5 * t_unc
	csd_logit2 = l2_logit.T - 0.5 * i_unc
	print("Mean-only image-to-text logits (by L2 distance):", l2_logit)
	print("Uncertainty-aware image-to-text logits (by CSD):", csd_logit)
	print("Uncertainty-aware text-to-image logits (by CSD):", csd_logit2.T)
	print("Image uncertainty: ", i_unc)
	print("Text uncertainty: ", t_unc)
	```

	```
	@article{chun2024prolip,
	title={Probabilistic Language-Image Pre-Training},
	author={Chun, Sanghyuk and Kim, Wonjae and Park, Song and Yun, Sangdoo},
	journal={arXiv preprint arXiv:2410.18857},
	year={2024}
	}
	```