A ViT-B/32 CLIP model trained for 4 epochs on the ye-pop dataset (491,520 images and CogVLM-generated detailed captions). Research artifact of clip-synthetic-captions. Outperforms the CLIP model trained using the original alt-texts on the DataComp benchmark suite (38 image classification and retrieval tasks).

Note: likely not directly useful as it is severely undertrained.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

Dataset used to train nopperl/clip-ye-pop-cogvlm_caption

Collection including nopperl/clip-ye-pop-cogvlm_caption