README.md · zer0int/CLIP-SAE-ViT-L-14 at main

metadata

license: mit
datasets:
  - zer0int/CLIP-adversarial-typographic-attack_text-image
  - SPRIGHT-T2I/spright_coco
base_model:
  - openai/clip-vit-large-patch14
pipeline_tag: zero-shot-image-classification
library_name: transformers

CLIP ViT-L/14 finetune: SAE-informed adversarial training

SAE = Sparse autoencoder
Accuracy ImageNet/ObjectNet my GmP: 91% > SAE (this): 89% > OpenAI pre-trained: 84.5%
But, it's fun to use with e.g. Flux.1 - get the Text-Encoder TE only version ⬇️ and try it!
And this SAE CLIP has best results for linear probe @ LAION-AI/CLIP_benchmark (see below)
This CLIP direct download is also the best CLIP to use for HunyuanVideo.
Required: Use with my zer0int/ComfyUI-HunyuanVideo-Nyan node (changes influence of LLM vs. CLIP; otherwise, difference is very little).

Interesting things with adversarial robustness to try: Right-click and download individual images: Image 1 -- Image 2 -- Image 3
Upload each into zero-shot [hopefully available soon on the right here->]
Try labels (class names): a photo of a cat, a photo of a dog, a photo of a text
Repeat the same with e.g. my GmP models models and see what happens. =)
I'm really hoping the HF format .safetensors conversion didn't mess anything up (it happens!); just in case it did, or if there's no inference API available to use:
I put a script that will do the same thing (on the not-converted model) on my GitHub repo. Plus, you can just reproduce the fine-tune yourself, as that code is also available! 🤗
👉 All training info & code: github.com/zer0int/CLIP-SAE-finetune
☕ Buy me a coffee