metadata
license: mit
datasets:
- zer0int/CLIP-adversarial-typographic-attack_text-image
- SPRIGHT-T2I/spright_coco
base_model:
- openai/clip-vit-large-patch14
pipeline_tag: zero-shot-image-classification
library_name: transformers
CLIP ViT-L/14 finetune: SAE-informed adversarial training
SAE = Sparse autoencoder
Accuracy ImageNet/ObjectNet my GmP: 91% > SAE (this): 89% > OpenAI pre-trained: 84.5%
But, it's fun to use with e.g. Flux.1 - get the Text-Encoder TE only version ⬇️ and try it!
And this SAE CLIP has best results for linear probe @ LAION-AI/CLIP_benchmark (see below)
This CLIP direct download is also the best CLIP to use for HunyuanVideo.
Required: Use with my zer0int/ComfyUI-HunyuanVideo-Nyan node (changes influence of LLM vs. CLIP; otherwise, difference is very little).
- Interesting things with adversarial robustness to try: Right-click and download individual images: Image 1 -- Image 2 -- Image 3
- Upload each into zero-shot [hopefully available soon on the right here->]
- Try labels (class names): a photo of a cat, a photo of a dog, a photo of a text
- Repeat the same with e.g. my GmP models models and see what happens. =)
- I'm really hoping the HF format .safetensors conversion didn't mess anything up (it happens!); just in case it did, or if there's no inference API available to use:
- I put a script that will do the same thing (on the not-converted model) on my GitHub repo. Plus, you can just reproduce the fine-tune yourself, as that code is also available! 🤗
- 👉 All training info & code: github.com/zer0int/CLIP-SAE-finetune
- ☕ Buy me a coffee