Recommended for Stable Diffusion 3.5 Large?

#12
by mimizukari - opened

Saw this as possible solution to "clip missing: ['text_projection.weight']" issue, but on a Flux forum. Does this also work for Stable Diffusion 3.5 large or is it incompatible (tri-clip loader with clip_g/t5/clip_l)

tl;dr: Yes, all my models with "TE-only" in the filename should work fine, as they have a text_projection.


It should work, in general, if there is a full model available. So, if you encountered this issue with the original OpenAI CLIP-L text encoder and you want to use that model, you could download openai/clip-vit-large-patch14, which is the full CLIP (text transformer and vision transformer), and use that.

If you get an error about "unexpected keys", you can choose to ignore that and just "dump the unnecessary keys" (that belong to the vision transformer, etc.) by finding the loading line in your code and adding strict=False , something along the lines of:

model.load_state_dict(state_dict, strict=False)

Reasoning: CLIP uses "projection" to project both text and vision to a shared space (latent) when it is used 'standalone', e.g. for zero-shot classification:

Resblocks of Text -> ln_final ((768,) -> text_projection [768,768] -> πŸ“„πŸ‘οΈ <- visual.proj <- ln_final <- Resblocks of ViT

You don't technically need the projection when you are using it as as a Text Encoder only for a Diffusion model, you can use "something else" as suitable.
For example, SDXL uses the penultimate (second-to-last) layer of the Text Encoder.
In ComfyUI, it seems that Flux (dev) doesn't use the text_projection either, but obtains the step before that - which results in an embedding with a larger norm than an embedding from text_projection.

But, yeah, as my "TE-only" models as well as any "full models, ViT and Text" always contain the text_projection layer, you should be able to use that with SD 3.5 just the same. The only potential issue I can see is that you might have to add strict=False somewhere in your code, as there may be unneeded keys that need to be explicitly ignored ("dumped") during loading.

Sign up or log in to comment