Are the long (248) vs short (77) text enhancing models similar quality for great text? and just different token length compatability?
Just making sure the long version of the model is as good as the short.
Define 'good'? :-)
- In benchmark scores, e.g. zero-shot ImageNet/ObjectNet, Long-CLIP is WORSE than the 'short CLIP'.
- For guiding text-to-image AI system AND long prompts, the Long-CLIP model adds much more intricate details and control than 'short CLIP' can do.
However, it is worse for very specific fine-grained accurate details, namely TEXT in images. Example without text:
very interesting!!
These images make it quite clear then I should use the short clip for objects with text!
Thank you for this!!
Would love to know if you found anything else in your research to improve text quality but maybe thats not for this thread
Well, CLIP cannot represent syntactics (the full meaning of a sentence) due to the way it is built. In its embeddings, tokens / words are further apart if unrelated, and closer together if related (contrastive learning).
That's why CLIP can't really distinguish "an orange cat sitting on a box" from "a cat sitting on an orange box"; it's a matter of chance (random seed) whether you get one or the other.
Though prompt weighting also helps by just "highlighting" ((orange cat)) vs. ((orange box)), that's more of a hack rather than a solution.
Try making "a tabby cat sitting on a box" vs. "a cat sitting on a tabby box"; "tabby" is so close to "cat", it requires a lot of prompt engineering and weighting to get it "apart" from the cat and "onto" the box.
So, it's a mission impossible to make CLIP 'spell text correctly' (if CLIP is as-is, without modifying the model -> compare to e.g. "unCLIP" models and other variants of CLIP). CLIP cannot 'see' the sequence in the correct order and guide that detail.
Hence why e.g. Flux.1 uses T5 in addition, because T5 'sees sequences' and alas syntactic order and meaning of the whole. But T5, on the other hand, cannot encode the intricacies of image details - which are all conserved in the CLIP text encoder due to the contrastive alignment of text and images becoming "close".
Here's an example. What CLIP 'sees' in the prompt is basically:
cat + sign -> holding + sign <- text <- [artificial redundant cat-words to make it more cat]
As you can see, the stunning details and capability is because of the huge diffusion transformer in Flux.1 - the guidance towards what to generate is done by CLIP. And my Text model, as intended, produces the sharpest letters. But CLIP doesn't have a clue about what it is doing.
CLIP just knows there is a sign, and due to signs much more often occuring in photos with text on them than cats have text written on them, the model confidently places the thing it has learned to be text on the sign. If it saw enough signs WITH a particular text, e.g. a stop sign, CLIP can also generate the correct text ON a sign.
But yeah, it does not inherently understand the sequence of words as a whole to capture the full meaning of the sentence like you (or GPT-4, or Claude, or T5) do.