Zero-Shot Image Classification
Transformers
Safetensors
siglip
vision
Inference Endpoints

Question About SigLIP 2’s Performance with Newline-Separated Labels

#2
by zfjerome1 - opened

I’ve been experimenting with SigLIP 1 and SigLIP 2 in the Hugging Face Space and noticed something interesting. When I input labels in two formats—comma-separated (e.g., "photojournalism photography, editorial photography") versus newline-separated (e.g., "photojournalism photography,\neditorial photography,\n...")—SigLIP 1 consistently performs more accurately, while SigLIP 2 seem to perform better when there is a new line. I have also tested with diverse set of images and noticed similar pattern.

Could you shed light on why SigLIP 2 handles newline-separated labels better? Is this an intentional design choice, like training on noisier text data, or an artifact of the tokenizer?

Comma separated:
image.png

Comma separated + New line:
image.png

Thank you!

Sign up or log in to comment