Question About SigLIP 2’s Performance with Newline-Separated Labels
#2
by
zfjerome1
- opened
I’ve been experimenting with SigLIP 1 and SigLIP 2 in the Hugging Face Space and noticed something interesting. When I input labels in two formats—comma-separated (e.g., "photojournalism photography, editorial photography") versus newline-separated (e.g., "photojournalism photography,\neditorial photography,\n...")—SigLIP 1 consistently performs more accurately, while SigLIP 2 seem to perform better when there is a new line. I have also tested with diverse set of images and noticed similar pattern.
Could you shed light on why SigLIP 2 handles newline-separated labels better? Is this an intentional design choice, like training on noisier text data, or an artifact of the tokenizer?
Thank you!