Zero-Shot Image Classification
Transformers
Safetensors
siglip
vision
Inference Endpoints

Model performance is not adequate.

#1
by talrejanikhil - opened

I tried this model on this simple image which has a picture of a cat
08710255134840A001.jpeg

However the model probability on the label is very low:
input labels = ["cat", "dog", "horse", "bird", "rabbit"]
probability % = ['0.0263%', '0.0003%', '0.0000%', '0.0001%', '0.0001%']

When using zero-shot models for image classification, it's important to use adequate prompts. Among labels that you provided, the model output the highest probability for the 'cat' label which is correct. The probability itself is low because the picture can't be described as 'cat'. If you use prompt such as 'a picture of a cat printed on a box' - the probability will be much higher.

Google org

I answered over there, this is expected and not a problem, you need to either softmax the output, or calibrate to your data/task depending on what exactly you want to do. Doing so is pretty easy: https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384/discussions/3#65f964b748d4f7baa4f1858d

giffmana changed discussion status to closed

Sign up or log in to comment