Model performance is not adequate.
When using zero-shot models for image classification, it's important to use adequate prompts. Among labels that you provided, the model output the highest probability for the 'cat' label which is correct. The probability itself is low because the picture can't be described as 'cat'. If you use prompt such as 'a picture of a cat printed on a box' - the probability will be much higher.
Indeed, see also the discussion here: https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384/discussions/3
I answered over there, this is expected and not a problem, you need to either softmax the output, or calibrate to your data/task depending on what exactly you want to do. Doing so is pretty easy: https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384/discussions/3#65f964b748d4f7baa4f1858d