Transformers documentation

SigLIP

Transformers

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v4.53.3).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

This model was released on 2023-03-27 and added to Hugging Face Transformers on 2024-01-08.

SigLIP

SigLIP is a multimodal image-text model similar to CLIP. It uses separate image and text encoders to generate representations for both modalities.

Unlike CLIP, SigLIP employs a pairwise sigmoid loss on image-text pairs during training. This training loss eliminates the need for a global view of all pairwise similarities between images and texts within a batch. Consequently, it enables more efficient scaling to larger batch sizes while also delivering superior performance with smaller batch sizes.

You can find all the original SigLIP checkpoints under the SigLIP collection.

Click on the SigLIP models in the right sidebar for more examples of how to apply SigLIP to different image and text tasks.

The example below demonstrates how to generate similarity scores between texts and image(s) with Pipeline or the AutoModel class.

Pipeline

AutoModel

Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the Quantization overview for more available quantization backends.

The example below uses bitsandbytes to only quantize the weights to int4.

import torch
import requests
from PIL import Image
from transformers import AutoProcessor, AutoModel, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModel.from_pretrained("google/siglip-base-patch16-224", quantization_config=bnb_config, device_map="auto", attn_implementation="sdpa")
processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-224")

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
candidate_labels = ["a Pallas cat", "a lion", "a Siberian tiger"]
texts = [f'This is a photo of {label}.' for label in candidate_labels]
inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image)
print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")

Notes

Training is supported for DDP and FSDP on single-node multi-GPU setups. However, it does not use torch.distributed utilities which may limit the scalability of batch size.
When using the standalone SiglipTokenizer or SiglipProcessor, make sure to pass padding="max_length" because that is how the model was trained.
To get the same results as the Pipeline, a prompt template of "This is a photo of {label}." should be passed to the processor.

Toggle the attn_implementation parameter to either "sdpa" or "flash_attention_2" to use a more memory-efficient attention.

# pip install -U flash-attn --no-build-isolation

from transformers import SiglipModel

model = SiglipModel.from_pretrained(
    "google/siglip-so400m-patch14-384",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.float16,
    device_map=device,
)

Transformers

SigLIP

Notes

SiglipConfig

class transformers.SiglipConfig

from_text_vision_configs

SiglipTextConfig

class transformers.SiglipTextConfig

SiglipVisionConfig

class transformers.SiglipVisionConfig

SiglipTokenizer

class transformers.SiglipTokenizer

build_inputs_with_special_tokens

get_special_tokens_mask

create_token_type_ids_from_sequences

save_vocabulary

SiglipImageProcessor

class transformers.SiglipImageProcessor

preprocess

SiglipImageProcessorFast

class transformers.SiglipImageProcessorFast

preprocess

SiglipProcessor

class transformers.SiglipProcessor

batch_decode

decode

SiglipModel

class transformers.SiglipModel

forward

get_text_features

get_image_features

SiglipTextModel

class transformers.SiglipTextModel

forward

SiglipVisionModel

class transformers.SiglipVisionModel

forward

SiglipForImageClassification

class transformers.SiglipForImageClassification

forward