README.md · neuralmagic/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K-ds at main

metadata

pipeline_tag: zero-shot-classification
base_model: laion/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K
inference: false
tags:
  - deepsparse

This is an unoptimized, exported version of https://huggingface.co/laion/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K that is ready to use with DeepSparse. It achieves 95.7% zero-shot top-1 accuracy on Imagenette.

Notebook for basic usage: Notebook for Imagenette evaluation:

Setup for usage

First, install DeepSparse with extensions for CLIP:

pip install deepsparse-nightly[clip]>=1.7.0.20231210

Download some test images of a church, a dog, and elephants:

wget -O basilica.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolo/sample_images/basilica.jpg
wget -O buddy.jpeg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/tests/deepsparse/pipelines/sample_images/buddy.jpeg
wget -O thailand.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolact/sample_images/thailand.jpg

For this model there is a second input that is the length of tokens, so run this input override code before making a text pipeline:

import numpy as np
from deepsparse.clip import CLIPTextPipeline

def custom_process_inputs(self, inputs):
    if not isinstance(inputs.text, list):
        inputs.text = [inputs.text]
    if not isinstance(inputs.text[0], str):
        return inputs.text
    tokens = [np.array(t).astype(np.int32) for t in self.tokenizer(inputs.text)]
    tokens = np.stack(tokens, axis=0)
    tokens_lengths = np.array(tokens.shape[0] * [tokens.shape[1] - 1])
    return [tokens, tokens_lengths]

# This overrides the process_inputs function globally for all CLIPTextPipeline classes
CLIPTextPipeline.process_inputs = custom_process_inputs

Text embedding pipeline

Here is an example of how to create and use a DeepSparse pipeline for text embeddings.

from deepsparse import Pipeline
from huggingface_hub import snapshot_download

# Download the model from HF
model_folder = snapshot_download(repo_id="neuralmagic/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K-quant-ds")

text_embed_pipeline = Pipeline.create(task="clip_text", model_path=model_folder + "/textual.onnx")

text = ["ice cream", "an elephant", "a dog", "a building", "a church"]

embeddings = text_embed_pipeline(text=text).text_embeddings
for i in range(len(embeddings)):
    print(embeddings[i].shape)
    print(embeddings[i])

Image embedding pipeline

Here is an example of how to create and use a DeepSparse pipeline for image embeddings.

from deepsparse import Pipeline
from huggingface_hub import snapshot_download

# Download the model from HF
model_folder = snapshot_download(repo_id="neuralmagic/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K-quant-ds")

image_embed_pipeline = Pipeline.create(task="clip_visual", model_path=model_folder + "/visual.onnx")

images = ["basilica.jpg", "buddy.jpeg", "thailand.jpg"]

embeddings = image_embed_pipeline(images=images).image_embeddings
for i in range(len(embeddings)):
    print(embeddings[i].shape)
    print(embeddings[i])

Zero-shot image classification pipeline

Since CLIP trained both the text and image embedding models in tandem, we can generate embeddings for both and relate them together without retraining. Here is an example of how to create and use a DeepSparse pipeline for zero-shot image classification.

from deepsparse import Pipeline
from deepsparse.clip import (
    CLIPTextInput,
    CLIPVisualInput,
    CLIPZeroShotInput
)
from huggingface_hub import snapshot_download

# Download the model from HF
model_folder = snapshot_download(repo_id="neuralmagic/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K-quant-ds")

possible_classes = ["ice cream", "an elephant", "a dog", "a building", "a church"]
images = ["basilica.jpg", "buddy.jpeg", "thailand.jpg"]

# Load the model into DeepSparse
pipeline = Pipeline.create(
    task="clip_zeroshot",
    visual_model_path=model_folder + "/visual.onnx",
    text_model_path=model_folder + "/textual.onnx"
)

# Infer
output = pipeline(
    image=CLIPVisualInput(images=images),
    text=CLIPTextInput(text=possible_classes),
).text_scores

for i in range(len(output)):
    prediction = possible_classes[np.argmax(output[i])]
    print(f"Image {images[i]} is a picture of {prediction}")

"""
Image basilica.jpg is a picture of a church
Image buddy.jpeg is a picture of a dog
Image thailand.jpg is a picture of an elephant
"""