Update README.md

338647b about 1 year ago

5.12 kB

	---
	pipeline_tag: zero-shot-classification
	base_model: laion/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K
	inference: false
	tags:
	- deepsparse
	---
	This is an unoptimized, exported version of https://huggingface.co/laion/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K that is ready to use with [DeepSparse](https://github.com/neuralmagic/deepsparse). It achieves 95.7% zero-shot top-1 accuracy on Imagenette.

	Notebook for basic usage: [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ZvU9ZSHJKSeJyH5bgxo_A-GSVIUcSt2E?usp=sharing)
	Notebook for Imagenette evaluation: [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1-Duq0YNtjzOnmuXCYo-5DDiOzeCItXpN?usp=sharing)

	## Setup for usage
	First, install DeepSparse with extensions for CLIP:
	```
	pip install deepsparse-nightly[clip]>=1.7.0.20231210
	```

	Download some test images of a church, a dog, and elephants:
	```
	wget -O basilica.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolo/sample_images/basilica.jpg
	wget -O buddy.jpeg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/tests/deepsparse/pipelines/sample_images/buddy.jpeg
	wget -O thailand.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolact/sample_images/thailand.jpg
	```

	For this model there is a second input that is the length of tokens, so run this input override code before making a text pipeline:
	```python
	import numpy as np
	from deepsparse.clip import CLIPTextPipeline

	def custom_process_inputs(self, inputs):
	if not isinstance(inputs.text, list):
	inputs.text = [inputs.text]
	if not isinstance(inputs.text[0], str):
	return inputs.text
	tokens = [np.array(t).astype(np.int32) for t in self.tokenizer(inputs.text)]
	tokens = np.stack(tokens, axis=0)
	tokens_lengths = np.array(tokens.shape[0] * [tokens.shape[1] - 1])
	return [tokens, tokens_lengths]

	# This overrides the process_inputs function globally for all CLIPTextPipeline classes
	CLIPTextPipeline.process_inputs = custom_process_inputs
	```

	## Text embedding pipeline

	Here is an example of how to create and use a [DeepSparse pipeline for text embeddings](https://github.com/neuralmagic/deepsparse/blob/main/src/deepsparse/clip/text_pipeline.py).
	```python
	from deepsparse import Pipeline
	from huggingface_hub import snapshot_download

	# Download the model from HF
	model_folder = snapshot_download(repo_id="neuralmagic/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K-quant-ds")

	text_embed_pipeline = Pipeline.create(task="clip_text", model_path=model_folder + "/textual.onnx")

	text = ["ice cream", "an elephant", "a dog", "a building", "a church"]

	embeddings = text_embed_pipeline(text=text).text_embeddings
	for i in range(len(embeddings)):
	print(embeddings[i].shape)
	print(embeddings[i])
	```

	## Image embedding pipeline

	Here is an example of how to create and use a [DeepSparse pipeline for image embeddings](https://github.com/neuralmagic/deepsparse/blob/main/src/deepsparse/clip/visual_pipeline.py).
	```python
	from deepsparse import Pipeline
	from huggingface_hub import snapshot_download

	# Download the model from HF
	model_folder = snapshot_download(repo_id="neuralmagic/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K-quant-ds")

	image_embed_pipeline = Pipeline.create(task="clip_visual", model_path=model_folder + "/visual.onnx")

	images = ["basilica.jpg", "buddy.jpeg", "thailand.jpg"]

	embeddings = image_embed_pipeline(images=images).image_embeddings
	for i in range(len(embeddings)):
	print(embeddings[i].shape)
	print(embeddings[i])
	```

	## Zero-shot image classification pipeline

	Since CLIP trained both the text and image embedding models in tandem, we can generate embeddings for both and relate them together without retraining. Here is an example of how to create and use a [DeepSparse pipeline for zero-shot image classification](https://github.com/neuralmagic/deepsparse/blob/main/src/deepsparse/clip/zeroshot_pipeline.py).
	```python
	from deepsparse import Pipeline
	from deepsparse.clip import (
	CLIPTextInput,
	CLIPVisualInput,
	CLIPZeroShotInput
	)
	from huggingface_hub import snapshot_download

	# Download the model from HF
	model_folder = snapshot_download(repo_id="neuralmagic/CLIP-ViT-B-32-256x256-DataComp-s34B-b86K-quant-ds")

	possible_classes = ["ice cream", "an elephant", "a dog", "a building", "a church"]
	images = ["basilica.jpg", "buddy.jpeg", "thailand.jpg"]

	# Load the model into DeepSparse
	pipeline = Pipeline.create(
	task="clip_zeroshot",
	visual_model_path=model_folder + "/visual.onnx",
	text_model_path=model_folder + "/textual.onnx"
	)

	# Infer
	output = pipeline(
	image=CLIPVisualInput(images=images),
	text=CLIPTextInput(text=possible_classes),
	).text_scores

	for i in range(len(output)):
	prediction = possible_classes[np.argmax(output[i])]
	print(f"Image {images[i]} is a picture of {prediction}")

	"""
	Image basilica.jpg is a picture of a church
	Image buddy.jpeg is a picture of a dog
	Image thailand.jpg is a picture of an elephant
	"""
	```