Divyasreepat commited on
Commit
9e25465
·
verified ·
1 Parent(s): c54ed21

Update README.md with new model card content

Browse files
Files changed (1) hide show
  1. README.md +65 -51
README.md CHANGED
@@ -11,17 +11,20 @@ Weights are released under the [MIT License](https://opensource.org/license/mit)
11
 
12
  ## Links
13
 
14
- * [CLIP Quickstart Notebook](https://www.kaggle.com/code/divyasss/clip-quickstart-single-shot-classification)
15
- * [CLIP API Documentation](https://keras.io/api/keras_cv/models/clip/)
16
  * [CLIP Model Card](https://huggingface.co/docs/transformers/en/model_doc/clip)
 
 
17
 
18
  ## Installation
19
 
20
- Keras and KerasCV can be installed with:
21
 
22
  ```
23
- pip install -U -q keras-cv
24
- pip install -U -q keras>=3
 
25
  ```
26
 
27
  Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment see the [Keras Getting Started](https://keras.io/getting_started/) page.
@@ -35,53 +38,64 @@ The following model checkpoints are provided by the Keras team. Full code exampl
35
  | clip-vit-base-patch32 | 151.28M | The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 32 and input images of size (224, 224) |
36
  | clip-vit-large-patch14 | 427.62M | The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (224, 224) |
37
  | clip-vit-large-patch14-336 | 427.94M | The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (336, 336) |
 
 
 
 
38
 
39
- ## Example code
40
- ```
41
- from keras import ops
42
  import keras
43
- from keras_cv.models.feature_extractor.clip import CLIPProcessor
44
- from keras_cv.models import CLIP
45
-
46
- processor = CLIPProcessor("vocab.json", "merges.txt")
47
- # processed_image = transform_image("cat.jpg", 224)
48
- tokens = processor(["mountains", "cat on tortoise", "house"])
49
- model = CLIP.from_preset("clip-vit-base-patch32")
50
- output = model({
51
- "images": processed_image,
52
- "token_ids": tokens['token_ids'],
53
- "padding_mask": tokens['padding_mask']})
54
-
55
-
56
- # optional if you need to pre process image
57
- def transform_image(image_path, input_resolution):
58
- mean = ops.array([0.48145466, 0.4578275, 0.40821073])
59
- std = ops.array([0.26862954, 0.26130258, 0.27577711])
60
-
61
- image = keras.utils.load_img(image_path)
62
- image = keras.utils.img_to_array(image)
63
- image = (
64
- ops.image.resize(
65
- image,
66
- (input_resolution, input_resolution),
67
- interpolation="bicubic",
68
- )
69
- / 255.0
70
- )
71
- central_fraction = input_resolution / image.shape[0]
72
- width, height = image.shape[0], image.shape[1]
73
- left = ops.cast((width - width * central_fraction) / 2, dtype="int32")
74
- top = ops.cast((height - height * central_fraction) / 2, dtype="int32")
75
- right = ops.cast((width + width * central_fraction) / 2, dtype="int32")
76
- bottom = ops.cast(
77
- (height + height * central_fraction) / 2, dtype="int32"
78
- )
79
-
80
- image = ops.slice(
81
- image, [left, top, 0], [right - left, bottom - top, 3]
82
- )
83
-
84
- image = (image - mean) / std
85
- return ops.expand_dims(image, axis=0)
86
  ```
87
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
  ## Links
13
 
14
+ * [CLIP Quickstart Notebook](https://www.kaggle.com/code/laxmareddypatlolla/clip-quickstart-notebook)
15
+ * [CLIP API Documentation](https://keras.io/keras_hub/api/models/clip/)
16
  * [CLIP Model Card](https://huggingface.co/docs/transformers/en/model_doc/clip)
17
+ * [KerasHub Beginner Guide](https://keras.io/guides/keras_hub/getting_started/)
18
+ * [KerasHub Model Publishing Guide](https://keras.io/guides/keras_hub/upload/)
19
 
20
  ## Installation
21
 
22
+ Keras and KerasHub can be installed with:
23
 
24
  ```
25
+ pip install -U -q keras-hub
26
+ pip install -U -q keras
27
+
28
  ```
29
 
30
  Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment see the [Keras Getting Started](https://keras.io/getting_started/) page.
 
38
  | clip-vit-base-patch32 | 151.28M | The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 32 and input images of size (224, 224) |
39
  | clip-vit-large-patch14 | 427.62M | The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (224, 224) |
40
  | clip-vit-large-patch14-336 | 427.94M | The model uses a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.The model uses a patch size of 14 and input images of size (336, 336) |
41
+ | clip_vit_b_32_laion2b_s34b_b79k | 151.28M | 151 million parameter, 12-layer for vision and 12-layer for text, patch size of 32, Open CLIP model. |
42
+ | clip_vit_h_14_laion2b_s32b_b79k | 986.11M | 986 million parameter, 32-layer for vision and 24-layer for text, patch size of 14, Open CLIP model. |
43
+ | clip_vit_g_14_laion2b_s12b_b42k | 1.37B | 1.4 billion parameter, 40-layer for vision and 24-layer for text, patch size of 14, Open CLIP model. |
44
+ | clip_vit_bigg_14_laion2b_39b_b160k | 2.54B | 2.5 billion parameter, 48-layer for vision and 32-layer for text, patch size of 14, Open CLIP model. |
45
 
46
+ ## Example Usage
47
+ ```python
 
48
  import keras
49
+ import numpy as np
50
+ import matplotlib.pyplot as plt
51
+ from keras_hub.models import CLIPBackbone, CLIPTokenizer
52
+ from keras_hub.layers import CLIPImageConverter
53
+
54
+ # instantiate the model and preprocessing tools
55
+ clip = CLIPBackbone.from_preset("clip_vit_large_patch14_336")
56
+ tokenizer = CLIPTokenizer.from_preset("clip_vit_large_patch14_336",
57
+ sequence_length=5)
58
+ image_converter = CLIPImageConverter.from_preset("clip_vit_large_patch14_336")
59
+
60
+ # obtain tokens for some input text
61
+ tokens = tokenizer.tokenize(["mountains", "cat on tortoise", "house"])
62
+
63
+ # preprocess image and text
64
+ image = keras.utils.load_img("cat.jpg")
65
+ image = image_converter(np.array([image]).astype(float))
66
+
67
+ # query the model for similarities
68
+ clip({
69
+ "images": image,
70
+ "token_ids": tokens,
71
+ })
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
  ```
73
 
74
+ ## Example Usage with Hugging Face URI
75
+
76
+ ```python
77
+ import keras
78
+ import numpy as np
79
+ import matplotlib.pyplot as plt
80
+ from keras_hub.models import CLIPBackbone, CLIPTokenizer
81
+ from keras_hub.layers import CLIPImageConverter
82
+
83
+ # instantiate the model and preprocessing tools
84
+ clip = CLIPBackbone.from_preset("hf://keras/clip_vit_large_patch14_336")
85
+ tokenizer = CLIPTokenizer.from_preset("hf://keras/clip_vit_large_patch14_336",
86
+ sequence_length=5)
87
+ image_converter = CLIPImageConverter.from_preset("hf://keras/clip_vit_large_patch14_336")
88
+
89
+ # obtain tokens for some input text
90
+ tokens = tokenizer.tokenize(["mountains", "cat on tortoise", "house"])
91
+
92
+ # preprocess image and text
93
+ image = keras.utils.load_img("cat.jpg")
94
+ image = image_converter(np.array([image]).astype(float))
95
+
96
+ # query the model for similarities
97
+ clip({
98
+ "images": image,
99
+ "token_ids": tokens,
100
+ })
101
+ ```