unum-cloud
/

uform-vl-english-small-gpu-fp16

Feature Extraction

Inference Endpoints

Model card Files Files and versions Community

kimihailv commited on Mar 28

Commit

f884fc9

•

1 Parent(s): c77c978

Update README.md

Files changed (1) hide show

README.md +6 -29

README.md CHANGED Viewed

@@ -51,7 +51,7 @@ To load the model:
 ```python
 import uform
-model = uform.get_model_onnx('unum-cloud/uform-vl-english-small', device='gpu', dtype='fp16')
 ```
 To encode data:
@@ -62,11 +62,11 @@ from PIL import Image
 text = 'a small red panda in a zoo'
 image = Image.open('red_panda.jpg')
-image_data = model.preprocess_image(image)
-text_data = model.preprocess_text(text)
-image_embedding = model.encode_image(image_data)
-text_embedding = model.encode_text(text_data)
 score, joint_embedding = model.encode_multimodal(
     image_features=image_features,
     text_features=text_features,
@@ -75,33 +75,10 @@ score, joint_embedding = model.encode_multimodal(
 )
 ```
-To get features:
-```python
-image_features, image_embedding = model.encode_image(image_data, return_features=True)
-text_features, text_embedding = model.encode_text(text_data, return_features=True)
-```
-These features can later be used to produce joint multimodal encodings faster, as the first layers of the transformer can be skipped:
-```python
-joint_embedding = model.encode_multimodal(
-    image_features=image_features,
-    text_features=text_features,
-    attention_mask=text_data['attention_mask']
-)
-```
-There are two options to calculate semantic compatibility between an image and a text: [Cosine Similarity](#cosine-similarity) and [Matching Score](#matching-score).
 ### Cosine Similarity
-```python
-import torch.nn.functional as F
-similarity = F.cosine_similarity(image_embedding, text_embedding)
-```
 The `similarity` will belong to the `[-1, 1]` range, `1` meaning the absolute match.
 __Pros__:

 ```python
 import uform
+model, processor = uform.get_model_onnx('unum-cloud/uform-vl-english-small', device='gpu', dtype='fp16')
 ```
 To encode data:
 text = 'a small red panda in a zoo'
 image = Image.open('red_panda.jpg')
+image_data = processor.preprocess_image(image)
+text_data = processor.preprocess_text(text)
+image_features, image_embedding = model.encode_image(image_data, return_features=True)
+text_features, text_embedding = model.encode_text(text_data, return_features=True)
 score, joint_embedding = model.encode_multimodal(
     image_features=image_features,
     text_features=text_features,
 )
 ```
+There are two options to calculate semantic compatibility between an image and a text: cosine similarity and [Matching Score](#matching-score).
 ### Cosine Similarity
 The `similarity` will belong to the `[-1, 1]` range, `1` meaning the absolute match.
 __Pros__: