Update README.md
Browse files
README.md
CHANGED
@@ -51,7 +51,7 @@ To load the model:
|
|
51 |
```python
|
52 |
import uform
|
53 |
|
54 |
-
model = uform.get_model_onnx('unum-cloud/uform-vl-english-small', device='gpu', dtype='fp16')
|
55 |
```
|
56 |
|
57 |
To encode data:
|
@@ -62,11 +62,11 @@ from PIL import Image
|
|
62 |
text = 'a small red panda in a zoo'
|
63 |
image = Image.open('red_panda.jpg')
|
64 |
|
65 |
-
image_data =
|
66 |
-
text_data =
|
67 |
|
68 |
-
image_embedding = model.encode_image(image_data)
|
69 |
-
text_embedding = model.encode_text(text_data)
|
70 |
score, joint_embedding = model.encode_multimodal(
|
71 |
image_features=image_features,
|
72 |
text_features=text_features,
|
@@ -75,33 +75,10 @@ score, joint_embedding = model.encode_multimodal(
|
|
75 |
)
|
76 |
```
|
77 |
|
78 |
-
|
79 |
-
|
80 |
-
```python
|
81 |
-
image_features, image_embedding = model.encode_image(image_data, return_features=True)
|
82 |
-
text_features, text_embedding = model.encode_text(text_data, return_features=True)
|
83 |
-
```
|
84 |
-
|
85 |
-
These features can later be used to produce joint multimodal encodings faster, as the first layers of the transformer can be skipped:
|
86 |
-
|
87 |
-
```python
|
88 |
-
joint_embedding = model.encode_multimodal(
|
89 |
-
image_features=image_features,
|
90 |
-
text_features=text_features,
|
91 |
-
attention_mask=text_data['attention_mask']
|
92 |
-
)
|
93 |
-
```
|
94 |
-
|
95 |
-
There are two options to calculate semantic compatibility between an image and a text: [Cosine Similarity](#cosine-similarity) and [Matching Score](#matching-score).
|
96 |
|
97 |
### Cosine Similarity
|
98 |
|
99 |
-
```python
|
100 |
-
import torch.nn.functional as F
|
101 |
-
|
102 |
-
similarity = F.cosine_similarity(image_embedding, text_embedding)
|
103 |
-
```
|
104 |
-
|
105 |
The `similarity` will belong to the `[-1, 1]` range, `1` meaning the absolute match.
|
106 |
|
107 |
__Pros__:
|
|
|
51 |
```python
|
52 |
import uform
|
53 |
|
54 |
+
model, processor = uform.get_model_onnx('unum-cloud/uform-vl-english-small', device='gpu', dtype='fp16')
|
55 |
```
|
56 |
|
57 |
To encode data:
|
|
|
62 |
text = 'a small red panda in a zoo'
|
63 |
image = Image.open('red_panda.jpg')
|
64 |
|
65 |
+
image_data = processor.preprocess_image(image)
|
66 |
+
text_data = processor.preprocess_text(text)
|
67 |
|
68 |
+
image_features, image_embedding = model.encode_image(image_data, return_features=True)
|
69 |
+
text_features, text_embedding = model.encode_text(text_data, return_features=True)
|
70 |
score, joint_embedding = model.encode_multimodal(
|
71 |
image_features=image_features,
|
72 |
text_features=text_features,
|
|
|
75 |
)
|
76 |
```
|
77 |
|
78 |
+
There are two options to calculate semantic compatibility between an image and a text: cosine similarity and [Matching Score](#matching-score).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
79 |
|
80 |
### Cosine Similarity
|
81 |
|
|
|
|
|
|
|
|
|
|
|
|
|
82 |
The `similarity` will belong to the `[-1, 1]` range, `1` meaning the absolute match.
|
83 |
|
84 |
__Pros__:
|