Nomic Embeddings API vs Transformers output
I have this code:
import torch.nn.functional as F
from transformers import AutoModel, AutoImageProcessor
from PIL import Image
img = Image.open("./image.jpg")
processor = AutoImageProcessor.from_pretrained("nomic-ai/nomic-embed-vision-v1")
vision_model = AutoModel.from_pretrained("nomic-ai/nomic-embed-vision-v1",
trust_remote_code=True)
inputs = processor([img], return_tensors="pt")
img_emb = vision_model(**inputs).last_hidden_state
img_embeddings = F.normalize(img_emb[:, 0], p=2, dim=1)
print(img_embeddings[0][:5])
This prints: tensor([-0.0672, -0.0483, -0.0122, -0.0547, -0.0542], grad_fn=<SliceBackward0>)
import nomic
from PIL import Image
from nomic import embed
nomic.cli.login(NOMIC_API_KEY)
output = embed.image(
images=[img],
model='nomic-embed-vision-v1',
)
print(output["embeddings"][0][:5])
This prints: [-0.06616211, -0.072265625, 0.002506256, -0.05718994, -0.04675293]
Both should produce vectors with similar values, right? What am I missing? And like the outputs from nomic
case how to we get values with more precision for the transformers
case?
hmm thanks for raising this. when i deployed the models they had equivalent outputs but seems like something’s gone wrong. i will investigate and get back to you asap
The only immediate thing I would check is if you get similar values for running the transformers model in fp16. the model we have running in production is run with that same precision
Ok, will dig in! apologies for this
@rajaiswal ok i believe i've identified an issue. it seems like something is going on when we upload the image via bytes to our API. trying to debug a bit more. a workaround (if possible) is to instead pass urls to the API and you should see similar results. The error should be around
np.abs(emb - nom_emb).min()=0.0
np.abs(emb - nom_emb).mean()=6.73e-05
np.abs(emb - nom_emb).max()=0.0003052
do you mind sharing the image so I can test? it appears there may be a bug in our api, we’re working to fix it asap
@zpn
What I meant was - Yes if I use the image_url then both the transformers and API produce the same results.transformers output: tensor([-0.0201, 0.0056, -0.0255, -0.0168, -0.0528], dtype=torch.float16, grad_fn=<SliceBackward0>)
API output: [-0.020095825, 0.0056610107, -0.025756836, -0.016479492, -0.052612305]
Here is the image I am testing with url: https://m.media-amazon.com/images/M/MV5BZTc0ZjNkYTktMmJmOS00OTJlLTg1NWUtMzQ5ZGMxM2NhY2M0L2ltYWdlL2ltYWdlXkEyXkFqcGdeQXVyNTAyODkwOQ@@._V1_.jpg
But does this mean that the bug lies within Nomic's hosted API and not with the transformers implementation?
Oh! I realize it is a client side "bug". If you upload the image directly, we resize the image otherwise it'll be too big for the request: https://github.com/nomic-ai/nomic/blob/main/nomic/embed.py#L342
The API and the transformers should be equivalent. When I've tested for a fixed input, they return nearly identical values.
I think the optimal solution is to use the URLs if possible
@zpn
Okay that explains the difference. Thanks for investigating! Also any way we can get more precision using transformers
like we get from the API? In the text model I get less precision using transformers
but more precision using sentence_transformers
. Something like that for vision model?