Sep 25

Btw, I tried extracting embeddings from a protein sequence using your model.

prot = "JUYTRFDCVBNJKLMNBHGV"
inputs = tokenizer(prot , return_tensors = 'pt')["input_ids"]
hidden_states = model(inputs.to("cuda"))[0] # [1, sequence_length, 256]

embedding with max pooling

embedding_max = torch.max(hidden_states[0], dim=0)[0]
print(embedding_max.shape) # expect to be 256

But the output "embedding_max" shape was "torch.Size([1024])". Which is far from 256, is it because this is a 1.6B model ????
(assuming you wrote 256 for other models of size 154M).

RaphaelMourad

Owner Sep 25

Thanks.
You guessed well, bigger model means bigger embedding.

kailasps

Sep 25

•

edited Sep 25

Thanks for the timely reply @RaphaelMourad .
However while i am trying out the model. The 1.6B version. There seems to be a peculiar behavior. All the max pooling embeddings are 'nan' values.

torch.Size([1024])
tensor([nan, nan, nan, ..., nan, nan, nan], device='cuda:0',grad_fn=)

I loaded the model directly in 16-bit precision ( in Colab ) and ran the same example from the model card.

insulin = "MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN"
inputs = tokenizer(insulin, return_tensors = 'pt')["input_ids"]
hidden_states = model(inputs.to('cuda'))[0] # [1, sequence_length, 256]

# embedding with max pooling
embedding_max = torch.max(hidden_states[0], dim=0)[0]
print(embedding_max.shape) # expect to be 256

kailasps

Sep 25

•

edited Sep 25

I did try the 'Mistral-Prot-v1-417M', which provides non-nan output, also how long did it take you to train this from scratch? What kind of compute power did you use man 😅?

RaphaelMourad
/

Mistral-Chem-v1-1.6B

Dimensions of the embeddings

embedding with max pooling