Synthyra
/

FastESM2_650

Model card Files Files and versions Community

lhallee commited on Dec 22, 2024

Commit

e22ff6e

·

verified ·

1 Parent(s): 8fbd71e

Update README.md

Files changed (1) hide show

README.md +7 -3

README.md CHANGED Viewed

@@ -16,20 +16,24 @@ Various other optimizations also make the base implementation slightly different
 ## Use with 🤗 transformers
 ```python
 import torch
-from transformers import AutoModel, AutoTokenizer
 model_path = 'Synthyra/FastESM2_650'
-model = AutoModel.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).eval()
 tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
 sequences = ['MPRTEIN', 'MSEQWENCE']
 tokenized = tokenizer(sequences, padding=True, return_tensors='pt')
 with torch.no_grad():
-    embeddings = model(**tokenized).last_hidden_state
 print(embeddings.shape) # (1, 11, 1280)
 ```
 ## Embed entire datasets with no new code
 To embed a list of protein sequences **fast**, just call embed_dataset. Sequences are sorted to reduce padding tokens, so the initial progress bar estimation is usually much longer than the actual time.
 ```python

 ## Use with 🤗 transformers
 ```python
 import torch
+from transformers import AutoModelForMaskedLM, AutoTokenizer
 model_path = 'Synthyra/FastESM2_650'
+model = AutoModelForMaskedLM.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).eval()
 tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
 sequences = ['MPRTEIN', 'MSEQWENCE']
 tokenized = tokenizer(sequences, padding=True, return_tensors='pt')
 with torch.no_grad():
+    embeddings = model(**tokenized, output_hidden_states=True).hidden_states[-1]
 print(embeddings.shape) # (1, 11, 1280)
 ```
+Please note that FastESM does not currently work with AutoModel.
+If you would like to train a model from scratch without a language modeling head you can still use the base code, but if you load the weights with AutoModel they will not map correctly.
+AutoModelForSequenceClassification and AutoModelForTokenClassification are working as intended.
 ## Embed entire datasets with no new code
 To embed a list of protein sequences **fast**, just call embed_dataset. Sequences are sorted to reduce padding tokens, so the initial progress bar estimation is usually much longer than the actual time.
 ```python