lhallee commited on
Commit
e22ff6e
·
verified ·
1 Parent(s): 8fbd71e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -3
README.md CHANGED
@@ -16,20 +16,24 @@ Various other optimizations also make the base implementation slightly different
16
  ## Use with 🤗 transformers
17
  ```python
18
  import torch
19
- from transformers import AutoModel, AutoTokenizer
20
 
21
  model_path = 'Synthyra/FastESM2_650'
22
- model = AutoModel.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).eval()
23
  tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
24
 
25
  sequences = ['MPRTEIN', 'MSEQWENCE']
26
  tokenized = tokenizer(sequences, padding=True, return_tensors='pt')
27
  with torch.no_grad():
28
- embeddings = model(**tokenized).last_hidden_state
29
 
30
  print(embeddings.shape) # (1, 11, 1280)
31
  ```
32
 
 
 
 
 
33
  ## Embed entire datasets with no new code
34
  To embed a list of protein sequences **fast**, just call embed_dataset. Sequences are sorted to reduce padding tokens, so the initial progress bar estimation is usually much longer than the actual time.
35
  ```python
 
16
  ## Use with 🤗 transformers
17
  ```python
18
  import torch
19
+ from transformers import AutoModelForMaskedLM, AutoTokenizer
20
 
21
  model_path = 'Synthyra/FastESM2_650'
22
+ model = AutoModelForMaskedLM.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).eval()
23
  tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
24
 
25
  sequences = ['MPRTEIN', 'MSEQWENCE']
26
  tokenized = tokenizer(sequences, padding=True, return_tensors='pt')
27
  with torch.no_grad():
28
+ embeddings = model(**tokenized, output_hidden_states=True).hidden_states[-1]
29
 
30
  print(embeddings.shape) # (1, 11, 1280)
31
  ```
32
 
33
+ Please note that FastESM does not currently work with AutoModel.
34
+ If you would like to train a model from scratch without a language modeling head you can still use the base code, but if you load the weights with AutoModel they will not map correctly.
35
+ AutoModelForSequenceClassification and AutoModelForTokenClassification are working as intended.
36
+
37
  ## Embed entire datasets with no new code
38
  To embed a list of protein sequences **fast**, just call embed_dataset. Sequences are sorted to reduce padding tokens, so the initial progress bar estimation is usually much longer than the actual time.
39
  ```python