|
--- |
|
datasets: |
|
- tattabio/OMG |
|
license: apache-2.0 |
|
--- |
|
# gLM2_650M_embed |
|
|
|
gLM2_embed is a fine-tuned vesion of [`tattabio/gLM2_650M`](https://huggingface.co/tattabio/gLM2_650M) for embedding and retrieval. |
|
|
|
- The first stage finetunes gLM2 over one epoch of UniRef50. |
|
- The second stage trains an adapter layer to align mean-pooled representations with AlphaFold structural [clusters](https://www.nature.com/articles/s41586-023-06510-w). |
|
|
|
## Getting Started |
|
```python |
|
import torch |
|
from transformers import AutoModel, AutoTokenizer |
|
model = AutoModel.from_pretrained('tattabio/gLM2_650M_embed', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda() |
|
tokenizer = AutoTokenizer.from_pretrained('tattabio/gLM2_650M_embed', trust_remote_code=True) |
|
|
|
# NOTE: Prepend with `<+>` to match gLM2 pre-training. |
|
sequence = "<+>MALTKVEKRNRIKRRVRGKISGTQASPRLSVYKSNK" |
|
|
|
# Tokenize the sequence. |
|
encodings = tokenizer([sequence], return_tensors='pt') |
|
# Extract embeddings. |
|
with torch.no_grad(): |
|
embeddings = model(encodings.input_ids.cuda()).pooler_output |
|
|
|
print(embeddings.shape) # torch.Size([1, 512]) |
|
``` |
|
|