--- datasets: - tattabio/OMG license: apache-2.0 --- # gLM2_650M_embed gLM2_embed is a fine-tuned vesion of [`tattabio/gLM2_650M`](https://huggingface.co/tattabio/gLM2_650M) for embedding and retrieval. - The first stage finetunes gLM2 over one epoch of UniRef50. - The second stage trains an adapter layer to align mean-pooled representations with AlphaFold structural [clusters](https://www.nature.com/articles/s41586-023-06510-w). ## Getting Started ```python import torch from transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained('tattabio/gLM2_650M_embed', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda() tokenizer = AutoTokenizer.from_pretrained('tattabio/gLM2_650M_embed', trust_remote_code=True) # NOTE: Prepend with `<+>` to match gLM2 pre-training. sequence = "<+>MALTKVEKRNRIKRRVRGKISGTQASPRLSVYKSNK" # Tokenize the sequence. encodings = tokenizer([sequence], return_tensors='pt') # Extract embeddings. with torch.no_grad(): embeddings = model(encodings.input_ids.cuda()).pooler_output print(embeddings.shape) # torch.Size([1, 512]) ```