tattabio
/

gLM2_650M_embed

Model card Files Files and versions Community

gLM2_650M_embed / README.md

andrecornman's picture

Update README.md

1b5c960 verified 2 months ago

|

history blame contribute delete

1.1 kB

	---
	datasets:
	- tattabio/OMG
	license: apache-2.0
	---
	# gLM2_650M_embed

	gLM2_embed is a fine-tuned vesion of [`tattabio/gLM2_650M`](https://huggingface.co/tattabio/gLM2_650M) for embedding and retrieval.

	- The first stage finetunes gLM2 over one epoch of UniRef50.
	- The second stage trains an adapter layer to align mean-pooled representations with AlphaFold structural [clusters](https://www.nature.com/articles/s41586-023-06510-w).

	## Getting Started
	```python
	import torch
	from transformers import AutoModel, AutoTokenizer
	model = AutoModel.from_pretrained('tattabio/gLM2_650M_embed', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda()
	tokenizer = AutoTokenizer.from_pretrained('tattabio/gLM2_650M_embed', trust_remote_code=True)

	# NOTE: Prepend with `<+>` to match gLM2 pre-training.
	sequence = "<+>MALTKVEKRNRIKRRVRGKISGTQASPRLSVYKSNK"

	# Tokenize the sequence.
	encodings = tokenizer([sequence], return_tensors='pt')
	# Extract embeddings.
	with torch.no_grad():
	embeddings = model(encodings.input_ids.cuda()).pooler_output

	print(embeddings.shape) # torch.Size([1, 512])
	```