metadata

library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - sentence-similarity
  - transformers
  - biology

Protein Matryoshka Embeddings

The model generates an embedding for input proteins. It was trained using Matryoshka loss, so shortened embeddings can be used for faster search and other tasks.

Inputs use IUPAC-IUB codes where letters A-Z map to amino acids. For example:

"M A R N W S F R V"

The base model was Rostlab/prot_bert_bfd. A sentence-transformers model was trained on cosine-similarity of embeddings from UniProt. For train/test/validation datasets of embeddings and distances, see: https://huggingface.co/datasets/monsoon-nlp/protein-pairs-uniprot-swissprot

Usage

Install these dependencies:

pip install -U sentence-transformers datasets

Generating embeddings:

from sentence_transformers import SentenceTransformer
sequences = ["M S L E Q K...", "M A R N W S F R V..."]

model = SentenceTransformer('monsoon-nlp/protein-matryoshka-embeddings')
embeddings = model.encode(sentences)
print(embeddings)

Training + Code

CoLab notebook: https://colab.research.google.com/drive/1uBk-jHOAPhIiUPPunfK7bMC8GnzpwmBy?usp=sharing

Results on 1,000 protein pairs from the validation dataset, during training:

steps	cosine_pearson	cosine_spearman
3000	0.8598688660086558	0.8666855900999677
6000	0.8692703523988448	0.8615673651584274
9000	0.8779733537629968	0.8754158959780602
12000	0.8877422045031667	0.8881492475969834
15000	0.9027359688395733	0.899106724739699
18000	0.9046675789738002	0.9044183600191271
21000	0.9165801536390973	0.9061381997421003
24000	0.9128046401341833	0.9076748537082228
27000	0.918547416546341	0.9127677526055185
30000	0.9239429677657788	0.9187051589781693

Future

This model will be updated when I have examples using it on protein classification tasks.

I'm interested in whether embedding quantization could be even more efficient.

If you want to collaborate on future projects / have resources to train longer on more embeddings, please get in touch.