monsoon-nlp's picture
add link to scatter plot notebook
0376554 verified
|
raw
history blame
2.78 kB
metadata
library_name: sentence-transformers
pipeline_tag: sentence-similarity
datasets:
  - monsoon-nlp/protein-pairs-uniprot-swissprot
tags:
  - sentence-transformers
  - sentence-similarity
  - transformers
  - biology
license: cc
base_model: Rostlab/prot_bert_bfd

Protein Matryoshka Embeddings

The model generates an embedding for input proteins. It was trained using Matryoshka loss, so shortened embeddings can be used for faster search and other tasks.

Inputs use IUPAC-IUB codes where letters A-Z map to amino acids. For example:

"M A R N W S F R V"

The base model was Rostlab/prot_bert_bfd. A sentence-transformers model was trained on cosine-similarity of embeddings from UniProt. For train/test/validation datasets of embeddings and distances, see: https://huggingface.co/datasets/monsoon-nlp/protein-pairs-uniprot-swissprot

Usage

Install these dependencies:

pip install -U sentence-transformers datasets

Generating embeddings:

from sentence_transformers import SentenceTransformer
sequences = ["M S L E Q K...", "M A R N W S F R V..."]

model = SentenceTransformer('monsoon-nlp/protein-matryoshka-embeddings')
embeddings = model.encode(sentences)
print(embeddings)

Training + Code

CoLab notebook: https://colab.research.google.com/drive/1uBk-jHOAPhIiUPPunfK7bMC8GnzpwmBy?usp=sharing

Results on 1,000 protein pairs from the validation dataset, during training:

steps cosine_pearson cosine_spearman
3000 0.8598688660086558 0.8666855900999677
6000 0.8692703523988448 0.8615673651584274
9000 0.8779733537629968 0.8754158959780602
12000 0.8877422045031667 0.8881492475969834
15000 0.9027359688395733 0.899106724739699
18000 0.9046675789738002 0.9044183600191271
21000 0.9165801536390973 0.9061381997421003
24000 0.9128046401341833 0.9076748537082228
27000 0.918547416546341 0.9127677526055185
30000 0.9239429677657788 0.9187051589781693

Validation

Scatter plots comparing the full and 128-dim embeddings to the original embeddings, using pairs from the test set: https://colab.research.google.com/drive/1hm4IIMXaLt_7QYRNvkiXl5BqmsHdC1Ue?usp=sharing

Future

This page will be updated when I have examples using it on protein classification tasks.

I'm interested in whether embedding quantization could be even more efficient.

If you want to collaborate on future projects / have resources to train longer on more embeddings, please get in touch.