library_name: sentence-transformers
pipeline_tag: sentence-similarity
datasets:
- monsoon-nlp/protein-pairs-uniprot-swissprot
tags:
- sentence-transformers
- sentence-similarity
- transformers
- biology
license: cc
base_model: Rostlab/prot_bert_bfd
Protein Matryoshka Embeddings
The model generates an embedding for input proteins. It was trained using Matryoshka loss, so shortened embeddings can be used for faster search and other tasks.
Inputs use IUPAC-IUB codes where letters A-Z map to amino acids. For example:
"M A R N W S F R V"
The base model was Rostlab/prot_bert_bfd. A sentence-transformers model was trained on cosine-similarity of embeddings from UniProt. For train/test/validation datasets of embeddings and distances, see: https://huggingface.co/datasets/monsoon-nlp/protein-pairs-uniprot-swissprot
Usage
Install these dependencies:
pip install -U sentence-transformers datasets
Generating embeddings:
from sentence_transformers import SentenceTransformer
sequences = ["M S L E Q K...", "M A R N W S F R V..."]
model = SentenceTransformer('monsoon-nlp/protein-matryoshka-embeddings')
embeddings = model.encode(sentences)
print(embeddings)
Training + Code
CoLab notebook: https://colab.research.google.com/drive/1uBk-jHOAPhIiUPPunfK7bMC8GnzpwmBy?usp=sharing
Results on 1,000 protein pairs from the validation dataset, during training:
steps | cosine_pearson | cosine_spearman |
---|---|---|
3000 | 0.8598688660086558 | 0.8666855900999677 |
6000 | 0.8692703523988448 | 0.8615673651584274 |
9000 | 0.8779733537629968 | 0.8754158959780602 |
12000 | 0.8877422045031667 | 0.8881492475969834 |
15000 | 0.9027359688395733 | 0.899106724739699 |
18000 | 0.9046675789738002 | 0.9044183600191271 |
21000 | 0.9165801536390973 | 0.9061381997421003 |
24000 | 0.9128046401341833 | 0.9076748537082228 |
27000 | 0.918547416546341 | 0.9127677526055185 |
30000 | 0.9239429677657788 | 0.9187051589781693 |
Validation
Scatter plots comparing the full and 128-dim embeddings to the original embeddings, using pairs from the test set: https://colab.research.google.com/drive/1hm4IIMXaLt_7QYRNvkiXl5BqmsHdC1Ue?usp=sharing
Finetuning / Tasks
One of the more popular evaluations is Tasks Assessing Protein Embeddings (TAPE)
Example using SciKit-Learn to train on Fluorescence, a regression task from TAPE: https://colab.research.google.com/drive/1cH9jOBSC56mqJHU_6ztQPp6qWJguNjAn?usp=sharing
TBD: example using SciKit-Learn on a classification task
TBD: example using Sentence-Transformers to finetune embeddings for a TAPE regression or classification task
TBD: examples using plant proteins from greenbeing-binary to train a binary classifier
Future
This page will be updated when I have examples using it on protein classification tasks.
I'm interested in whether embedding quantization could be even more efficient.
If you want to collaborate on future projects / have resources to train longer on more embeddings, please get in touch.