library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- transformers
- biology
Protein Matryoshka Embeddings
The model generates an embedding for input proteins. It was trained using Matryoshka loss, so shortened embeddings can be used for faster search and other tasks.
Inputs use IUPAC-IUB codes where letters A-Z map to amino acids. For example:
"M A R N W S F R V"
The base model was Rostlab/prot_bert_bfd. A sentence-transformers model was trained on cosine-similarity of embeddings from UniProt. For train/test/validation datasets of embeddings and distances, see: https://huggingface.co/datasets/monsoon-nlp/protein-pairs-uniprot-swissprot
Usage
Install these dependencies:
pip install -U sentence-transformers datasets
Generating embeddings:
from sentence_transformers import SentenceTransformer
sequences = ["M S L E Q K...", "M A R N W S F R V..."]
model = SentenceTransformer('monsoon-nlp/protein-matryoshka-embeddings')
embeddings = model.encode(sentences)
print(embeddings)
Training + Code
CoLab notebook: https://colab.research.google.com/drive/1uBk-jHOAPhIiUPPunfK7bMC8GnzpwmBy?usp=sharing
Results on 1,000 protein pairs from the validation dataset, during training:
steps | cosine_pearson | cosine_spearman |
---|---|---|
3000 | 0.8598688660086558 | 0.8666855900999677 |
6000 | 0.8692703523988448 | 0.8615673651584274 |
9000 | 0.8779733537629968 | 0.8754158959780602 |
12000 | 0.8877422045031667 | 0.8881492475969834 |
15000 | 0.9027359688395733 | 0.899106724739699 |
18000 | 0.9046675789738002 | 0.9044183600191271 |
21000 | 0.9165801536390973 | 0.9061381997421003 |
24000 | 0.9128046401341833 | 0.9076748537082228 |
27000 | 0.918547416546341 | 0.9127677526055185 |
30000 | 0.9239429677657788 | 0.9187051589781693 |
Future
This model will be updated when I have examples using it on protein classification tasks.
I'm interested in whether embedding quantization could be even more efficient.
If you want to collaborate on future projects / have resources to train longer on more embeddings, please get in touch.