monsoon-nlp's picture
Update README.md
5d13a01 verified
---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
datasets:
- monsoon-nlp/protein-pairs-uniprot-swissprot
tags:
- sentence-transformers
- sentence-similarity
- transformers
- biology
- protein language model
license: cc
base_model: Rostlab/prot_bert_bfd
---
# Protein Matryoshka Embeddings
The model generates an embedding for input proteins. It was trained using [Matryoshka loss](https://huggingface.co/blog/matryoshka),
so shortened embeddings can be used for faster search and other tasks.
Inputs use [IUPAC-IUB codes](https://en.wikipedia.org/wiki/FASTA_format#Sequence_representation) where letters A-Z map to amino acids. For example:
"M A R N W S F R V"
The base model was [Rostlab/prot_bert_bfd](https://huggingface.co/Rostlab/prot_bert_bfd).
A [sentence-transformers](https://github.com/UKPLab/sentence-transformers) model was trained on cosine-similarity of embeddings
from [UniProt](https://www.uniprot.org/help/downloads#embeddings).
For train/test/validation datasets of embeddings and distances, see: https://huggingface.co/datasets/monsoon-nlp/protein-pairs-uniprot-swissprot
## Usage
Install these dependencies:
```
pip install -U sentence-transformers datasets
```
Generating embeddings:
```python
from sentence_transformers import SentenceTransformer
sequences = ["M S L E Q K...", "M A R N W S F R V..."]
model = SentenceTransformer('monsoon-nlp/protein-matryoshka-embeddings')
embeddings = model.encode(sentences)
print(embeddings)
```
## Training + Code
CoLab notebook: https://colab.research.google.com/drive/1uBk-jHOAPhIiUPPunfK7bMC8GnzpwmBy?usp=sharing
Results on 1,000 protein pairs from the validation dataset, during training:
|steps|cosine_pearson|cosine_spearman|
|-----|--------------|---------------|
|3000|0.8598688660086558|0.8666855900999677|
|6000|0.8692703523988448|0.8615673651584274|
|9000|0.8779733537629968|0.8754158959780602|
|12000|0.8877422045031667|0.8881492475969834|
|15000|0.9027359688395733|0.899106724739699|
|18000|0.9046675789738002|0.9044183600191271|
|21000|0.9165801536390973|0.9061381997421003|
|24000|0.9128046401341833|0.9076748537082228|
|27000|0.918547416546341|0.9127677526055185|
|30000|0.9239429677657788|0.9187051589781693|
## Validation
Scatter plots comparing the full and 128-dim embeddings to the original embeddings, using pairs from the test set: https://colab.research.google.com/drive/1hm4IIMXaLt_7QYRNvkiXl5BqmsHdC1Ue?usp=sharing
## Finetuning / Tasks
One of the more popular evaluations is [Tasks Assessing Protein Embeddings (TAPE)](https://github.com/songlab-cal/tape)
Example using SciKit-Learn to train on Fluorescence, a regression task from TAPE: https://colab.research.google.com/drive/1cH9jOBSC56mqJHU_6ztQPp6qWJguNjAn?usp=sharing
Example using SciKit-Learn to train on a classification task from [greenbeing-binary](https://huggingface.co/datasets/monsoon-nlp/greenbeing-binary) - https://colab.research.google.com/drive/1MCTn8f3oeIKpB6n_8mPumet3ukm7GD8a?usp=sharing
## Future
This page will be updated when I have examples using it on protein classification tasks.
I'm interested in whether [embedding quantization](https://huggingface.co/blog/embedding-quantization) could be even more efficient.
If you want to collaborate on future projects / have resources to train longer on more embeddings, please get in touch.