|
--- |
|
library_name: sentence-transformers |
|
pipeline_tag: sentence-similarity |
|
datasets: |
|
- monsoon-nlp/protein-pairs-uniprot-swissprot |
|
tags: |
|
- sentence-transformers |
|
- sentence-similarity |
|
- transformers |
|
- biology |
|
- protein language model |
|
license: cc |
|
base_model: Rostlab/prot_bert_bfd |
|
--- |
|
|
|
# Protein Matryoshka Embeddings |
|
|
|
The model generates an embedding for input proteins. It was trained using [Matryoshka loss](https://huggingface.co/blog/matryoshka), |
|
so shortened embeddings can be used for faster search and other tasks. |
|
|
|
Inputs use [IUPAC-IUB codes](https://en.wikipedia.org/wiki/FASTA_format#Sequence_representation) where letters A-Z map to amino acids. For example: |
|
|
|
"M A R N W S F R V" |
|
|
|
The base model was [Rostlab/prot_bert_bfd](https://huggingface.co/Rostlab/prot_bert_bfd). |
|
A [sentence-transformers](https://github.com/UKPLab/sentence-transformers) model was trained on cosine-similarity of embeddings |
|
from [UniProt](https://www.uniprot.org/help/downloads#embeddings). |
|
For train/test/validation datasets of embeddings and distances, see: https://huggingface.co/datasets/monsoon-nlp/protein-pairs-uniprot-swissprot |
|
|
|
|
|
## Usage |
|
|
|
Install these dependencies: |
|
|
|
``` |
|
pip install -U sentence-transformers datasets |
|
``` |
|
|
|
Generating embeddings: |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
sequences = ["M S L E Q K...", "M A R N W S F R V..."] |
|
|
|
model = SentenceTransformer('monsoon-nlp/protein-matryoshka-embeddings') |
|
embeddings = model.encode(sentences) |
|
print(embeddings) |
|
``` |
|
|
|
|
|
## Training + Code |
|
|
|
CoLab notebook: https://colab.research.google.com/drive/1uBk-jHOAPhIiUPPunfK7bMC8GnzpwmBy?usp=sharing |
|
|
|
Results on 1,000 protein pairs from the validation dataset, during training: |
|
|
|
|steps|cosine_pearson|cosine_spearman| |
|
|-----|--------------|---------------| |
|
|3000|0.8598688660086558|0.8666855900999677| |
|
|6000|0.8692703523988448|0.8615673651584274| |
|
|9000|0.8779733537629968|0.8754158959780602| |
|
|12000|0.8877422045031667|0.8881492475969834| |
|
|15000|0.9027359688395733|0.899106724739699| |
|
|18000|0.9046675789738002|0.9044183600191271| |
|
|21000|0.9165801536390973|0.9061381997421003| |
|
|24000|0.9128046401341833|0.9076748537082228| |
|
|27000|0.918547416546341|0.9127677526055185| |
|
|30000|0.9239429677657788|0.9187051589781693| |
|
|
|
## Validation |
|
|
|
Scatter plots comparing the full and 128-dim embeddings to the original embeddings, using pairs from the test set: https://colab.research.google.com/drive/1hm4IIMXaLt_7QYRNvkiXl5BqmsHdC1Ue?usp=sharing |
|
|
|
## Finetuning / Tasks |
|
|
|
One of the more popular evaluations is [Tasks Assessing Protein Embeddings (TAPE)](https://github.com/songlab-cal/tape) |
|
|
|
Example using SciKit-Learn to train on Fluorescence, a regression task from TAPE: https://colab.research.google.com/drive/1cH9jOBSC56mqJHU_6ztQPp6qWJguNjAn?usp=sharing |
|
|
|
Example using SciKit-Learn to train on a classification task from [greenbeing-binary](https://huggingface.co/datasets/monsoon-nlp/greenbeing-binary) - https://colab.research.google.com/drive/1MCTn8f3oeIKpB6n_8mPumet3ukm7GD8a?usp=sharing |
|
|
|
## Future |
|
|
|
This page will be updated when I have examples using it on protein classification tasks. |
|
|
|
I'm interested in whether [embedding quantization](https://huggingface.co/blog/embedding-quantization) could be even more efficient. |
|
|
|
If you want to collaborate on future projects / have resources to train longer on more embeddings, please get in touch. |