AmelieSchreiber
commited on
Commit
·
49d3c66
1
Parent(s):
9e90ac7
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,95 @@
|
|
1 |
---
|
2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
+
datasets:
|
4 |
+
- AmelieSchreiber/general_binding_sites
|
5 |
+
language:
|
6 |
+
- en
|
7 |
+
metrics:
|
8 |
+
- precision
|
9 |
+
- recall
|
10 |
+
- f1
|
11 |
+
library_name: transformers
|
12 |
+
tags:
|
13 |
+
- biology
|
14 |
+
- esm
|
15 |
+
- esm2
|
16 |
+
- ESM-2
|
17 |
+
- protein language model
|
18 |
---
|
19 |
+
|
20 |
+
# ESM-2 for General Protein Binding Site Prediction
|
21 |
+
|
22 |
+
This model is trained to predict general binding sites of proteins using on the sequence. This is a finetuned version of
|
23 |
+
`esm2_t6_8M_UR50D`, trained on [this dataset](https://huggingface.co/datasets/AmelieSchreiber/general_binding_sites). The data is
|
24 |
+
not filtered by family, and thus the model may be overfit to some degree.
|
25 |
+
|
26 |
+
## Training
|
27 |
+
|
28 |
+
```
|
29 |
+
epoch 3:
|
30 |
+
'eval_loss': 0.08215777575969696,
|
31 |
+
'eval_precision': 0.4673852829840273,
|
32 |
+
'eval_recall': 0.9587594696969697,
|
33 |
+
'eval_f1': 0.6284215753212091,
|
34 |
+
'eval_auc': 0.9730582015280457
|
35 |
+
```
|
36 |
+
|
37 |
+
## Using the Model
|
38 |
+
|
39 |
+
Try pasting a protein sequence into the cell on the right and clicking on "Compute". For example, try
|
40 |
+
|
41 |
+
```
|
42 |
+
MKVEEILEKALELVIPDEEEVRKGREAEEELRRRLDELGVEYVFVGSYARNTWLKGSLEIDVFLLFPEEFSKEELRERGLEIGKAVLDSYEIRYAEHPYVHGVVKGVEVDVVPCYKLKEPKNIKSAVDRTPFHHKWLEGRIKGKENEVRLLKGFLKANGIYGAEYKVRGFSGYLCELLIVFYGSFLETVKNARRWTRRTVIDVAKGEVRKGEEFFVVDPVDEKRNVAANLSLDNLARFVHLCREFMEAPSLGFFKPKHPLEIEPERLRKIVEERGTAVFAVKFRKPDIVDDNLYPQLERASRKIFEFLERENFMPLRSAFKASEEFCYLLFECQIKEISRVFRRMGPQFEDERNVKKFLSRNRAFRPFIENGRWWAFEMRKFTTPEEGVRSYASTHWHTLGKNVGESIREYFEIISGEKLFKEPVTAELCEMMGVKD
|
43 |
+
```
|
44 |
+
|
45 |
+
To use the model, try running:
|
46 |
+
```python
|
47 |
+
import torch
|
48 |
+
from transformers import AutoModelForTokenClassification, AutoTokenizer
|
49 |
+
|
50 |
+
def predict_binding_sites(model_path, protein_sequences):
|
51 |
+
"""
|
52 |
+
Predict binding sites for a collection of protein sequences.
|
53 |
+
|
54 |
+
Parameters:
|
55 |
+
- model_path (str): Path to the saved model.
|
56 |
+
- protein_sequences (List[str]): List of protein sequences.
|
57 |
+
|
58 |
+
Returns:
|
59 |
+
- List[List[str]]: Predicted labels for each sequence.
|
60 |
+
"""
|
61 |
+
|
62 |
+
# Load tokenizer and model
|
63 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
64 |
+
model = AutoModelForTokenClassification.from_pretrained(model_path)
|
65 |
+
|
66 |
+
# Ensure model is in evaluation mode
|
67 |
+
model.eval()
|
68 |
+
|
69 |
+
# Tokenize sequences
|
70 |
+
inputs = tokenizer(protein_sequences, return_tensors="pt", padding=True, truncation=True)
|
71 |
+
|
72 |
+
# Move to the same device as model and obtain logits
|
73 |
+
with torch.no_grad():
|
74 |
+
logits = model(**inputs).logits
|
75 |
+
|
76 |
+
# Obtain predicted labels
|
77 |
+
predicted_labels = torch.argmax(logits, dim=-1).cpu().numpy()
|
78 |
+
|
79 |
+
# Convert label IDs to human-readable labels
|
80 |
+
id2label = model.config.id2label
|
81 |
+
human_readable_labels = [[id2label[label_id] for label_id in sequence] for sequence in predicted_labels]
|
82 |
+
|
83 |
+
return human_readable_labels
|
84 |
+
|
85 |
+
# Usage:
|
86 |
+
model_path = "AmelieSchreiber/esm2_t6_8M_general_binding_sites" # Replace with your model's path
|
87 |
+
unseen_proteins = [
|
88 |
+
"MKVEEILEKALELVIPDEEEVRKGREAEEELRRRLDELGVEYVFVGSYARNTWLKGSLEIDVFLLFPEEFSKEELRERGLEIGKAVLDSYEIRYAEHPYVHGVVKGVEVDVVPCYKLKEPKNIKSAVDRTPFHHKWLEGRIKGKENEVRLLKGFLKANGIYGAEYKVRGFSGYLCELLIVFYGSFLETVKNARRWTRRTVIDVAKGEVRKGEEFFVVDPVDEKRNVAANLSLDNLARFVHLCREFMEAPSLGFFKPKHPLEIEPERLRKIVEERGTAVFAVKFRKPDIVDDNLYPQLERASRKIFEFLERENFMPLRSAFKASEEFCYLLFECQIKEISRVFRRMGPQFEDERNVKKFLSRNRAFRPFIENGRWWAFEMRKFTTPEEGVRSYASTHWHTLGKNVGESIREYFEIISGEKLFKEPVTAELCEMMGVKD",
|
89 |
+
"MKVEEILEKALELVIPDEEEVRKGREAEEELRRRLDELGVEYVFVGSYARNTWLKGSLEIAVFLLFPEEFSKEELRERGLEIGKAVLDSYEIRYAEHPYVHGVVKGVEVDVVPCYKLKEPKNIKSAVDRTPFHHKWLEGRIKGKENEVRLLKGFLKANGIYGAEYKVRGFSGYLCELLIVFYGSFLETVKNARRWTRRTVIDVAKGEVRKGEEFFVVDPVDEKRNVAANLSLDNLARFVHLCREFMEAPSLGFFKVKHPLEIEPERLRKIVEERGTAVFAVKFRKPDIVDDNLYPQLERASRKIFEFLERENFMPLRSAFKASEEFCYLLFECQIKEISRVFRRMGPQFEDERNVKKFLSRNRAFRPFIENGRWWAFEMRKFTTPEEGVRSYASTHWHTLGKNVGESIREYFEIISGEKLFKEPVTAELCEMMGVKD",
|
90 |
+
"MKVEEILEKALELVIPDEEEVRKGREAEEELRRRLDELGVEAVFVGSYARNTWLKGSLEIAVFLLFPEEFSKEELRERGLEIEKAVLDSYEIRYAEHPYVHGVVKGVEVDVVPCYKLKEPKNIKSAVDRTPFHHKELEGRIKGKENEVRLLKGFLKANGIYGAEYAVRGFSGYLCELLIVFYGSFLETVKNARRWTRRTVIDVAKGEVRKGEEFFVVDPVDEKRNVAANLSLDNLARFVHLCREFMEAPSLGFFKVKHPLEIEPERLRKIVEERGTAVFMVKFRKPDIVDDNLYPQLRRASRKIFEFLERNNFMPLRSAFKASEEFCYLLFECQIKEISDVFRRMGPLFEDERNVKKFLSRNRALRPFIENGRWWIFEMRKFTTPEEGVRSYASTHWHTLGKNVGESIREYFEIISGEKLFKEPVTAELCRMMGVKD",
|
91 |
+
"MKVEEILEKALELVIPDEEEVRKGREAEEELRRRLDELGVEAVFVGSYARNTWLKGSLEIAVFLLFPEEFSKEELRERGLEIEKAVLDSYGIRYAEHPYVHGVVKGVELDVVPCYKLKEPKNIKSAVDRTPFHHKELEGRIKGKENEYRSLKGFLKANGIYGAEYAVRGFSGYLCELLIVFYGSFLETVKNARRWTRKTVIDVAKGEVRKGEEFFVVDPVDEKRNVAALLSLDNLARFVHLCREFMEAVSLGFFKVKHPLEIEPERLRKIVEERGTAVFMVKFRKPDIVDDNLYPQLRRASRKIFEFLERNNFMPLRRAFKASEEFCYLLFEQQIKEISDVFRRMGPLFEDERNVKKFLSRNRALRPFIENGRWWIFEMRKFTTPEEGVRSYASTHWHTLGKNVGESIREYFEIIEGEKLFKEPVTAELCRMMGVKD"
|
92 |
+
] # Replace with your unseen protein sequences
|
93 |
+
predictions = predict_binding_sites(model_path, unseen_proteins)
|
94 |
+
predictions
|
95 |
+
```
|