AmelieSchreiber commited on
Commit
49d3c66
·
1 Parent(s): 9e90ac7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +92 -0
README.md CHANGED
@@ -1,3 +1,95 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ datasets:
4
+ - AmelieSchreiber/general_binding_sites
5
+ language:
6
+ - en
7
+ metrics:
8
+ - precision
9
+ - recall
10
+ - f1
11
+ library_name: transformers
12
+ tags:
13
+ - biology
14
+ - esm
15
+ - esm2
16
+ - ESM-2
17
+ - protein language model
18
  ---
19
+
20
+ # ESM-2 for General Protein Binding Site Prediction
21
+
22
+ This model is trained to predict general binding sites of proteins using on the sequence. This is a finetuned version of
23
+ `esm2_t6_8M_UR50D`, trained on [this dataset](https://huggingface.co/datasets/AmelieSchreiber/general_binding_sites). The data is
24
+ not filtered by family, and thus the model may be overfit to some degree.
25
+
26
+ ## Training
27
+
28
+ ```
29
+ epoch 3:
30
+ 'eval_loss': 0.08215777575969696,
31
+ 'eval_precision': 0.4673852829840273,
32
+ 'eval_recall': 0.9587594696969697,
33
+ 'eval_f1': 0.6284215753212091,
34
+ 'eval_auc': 0.9730582015280457
35
+ ```
36
+
37
+ ## Using the Model
38
+
39
+ Try pasting a protein sequence into the cell on the right and clicking on "Compute". For example, try
40
+
41
+ ```
42
+ MKVEEILEKALELVIPDEEEVRKGREAEEELRRRLDELGVEYVFVGSYARNTWLKGSLEIDVFLLFPEEFSKEELRERGLEIGKAVLDSYEIRYAEHPYVHGVVKGVEVDVVPCYKLKEPKNIKSAVDRTPFHHKWLEGRIKGKENEVRLLKGFLKANGIYGAEYKVRGFSGYLCELLIVFYGSFLETVKNARRWTRRTVIDVAKGEVRKGEEFFVVDPVDEKRNVAANLSLDNLARFVHLCREFMEAPSLGFFKPKHPLEIEPERLRKIVEERGTAVFAVKFRKPDIVDDNLYPQLERASRKIFEFLERENFMPLRSAFKASEEFCYLLFECQIKEISRVFRRMGPQFEDERNVKKFLSRNRAFRPFIENGRWWAFEMRKFTTPEEGVRSYASTHWHTLGKNVGESIREYFEIISGEKLFKEPVTAELCEMMGVKD
43
+ ```
44
+
45
+ To use the model, try running:
46
+ ```python
47
+ import torch
48
+ from transformers import AutoModelForTokenClassification, AutoTokenizer
49
+
50
+ def predict_binding_sites(model_path, protein_sequences):
51
+ """
52
+ Predict binding sites for a collection of protein sequences.
53
+
54
+ Parameters:
55
+ - model_path (str): Path to the saved model.
56
+ - protein_sequences (List[str]): List of protein sequences.
57
+
58
+ Returns:
59
+ - List[List[str]]: Predicted labels for each sequence.
60
+ """
61
+
62
+ # Load tokenizer and model
63
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
64
+ model = AutoModelForTokenClassification.from_pretrained(model_path)
65
+
66
+ # Ensure model is in evaluation mode
67
+ model.eval()
68
+
69
+ # Tokenize sequences
70
+ inputs = tokenizer(protein_sequences, return_tensors="pt", padding=True, truncation=True)
71
+
72
+ # Move to the same device as model and obtain logits
73
+ with torch.no_grad():
74
+ logits = model(**inputs).logits
75
+
76
+ # Obtain predicted labels
77
+ predicted_labels = torch.argmax(logits, dim=-1).cpu().numpy()
78
+
79
+ # Convert label IDs to human-readable labels
80
+ id2label = model.config.id2label
81
+ human_readable_labels = [[id2label[label_id] for label_id in sequence] for sequence in predicted_labels]
82
+
83
+ return human_readable_labels
84
+
85
+ # Usage:
86
+ model_path = "AmelieSchreiber/esm2_t6_8M_general_binding_sites" # Replace with your model's path
87
+ unseen_proteins = [
88
+ "MKVEEILEKALELVIPDEEEVRKGREAEEELRRRLDELGVEYVFVGSYARNTWLKGSLEIDVFLLFPEEFSKEELRERGLEIGKAVLDSYEIRYAEHPYVHGVVKGVEVDVVPCYKLKEPKNIKSAVDRTPFHHKWLEGRIKGKENEVRLLKGFLKANGIYGAEYKVRGFSGYLCELLIVFYGSFLETVKNARRWTRRTVIDVAKGEVRKGEEFFVVDPVDEKRNVAANLSLDNLARFVHLCREFMEAPSLGFFKPKHPLEIEPERLRKIVEERGTAVFAVKFRKPDIVDDNLYPQLERASRKIFEFLERENFMPLRSAFKASEEFCYLLFECQIKEISRVFRRMGPQFEDERNVKKFLSRNRAFRPFIENGRWWAFEMRKFTTPEEGVRSYASTHWHTLGKNVGESIREYFEIISGEKLFKEPVTAELCEMMGVKD",
89
+ "MKVEEILEKALELVIPDEEEVRKGREAEEELRRRLDELGVEYVFVGSYARNTWLKGSLEIAVFLLFPEEFSKEELRERGLEIGKAVLDSYEIRYAEHPYVHGVVKGVEVDVVPCYKLKEPKNIKSAVDRTPFHHKWLEGRIKGKENEVRLLKGFLKANGIYGAEYKVRGFSGYLCELLIVFYGSFLETVKNARRWTRRTVIDVAKGEVRKGEEFFVVDPVDEKRNVAANLSLDNLARFVHLCREFMEAPSLGFFKVKHPLEIEPERLRKIVEERGTAVFAVKFRKPDIVDDNLYPQLERASRKIFEFLERENFMPLRSAFKASEEFCYLLFECQIKEISRVFRRMGPQFEDERNVKKFLSRNRAFRPFIENGRWWAFEMRKFTTPEEGVRSYASTHWHTLGKNVGESIREYFEIISGEKLFKEPVTAELCEMMGVKD",
90
+ "MKVEEILEKALELVIPDEEEVRKGREAEEELRRRLDELGVEAVFVGSYARNTWLKGSLEIAVFLLFPEEFSKEELRERGLEIEKAVLDSYEIRYAEHPYVHGVVKGVEVDVVPCYKLKEPKNIKSAVDRTPFHHKELEGRIKGKENEVRLLKGFLKANGIYGAEYAVRGFSGYLCELLIVFYGSFLETVKNARRWTRRTVIDVAKGEVRKGEEFFVVDPVDEKRNVAANLSLDNLARFVHLCREFMEAPSLGFFKVKHPLEIEPERLRKIVEERGTAVFMVKFRKPDIVDDNLYPQLRRASRKIFEFLERNNFMPLRSAFKASEEFCYLLFECQIKEISDVFRRMGPLFEDERNVKKFLSRNRALRPFIENGRWWIFEMRKFTTPEEGVRSYASTHWHTLGKNVGESIREYFEIISGEKLFKEPVTAELCRMMGVKD",
91
+ "MKVEEILEKALELVIPDEEEVRKGREAEEELRRRLDELGVEAVFVGSYARNTWLKGSLEIAVFLLFPEEFSKEELRERGLEIEKAVLDSYGIRYAEHPYVHGVVKGVELDVVPCYKLKEPKNIKSAVDRTPFHHKELEGRIKGKENEYRSLKGFLKANGIYGAEYAVRGFSGYLCELLIVFYGSFLETVKNARRWTRKTVIDVAKGEVRKGEEFFVVDPVDEKRNVAALLSLDNLARFVHLCREFMEAVSLGFFKVKHPLEIEPERLRKIVEERGTAVFMVKFRKPDIVDDNLYPQLRRASRKIFEFLERNNFMPLRRAFKASEEFCYLLFEQQIKEISDVFRRMGPLFEDERNVKKFLSRNRALRPFIENGRWWIFEMRKFTTPEEGVRSYASTHWHTLGKNVGESIREYFEIIEGEKLFKEPVTAELCRMMGVKD"
92
+ ] # Replace with your unseen protein sequences
93
+ predictions = predict_binding_sites(model_path, unseen_proteins)
94
+ predictions
95
+ ```