Niksa Praljak commited on
Commit
597c791
·
1 Parent(s): d85f01b

Update README.md and PenCL inference with a new set of prompt inputs

Browse files
Files changed (2) hide show
  1. README.md +38 -28
  2. run_PenCL_inference.py +16 -11
README.md CHANGED
@@ -119,30 +119,40 @@ The script provides the following outputs:
119
 
120
  #### Sample Output
121
  ```plaintext
122
- === Inference Results ===
123
- Shape of z_p (protein latent): torch.Size([2, 512])
124
- Shape of z_t (text latent): torch.Size([2, 512])
125
 
126
- Magnitudes of z_p vectors: tensor([5.3376, 4.8237])
127
- Magnitudes of z_t vectors: tensor([29.6971, 27.6714])
128
 
129
  === Dot Product Scores Matrix ===
130
- tensor([[ 7.3152, 1.8080],
131
- [ 3.3922, 16.6157]])
 
 
 
132
 
133
  === Normalized Probabilities ===
134
- Protein-Normalized Probabilities:
135
- tensor([[9.8060e-01, 3.7078e-07],
136
- [1.9398e-02, 1.0000e+00]])
137
-
138
- Text-Normalized Probabilities:
139
- tensor([[9.9596e-01, 4.0412e-03],
140
- [1.8076e-06, 1.0000e+00]])
 
 
 
 
 
 
141
 
142
  === Homology Matrix (Dot Product of Normalized z_p) ===
143
- tensor([[1.0000, 0.1840],
144
- [0.1840, 1.0000]])
145
-
 
 
146
  ```
147
 
148
  ## Stage 2: Facilitator Sampling
@@ -162,7 +172,7 @@ Before running the model, ensure you have:
162
  1. Run sampling:
163
  ```bash
164
  python run_Facilitator_sample.py \
165
- --json_path "stage2_facilitator_config.json" \
166
  --model_path "./weights/Facilitator/BioM3_Facilitator_epoch20.bin" \
167
  --input_data_path "test_PenCL_embeddings.pt" \
168
  --output_data_path "test_Facilitator_embeddings.pt"
@@ -198,22 +208,22 @@ The script provides the following outputs:
198
 
199
  ```plaintext
200
  === Facilitator Model Output ===
201
- Shape of z_t (Text Embeddings): torch.Size([2, 512])
202
- Shape of z_p (Protein Embeddings): torch.Size([2, 512])
203
- Shape of z_c (Facilitated Embeddings): torch.Size([2, 512])
204
 
205
  === Norm (L2 Magnitude) Results for Batch Index 0 ===
206
- Norm of z_t (Text Embedding): 29.697054
207
- Norm of z_p (Protein Embedding): 5.337610
208
- Norm of z_c (Facilitated Embedding): 3.244318
209
 
210
  === Mean Squared Error (MSE) Results ===
211
- MSE between Facilitated Embeddings (z_c) and Protein Embeddings (z_p): 0.069909
212
- MSE between Text Embeddings (z_t) and Protein Embeddings (z_p): 1.612812
213
 
214
  === Max Mean Discrepancy (MMD) Results ===
215
- MMD between Facilitated Embeddings (z_c) and Protein Embeddings (z_p): 0.000171
216
- MMD between Text Embeddings (z_t) and Protein Embeddings (z_p): 0.005172
217
  ```
218
 
219
  ### What the Output Means
 
119
 
120
  #### Sample Output
121
  ```plaintext
122
+ Shape of z_p (protein latent): torch.Size([5, 512])
123
+ Shape of z_t (text latent): torch.Size([5, 512])
 
124
 
125
+ Magnitudes of z_p vectors: tensor([4.2894, 4.0314, 4.2747, 4.0478, 3.9959])
126
+ Magnitudes of z_t vectors: tensor([33.3649, 32.5055, 31.6935, 33.3630, 29.6486])
127
 
128
  === Dot Product Scores Matrix ===
129
+ tensor([[28.8613, -3.3248, -0.4564, 7.5766, 3.3064],
130
+ [-0.7815, 28.2294, 10.3146, 3.9422, 11.2805],
131
+ [-2.7591, 12.8974, 30.3760, -0.2481, 2.5218],
132
+ [10.4455, 3.6447, -3.9202, 30.2053, 7.3378],
133
+ [ 5.3883, 10.0869, -1.4182, 8.1128, 27.7488]])
134
 
135
  === Normalized Probabilities ===
136
+ Protein-Normalized Probabilities (Softmax across Proteins for each Text):
137
+ tensor([[1.0000e+00, 1.9778e-14, 4.0705e-14, 1.4876e-10, 2.4255e-11],
138
+ [1.3374e-13, 1.0000e+00, 1.9384e-09, 3.9271e-12, 7.0454e-08],
139
+ [1.8511e-14, 2.1949e-07, 1.0000e+00, 5.9466e-14, 1.1068e-11],
140
+ [1.0049e-08, 2.1039e-11, 1.2746e-15, 1.0000e+00, 1.3665e-09],
141
+ [6.3943e-11, 1.3208e-08, 1.5558e-14, 2.5430e-10, 1.0000e+00]])
142
+
143
+ Text-Normalized Probabilities (Softmax across Texts for each Protein):
144
+ tensor([[1.0000e+00, 1.0513e-14, 1.8512e-13, 5.7037e-10, 7.9733e-12],
145
+ [2.5160e-13, 1.0000e+00, 1.6584e-08, 2.8327e-11, 4.3569e-08],
146
+ [4.0702e-15, 2.5655e-08, 1.0000e+00, 5.0136e-14, 7.9997e-13],
147
+ [2.6208e-09, 2.9167e-12, 1.5118e-15, 1.0000e+00, 1.1715e-10],
148
+ [1.9452e-10, 2.1357e-08, 2.1524e-13, 2.9662e-09, 1.0000e+00]])
149
 
150
  === Homology Matrix (Dot Product of Normalized z_p) ===
151
+ tensor([[ 1.0000, -0.0706, -0.1477, 0.1752, 0.1810],
152
+ [-0.0706, 1.0000, 0.1573, 0.0197, 0.2951],
153
+ [-0.1477, 0.1573, 1.0000, 0.0767, -0.0990],
154
+ [ 0.1752, 0.0197, 0.0767, 1.0000, 0.2231],
155
+ [ 0.1810, 0.2951, -0.0990, 0.2231, 1.0000]])
156
  ```
157
 
158
  ## Stage 2: Facilitator Sampling
 
172
  1. Run sampling:
173
  ```bash
174
  python run_Facilitator_sample.py \
175
+ --json_path "stage2_config.json" \
176
  --model_path "./weights/Facilitator/BioM3_Facilitator_epoch20.bin" \
177
  --input_data_path "test_PenCL_embeddings.pt" \
178
  --output_data_path "test_Facilitator_embeddings.pt"
 
208
 
209
  ```plaintext
210
  === Facilitator Model Output ===
211
+ Shape of z_t (Text Embeddings): torch.Size([5, 512])
212
+ Shape of z_p (Protein Embeddings): torch.Size([5, 512])
213
+ Shape of z_c (Facilitated Embeddings): torch.Size([5, 512])
214
 
215
  === Norm (L2 Magnitude) Results for Batch Index 0 ===
216
+ Norm of z_t (Text Embedding): 33.364857
217
+ Norm of z_p (Protein Embedding): 4.289446
218
+ Norm of z_c (Facilitated Embedding): 3.976427
219
 
220
  === Mean Squared Error (MSE) Results ===
221
+ MSE between Facilitated Embeddings (z_c) and Protein Embeddings (z_p): 0.013486
222
+ MSE between Text Embeddings (z_t) and Protein Embeddings (z_p): 1.937837
223
 
224
  === Max Mean Discrepancy (MMD) Results ===
225
+ MMD between Facilitated Embeddings (z_c) and Protein Embeddings (z_p): 0.000009
226
+ MMD between Text Embeddings (z_t) and Protein Embeddings (z_p): 0.004736
227
  ```
228
 
229
  ### What the Output Means
run_PenCL_inference.py CHANGED
@@ -33,18 +33,22 @@ def prepare_model(config_args, model_path) -> nn.Module:
33
 
34
  # Step 4: Prepare Test Dataset
35
  def load_test_dataset(config_args):
 
36
  test_dict = {
37
- 'primary_Accession': ['A0A009IHW8', 'A0A023I7E1'],
38
- 'protein_sequence': [
39
- "MSLEQKKGADIISKILQIQNSIGKTTSPSTLKTKLSEISRKEQENARIQSKL...",
40
- "MRFQVIVAAATITMITSYIPGVASQSTSDGDDLFVPVSNFDPKSIFPEIKHP..."
41
- ],
42
- '[final]text_caption': [
43
- "PROTEIN NAME: 2' cyclic ADP-D-ribose synthase AbTIR...",
44
- "PROTEIN NAME: Glucan endo-1,3-beta-D-glucosidase 1..."
45
- ],
46
- 'pfam_label': ["['PF13676']", "['PF17652','PF03639']"]
47
- }
 
 
 
48
  test_df = pd.DataFrame(test_dict)
49
  test_dataset = prep.TextSeqPairing_Dataset(args=config_args, df=test_df)
50
  return test_dataset
@@ -77,6 +81,7 @@ def compute_homology_matrix(z_p_tensor):
77
 
78
  # Main Execution
79
  if __name__ == '__main__':
 
80
  # Parse arguments
81
  config_args_parser = parse_arguments()
82
 
 
33
 
34
  # Step 4: Prepare Test Dataset
35
  def load_test_dataset(config_args):
36
+
37
  test_dict = {
38
+ 'primary_Accession': ["P69222", "B5XIP6", "B5XJL3", "B5Y368", "B5YH59"],
39
+ 'protein_sequence': ["MAKEDNIEMQGTVLETLPNTMFRVELENGHVVTAHISGKMRKNYIRILTGDKVTVELTPYDLSKGRIVFRSR",
40
+ "MVKMIVGLGNPGSKYEKTKHNIGFMAIDNIVKNLDVTFTDDKNFKAQIGSTFINHEKVYFVKPTTFMNNSGIAVKALLTYYNIDITDLIVIYDDLDMEVSKLRLRSKGSAGGHNGIKSIIAHIGTQEFNRIKVGIGRPLKGMTVINHVMGQFNTEDNIAISLTLDRVVNAVKFYLQENDFEKTMQKFNG",
41
+ "MTDYPIKYRLIKTEKHTGARLGEIITPHGTFPTPMFMPVGTQATVKTQSPEELKAIGSGIILSNTYHLWLRPGDELIARSGGLHKFMNWDQPILTDSGGFQVYSLADSRNITEEGVTFKNHLNGSKMFLSPEKAISIQNNLGSDIMMSFDECPQFYQPYDYVKKSIERTSRWAERGLKAHRRPHDQGLFGIVQGAGFEDLRRQSAADLVAMDFPGYSIGGLAVGESHEEMNAVLDFTTPLLPENKPRYLMGVGAPDSLIDGVIRGVDMFDCVLPTRIARNGTCMTSEGRLVVKNAKFAEDFTPLDHDCDCYTCQNYSRAYIRHLLKADETFGIRLTSYHNLYFLVNLMKKVRQAIMDDNLLEFRQDFLERYGYNKSNRNF",
42
+ "MAAKDVKFGNDARVKMLRGVNVLADAVKVTLGPKGRNVVLDKSFGAPTITKDGVSVAREIELEDKFENMGAQMVKEVASKANDAAGDGTTTATVLAQAIVNEGLKAVAAGMNPMDLKRGIDKAVIAAVEELKALSVPCSDSKAIAQVGTISANSDETVGKLIAEAMDKVGKEGVITVEDGTGLEDELDVVEGMQFDRGYLSPYFINKPDTGAVELESPFILLADKKISNIREMLPVLEAVAKAGKPLVIIAEDVEGEALATLVVNTMRGIVKVAAVKAPGFGDRRKAMLQDIATLTGGTVISEEIGMELEKATLEDLGQAKRVVINKDTTTIIDGVGEESAIQGRVAQIRKQIEEATSDYDREKLQERVAKLAGGVAVIKVGAATEVEMKEKKARVDDALHATRAAVEEGVVAGGGVALVRVAAKLAGLTGQNEDQNVGIKVALRAMEAPLRQIVSNAGEEPSVVANNVKAGDGNYGYNAATEEYGNMIDFGILDPTKVTRSALQYAASVAGLMITTECMVTDLPKGDAPDLGAAGGMGGMGGMGGMM",
43
+ "MGKAIGIDLGTTNSVVAVVVGGEPVVIPNQEGQRTTPSVVAFTDKGERLVGQVAKRQAITNPENTIFSIKRLMGRKYNSQEVQEAKKRLPYKIVEAPNGDAHVEIMGKRYSPPEISAMILQKLKQAAEDYLGEPVTEAVITVPAYFDDSQRQATKDAGRIAGLNVLRIINEPTAAALAYGLDKKKEEKIAVYDLGGGTFDISILEIGEGVIEVKATNGDTYLGGDDFDIRVMDWLIEEFKKQEGIDLRKDRMALQRLKEAAERAKIELSSAMETEINLPFITADASGPKHLLMKLTRAKLEQLVDDLIQKSLEPCKKALSDAGLSQSQIDEVILVGGQTRTPKVQKVVQDFFGKEPHKGVNPDEVVAVGAAIQAAILKGEVKEVLLLDVTPLSLGIETLGGVFTKIIERNTTIPTKKSQIFTTAADNQTAVTIKVYQGEREMAADNKLLGVFELVGIPPAPRGIPQIEVTFDIDANGILHVSAKDLATGKEQSIRITASSGLSEEEIKKMIREAEAHAEEDRRKKQIAEARNEADNMIYTVEKTLRDMGDRISEDERKRIEEAIEKCRRIKDTSNDVNEIKAAVEELAKASHRVAEELYKKAGASQQGAGSTTQSKKEEDVIEAEVEDKDNK"],
44
+ '[final]text_caption': ["PROTEIN NAME: Translation initiation factor IF-1. FUNCTION: One of the essential components for the initiation of protein synthesis. Binds in the vicinity of the A-site. Stabilizes the binding of IF-2 and IF-3 on the 30S subunit to which N-formylmethionyl-tRNA(fMet) subsequently binds. Helps modulate mRNA selection, yielding the 30S pre-initiation complex (PIC). Upon addition of the 50S ribosomal subunit, IF-1, IF-2 and IF-3 are released leaving the mature 70S translation initiation complex. SUBUNIT: Component of the 30S ribosomal translation pre-initiation complex which assembles on the 30S ribosome in the order IF-2 and IF-3, IF-1 and N-formylmethionyl-tRNA(fMet); mRNA recruitment can occur at any time during PIC assembly. SUBCELLULAR LOCATION: Cytoplasm. SIMILARITY: Belongs to the IF-1 family. LINEAGE: The organism lineage is Bacteria, Pseudomonadota, Gammaproteobacteria, Enterobacterales, Enterobacteriaceae, Escherichia. FAMILY NAMES: Family names are Translation initiation factor 1A / IF-1.",
45
+ "PROTEIN NAME: Peptidyl-tRNA hydrolase. FUNCTION: The natural substrate for this enzyme may be peptidyl-tRNAs which drop off the ribosome during protein synthesis. CATALYTIC ACTIVITY: an N-acyl-L-alpha-aminoacyl-tRNA + H2O = a tRNA + an N-acyl-L-amino acid + H(+). SUBUNIT: Monomer. SUBCELLULAR LOCATION: Cytoplasm. SIMILARITY: Belongs to the PTH family. LINEAGE: The organism lineage is Bacteria, Bacillota, Bacilli, Lactobacillales, Streptococcaceae, Streptococcus. FAMILY NAMES: Family names are Peptidyl-tRNA hydrolase.",
46
+ "PROTEIN NAME: Queuine tRNA-ribosyltransferase. FUNCTION: Catalyzes the base-exchange of a guanine (G) residue with the queuine precursor 7-aminomethyl-7-deazaguanine (PreQ1) at position 34 (anticodon wobble position) in tRNAs with GU(N) anticodons (tRNA-Asp, -Asn, -His and -Tyr). Catalysis occurs through a double-displacement mechanism. The nucleophile active site attacks the C1' of nucleotide 34 to detach the guanine base from the RNA, forming a covalent enzyme-RNA intermediate. The proton acceptor active site deprotonates the incoming PreQ1, allowing a nucleophilic attack on the C1' of the ribose to form the product. After dissociation, two additional enzymatic reactions on the tRNA convert PreQ1 to queuine (Q), resulting in the hypermodified nucleoside queuosine (7-(((4,5-cis-dihydroxy-2-cyclopenten-1-yl)amino)methyl)-7-deazaguanosine). CATALYTIC ACTIVITY: 7-aminomethyl-7-carbaguanine + guanosine(34) in tRNA = 7-aminomethyl-7-carbaguanosine(34) in tRNA + guanine. COFACTOR: Binds 1 zinc ion per subunit. PATHWAY: tRNA modification; tRNA-queuosine biosynthesis. SUBUNIT: Homodimer. Within each dimer, one monomer is responsible for RNA recognition and catalysis, while the other monomer binds to the replacement base PreQ1. SIMILARITY: Belongs to the queuine tRNA-ribosyltransferase family. LINEAGE: The organism lineage is Bacteria, Bacillota, Bacilli, Lactobacillales, Streptococcaceae, Streptococcus. FAMILY NAMES: Family names are Queuine tRNA-ribosyltransferase.",
47
+ "PROTEIN NAME: Chaperonin GroEL. FUNCTION: Together with its co-chaperonin GroES, plays an essential role in assisting protein folding. The GroEL-GroES system forms a nano-cage that allows encapsulation of the non-native substrate proteins and provides a physical environment optimized to promote and accelerate protein folding. CATALYTIC ACTIVITY: ATP + H2O + a folded polypeptide = ADP + phosphate + an unfolded polypeptide. SUBUNIT: Forms a cylinder of 14 subunits composed of two heptameric rings stacked back-to-back. Interacts with the co-chaperonin GroES. SUBCELLULAR LOCATION: Cytoplasm. SIMILARITY: Belongs to the chaperonin (HSP60) family. LINEAGE: The organism lineage is Bacteria, Pseudomonadota, Gammaproteobacteria, Enterobacterales, Enterobacteriaceae, Klebsiella/Raoultella group, Klebsiella. FAMILY NAMES: Family names are TCP-1/cpn60 chaperonin family.",
48
+ "PROTEIN NAME: Chaperone protein DnaK. FUNCTION: Acts as a chaperone. INDUCTION: By stress conditions e.g. heat shock. SIMILARITY: Belongs to the heat shock protein 70 family. LINEAGE: The organism lineage is Bacteria, Nitrospirae, Thermodesulfovibrionia, Thermodesulfovibrionales, Thermodesulfovibrionaceae, Thermodesulfovibrio. FAMILY NAMES: Family names are Hsp70 protein."],
49
+ "pfam_label": ["['PF01176’]", "['PF01195’]", "['PF01702’]", "['PF00118’]", "['PF00012’]"]
50
+ }
51
+
52
  test_df = pd.DataFrame(test_dict)
53
  test_dataset = prep.TextSeqPairing_Dataset(args=config_args, df=test_df)
54
  return test_dataset
 
81
 
82
  # Main Execution
83
  if __name__ == '__main__':
84
+
85
  # Parse arguments
86
  config_args_parser = parse_arguments()
87