niksapraljak1
/

BioM3

Model card Files Files and versions Community

Niksa Praljak commited on Dec 16, 2024

Commit

eca78a8

1 Parent(s): 0655b48

update README.md and include PenCL inference script

Browse files

Files changed (2) hide show

README.md +134 -0
run_PenCL_inference.py +120 -0

README.md CHANGED Viewed

@@ -1,3 +1,137 @@
 ---
 license: apache-2.0
 ---

 ---
 license: apache-2.0
 ---
+# BioM3: Protein Language Model Pipeline
+## Citation
+If you use this code, please cite:
+```bibtex
+Natural Language Prompts Guide the Design of Novel Functional Protein Sequences
+bioRxiv 2024.11.11.622734
+doi: https://doi.org/10.1101/2024.11.11.622734
+```
+[Read the paper on bioRxiv](https://www.biorxiv.org/content/10.1101/2024.11.11.622734v1)
+## Software Requirements
+### Required Dependencies
+- Python 3.8 or later
+- PyTorch (latest stable version)
+- PyTorch Lightning
+- pandas
+- pyyaml
+### Installation
+Create and activate a conda environment:
+```bash
+conda create -n BioM3_env python=3.8
+conda activate BioM3_env
+```
+Install the required packages:
+```bash
+conda install pytorch pytorch-lightning pandas pyyaml -c pytorch -c conda-forge
+```
+## Stage 1: PenCL Inference
+### Overview
+This stage demonstrates how to perform inference using the **BioM3 PenCL model** for aligning protein sequences and text descriptions. The model computes latent embeddings for the given inputs and calculates **dot product scores** (similarities) with normalization.
+### Model Weights
+Before running the model, ensure you have:
+- Configuration file: `stage1_config.json`
+- Pre-trained weights: `BioM3_PenCL_epoch20.bin`
+### Running the Model
+1. Clone the repository:
+```bash
+git clone https://huggingface.co/your_username/BioM3_PenCL
+cd BioM3_PenCL
+```
+2. Run inference:
+```bash
+python run_PenCL_inference.py \
+    --json_path "stage1_config.json" \
+    --model_path "BioM3_PenCL_epoch20.bin"
+```
+### Expected Output
+The script provides the following outputs:
+1. **Latent Embedding Shapes**
+   - `z_p`: Protein sequence embeddings
+   - `z_t`: Text description embeddings
+2. **Vector Magnitudes**
+   - L2 norms of both embedding types
+3. **Dot Product Scores**
+   - Similarity matrix between embeddings
+4. **Normalized Probabilities**
+   - Protein-normalized (softmax over rows)
+   - Text-normalized (softmax over columns)
+#### Sample Output
+```plaintext
+=== Inference Results ===
+Shape of z_p (protein latent): torch.Size([2, 512])
+Shape of z_t (text latent): torch.Size([2, 512])
+Magnitudes of z_p vectors: tensor([5.3376, 4.8237])
+Magnitudes of z_t vectors: tensor([29.6971, 27.6714])
+=== Dot Product Scores Matrix ===
+tensor([[ 7.3152,  1.8080],
+        [ 3.3922, 16.6157]])
+=== Normalized Probabilities ===
+Protein-Normalized Probabilities:
+tensor([[9.8060e-01, 3.7078e-07],
+        [1.9398e-02, 1.0000e+00]])
+Text-Normalized Probabilities:
+tensor([[9.9596e-01, 4.0412e-03],
+        [1.8076e-06, 1.0000e+00]])
+```
+## Stage 2: Facilitator Sampling
+🚧 **Coming Soon** 🚧
+This stage will contain scripts and models for the Facilitator Sampling process. Check back for:
+- Configuration files
+- Model weights
+- Running instructions
+- Output examples
+## Stage 3: ProteoScribe
+🚧 **Coming Soon** 🚧
+This stage will contain scripts and models for the ProteoScribe process. Check back for:
+- Configuration files
+- Model weights
+- Running instructions
+- Output examples
+## Support
+For questions or issues:
+- Open an issue in this repository
+- Contact: [Your contact information]
+---
+Repository maintained by the BioM3 Team

run_PenCL_inference.py ADDED Viewed

	@@ -0,0 +1,120 @@

+import argparse
+import yaml
+from argparse import Namespace
+import json
+import pandas as pd
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import Stage1_source.preprocess as prep
+import Stage1_source.model as mod
+import Stage1_source.PL_wrapper as PL_wrap
+# Step 1: Load JSON Configuration
+def load_json_config(json_path):
+    with open(json_path, "r") as f:
+        config = json.load(f)
+    return config
+# Step 2: Convert JSON dictionary to Namespace
+def convert_to_namespace(config_dict):
+    for key, value in config_dict.items():
+        if isinstance(value, dict):
+            config_dict[key] = convert_to_namespace(value)
+    return Namespace(**config_dict)
+# Step 3: Load Pre-trained Model
+def prepare_model(config_args, model_path) -> nn.Module:
+    model = mod.pfam_PEN_CL(args=config_args)
+    model.load_state_dict(torch.load(model_path, map_location="cpu"))
+    model.eval()
+    print("Model loaded successfully with weights!")
+    return model
+# Step 4: Prepare Test Dataset
+def load_test_dataset(config_args):
+    test_dict = {
+        'primary_Accession': ['A0A009IHW8', 'A0A023I7E1'],
+        'protein_sequence': [
+            "MSLEQKKGADIISKILQIQNSIGKTTSPSTLKTKLSEISRKEQENARIQSKL...",
+            "MRFQVIVAAATITMITSYIPGVASQSTSDGDDLFVPVSNFDPKSIFPEIKHP..."
+        ],
+        '[final]text_caption': [
+            "PROTEIN NAME: 2' cyclic ADP-D-ribose synthase AbTIR...",
+            "PROTEIN NAME: Glucan endo-1,3-beta-D-glucosidase 1..."
+        ],
+        'pfam_label': ["['PF13676']", "['PF17652','PF03639']"]
+    }
+    test_df = pd.DataFrame(test_dict)
+    test_dataset = prep.TextSeqPairing_Dataset(args=config_args, df=test_df)
+    return test_dataset
+# Step 5: Argument Parser Function
+def parse_arguments():
+    parser = argparse.ArgumentParser(description="BioM3 Inference Script (Stage 1)")
+    parser.add_argument('--json_path', type=str, required=True,
+                        help="Path to the JSON configuration file (stage1_config.json)")
+    parser.add_argument('--model_path', type=str, required=True,
+                        help="Path to the pre-trained model weights (pytorch_model.bin)")
+    return parser.parse_args()
+# Main Execution
+if __name__ == '__main__':
+    # Parse arguments
+    config_args_parser = parse_arguments()
+    # Load configuration
+    config_dict = load_json_config(config_args_parser.json_path)
+    config_args = convert_to_namespace(config_dict)
+    # Load model
+    model = prepare_model(config_args=config_args, model_path=config_args_parser.model_path)
+    # Load test dataset
+    test_dataset = load_test_dataset(config_args)
+    # Run inference and store z_t, z_p
+    z_t_list = []
+    z_p_list = []
+    with torch.no_grad():
+        for idx in range(len(test_dataset)):
+            batch = test_dataset[idx]
+            x_t, x_p = batch
+            outputs = model(x_t, x_p, compute_masked_logits=False) # Infer Joint-Embeddings
+            z_t = outputs['text_joint_latent']  # Text latent
+            z_p = outputs['seq_joint_latent']   # Protein latent
+            z_t_list.append(z_t)
+            z_p_list.append(z_p)
+    # Stack all latent vectors
+    z_t_tensor = torch.vstack(z_t_list)  # Shape: (num_samples, latent_dim)
+    z_p_tensor = torch.vstack(z_p_list)  # Shape: (num_samples, latent_dim)
+    # Compute Dot Product scores
+    dot_product_scores = torch.matmul(z_p_tensor, z_t_tensor.T)  # Dot product
+    # Normalize scores into probabilities
+    protein_given_text_probs = F.softmax(dot_product_scores, dim=0)  # Normalize across rows (proteins), for each text
+    text_given_protein_probs = F.softmax(dot_product_scores, dim=1)  # Normalize across columns (texts), for each protein
+    # Compute magnitudes (L2 norms) for z_t and z_p
+    z_p_magnitude = torch.norm(z_p_tensor, dim=1)  # L2 norm for each protein latent vector
+    z_t_magnitude = torch.norm(z_t_tensor, dim=1)  # L2 norm for each text latent vector
+    # Print results
+    print("\n=== Inference Results ===")
+    print(f"Shape of z_p (protein latent): {z_p_tensor.shape}")
+    print(f"Shape of z_t (text latent): {z_t_tensor.shape}")
+    print(f"\nMagnitudes of z_p vectors: {z_p_magnitude}")
+    print(f"Magnitudes of z_t vectors: {z_t_magnitude}")
+    print("\n=== Dot Product Scores Matrix ===")
+    print(dot_product_scores)
+    print("\n=== Normalized Probabilities ===")
+    print("Protein-Normalized Probabilities (Softmax across Proteins for each Text):")
+    print(protein_given_text_probs)
+    print("\nText-Normalized Probabilities (Softmax across Texts for each Protein):")
+    print(text_given_protein_probs)