--- license: apache-2.0 --- # BioM3: Protein Language Model Pipeline ## Citation If you use this code, please cite: ```bibtex Natural Language Prompts Guide the Design of Novel Functional Protein Sequences bioRxiv 2024.11.11.622734 doi: https://doi.org/10.1101/2024.11.11.622734 ``` [Read the paper on bioRxiv](https://www.biorxiv.org/content/10.1101/2024.11.11.622734v1) ## Software Requirements ### Required Dependencies - Python 3.8 or later - PyTorch (latest stable version) - PyTorch Lightning - pandas - pyyaml ### Installation Create and activate a conda environment: ```bash conda create -n BioM3_env python=3.8 conda activate BioM3_env ``` Install the required packages: ```bash conda install pytorch pytorch-lightning pandas pyyaml -c pytorch -c conda-forge ``` ## Stage 1: PenCL Inference ### Overview This stage demonstrates how to perform inference using the **BioM3 PenCL model** for aligning protein sequences and text descriptions. The model computes latent embeddings for the given inputs and calculates **dot product scores** (similarities) with normalization. ### Model Weights Before running the model, ensure you have: - Configuration file: `stage1_config.json` - Pre-trained weights: `BioM3_PenCL_epoch20.bin` ### Running the Model 1. Clone the repository: ```bash git clone https://huggingface.co/your_username/BioM3_PenCL cd BioM3_PenCL ``` 2. Run inference: ```bash python run_PenCL_inference.py \ --json_path "stage1_config.json" \ --model_path "BioM3_PenCL_epoch20.bin" ``` ### Expected Output The script provides the following outputs: 1. **Latent Embedding Shapes** - `z_p`: Protein sequence embeddings - `z_t`: Text description embeddings 2. **Vector Magnitudes** - L2 norms of both embedding types 3. **Dot Product Scores** - Similarity matrix between embeddings 4. **Normalized Probabilities** - Protein-normalized (softmax over rows) - Text-normalized (softmax over columns) #### Sample Output ```plaintext === Inference Results === Shape of z_p (protein latent): torch.Size([2, 512]) Shape of z_t (text latent): torch.Size([2, 512]) Magnitudes of z_p vectors: tensor([5.3376, 4.8237]) Magnitudes of z_t vectors: tensor([29.6971, 27.6714]) === Dot Product Scores Matrix === tensor([[ 7.3152, 1.8080], [ 3.3922, 16.6157]]) === Normalized Probabilities === Protein-Normalized Probabilities: tensor([[9.8060e-01, 3.7078e-07], [1.9398e-02, 1.0000e+00]]) Text-Normalized Probabilities: tensor([[9.9596e-01, 4.0412e-03], [1.8076e-06, 1.0000e+00]]) ``` ## Stage 2: Facilitator Sampling 🚧 **Coming Soon** 🚧 This stage will contain scripts and models for the Facilitator Sampling process. Check back for: - Configuration files - Model weights - Running instructions - Output examples ## Stage 3: ProteoScribe 🚧 **Coming Soon** 🚧 This stage will contain scripts and models for the ProteoScribe process. Check back for: - Configuration files - Model weights - Running instructions - Output examples ## Support For questions or issues: - Open an issue in this repository - Contact: [Your contact information] --- Repository maintained by the BioM3 Team