File size: 3,313 Bytes
14fddb7 eca78a8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
---
license: apache-2.0
---
# BioM3: Protein Language Model Pipeline
## Citation
If you use this code, please cite:
```bibtex
Natural Language Prompts Guide the Design of Novel Functional Protein Sequences
bioRxiv 2024.11.11.622734
doi: https://doi.org/10.1101/2024.11.11.622734
```
[Read the paper on bioRxiv](https://www.biorxiv.org/content/10.1101/2024.11.11.622734v1)
## Software Requirements
### Required Dependencies
- Python 3.8 or later
- PyTorch (latest stable version)
- PyTorch Lightning
- pandas
- pyyaml
### Installation
Create and activate a conda environment:
```bash
conda create -n BioM3_env python=3.8
conda activate BioM3_env
```
Install the required packages:
```bash
conda install pytorch pytorch-lightning pandas pyyaml -c pytorch -c conda-forge
```
## Stage 1: PenCL Inference
### Overview
This stage demonstrates how to perform inference using the **BioM3 PenCL model** for aligning protein sequences and text descriptions. The model computes latent embeddings for the given inputs and calculates **dot product scores** (similarities) with normalization.
### Model Weights
Before running the model, ensure you have:
- Configuration file: `stage1_config.json`
- Pre-trained weights: `BioM3_PenCL_epoch20.bin`
### Running the Model
1. Clone the repository:
```bash
git clone https://huggingface.co/your_username/BioM3_PenCL
cd BioM3_PenCL
```
2. Run inference:
```bash
python run_PenCL_inference.py \
--json_path "stage1_config.json" \
--model_path "BioM3_PenCL_epoch20.bin"
```
### Expected Output
The script provides the following outputs:
1. **Latent Embedding Shapes**
- `z_p`: Protein sequence embeddings
- `z_t`: Text description embeddings
2. **Vector Magnitudes**
- L2 norms of both embedding types
3. **Dot Product Scores**
- Similarity matrix between embeddings
4. **Normalized Probabilities**
- Protein-normalized (softmax over rows)
- Text-normalized (softmax over columns)
#### Sample Output
```plaintext
=== Inference Results ===
Shape of z_p (protein latent): torch.Size([2, 512])
Shape of z_t (text latent): torch.Size([2, 512])
Magnitudes of z_p vectors: tensor([5.3376, 4.8237])
Magnitudes of z_t vectors: tensor([29.6971, 27.6714])
=== Dot Product Scores Matrix ===
tensor([[ 7.3152, 1.8080],
[ 3.3922, 16.6157]])
=== Normalized Probabilities ===
Protein-Normalized Probabilities:
tensor([[9.8060e-01, 3.7078e-07],
[1.9398e-02, 1.0000e+00]])
Text-Normalized Probabilities:
tensor([[9.9596e-01, 4.0412e-03],
[1.8076e-06, 1.0000e+00]])
```
## Stage 2: Facilitator Sampling
🚧 **Coming Soon** 🚧
This stage will contain scripts and models for the Facilitator Sampling process. Check back for:
- Configuration files
- Model weights
- Running instructions
- Output examples
## Stage 3: ProteoScribe
🚧 **Coming Soon** 🚧
This stage will contain scripts and models for the ProteoScribe process. Check back for:
- Configuration files
- Model weights
- Running instructions
- Output examples
## Support
For questions or issues:
- Open an issue in this repository
- Contact: [Your contact information]
---
Repository maintained by the BioM3 Team
|