File size: 4,183 Bytes
14fddb7
 
 
eca78a8
efd5a17
eca78a8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f782d11
eca78a8
 
efd5a17
 
 
 
 
 
 
 
 
 
 
 
 
 
eca78a8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c11fc4d
 
 
 
 
eca78a8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---

license: apache-2.0
---


# BioM3: Biological Multi-Modal Model for Protein Design

## Citation

If you use this code, please cite:

```bibtex

Natural Language Prompts Guide the Design of Novel Functional Protein Sequences

bioRxiv 2024.11.11.622734

doi: https://doi.org/10.1101/2024.11.11.622734

```

[Read the paper on bioRxiv](https://www.biorxiv.org/content/10.1101/2024.11.11.622734v1)

## Software Requirements

### Required Dependencies
- Python 3.8 or later
- PyTorch (latest stable version)
- PyTorch Lightning
- pandas
- pyyaml

### Installation

Create and activate a conda environment:
```bash

conda create -n BioM3_env python=3.8

conda activate BioM3_env

```

Install the required packages:
```bash

conda install pytorch pytorch-lightning pandas pyyaml -c pytorch -c conda-forge

```

## Stage 1: PenCL Inference

### Overview

This stage demonstrates how to perform inference using the **BioM3 PenCL model** for aligning protein sequences and text descriptions. The model computes latent embeddings for the given inputs and calculates **dot product scores** (similarities) with normalization.

### Model Weights

Before running the model, ensure you have:
- Configuration file: `stage1_config.json`
- Pre-trained weights: `BioM3_PenCL_epoch20.bin`

### Running the Model

1. Clone the repository:
```bash

git clone https://huggingface.co/your_username/BioM3_PenCL

cd BioM3_PenCL

```

2. Run inference:
```bash

python run_PenCL_inference.py \

    --json_path "stage1_config.json" \

    --model_path "./weights/PenCL/BioM3_PenCL_epoch20.bin"

```

### Example Input Data

The script demonstrates inference using two protein-text pairs from the SwissProt dataset:

**Pair 1:**
- **Protein Sequence:** MSLEQKKGADIISKILQIQNSIGKTTSPSTLKTKLSEISRKEQENARIQSKL...
- **Text Description:** PROTEIN NAME: 2' cyclic ADP-D-ribose synthase AbTIR...

**Pair 2:**
- **Protein Sequence:** MRFQVIVAAATITMITSYIPGVASQSTSDGDDLFVPVSNFDPKSIFPEIKHP...
- **Text Description:** PROTEIN NAME: Glucan endo-1,3-beta-D-glucosidase 1...

These pairs demonstrate how the model aligns protein sequences with their corresponding functional descriptions. The model will compute embeddings for both the sequences and descriptions, then calculate their similarities using dot product scores.

### Expected Output

The script provides the following outputs:

1. **Latent Embedding Shapes**
   - `z_p`: Protein sequence embeddings
   - `z_t`: Text description embeddings

2. **Vector Magnitudes**
   - L2 norms of both embedding types

3. **Dot Product Scores**
   - Similarity matrix between embeddings

4. **Normalized Probabilities**
   - Protein-normalized (softmax over rows)
   - Text-normalized (softmax over columns)

#### Sample Output
```plaintext

=== Inference Results ===

Shape of z_p (protein latent): torch.Size([2, 512])

Shape of z_t (text latent): torch.Size([2, 512])



Magnitudes of z_p vectors: tensor([5.3376, 4.8237])

Magnitudes of z_t vectors: tensor([29.6971, 27.6714])



=== Dot Product Scores Matrix ===

tensor([[ 7.3152,  1.8080],

        [ 3.3922, 16.6157]])



=== Normalized Probabilities ===

Protein-Normalized Probabilities:

tensor([[9.8060e-01, 3.7078e-07],

        [1.9398e-02, 1.0000e+00]])



Text-Normalized Probabilities:

tensor([[9.9596e-01, 4.0412e-03],

        [1.8076e-06, 1.0000e+00]])



=== Homology Matrix (Dot Product of Normalized z_p) ===

tensor([[1.0000, 0.1840],

        [0.1840, 1.0000]])



```

## Stage 2: Facilitator Sampling

🚧 **Coming Soon** 🚧

This stage will contain scripts and models for the Facilitator Sampling process. Check back for:
- Configuration files
- Model weights
- Running instructions
- Output examples

## Stage 3: ProteoScribe

🚧 **Coming Soon** 🚧

This stage will contain scripts and models for the ProteoScribe process. Check back for:
- Configuration files
- Model weights
- Running instructions
- Output examples

## Support

For questions or issues:
- Open an issue in this repository
- Contact: [Your contact information]

---
Repository maintained by the BioM3 Team