|
---
|
|
license: apache-2.0
|
|
---
|
|
|
|
# BioM3: Protein Language Model Pipeline
|
|
|
|
## Citation
|
|
|
|
If you use this code, please cite:
|
|
|
|
```bibtex
|
|
Natural Language Prompts Guide the Design of Novel Functional Protein Sequences
|
|
bioRxiv 2024.11.11.622734
|
|
doi: https://doi.org/10.1101/2024.11.11.622734
|
|
```
|
|
|
|
[Read the paper on bioRxiv](https://www.biorxiv.org/content/10.1101/2024.11.11.622734v1)
|
|
|
|
## Software Requirements
|
|
|
|
### Required Dependencies
|
|
- Python 3.8 or later
|
|
- PyTorch (latest stable version)
|
|
- PyTorch Lightning
|
|
- pandas
|
|
- pyyaml
|
|
|
|
### Installation
|
|
|
|
Create and activate a conda environment:
|
|
```bash
|
|
conda create -n BioM3_env python=3.8
|
|
conda activate BioM3_env
|
|
```
|
|
|
|
Install the required packages:
|
|
```bash
|
|
conda install pytorch pytorch-lightning pandas pyyaml -c pytorch -c conda-forge
|
|
```
|
|
|
|
## Stage 1: PenCL Inference
|
|
|
|
### Overview
|
|
|
|
This stage demonstrates how to perform inference using the **BioM3 PenCL model** for aligning protein sequences and text descriptions. The model computes latent embeddings for the given inputs and calculates **dot product scores** (similarities) with normalization.
|
|
|
|
### Model Weights
|
|
|
|
Before running the model, ensure you have:
|
|
- Configuration file: `stage1_config.json`
|
|
- Pre-trained weights: `BioM3_PenCL_epoch20.bin`
|
|
|
|
### Running the Model
|
|
|
|
1. Clone the repository:
|
|
```bash
|
|
git clone https://huggingface.co/your_username/BioM3_PenCL
|
|
cd BioM3_PenCL
|
|
```
|
|
|
|
2. Run inference:
|
|
```bash
|
|
python run_PenCL_inference.py \
|
|
--json_path "stage1_config.json" \
|
|
--model_path "BioM3_PenCL_epoch20.bin"
|
|
```
|
|
|
|
### Expected Output
|
|
|
|
The script provides the following outputs:
|
|
|
|
1. **Latent Embedding Shapes**
|
|
- `z_p`: Protein sequence embeddings
|
|
- `z_t`: Text description embeddings
|
|
|
|
2. **Vector Magnitudes**
|
|
- L2 norms of both embedding types
|
|
|
|
3. **Dot Product Scores**
|
|
- Similarity matrix between embeddings
|
|
|
|
4. **Normalized Probabilities**
|
|
- Protein-normalized (softmax over rows)
|
|
- Text-normalized (softmax over columns)
|
|
|
|
#### Sample Output
|
|
```plaintext
|
|
=== Inference Results ===
|
|
Shape of z_p (protein latent): torch.Size([2, 512])
|
|
Shape of z_t (text latent): torch.Size([2, 512])
|
|
|
|
Magnitudes of z_p vectors: tensor([5.3376, 4.8237])
|
|
Magnitudes of z_t vectors: tensor([29.6971, 27.6714])
|
|
|
|
=== Dot Product Scores Matrix ===
|
|
tensor([[ 7.3152, 1.8080],
|
|
[ 3.3922, 16.6157]])
|
|
|
|
=== Normalized Probabilities ===
|
|
Protein-Normalized Probabilities:
|
|
tensor([[9.8060e-01, 3.7078e-07],
|
|
[1.9398e-02, 1.0000e+00]])
|
|
|
|
Text-Normalized Probabilities:
|
|
tensor([[9.9596e-01, 4.0412e-03],
|
|
[1.8076e-06, 1.0000e+00]])
|
|
```
|
|
|
|
## Stage 2: Facilitator Sampling
|
|
|
|
π§ **Coming Soon** π§
|
|
|
|
This stage will contain scripts and models for the Facilitator Sampling process. Check back for:
|
|
- Configuration files
|
|
- Model weights
|
|
- Running instructions
|
|
- Output examples
|
|
|
|
## Stage 3: ProteoScribe
|
|
|
|
π§ **Coming Soon** π§
|
|
|
|
This stage will contain scripts and models for the ProteoScribe process. Check back for:
|
|
- Configuration files
|
|
- Model weights
|
|
- Running instructions
|
|
- Output examples
|
|
|
|
## Support
|
|
|
|
For questions or issues:
|
|
- Open an issue in this repository
|
|
- Contact: [Your contact information]
|
|
|
|
---
|
|
Repository maintained by the BioM3 Team
|
|
|
|
|