Niksa Praljak commited on
Commit
013cb42
·
1 Parent(s): 966f016

Add instruction loading pretrained weights

Browse files
Files changed (1) hide show
  1. weights/ProteoScribe/README.md +128 -13
weights/ProteoScribe/README.md CHANGED
@@ -4,32 +4,147 @@
4
  ### **`weights/ProteoScribe/README.md`**
5
 
6
  ```markdown
 
7
  # ProteoScribe Pre-trained Weights
 
8
 
9
- This folder will contain the pre-trained weights for the **ProteoScribe** model. ProteoScribe enables advanced functional annotation or protein generation tasks.
 
 
 
 
 
 
10
 
11
  ---
 
 
 
 
 
 
 
 
12
 
13
- ## **Downloading Pre-trained Weights**
 
 
 
 
 
 
 
14
 
15
- The Google Drive link for downloading the ProteoScribe pre-trained weights will be added here soon.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  ---
 
 
18
 
19
- ## **File Details**
 
 
20
 
21
- - **File Name**: ProteoScribe pre-trained weights (TBD).
22
- - **Description**: Pre-trained weights for the ProteoScribe model.
 
 
 
 
 
 
 
 
 
 
23
 
24
  ---
 
 
 
 
25
 
26
- ## **Usage**
 
 
 
 
 
27
 
28
- Once available, you can load the weights into your model using PyTorch:
 
 
29
 
30
- ```python
31
- import torch
32
- model = YourProteoScribeModel() # Replace with your model class
33
- model.load_state_dict(torch.load("weights/ProteoScribe/ProteoScribe_weights.bin", map_location="cpu"))
34
- model.eval()
35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  ### **`weights/ProteoScribe/README.md`**
5
 
6
  ```markdown
7
+
8
  # ProteoScribe Pre-trained Weights
9
+ This folder contains the pre-trained weights for the **ProteoScribe** model (Stage 3 of BioM3). The ProteoScribe model generates protein sequences from conditioned latent embeddings.
10
 
11
+ ---
12
+ ## **Downloading Pre-trained Weights**
13
+ To download the **ProteoScribe epoch 20 pre-trained weights** as a `.bin` file from Google Drive, use the following command:
14
+ ```bash
15
+ pip install gdown
16
+ gdown --id 1c3CwvbOP_kp3FpLL1wPrjO6qtY-XiT26 -O BioM3_ProteoScribe_pfam_epoch20_v1.bin
17
+ ```
18
 
19
  ---
20
+ ## **Usage**
21
+ Once available, the pre-trained weights can be loaded as follows:
22
+ ```python
23
+ import json
24
+ import torch
25
+ import torch.nn as nn
26
+ from argparse import Namespace
27
+ import Stage3_source.cond_diff_transformer_layer as Stage3_mod
28
 
29
+ # Step 1: Load JSON Configuration
30
+ def load_json_config(json_path):
31
+ """
32
+ Load a JSON configuration file and return it as a dictionary.
33
+ """
34
+ with open(json_path, "r") as f:
35
+ config = json.load(f)
36
+ return config
37
 
38
+ # Step 2: Convert JSON Dictionary to Namespace
39
+ def convert_to_namespace(config_dict):
40
+ """
41
+ Recursively convert a dictionary to an argparse Namespace.
42
+ """
43
+ for key, value in config_dict.items():
44
+ if isinstance(value, dict):
45
+ config_dict[key] = convert_to_namespace(value)
46
+ return Namespace(**config_dict)
47
+
48
+ # Step 3: Model Loading Function
49
+ def prepare_model(model_path, config_args) -> nn.Module:
50
+ """
51
+ Initialize and load the ProteoScribe model with pre-trained weights.
52
+ """
53
+ # Initialize the model graph
54
+ model = Stage3_mod.get_model(
55
+ args=config_args,
56
+ data_shape=(config_args.image_size, config_args.image_size),
57
+ num_classes=config_args.num_classes
58
+ )
59
+
60
+ # Load pre-trained weights
61
+ model.load_state_dict(torch.load(model_path, map_location=config_args.device))
62
+ model.eval()
63
+
64
+ return model
65
+
66
+ if __name__ == '__main__':
67
+ # Path to configuration and weights
68
+ config_path = "stage3_config.json"
69
+ model_weights_path = "weights/ProteoScribe/BioM3_ProteoScribe_pfam_epoch20_v1.bin"
70
+
71
+ # Load Configuration
72
+ print("Loading configuration...")
73
+ config_dict = load_json_config(config_path)
74
+ config_args = convert_to_namespace(config_dict)
75
+
76
+ # Set device if not specified in config
77
+ if not hasattr(config_args, 'device'):
78
+ config_args.device = 'cuda' if torch.cuda.is_available() else 'cpu'
79
+
80
+ # Load Model
81
+ print("Loading pre-trained model weights...")
82
+ model = prepare_model(model_weights_path, config_args)
83
+ print(f"Model loaded successfully with weights! (Device: {config_args.device})")
84
+ ```
85
 
86
  ---
87
+ ## **Model Structure**
88
+ The ProteoScribe model is structured as a conditional diffusion transformer that generates protein sequences based on facilitated embeddings. The model consists of:
89
 
90
+ 1. A transformer-based architecture for sequence generation
91
+ 2. Conditional diffusion layers for embedding processing
92
+ 3. Output layers for amino acid sequence prediction
93
 
94
+ ---
95
+ ## **Configuration Requirements**
96
+ The `stage3_config.json` file should contain the following key parameters:
97
+
98
+ ```json
99
+ {
100
+ "image_size": [required_size],
101
+ "num_classes": [num_amino_acids],
102
+ "device": "cuda", // or "cpu"
103
+ // Additional model-specific parameters
104
+ }
105
+ ```
106
 
107
  ---
108
+ ## **Dependencies**
109
+ Ensure you have the following dependencies installed:
110
+ - PyTorch (latest stable version)
111
+ - Stage3_source module (included in the BioM3 repository)
112
 
113
+ ---
114
+ ## **Important Notes**
115
+ 1. The model expects facilitated embeddings (z_c) as input, typically generated from Stage 2 (Facilitator)
116
+ 2. Model weights are optimized for protein sequence generation tasks
117
+ 3. Use CUDA-enabled GPU for optimal performance (if available)
118
+ 4. Default configuration is tuned for the Pfam database
119
 
120
+ ---
121
+ ## **Troubleshooting**
122
+ Common issues and solutions:
123
 
124
+ 1. **CUDA Out of Memory**
125
+ - Reduce batch size in configuration
126
+ - Use CPU if GPU memory is insufficient
 
 
127
 
128
+ 2. **Module Import Errors**
129
+ - Ensure Stage3_source is in Python path
130
+ - Check all dependencies are installed
131
+
132
+ 3. **Weight Loading Issues**
133
+ - Verify the downloaded weights file is complete
134
+ - Check model configuration matches pre-trained architecture
135
+
136
+ For additional support or issues:
137
+ - Open an issue in the BioM3 repository
138
+ - Check the documentation for updates
139
+
140
+ ---
141
+ ## **Citation**
142
+ If you use these weights in your research, please cite:
143
+ ```bibtex
144
+ Natural Language Prompts Guide the Design of Novel Functional Protein Sequences
145
+ bioRxiv 2024.11.11.622734
146
+ doi: https://doi.org/10.1101/2024.11.11.622734
147
+ ```
148
+
149
+ ---
150
+ Repository maintained by the BioM3 Team