niksapraljak1
/

BioM3

Model card Files Files and versions Community

Niksa Praljak commited on 27 days ago

Commit

013cb42

1 Parent(s): 966f016

Add instruction loading pretrained weights

Browse files

Files changed (1) hide show

weights/ProteoScribe/README.md +128 -13

weights/ProteoScribe/README.md CHANGED Viewed

@@ -4,32 +4,147 @@
 ### **`weights/ProteoScribe/README.md`**
 ```markdown
 # ProteoScribe Pre-trained Weights
-This folder will contain the pre-trained weights for the **ProteoScribe** model. ProteoScribe enables advanced functional annotation or protein generation tasks.
 ---
-## **Downloading Pre-trained Weights**
-The Google Drive link for downloading the ProteoScribe pre-trained weights will be added here soon.
 ---
-## **File Details**
-- **File Name**: ProteoScribe pre-trained weights (TBD).
-- **Description**: Pre-trained weights for the ProteoScribe model.
 ---
-## **Usage**
-Once available, you can load the weights into your model using PyTorch:
-```python
-import torch
-model = YourProteoScribeModel()  # Replace with your model class
-model.load_state_dict(torch.load("weights/ProteoScribe/ProteoScribe_weights.bin", map_location="cpu"))
-model.eval()

 ### **`weights/ProteoScribe/README.md`**
 ```markdown
 # ProteoScribe Pre-trained Weights
+This folder contains the pre-trained weights for the **ProteoScribe** model (Stage 3 of BioM3). The ProteoScribe model generates protein sequences from conditioned latent embeddings.
+---
+## **Downloading Pre-trained Weights**
+To download the **ProteoScribe epoch 20 pre-trained weights** as a `.bin` file from Google Drive, use the following command:
+```bash
+pip install gdown
+gdown --id 1c3CwvbOP_kp3FpLL1wPrjO6qtY-XiT26 -O BioM3_ProteoScribe_pfam_epoch20_v1.bin
+```
 ---
+## **Usage**
+Once available, the pre-trained weights can be loaded as follows:
+```python
+import json
+import torch
+import torch.nn as nn
+from argparse import Namespace
+import Stage3_source.cond_diff_transformer_layer as Stage3_mod
+# Step 1: Load JSON Configuration
+def load_json_config(json_path):
+    """
+    Load a JSON configuration file and return it as a dictionary.
+    """
+    with open(json_path, "r") as f:
+        config = json.load(f)
+    return config
+# Step 2: Convert JSON Dictionary to Namespace
+def convert_to_namespace(config_dict):
+    """
+    Recursively convert a dictionary to an argparse Namespace.
+    """
+    for key, value in config_dict.items():
+        if isinstance(value, dict):
+            config_dict[key] = convert_to_namespace(value)
+    return Namespace(**config_dict)
+# Step 3: Model Loading Function
+def prepare_model(model_path, config_args) -> nn.Module:
+    """
+    Initialize and load the ProteoScribe model with pre-trained weights.
+    """
+    # Initialize the model graph
+    model = Stage3_mod.get_model(
+        args=config_args,
+        data_shape=(config_args.image_size, config_args.image_size),
+        num_classes=config_args.num_classes
+    )
+    # Load pre-trained weights
+    model.load_state_dict(torch.load(model_path, map_location=config_args.device))
+    model.eval()
+    return model
+if __name__ == '__main__':
+    # Path to configuration and weights
+    config_path = "stage3_config.json"
+    model_weights_path = "weights/ProteoScribe/BioM3_ProteoScribe_pfam_epoch20_v1.bin"
+    # Load Configuration
+    print("Loading configuration...")
+    config_dict = load_json_config(config_path)
+    config_args = convert_to_namespace(config_dict)
+    # Set device if not specified in config
+    if not hasattr(config_args, 'device'):
+        config_args.device = 'cuda' if torch.cuda.is_available() else 'cpu'
+    # Load Model
+    print("Loading pre-trained model weights...")
+    model = prepare_model(model_weights_path, config_args)
+    print(f"Model loaded successfully with weights! (Device: {config_args.device})")
+```
 ---
+## **Model Structure**
+The ProteoScribe model is structured as a conditional diffusion transformer that generates protein sequences based on facilitated embeddings. The model consists of:
+1. A transformer-based architecture for sequence generation
+2. Conditional diffusion layers for embedding processing
+3. Output layers for amino acid sequence prediction
+---
+## **Configuration Requirements**
+The `stage3_config.json` file should contain the following key parameters:
+```json
+{
+    "image_size": [required_size],
+    "num_classes": [num_amino_acids],
+    "device": "cuda",  // or "cpu"
+    // Additional model-specific parameters
+}
+```
 ---
+## **Dependencies**
+Ensure you have the following dependencies installed:
+- PyTorch (latest stable version)
+- Stage3_source module (included in the BioM3 repository)
+---
+## **Important Notes**
+1. The model expects facilitated embeddings (z_c) as input, typically generated from Stage 2 (Facilitator)
+2. Model weights are optimized for protein sequence generation tasks
+3. Use CUDA-enabled GPU for optimal performance (if available)
+4. Default configuration is tuned for the Pfam database
+---
+## **Troubleshooting**
+Common issues and solutions:
+1. **CUDA Out of Memory**
+   - Reduce batch size in configuration
+   - Use CPU if GPU memory is insufficient
+2. **Module Import Errors**
+   - Ensure Stage3_source is in Python path
+   - Check all dependencies are installed
+3. **Weight Loading Issues**
+   - Verify the downloaded weights file is complete
+   - Check model configuration matches pre-trained architecture
+For additional support or issues:
+- Open an issue in the BioM3 repository
+- Check the documentation for updates
+---
+## **Citation**
+If you use these weights in your research, please cite:
+```bibtex
+Natural Language Prompts Guide the Design of Novel Functional Protein Sequences
+bioRxiv 2024.11.11.622734
+doi: https://doi.org/10.1101/2024.11.11.622734
+```
+---
+Repository maintained by the BioM3 Team