AI4PD
/

ZymCTRL

@@ -11,8 +11,9 @@ inference:
 # **ZymCTRL**
 ZymCTRL (Enzyme Control) ([ see preprint ](https://www.biorxiv.org/content/10.1101/2024.05.03.592223v1))
-is a conditional language model for the generation of artificial functional enzymes. It was trained on Uniprot database of sequences containing EC annotations, comprising over 37 M sequences.
-Given a user-defined Enzymatic Commission (EC) number, the model generates protein sequences that fulfill that catalytic reaction.
 The generated sequences are ordered, globular, and distant to natural ones, while their intended catalytic properties match those defined by users.
 If you don't know the EC number of your protein of interest, have a look for example here: https://www.brenda-enzymes.org/ecexplorer.php?browser=1
@@ -48,14 +49,14 @@ Since ZymCTRL has been trained on the classical language model objective on enzy
 it particularly excels at generating enzyme sequences given a user-defined EC class, such as alcohol dehydrogenases ('1.1.1.2').
 The model can generate in two ways: in a zero-shot fashion, i.e., directly generating from the checkpoint weights, or after fine-tuning.
-Fine-tuning allows augmenting the BRENDA datasets that were used during training, for example,
 if you have a curated internal dataset or a set of ancestrally-reconstructed sequences. This is entirely optional. One advantage of
 running the model in zero-shot is that it doesn't require any further training.
 ### **Example 1: Generating nitrilases (EC 3.5.5.1)**
-The script below will be used for the generation of any BRENDA class in a zero-shot fashion,
 here we showcase the generation of novel nitrilases.
 To run this script, you should download ZymCTRL to a local folder in your workstation.
@@ -132,7 +133,7 @@ if __name__=='__main__':
     model = GPT2LMHeadModel.from_pretrained('/path/to/zymCTRL').to(device) # change to ZymCTRL location
     special_tokens = ['<start>', '<end>', '<|endoftext|>','<pad>',' ', '<sep>']
-    # change to the appropriate BRENDA EC classes
     labels=['3.5.5.1'] # nitrilases. You can put as many labels as you want.
     for label in tqdm(labels):
@@ -162,7 +163,7 @@ as it will learn new properties from your dataset and potentially improve the ge
 To fine-tune ZymCTRL, you can use the script below to process your sequences. The only requisite is to start with an input file,
 'sequences.fasta' which contains all the sequences in a fasta format. Please follow the format below. There should not be new lines '\n' or
-any separator between sequences. In the script, change the variable ec_label to the specific BRENDA class you'd like to fine-tune.
 The script will produce a file called {ec_label}_processed.txt and a folder with the training and validation datasets (split 10%)
 ```
 >Sequence1

 # **ZymCTRL**
 ZymCTRL (Enzyme Control) ([ see preprint ](https://www.biorxiv.org/content/10.1101/2024.05.03.592223v1))
+is a conditional language model for the generation of artificial functional enzymes.
+It was trained on the UniProt database of sequences containing (Enzyme Commission) EC annotations, comprising over 37 M sequences.
+Given a user-defined Enzymatic Commission (EC) number, the model generates protein sequences that fulfil that catalytic reaction.
 The generated sequences are ordered, globular, and distant to natural ones, while their intended catalytic properties match those defined by users.
 If you don't know the EC number of your protein of interest, have a look for example here: https://www.brenda-enzymes.org/ecexplorer.php?browser=1
 it particularly excels at generating enzyme sequences given a user-defined EC class, such as alcohol dehydrogenases ('1.1.1.2').
 The model can generate in two ways: in a zero-shot fashion, i.e., directly generating from the checkpoint weights, or after fine-tuning.
+Fine-tuning allows augmenting the specific EC datasets that were used during training, for example,
 if you have a curated internal dataset or a set of ancestrally-reconstructed sequences. This is entirely optional. One advantage of
 running the model in zero-shot is that it doesn't require any further training.
 ### **Example 1: Generating nitrilases (EC 3.5.5.1)**
+The script below will be used for the generation of any EC class in a zero-shot fashion,
 here we showcase the generation of novel nitrilases.
 To run this script, you should download ZymCTRL to a local folder in your workstation.
     model = GPT2LMHeadModel.from_pretrained('/path/to/zymCTRL').to(device) # change to ZymCTRL location
     special_tokens = ['<start>', '<end>', '<|endoftext|>','<pad>',' ', '<sep>']
+    # change to the appropriate EC classes
     labels=['3.5.5.1'] # nitrilases. You can put as many labels as you want.
     for label in tqdm(labels):
 To fine-tune ZymCTRL, you can use the script below to process your sequences. The only requisite is to start with an input file,
 'sequences.fasta' which contains all the sequences in a fasta format. Please follow the format below. There should not be new lines '\n' or
+any separator between sequences. In the script, change the variable ec_label to the specific EC class you'd like to fine-tune.
 The script will produce a file called {ec_label}_processed.txt and a folder with the training and validation datasets (split 10%)
 ```
 >Sequence1