nferruz commited on
Commit
f83e84d
1 Parent(s): 7e8a9c6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -6
README.md CHANGED
@@ -11,8 +11,9 @@ inference:
11
  # **ZymCTRL**
12
 
13
  ZymCTRL (Enzyme Control) ([ see preprint ](https://www.biorxiv.org/content/10.1101/2024.05.03.592223v1))
14
- is a conditional language model for the generation of artificial functional enzymes. It was trained on Uniprot database of sequences containing EC annotations, comprising over 37 M sequences.
15
- Given a user-defined Enzymatic Commission (EC) number, the model generates protein sequences that fulfill that catalytic reaction.
 
16
  The generated sequences are ordered, globular, and distant to natural ones, while their intended catalytic properties match those defined by users.
17
 
18
  If you don't know the EC number of your protein of interest, have a look for example here: https://www.brenda-enzymes.org/ecexplorer.php?browser=1
@@ -48,14 +49,14 @@ Since ZymCTRL has been trained on the classical language model objective on enzy
48
  it particularly excels at generating enzyme sequences given a user-defined EC class, such as alcohol dehydrogenases ('1.1.1.2').
49
 
50
  The model can generate in two ways: in a zero-shot fashion, i.e., directly generating from the checkpoint weights, or after fine-tuning.
51
- Fine-tuning allows augmenting the BRENDA datasets that were used during training, for example,
52
  if you have a curated internal dataset or a set of ancestrally-reconstructed sequences. This is entirely optional. One advantage of
53
  running the model in zero-shot is that it doesn't require any further training.
54
 
55
 
56
  ### **Example 1: Generating nitrilases (EC 3.5.5.1)**
57
 
58
- The script below will be used for the generation of any BRENDA class in a zero-shot fashion,
59
  here we showcase the generation of novel nitrilases.
60
 
61
  To run this script, you should download ZymCTRL to a local folder in your workstation.
@@ -132,7 +133,7 @@ if __name__=='__main__':
132
  model = GPT2LMHeadModel.from_pretrained('/path/to/zymCTRL').to(device) # change to ZymCTRL location
133
  special_tokens = ['<start>', '<end>', '<|endoftext|>','<pad>',' ', '<sep>']
134
 
135
- # change to the appropriate BRENDA EC classes
136
  labels=['3.5.5.1'] # nitrilases. You can put as many labels as you want.
137
 
138
  for label in tqdm(labels):
@@ -162,7 +163,7 @@ as it will learn new properties from your dataset and potentially improve the ge
162
 
163
  To fine-tune ZymCTRL, you can use the script below to process your sequences. The only requisite is to start with an input file,
164
  'sequences.fasta' which contains all the sequences in a fasta format. Please follow the format below. There should not be new lines '\n' or
165
- any separator between sequences. In the script, change the variable ec_label to the specific BRENDA class you'd like to fine-tune.
166
  The script will produce a file called {ec_label}_processed.txt and a folder with the training and validation datasets (split 10%)
167
  ```
168
  >Sequence1
 
11
  # **ZymCTRL**
12
 
13
  ZymCTRL (Enzyme Control) ([ see preprint ](https://www.biorxiv.org/content/10.1101/2024.05.03.592223v1))
14
+ is a conditional language model for the generation of artificial functional enzymes.
15
+ It was trained on the UniProt database of sequences containing (Enzyme Commission) EC annotations, comprising over 37 M sequences.
16
+ Given a user-defined Enzymatic Commission (EC) number, the model generates protein sequences that fulfil that catalytic reaction.
17
  The generated sequences are ordered, globular, and distant to natural ones, while their intended catalytic properties match those defined by users.
18
 
19
  If you don't know the EC number of your protein of interest, have a look for example here: https://www.brenda-enzymes.org/ecexplorer.php?browser=1
 
49
  it particularly excels at generating enzyme sequences given a user-defined EC class, such as alcohol dehydrogenases ('1.1.1.2').
50
 
51
  The model can generate in two ways: in a zero-shot fashion, i.e., directly generating from the checkpoint weights, or after fine-tuning.
52
+ Fine-tuning allows augmenting the specific EC datasets that were used during training, for example,
53
  if you have a curated internal dataset or a set of ancestrally-reconstructed sequences. This is entirely optional. One advantage of
54
  running the model in zero-shot is that it doesn't require any further training.
55
 
56
 
57
  ### **Example 1: Generating nitrilases (EC 3.5.5.1)**
58
 
59
+ The script below will be used for the generation of any EC class in a zero-shot fashion,
60
  here we showcase the generation of novel nitrilases.
61
 
62
  To run this script, you should download ZymCTRL to a local folder in your workstation.
 
133
  model = GPT2LMHeadModel.from_pretrained('/path/to/zymCTRL').to(device) # change to ZymCTRL location
134
  special_tokens = ['<start>', '<end>', '<|endoftext|>','<pad>',' ', '<sep>']
135
 
136
+ # change to the appropriate EC classes
137
  labels=['3.5.5.1'] # nitrilases. You can put as many labels as you want.
138
 
139
  for label in tqdm(labels):
 
163
 
164
  To fine-tune ZymCTRL, you can use the script below to process your sequences. The only requisite is to start with an input file,
165
  'sequences.fasta' which contains all the sequences in a fasta format. Please follow the format below. There should not be new lines '\n' or
166
+ any separator between sequences. In the script, change the variable ec_label to the specific EC class you'd like to fine-tune.
167
  The script will produce a file called {ec_label}_processed.txt and a folder with the training and validation datasets (split 10%)
168
  ```
169
  >Sequence1