Update README.md
Browse files
README.md
CHANGED
@@ -11,8 +11,9 @@ inference:
|
|
11 |
# **ZymCTRL**
|
12 |
|
13 |
ZymCTRL (Enzyme Control) ([ see preprint ](https://www.biorxiv.org/content/10.1101/2024.05.03.592223v1))
|
14 |
-
is a conditional language model for the generation of artificial functional enzymes.
|
15 |
-
|
|
|
16 |
The generated sequences are ordered, globular, and distant to natural ones, while their intended catalytic properties match those defined by users.
|
17 |
|
18 |
If you don't know the EC number of your protein of interest, have a look for example here: https://www.brenda-enzymes.org/ecexplorer.php?browser=1
|
@@ -48,14 +49,14 @@ Since ZymCTRL has been trained on the classical language model objective on enzy
|
|
48 |
it particularly excels at generating enzyme sequences given a user-defined EC class, such as alcohol dehydrogenases ('1.1.1.2').
|
49 |
|
50 |
The model can generate in two ways: in a zero-shot fashion, i.e., directly generating from the checkpoint weights, or after fine-tuning.
|
51 |
-
Fine-tuning allows augmenting the
|
52 |
if you have a curated internal dataset or a set of ancestrally-reconstructed sequences. This is entirely optional. One advantage of
|
53 |
running the model in zero-shot is that it doesn't require any further training.
|
54 |
|
55 |
|
56 |
### **Example 1: Generating nitrilases (EC 3.5.5.1)**
|
57 |
|
58 |
-
The script below will be used for the generation of any
|
59 |
here we showcase the generation of novel nitrilases.
|
60 |
|
61 |
To run this script, you should download ZymCTRL to a local folder in your workstation.
|
@@ -132,7 +133,7 @@ if __name__=='__main__':
|
|
132 |
model = GPT2LMHeadModel.from_pretrained('/path/to/zymCTRL').to(device) # change to ZymCTRL location
|
133 |
special_tokens = ['<start>', '<end>', '<|endoftext|>','<pad>',' ', '<sep>']
|
134 |
|
135 |
-
# change to the appropriate
|
136 |
labels=['3.5.5.1'] # nitrilases. You can put as many labels as you want.
|
137 |
|
138 |
for label in tqdm(labels):
|
@@ -162,7 +163,7 @@ as it will learn new properties from your dataset and potentially improve the ge
|
|
162 |
|
163 |
To fine-tune ZymCTRL, you can use the script below to process your sequences. The only requisite is to start with an input file,
|
164 |
'sequences.fasta' which contains all the sequences in a fasta format. Please follow the format below. There should not be new lines '\n' or
|
165 |
-
any separator between sequences. In the script, change the variable ec_label to the specific
|
166 |
The script will produce a file called {ec_label}_processed.txt and a folder with the training and validation datasets (split 10%)
|
167 |
```
|
168 |
>Sequence1
|
|
|
11 |
# **ZymCTRL**
|
12 |
|
13 |
ZymCTRL (Enzyme Control) ([ see preprint ](https://www.biorxiv.org/content/10.1101/2024.05.03.592223v1))
|
14 |
+
is a conditional language model for the generation of artificial functional enzymes.
|
15 |
+
It was trained on the UniProt database of sequences containing (Enzyme Commission) EC annotations, comprising over 37 M sequences.
|
16 |
+
Given a user-defined Enzymatic Commission (EC) number, the model generates protein sequences that fulfil that catalytic reaction.
|
17 |
The generated sequences are ordered, globular, and distant to natural ones, while their intended catalytic properties match those defined by users.
|
18 |
|
19 |
If you don't know the EC number of your protein of interest, have a look for example here: https://www.brenda-enzymes.org/ecexplorer.php?browser=1
|
|
|
49 |
it particularly excels at generating enzyme sequences given a user-defined EC class, such as alcohol dehydrogenases ('1.1.1.2').
|
50 |
|
51 |
The model can generate in two ways: in a zero-shot fashion, i.e., directly generating from the checkpoint weights, or after fine-tuning.
|
52 |
+
Fine-tuning allows augmenting the specific EC datasets that were used during training, for example,
|
53 |
if you have a curated internal dataset or a set of ancestrally-reconstructed sequences. This is entirely optional. One advantage of
|
54 |
running the model in zero-shot is that it doesn't require any further training.
|
55 |
|
56 |
|
57 |
### **Example 1: Generating nitrilases (EC 3.5.5.1)**
|
58 |
|
59 |
+
The script below will be used for the generation of any EC class in a zero-shot fashion,
|
60 |
here we showcase the generation of novel nitrilases.
|
61 |
|
62 |
To run this script, you should download ZymCTRL to a local folder in your workstation.
|
|
|
133 |
model = GPT2LMHeadModel.from_pretrained('/path/to/zymCTRL').to(device) # change to ZymCTRL location
|
134 |
special_tokens = ['<start>', '<end>', '<|endoftext|>','<pad>',' ', '<sep>']
|
135 |
|
136 |
+
# change to the appropriate EC classes
|
137 |
labels=['3.5.5.1'] # nitrilases. You can put as many labels as you want.
|
138 |
|
139 |
for label in tqdm(labels):
|
|
|
163 |
|
164 |
To fine-tune ZymCTRL, you can use the script below to process your sequences. The only requisite is to start with an input file,
|
165 |
'sequences.fasta' which contains all the sequences in a fasta format. Please follow the format below. There should not be new lines '\n' or
|
166 |
+
any separator between sequences. In the script, change the variable ec_label to the specific EC class you'd like to fine-tune.
|
167 |
The script will produce a file called {ec_label}_processed.txt and a folder with the training and validation datasets (split 10%)
|
168 |
```
|
169 |
>Sequence1
|