Update README.md
Browse files
README.md
CHANGED
@@ -6,15 +6,15 @@ license: apache-2.0
|
|
6 |
ZymCTRL ([Paper presented @ Machine Learning for Structural Biology workshop - December 2022](https://www.mlsb.io/papers_2022/ZymCTRL_a_conditional_language_model_for_the_controllable_generation_of_artificial_enzymes.pdf))
|
7 |
is a conditional language model for the generation of artificial functional enzymes. It was trained on the entire BRENDA database of enzymes, comprising over 37 M sequences.
|
8 |
Given a user-defined Enzymatic Commission (EC) number, the model generates protein sequences that fulfill that catalytic reaction.
|
9 |
-
The generated sequences are ordered, globular and distant to natural ones, while their intended catalytic properties match those defined by users.
|
10 |
|
11 |
-
If you don't know
|
12 |
|
13 |
-
See below information about the model, how to generate sequences, and how to save and rank them by perplexity.
|
14 |
|
15 |
## **Model description**
|
16 |
ZymCTRL is based on the [CTRL Transformer](https://arxiv.org/abs/1909.05858) architecture (which in turn is very similar to ChatGPT) and contains 36 layers
|
17 |
-
with a model dimensionality of 1280,
|
18 |
|
19 |
ZymCTRL is a decoder-only transformer model pre-trained on the BRENDA database
|
20 |
(version July 2022). The pre-training was done on the raw sequences without FASTA headers,
|
@@ -22,10 +22,10 @@ with the EC classes prepended to each sequence. The databases will be uploaded s
|
|
22 |
|
23 |
ZymCTRL was trained with an autoregressive objective, i.e., the model learns to predict
|
24 |
the next token given a sequence context. Because the first tokens on each sequence encode the EC numbers,
|
25 |
-
the model learns the dependencies among EC classes and their corresponding sequences
|
26 |
|
27 |
-
There are stark differences in the number of members among EC classes, and for this reason we also tokenized the EC numbers.
|
28 |
-
In this manner, EC numbers '2.7.1.1' and '2.7.1.2' share the first three tokens (six including separators) and hence the model can infer that
|
29 |
there are relationships between the two classes.
|
30 |
|
31 |
The figure below summarizes the process of training:
|
@@ -40,10 +40,10 @@ Detailed installation instructions can be found here: https://huggingface.co/doc
|
|
40 |
Since ZymCTRL has been trained on the classical language model objective on enzyme sequences with their EC annotation,
|
41 |
it particularly excels at generating enzyme sequences given a user-defined EC class, such as alcohol dehydrogenases ('1.1.1.2').
|
42 |
|
43 |
-
The model can generate in two ways: in a zero-shot fashion, i.e directly generating from the checkpoint weights
|
44 |
-
Fine-tuning allows
|
45 |
-
if you have a curated internal dataset
|
46 |
-
running the model in zero-shot
|
47 |
|
48 |
|
49 |
### **Example 1: Generating nitrilases (EC 3.5.5.1)**
|
@@ -51,19 +51,19 @@ running the model in zero-shot, is that it doesn't require any further training.
|
|
51 |
The script below will be used for the generation of any BRENDA class in a zero-shot fashion,
|
52 |
here we showcase the generation of novel dehalogenases.
|
53 |
|
54 |
-
To run this script you should download ZymCTRL to a local folder in your workstation.
|
55 |
Then replace the placeholders in the script with your actual folder path.
|
56 |
|
57 |
You can run it directly in the command line (once you have hugging face installed),
|
58 |
-
with the following command: `python generate.py
|
59 |
|
60 |
The script will write each sequence in a fasta file in the folder you specify. In the fasta header,
|
61 |
it will store the sequence's computed perplexity value. Perplexity is a measure of the model's confidence
|
62 |
in that generation, with lower values being better. The sequences are ordered by perplexity before writing them out,
|
63 |
so those that finish in *_0.fasta and *_1.fasta will be the best ones per batch.
|
64 |
|
65 |
-
**Given that generation runs so fast, we recommend
|
66 |
-
With the script below that would mean picking only those that finish in '_0.fasta'**
|
67 |
|
68 |
```
|
69 |
import torch
|
@@ -73,7 +73,7 @@ from tqdm import tqdm
|
|
73 |
import math
|
74 |
|
75 |
def remove_characters(sequence, char_list):
|
76 |
-
"This function removes special tokens used during training"
|
77 |
columns = sequence.split('<sep>')
|
78 |
seq = columns[1]
|
79 |
for char in char_list:
|
@@ -144,18 +144,18 @@ if __name__=='__main__':
|
|
144 |
|
145 |
## **Example 2: Fine-tuning on a set of user-defined sequences**
|
146 |
|
147 |
-
This alternative to the zero-shot generation allows
|
148 |
|
149 |
This strategy is not strictly necessary, in fact, we have observed good generations even for EC classes where there are
|
150 |
-
only 1-2 representatives in Nature. But you might have an internal set of sequences that you'd like to incorporate
|
151 |
For example, internal datasets after protein engineering efforts,
|
152 |
ancestrally-reconstructed sets, or after searching against metagenomics databases. In these cases, it is advisable to fine-tune ZymCTRL,
|
153 |
as it will learn new properties from your dataset and potentially improve the generation quality
|
154 |
(especially for poorly populated EC classes).
|
155 |
|
156 |
-
To fine-tune ZymCTRL, you will need to process your sequences quite a bit.
|
157 |
-
modifications. The only requisite is to start with an input file 'sequences.fasta' which
|
158 |
-
We recommend using at least 200 sequences to obtain best results. But we've seen it working with fewer sequences, so if you don't have
|
159 |
that many, give it still a go.
|
160 |
|
161 |
|
@@ -372,14 +372,14 @@ https://github.com/huggingface/transformers/blob/master/examples/pytorch/languag
|
|
372 |
|
373 |
|
374 |
### **Training specs**
|
375 |
-
The model was trained on 48 NVIDIA A100 GPUs for
|
376 |
-
using a block size of 1024
|
377 |
The optimizer used was Adam (beta1 = 0.9, beta2 = 0.999)
|
378 |
with a learning rate of 0.8e-04.
|
379 |
|
380 |
### **Contact**
|
381 |
|
382 |
We are the AI for Protein Design group at the Institute of Molecular Biology of Barcelona (https://www.aiproteindesign.com/).
|
383 |
-
For any
|
384 |
We are always open for collaborations, send an email to nfccri [at] ibmb [dot] csic [dot] es
|
385 |
|
|
|
6 |
ZymCTRL ([Paper presented @ Machine Learning for Structural Biology workshop - December 2022](https://www.mlsb.io/papers_2022/ZymCTRL_a_conditional_language_model_for_the_controllable_generation_of_artificial_enzymes.pdf))
|
7 |
is a conditional language model for the generation of artificial functional enzymes. It was trained on the entire BRENDA database of enzymes, comprising over 37 M sequences.
|
8 |
Given a user-defined Enzymatic Commission (EC) number, the model generates protein sequences that fulfill that catalytic reaction.
|
9 |
+
The generated sequences are ordered, globular, and distant to natural ones, while their intended catalytic properties match those defined by users.
|
10 |
|
11 |
+
If you don't know the EC number of your protein of interest, have a look at the BRENDA webpage: https://www.brenda-enzymes.org/ecexplorer.php?browser=1
|
12 |
|
13 |
+
See below for information about the model, how to generate sequences, and how to save and rank them by perplexity.
|
14 |
|
15 |
## **Model description**
|
16 |
ZymCTRL is based on the [CTRL Transformer](https://arxiv.org/abs/1909.05858) architecture (which in turn is very similar to ChatGPT) and contains 36 layers
|
17 |
+
with a model dimensionality of 1280, totaling 738 million parameters.
|
18 |
|
19 |
ZymCTRL is a decoder-only transformer model pre-trained on the BRENDA database
|
20 |
(version July 2022). The pre-training was done on the raw sequences without FASTA headers,
|
|
|
22 |
|
23 |
ZymCTRL was trained with an autoregressive objective, i.e., the model learns to predict
|
24 |
the next token given a sequence context. Because the first tokens on each sequence encode the EC numbers,
|
25 |
+
the model learns the dependencies among EC classes and their corresponding sequences and is able to _speak_ the enzyme language.
|
26 |
|
27 |
+
There are stark differences in the number of members among EC classes, and for this reason, we also tokenized the EC numbers.
|
28 |
+
In this manner, EC numbers '2.7.1.1' and '2.7.1.2' share the first three tokens (six, including separators), and hence the model can infer that
|
29 |
there are relationships between the two classes.
|
30 |
|
31 |
The figure below summarizes the process of training:
|
|
|
40 |
Since ZymCTRL has been trained on the classical language model objective on enzyme sequences with their EC annotation,
|
41 |
it particularly excels at generating enzyme sequences given a user-defined EC class, such as alcohol dehydrogenases ('1.1.1.2').
|
42 |
|
43 |
+
The model can generate in two ways: in a zero-shot fashion, i.e., directly generating from the checkpoint weights, or after fine-tuning.
|
44 |
+
Fine-tuning allows augmenting the BRENDA datasets that were used during training, for example,
|
45 |
+
if you have a curated internal dataset or a set of ancestrally-reconstructed sequences. This is entirely optional. One advantage of
|
46 |
+
running the model in zero-shot is that it doesn't require any further training.
|
47 |
|
48 |
|
49 |
### **Example 1: Generating nitrilases (EC 3.5.5.1)**
|
|
|
51 |
The script below will be used for the generation of any BRENDA class in a zero-shot fashion,
|
52 |
here we showcase the generation of novel dehalogenases.
|
53 |
|
54 |
+
To run this script, you should download ZymCTRL to a local folder in your workstation.
|
55 |
Then replace the placeholders in the script with your actual folder path.
|
56 |
|
57 |
You can run it directly in the command line (once you have hugging face installed),
|
58 |
+
with the following command: `python generate.py`
|
59 |
|
60 |
The script will write each sequence in a fasta file in the folder you specify. In the fasta header,
|
61 |
it will store the sequence's computed perplexity value. Perplexity is a measure of the model's confidence
|
62 |
in that generation, with lower values being better. The sequences are ordered by perplexity before writing them out,
|
63 |
so those that finish in *_0.fasta and *_1.fasta will be the best ones per batch.
|
64 |
|
65 |
+
**Given that generation runs so fast, we recommend generating hundreds or thousands and then only picking the best 5%.
|
66 |
+
With the script below, that would mean picking only those that finish in '_0.fasta'**
|
67 |
|
68 |
```
|
69 |
import torch
|
|
|
73 |
import math
|
74 |
|
75 |
def remove_characters(sequence, char_list):
|
76 |
+
"This function removes special tokens used during training."
|
77 |
columns = sequence.split('<sep>')
|
78 |
seq = columns[1]
|
79 |
for char in char_list:
|
|
|
144 |
|
145 |
## **Example 2: Fine-tuning on a set of user-defined sequences**
|
146 |
|
147 |
+
This alternative to the zero-shot generation allows updating ZymCTRL's weights to new sequences.
|
148 |
|
149 |
This strategy is not strictly necessary, in fact, we have observed good generations even for EC classes where there are
|
150 |
+
only 1-2 representatives in Nature. But you might have an internal set of sequences that you'd like to incorporate into the model.
|
151 |
For example, internal datasets after protein engineering efforts,
|
152 |
ancestrally-reconstructed sets, or after searching against metagenomics databases. In these cases, it is advisable to fine-tune ZymCTRL,
|
153 |
as it will learn new properties from your dataset and potentially improve the generation quality
|
154 |
(especially for poorly populated EC classes).
|
155 |
|
156 |
+
To fine-tune ZymCTRL, you will need to process your sequences quite a bit. The scripts below can exactly do that without any
|
157 |
+
modifications. The only requisite is to start with an input file, 'sequences.fasta' which contains all the sequences in a fasta format.
|
158 |
+
We recommend using at least 200 sequences to obtain the best results. But we've seen it working with fewer sequences, so if you don't have
|
159 |
that many, give it still a go.
|
160 |
|
161 |
|
|
|
372 |
|
373 |
|
374 |
### **Training specs**
|
375 |
+
The model was trained on 48 NVIDIA A100 GPUs for eight epochs,
|
376 |
+
using a block size of 1024 and a total batch size of 768.
|
377 |
The optimizer used was Adam (beta1 = 0.9, beta2 = 0.999)
|
378 |
with a learning rate of 0.8e-04.
|
379 |
|
380 |
### **Contact**
|
381 |
|
382 |
We are the AI for Protein Design group at the Institute of Molecular Biology of Barcelona (https://www.aiproteindesign.com/).
|
383 |
+
For any questions post an issue in this repository so that other people can benefit from the feedback, and I'll get back to you shortly.
|
384 |
We are always open for collaborations, send an email to nfccri [at] ibmb [dot] csic [dot] es
|
385 |
|