File size: 8,724 Bytes
6240c82
 
 
3bed200
68e356f
3bed200
 
 
68e356f
530e4c8
68e356f
3bed200
 
68e356f
 
605fd54
68e356f
 
 
530e4c8
68e356f
 
 
 
530e4c8
68e356f
 
 
 
 
 
 
 
 
 
 
530e4c8
68e356f
 
 
 
 
 
 
 
 
80d9278
68e356f
 
530e4c8
68e356f
 
 
 
605fd54
68e356f
530e4c8
68e356f
530e4c8
 
05f3322
 
 
 
 
 
 
 
 
 
 
605fd54
05f3322
605fd54
 
 
 
 
 
 
 
 
 
 
408e3c9
605fd54
 
 
 
05f3322
605fd54
 
05f3322
605fd54
05f3322
605fd54
 
 
 
 
 
 
 
05f3322
530e4c8
3bed200
 
 
 
 
 
 
 
 
 
 
530e4c8
 
3bed200
 
 
530e4c8
 
 
 
 
 
 
3bed200
 
 
530e4c8
 
 
 
 
 
 
3bed200
 
 
530e4c8
 
 
 
 
 
3bed200
 
 
530e4c8
 
 
 
05f3322
3bed200
05f3322
3bed200
 
 
 
 
 
 
b1a3138
05f3322
530e4c8
3bed200
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
---
license: other
---
# AIDO.Protein2StructureToken-16B

**AIDO.Protein2StructureToken-16B** is a fine-tuned version of [AIDO.Protein-16B](https://huggingface.co/genbio-ai/AIDO.Protein-16B), for protein structure prediction. 
This model uses amino acid sequences as input to predict tokens that can be decoded into 3D structures by [AIDO.StructureDecoder](https://huggingface.co/genbio-ai/AIDO.StructureDecoder).
It surpasses existing state-of-the-art models, such as **ESM3-open**, in structure prediction tasks, demonstrating its robustness and capability in this domain.

## Model Architecture Details

This model retains the architecture of AIDO.Protein-16B, a transformer encoder-only architecture with dense MLP layers replaced by sparse Mixture of Experts (MoE) layers. 
Each token activates 2 experts using a top-2 routing mechanism. A visual summary of the architecture is provided below:

<center>
  <img src="https://huggingface.co/genbio-ai/AIDO.Protein-16B/resolve/main/proteinmoe_architecture.png" alt="AIDO.Protein-16B Architecture" style="width:70%; height:auto;" />
</center>


### Key Differences
The final output linear layer has been adapted to support a new vocabulary size:
- **Input Vocabulary Size**: 44 (amino acids + special tokens)
- **Output Vocabulary Size**: 512 (structure tokens without special tokens)

### Architecture Parameters
| Component                     | Value |
|-------------------------------|-------|
| Number of Attention Heads     | 36    |
| Number of Hidden Layers       | 36    |
| Hidden Size                   | 2304  |
| Number of MoE Layers per Block| 8     |
| Number of MoE Layers per Token| 2     |
| Input Vocabulary Size         | 44    |
| Output Vocabulary Size        | 512   |
| Context Length                | 1024  |

## Training Details

The fine-tuning process used **0.4 trillion tokens**, using AlphaFold database with **170M samples** and PDB database with **0.4M samples**, making it highly specialized for structure prediction. The training took around 20 days on 64 A100 GPUs.

- **Batch Size**: Global batch size of 2048
- **Context Length**: 1024
- **Precision**: FP16
- **Hardware**: 64 NVIDIA A100 80GB GPUs
- **Learning Rate**: Max learning rate of 1e-4
- **Scheduler**: Cosine decay with 2.5% warmup
- **Tokens Trained**: 0.4T tokens
- **Training steps**: 200k steps

## Tokenization

The input sequence should be single-chain amino acid sequences.

- **Input Tokenization**: The sequences are tokenized at the amino acid level and terminated with a `[SEP]` token (id=34).
- **Output Tokenization**: Each input token is converted into a structure token. The output can be decoded into 3D structures in PDB format using [AIDO.StructureDecoder](https://huggingface.co/genbio-ai/AIDO.StructureDecoder).

## Results

<center><img src="StructurePrediction.PNG" alt="Structure Prediction Results" style="width:40%; height:auto;" /></center>

## How to Use

### Structure Prediction

To reproduce the structure prediction results described above, follow these steps:

1. Install the [Model Generator package](https://github.com/genbio-ai/ModelGenerator/).

2. Run the prediction command:

   ```bash
   mgen predict --config experiments/AIDO.StructureTokenizer/protein2structoken_16b.yaml
   ```
   This will pull the CASP14, CASP15, and CAMEO dataset from [genbio-ai/casp14-casp15-cameo-test-proteins](https://huggingface.co/datasets/genbio-ai/casp14-casp15-cameo-test-proteins), and predict the structure tokens from the amino acid sequence.

3. Convert the output `.tsv` to `.pt` and extract model codebook:

   ```bash
   # convert the predicted structures in tsv into one pt file
   python experiments/AIDO.StructureTokenizer/struct_token_format_conversion.py logs/protein2structoken_16b/predict_predictions.tsv logs/protein2structoken_16b/predict_predictions.pt
   # extract the codebook of the structure tokenizer
   python experiments/AIDO.StructureTokenizer/extract_structure_tokenizer_codebook.py --output_path logs/protein2structoken_16b/codebook.pt
   ```
5. Run the decoding command to get 3D structures in PDB format (currently this script only supports single GPU inference):
   ```bash
   CUDA_VISIBLE_DEVICES=0 mgen predict --config experiments/AIDO.StructureTokenizer/decode.yaml \
     --data.init_args.config.struct_tokens_datasets_configs.name=protein2structoken_16b \
     --data.init_args.config.struct_tokens_datasets_configs.struct_tokens_path=logs/protein2structoken_16b/predict_predictions.pt \
     --data.init_args.config.struct_tokens_datasets_configs.codebook_path=logs/protein2structoken_16b/codebook.pt
   ```
   The outputs are in `logs/protstruct_decode/protein2structoken_16b_pdb_files/`
6. You can compare the predicted structures with the ground truth PDBs in [genbio-ai/casp14-casp15-cameo-test-proteins](https://huggingface.co/datasets/genbio-ai/casp14-casp15-cameo-test-proteins/tree/main).

Alternatively, you can provide your own input amino acid sequence in a CSV file. Here is one example csv at `experiments/AIDO.StructureTokenizer/protein2structoken_example_input.csv` in `ModelGenerator`:
```
idx,aa_seq
example,KEFWNLDKNLQLRLGIVFLG
```
Here, `idx` is a unique name, and `aa_seq` is the amino acid sequence. To use this customized CSV file, replace the second step with
```bash
mgen predict --config experiments/AIDO.StructureTokenizer/protein2structoken_16b.yaml \
 --data.init_args.path=experiments/AIDO.StructureTokenizer/ \
 --data.init_args.test_split_files=[protein2structoken_example_input.csv]
```

### Build any downstream models from this backbone with ModelGenerator
For more information, visit: [Model Generator](https://github.com/genbio-ai/modelgenerator)
```bash
mgen fit --model SequenceClassification --model.backbone aido_protein_16b --data SequenceClassificationDataModule --data.path <hf_or_local_path_to_your_dataset>
mgen test --model SequenceClassification --model.backbone aido_protein_16b --data SequenceClassificationDataModule --data.path <hf_or_local_path_to_your_dataset>
```
The usage of this model is the same as [AIDO.Protein-16B](https://huggingface.co/genbio-ai/AIDO.Protein-16B). 
You only need to change the `model.backbone` to `aido_protein2structoken`.


### Or use directly in Python
#### Embedding
```python
from modelgenerator.tasks import Embed
model = Embed.from_config({"model.backbone": "aido_protein2structoken_16b"}).eval()
collated_batch = model.collate({"sequences": ["HELLQ", "WRLD"]})
embedding = model(collated_batch)
print(embedding.shape)
print(embedding)
```
#### Sequence Level Classification
```python
import torch
from modelgenerator.tasks import SequenceClassification
model = SequenceClassification.from_config({"model.backbone": "aido_protein2structoken_16b", "model.n_classes": 2}).eval()
collated_batch = model.collate({"sequences": ["HELLQ", "WRLD"]})
logits = model(collated_batch)
print(logits)
print(torch.argmax(logits, dim=-1))
```
#### Token Level Classification
```python
import torch
from modelgenerator.tasks import TokenClassification
model = TokenClassification.from_config({"model.backbone": "aido_protein2structoken_16b", "model.n_classes": 3}).eval()
collated_batch = model.collate({"sequences": ["HELLQ", "WRLD"]})
logits = model(collated_batch)
print(logits)
print(torch.argmax(logits, dim=-1))
```
#### Regression
```python
from modelgenerator.tasks import SequenceRegression
model = SequenceRegression.from_config({"model.backbone": "aido_protein2structoken_16b"}).eval()
collated_batch = model.collate({"sequences": ["HELLQ", "WRLD"]})
logits = model(collated_batch)
print(logits)
```

## Citation
Please cite AIDO.Protein and AIDO.StructureTokenizer using the following BibTex codes:
```
@inproceedings{zhang_balancing_2024,
	title = {Balancing Locality and Reconstruction in Protein Structure Tokenizer},
	url = {https://www.biorxiv.org/content/10.1101/2024.12.02.626366v2},
	doi = {10.1101/2024.12.02.626366},
	publisher = {bioRxiv},
	author = {Zhang, Jiayou and Meynard-Piganeau, Barthelemy and Gong, James and Cheng, Xingyi and Luo, Yingtao and Ly, Hugo and Song, Le and Xing, Eric},
	year = {2024},
    booktitle={NeurIPS 2024 Workshop on Machine Learning in Structural Biology (MLSB)},
}

@inproceedings{sun_mixture_2024,
	title = {Mixture of Experts Enable Efficient and Effective Protein Understanding and Design},
	url = {https://www.biorxiv.org/content/10.1101/2024.11.29.625425v1},
	doi = {10.1101/2024.11.29.625425},
	publisher = {bioRxiv},
	author = {Sun, Ning and Zou, Shuxian and Tao, Tianhua and Mahbub, Sazan and Li, Dian and Zhuang, Yonghao and Wang, Hongyi and Cheng, Xingyi and Song, Le and Xing, Eric P.},
	year = {2024},
    booktitle={NeurIPS 2024 Workshop on AI for New Drug Modalities},
}
```