File size: 8,724 Bytes

---
license: other
---
# AIDO.Protein2StructureToken-16B

**AIDO.Protein2StructureToken-16B** is a fine-tuned version of [AIDO.Protein-16B](https://huggingface.co/genbio-ai/AIDO.Protein-16B), for protein structure prediction. 
This model uses amino acid sequences as input to predict tokens that can be decoded into 3D structures by [AIDO.StructureDecoder](https://huggingface.co/genbio-ai/AIDO.StructureDecoder).
It surpasses existing state-of-the-art models, such as **ESM3-open**, in structure prediction tasks, demonstrating its robustness and capability in this domain.

## Model Architecture Details

This model retains the architecture of AIDO.Protein-16B, a transformer encoder-only architecture with dense MLP layers replaced by sparse Mixture of Experts (MoE) layers. 
Each token activates 2 experts using a top-2 routing mechanism. A visual summary of the architecture is provided below:

<center>
  <img src="https://huggingface.co/genbio-ai/AIDO.Protein-16B/resolve/main/proteinmoe_architecture.png" alt="AIDO.Protein-16B Architecture" style="width:70%; height:auto;" />
</center>


### Key Differences
The final output linear layer has been adapted to support a new vocabulary size:
- **Input Vocabulary Size**: 44 (amino acids + special tokens)
- **Output Vocabulary Size**: 512 (structure tokens without special tokens)

### Architecture Parameters
| Component                     | Value |
|-------------------------------|-------|
| Number of Attention Heads     | 36    |
| Number of Hidden Layers       | 36    |
| Hidden Size                   | 2304  |
| Number of MoE Layers per Block| 8     |
| Number of MoE Layers per Token| 2     |
| Input Vocabulary Size         | 44    |
| Output Vocabulary Size        | 512   |
| Context Length                | 1024  |

## Training Details

The fine-tuning process used **0.4 trillion tokens**, using AlphaFold database with **170M samples** and PDB database with **0.4M samples**, making it highly specialized for structure prediction. The training took around 20 days on 64 A100 GPUs.

- **Batch Size**: Global batch size of 2048
- **Context Length**: 1024
- **Precision**: FP16
- **Hardware**: 64 NVIDIA A100 80GB GPUs
- **Learning Rate**: Max learning rate of 1e-4
- **Scheduler**: Cosine decay with 2.5% warmup
- **Tokens Trained**: 0.4T tokens
- **Training steps**: 200k steps

## Tokenization

The input sequence should be single-chain amino acid sequences.

- **Input Tokenization**: The sequences are tokenized at the amino acid level and terminated with a `[SEP]` token (id=34).
- **Output Tokenization**: Each input token is converted into a structure token. The output can be decoded into 3D structures in PDB format using [AIDO.StructureDecoder](https://huggingface.co/genbio-ai/AIDO.StructureDecoder).

## Results

<center><img src="StructurePrediction.PNG" alt="Structure Prediction Results" style="width:40%; height:auto;" /></center>

## How to Use

### Structure Prediction

To reproduce the structure prediction results described above, follow these steps:

1. Install the [Model Generator package](https://github.com/genbio-ai/ModelGenerator/).

2. Run the prediction command:

   ```bash
   mgen predict --config experiments/AIDO.StructureTokenizer/protein2structoken_16b.yaml
   ```
   This will pull the CASP14, CASP15, and CAMEO dataset from [genbio-ai/casp14-casp15-cameo-test-proteins](https://huggingface.co/datasets/genbio-ai/casp14-casp15-cameo-test-proteins), and predict the structure tokens from the amino acid sequence.

3. Convert the output `.tsv` to `.pt` and extract model codebook:

   ```bash
   # convert the predicted structures in tsv into one pt file
   python experiments/AIDO.StructureTokenizer/struct_token_format_conversion.py logs/protein2structoken_16b/predict_predictions.tsv logs/protein2structoken_16b/predict_predictions.pt
   # extract the codebook of the structure tokenizer
   python experiments/AIDO.StructureTokenizer/extract_structure_tokenizer_codebook.py --output_path logs/protein2structoken_16b/codebook.pt
   ```
5. Run the decoding command to get 3D structures in PDB format (currently this script only supports single GPU inference):
   ```bash
   CUDA_VISIBLE_DEVICES=0 mgen predict --config experiments/AIDO.StructureTokenizer/decode.yaml \
     --data.init_args.config.struct_tokens_datasets_configs.name=protein2structoken_16b \
     --data.init_args.config.struct_tokens_datasets_configs.struct_tokens_path=logs/protein2structoken_16b/predict_predictions.pt \
     --data.init_args.config.struct_tokens_datasets_configs.codebook_path=logs/protein2structoken_16b/codebook.pt
   ```
   The outputs are in `logs/protstruct_decode/protein2structoken_16b_pdb_files/`
6. You can compare the predicted structures with the ground truth PDBs in [genbio-ai/casp14-casp15-cameo-test-proteins](https://huggingface.co/datasets/genbio-ai/casp14-casp15-cameo-test-proteins/tree/main).

Alternatively, you can provide your own input amino acid sequence in a CSV file. Here is one example csv at `experiments/AIDO.StructureTokenizer/protein2structoken_example_input.csv` in `ModelGenerator`:
```
idx,aa_seq
example,KEFWNLDKNLQLRLGIVFLG
```
Here, `idx` is a unique name, and `aa_seq` is the amino acid sequence. To use this customized CSV file, replace the second step with
```bash
mgen predict --config experiments/AIDO.StructureTokenizer/protein2structoken_16b.yaml \
 --data.init_args.path=experiments/AIDO.StructureTokenizer/ \
 --data.init_args.test_split_files=[protein2structoken_example_input.csv]
```

### Build any downstream models from this backbone with ModelGenerator
For more information, visit: [Model Generator](https://github.com/genbio-ai/modelgenerator)
```bash
mgen fit --model SequenceClassification --model.backbone aido_protein_16b --data SequenceClassificationDataModule --data.path <hf_or_local_path_to_your_dataset>
mgen test --model SequenceClassification --model.backbone aido_protein_16b --data SequenceClassificationDataModule --data.path <hf_or_local_path_to_your_dataset>
```
The usage of this model is the same as [AIDO.Protein-16B](https://huggingface.co/genbio-ai/AIDO.Protein-16B). 
You only need to change the `model.backbone` to `aido_protein2structoken`.


### Or use directly in Python
#### Embedding
```python
from modelgenerator.tasks import Embed
model = Embed.from_config({"model.backbone": "aido_protein2structoken_16b"}).eval()
collated_batch = model.collate({"sequences": ["HELLQ", "WRLD"]})
embedding = model(collated_batch)
print(embedding.shape)
print(embedding)
```
#### Sequence Level Classification
```python
import torch
from modelgenerator.tasks import SequenceClassification
model = SequenceClassification.from_config({"model.backbone": "aido_protein2structoken_16b", "model.n_classes": 2}).eval()
collated_batch = model.collate({"sequences": ["HELLQ", "WRLD"]})
logits = model(collated_batch)
print(logits)
print(torch.argmax(logits, dim=-1))
```
#### Token Level Classification
```python
import torch
from modelgenerator.tasks import TokenClassification
model = TokenClassification.from_config({"model.backbone": "aido_protein2structoken_16b", "model.n_classes": 3}).eval()
collated_batch = model.collate({"sequences": ["HELLQ", "WRLD"]})
logits = model(collated_batch)
print(logits)
print(torch.argmax(logits, dim=-1))
```
#### Regression
```python
from modelgenerator.tasks import SequenceRegression
model = SequenceRegression.from_config({"model.backbone": "aido_protein2structoken_16b"}).eval()
collated_batch = model.collate({"sequences": ["HELLQ", "WRLD"]})
logits = model(collated_batch)
print(logits)
```

## Citation
Please cite AIDO.Protein and AIDO.StructureTokenizer using the following BibTex codes:
```
@inproceedings{zhang_balancing_2024,
	title = {Balancing Locality and Reconstruction in Protein Structure Tokenizer},
	url = {https://www.biorxiv.org/content/10.1101/2024.12.02.626366v2},
	doi = {10.1101/2024.12.02.626366},
	publisher = {bioRxiv},
	author = {Zhang, Jiayou and Meynard-Piganeau, Barthelemy and Gong, James and Cheng, Xingyi and Luo, Yingtao and Ly, Hugo and Song, Le and Xing, Eric},
	year = {2024},
    booktitle={NeurIPS 2024 Workshop on Machine Learning in Structural Biology (MLSB)},
}

@inproceedings{sun_mixture_2024,
	title = {Mixture of Experts Enable Efficient and Effective Protein Understanding and Design},
	url = {https://www.biorxiv.org/content/10.1101/2024.11.29.625425v1},
	doi = {10.1101/2024.11.29.625425},
	publisher = {bioRxiv},
	author = {Sun, Ning and Zou, Shuxian and Tao, Tianhua and Mahbub, Sazan and Li, Dian and Zhuang, Yonghao and Wang, Hongyi and Cheng, Xingyi and Song, Le and Xing, Eric P.},
	year = {2024},
    booktitle={NeurIPS 2024 Workshop on AI for New Drug Modalities},
}
```