AIDO.StructureTokenizer
AIDO.StructureTokenizer is a VQ-VAE-based tokenizer designed for protein structure prediction and tokenization. It encodes amino-acid-agnostic backbone structures into discrete tokens and reconstructs the full atomic-level structures, including side chains. This tokenizer facilitates the integration of 3D protein structure data with sequence-based language models, enabling efficient and accurate multimodal protein modeling.
Model Description
AIDO.StructureTokenizer is built on a Vector Quantized Variational Autoencoder (VQ-VAE) architecture with the following components:
- Equivariant Encoder (6M): Encodes backbone structures into a latent space that maintains rotational and translational symmetries using the Equiformer architecture.
- Discrete Codebook: Maps continuous latent vectors into 512 discrete structural tokens.
- Invariant Decoder (300M): Reconstructs full 3D structures, including side chains, from the structural tokens using an architecture adapted from ESMFold.
This model strikes a balance between reconstruction fidelity and structural locality, optimizing its suitability for downstream tasks such as structure prediction, homology detection, and multimodal protein modeling.
Key Features
- Encoding Structures into Tokens (See genbio-ai/AIDO.StructureEncoder)
- Decoding Tokens into Structures (See genbio-ai/AIDO.StructureDecoder)
- Reconstructing Structures (See below)
- Structure Prediction (See this section in genbio-ai/AIDO.Protein2StructureToken-16B)
Results
Reconstructing Structures
Homology Detection
Structure Prediction
How to Use
Please see experiments/AIDO.StructureTokenizer
in Model Generator for more details.
Setup
Install Model Generator
Data preparation
To reproduce the reconstruction results in the paper, we provide a preprocessed CASP15 dataset at genbio-ai/sample-structure-dataset. It could be downloaded via
huggingface-cli download genbio-ai/sample-structure-dataset --repo-type dataset --local-dir ./data/protstruct_sample_data/
This dataset is based on the CASP15 dataset, which can be referenced at:
The downloaded directory includes:
- A
registries
folder containing a CSV file with metadata such as filenames and PDB IDs. - A
CASP15_merged
folder containing PDB files, where domains are merged in the same way as described in Bhattacharya-Lab/CASP15.
To use customized data, you can prepare a dataset with the following structure:
- A folder containing PDB files (supported formats:
cif.gz
,cif
,ent.gz
,pdb
).
Then, you need to prepare a registry file in CSV format using the following command:
python experiments/AIDO.StructureTokenizer/register_dataset.py \
--folder_path /path/to/folder_path \
--format cif.gz \
--output_file /path/to/output_file.csv
You need to replace the folder_path
and the registry_path
in the following steps accordingly.
Running Encoding and Decoding Task
If you use the provided CASP15 dataset, you can run the combined encoding and decoding task using the following command:
CUDA_VISIBLE_DEVICES=0 mgen predict --config=experiments/AIDO.StructureTokenizer/encode_decode.yaml
If you use your own dataset, you need to update the folder_path
and the registry_path
in the encode_decode.yaml
configuration file or override them when running the command. Example:
CUDA_VISIBLE_DEVICES=0 mgen predict --config experiments/AIDO.StructuctureTokenizer/encode_decode.yaml \
--data.init_args.config.proteins_datasets_configs.name="your_dataset_name" \
--data.init_args.config.proteins_datasets_configs.registry_path="your_dataset_folder_path" \
--data.init_args.config.proteins_datasets_configs.folder_path="your_dataset_registry_path" \
--trainer.callbacks.dict_kwargs.output_dir="your_output_dir"
The input and the output can be summarized as follows:
Input:
- The PDB files in the dataset folder.
- The registry file in CSV format indicating the metadata of the dataset.
Output:
- The decoded structures and their corresponding original structures will be saved in the output directory specified in the configuration file. By default it is saved in
logs/protstruct_model/
. - The decoded structures end with
output.pdb
. - The original structures end with
input.pdb
.
Notes:
- Decoding the structures could take a long time even when using a GPU.
- Currently, this function only supports single GPU inference due to the file saving mechanism. We plan to support multi-GPU inference in the future.
- The reconstructed structures are aligned to the original structures using the Kabsch algorithm. This makes it easier to visualize and compare the structures.
Visualizing the Reconstructed Structures
We use VS Code + Protein Viewer Extension to visualize the protein structures. It's a beginner-friendly tool for VS Code users. You could also use your preferred protein structure viewer to visualize the structures (e.g., PyMOL, ChimeraX, etc.), but here we focus on this extension.
If you have run the Running Encoding and Decoding Task, you could find the decoded structures and their corresponding original structures in the output directory. You could visualize them as follows.
- Find the desired
output.pdb
andinput.pdb
pair in the side panel. Select both files when holding theCtrl
key (for Mac users, hold theCmd
key). - Right-click on the selected files and choose "Launch Protein Viewer".
- A new tab will open with the protein structures displayed. You can interact with the structures using the Protein Viewer extension. Wwe have aligned the reconstructed structures to the original structures using the Kabsch algorithm, the displayed structures should be like this, where different colors mean different files.
Citation
Please cite AIDO.StructureTokenizer using the following BibTex code:
@inproceedings{zhang_balancing_2024,
title = {Balancing Locality and Reconstruction in Protein Structure Tokenizer},
url = {https://www.biorxiv.org/content/10.1101/2024.12.02.626366v2},
doi = {10.1101/2024.12.02.626366},
publisher = {bioRxiv},
author = {Zhang, Jiayou and Meynard-Piganeau, Barthelemy and Gong, James and Cheng, Xingyi and Luo, Yingtao and Ly, Hugo and Song, Le and Xing, Eric},
year = {2024},
booktitle={NeurIPS 2024 Workshop on Machine Learning in Structural Biology (MLSB)},
}
- Downloads last month
- 24