AIDO.StructureTokenizer

AIDO.StructureTokenizer is a VQ-VAE-based tokenizer designed for protein structure prediction and tokenization. It encodes amino-acid-agnostic backbone structures into discrete tokens and reconstructs the full atomic-level structures, including side chains. This tokenizer facilitates the integration of 3D protein structure data with sequence-based language models, enabling efficient and accurate multimodal protein modeling.

Model Description

Model Architecture

AIDO.StructureTokenizer is built on a Vector Quantized Variational Autoencoder (VQ-VAE) architecture with the following components:

  • Equivariant Encoder (6M): Encodes backbone structures into a latent space that maintains rotational and translational symmetries using the Equiformer architecture.
  • Discrete Codebook: Maps continuous latent vectors into 512 discrete structural tokens.
  • Invariant Decoder (300M): Reconstructs full 3D structures, including side chains, from the structural tokens using an architecture adapted from ESMFold.

This model strikes a balance between reconstruction fidelity and structural locality, optimizing its suitability for downstream tasks such as structure prediction, homology detection, and multimodal protein modeling.

Key Features

Results

Reconstructing Structures

Reconstruction Results

Homology Detection

Homology Detection Results

Structure Prediction

Structure Prediction Results

How to Use

Please see experiments/AIDO.StructureTokenizer in Model Generator for more details.

Setup

Install Model Generator

Data preparation

To reproduce the reconstruction results in the paper, we provide a preprocessed CASP15 dataset at genbio-ai/sample-structure-dataset. It could be downloaded via

huggingface-cli download genbio-ai/sample-structure-dataset --repo-type dataset --local-dir ./data/protstruct_sample_data/

This dataset is based on the CASP15 dataset, which can be referenced at:

The downloaded directory includes:

  • A registries folder containing a CSV file with metadata such as filenames and PDB IDs.
  • A CASP15_merged folder containing PDB files, where domains are merged in the same way as described in Bhattacharya-Lab/CASP15.

To use customized data, you can prepare a dataset with the following structure:

  • A folder containing PDB files (supported formats: cif.gz, cif, ent.gz, pdb).

Then, you need to prepare a registry file in CSV format using the following command:

python experiments/AIDO.StructureTokenizer/register_dataset.py \
    --folder_path /path/to/folder_path \
    --format cif.gz \
    --output_file /path/to/output_file.csv

You need to replace the folder_path and the registry_path in the following steps accordingly.

Running Encoding and Decoding Task

If you use the provided CASP15 dataset, you can run the combined encoding and decoding task using the following command:

CUDA_VISIBLE_DEVICES=0 mgen predict --config=experiments/AIDO.StructureTokenizer/encode_decode.yaml

If you use your own dataset, you need to update the folder_path and the registry_path in the encode_decode.yaml configuration file or override them when running the command. Example:

CUDA_VISIBLE_DEVICES=0 mgen predict --config experiments/AIDO.StructuctureTokenizer/encode_decode.yaml \
    --data.init_args.config.proteins_datasets_configs.name="your_dataset_name" \
    --data.init_args.config.proteins_datasets_configs.registry_path="your_dataset_folder_path" \
    --data.init_args.config.proteins_datasets_configs.folder_path="your_dataset_registry_path" \
    --trainer.callbacks.dict_kwargs.output_dir="your_output_dir"

The input and the output can be summarized as follows:

Input:

  • The PDB files in the dataset folder.
  • The registry file in CSV format indicating the metadata of the dataset.

Output:

  • The decoded structures and their corresponding original structures will be saved in the output directory specified in the configuration file. By default it is saved in logs/protstruct_model/.
  • The decoded structures end with output.pdb.
  • The original structures end with input.pdb.

Notes:

  • Decoding the structures could take a long time even when using a GPU.
  • Currently, this function only supports single GPU inference due to the file saving mechanism. We plan to support multi-GPU inference in the future.
  • The reconstructed structures are aligned to the original structures using the Kabsch algorithm. This makes it easier to visualize and compare the structures.

Visualizing the Reconstructed Structures

We use VS Code + Protein Viewer Extension to visualize the protein structures. It's a beginner-friendly tool for VS Code users. You could also use your preferred protein structure viewer to visualize the structures (e.g., PyMOL, ChimeraX, etc.), but here we focus on this extension.

If you have run the Running Encoding and Decoding Task, you could find the decoded structures and their corresponding original structures in the output directory. You could visualize them as follows.

  • Find the desired output.pdb and input.pdb pair in the side panel. Select both files when holding the Ctrl key (for Mac users, hold the Cmd key). Select Files
  • Right-click on the selected files and choose "Launch Protein Viewer". Launch Protein Viewer from File(s)
  • A new tab will open with the protein structures displayed. You can interact with the structures using the Protein Viewer extension. Wwe have aligned the reconstructed structures to the original structures using the Kabsch algorithm, the displayed structures should be like this, where different colors mean different files. Visualize Reconstruction

Citation

Please cite AIDO.StructureTokenizer using the following BibTex code:

@inproceedings{zhang_balancing_2024,
    title = {Balancing Locality and Reconstruction in Protein Structure Tokenizer},
    url = {https://www.biorxiv.org/content/10.1101/2024.12.02.626366v2},
    doi = {10.1101/2024.12.02.626366},
    publisher = {bioRxiv},
    author = {Zhang, Jiayou and Meynard-Piganeau, Barthelemy and Gong, James and Cheng, Xingyi and Luo, Yingtao and Ly, Hugo and Song, Le and Xing, Eric},
    year = {2024},
    booktitle={NeurIPS 2024 Workshop on Machine Learning in Structural Biology (MLSB)},
}
Downloads last month
24
Safetensors
Model size
275M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

Collection including genbio-ai/AIDO.StructureTokenizer