File size: 7,768 Bytes

---
tags:
- biology
- genbio
- model_hub_mixin
- protein
- pytorch_model_hub_mixin
license: other
---

# AIDO.StructureTokenizer

AIDO.StructureTokenizer is a VQ-VAE-based tokenizer designed for protein structure prediction and tokenization. It encodes amino-acid-agnostic backbone structures into discrete tokens and reconstructs the full atomic-level structures, including side chains. This tokenizer facilitates the integration of 3D protein structure data with sequence-based language models, enabling efficient and accurate multimodal protein modeling.

## Model Description

![Model Architecture](./assets/images/architecture.png)

**AIDO.StructureTokenizer** is built on a Vector Quantized Variational Autoencoder (VQ-VAE) architecture with the following components:
- Equivariant Encoder (6M): Encodes backbone structures into a latent space that maintains rotational and translational symmetries using the Equiformer architecture.
- Discrete Codebook: Maps continuous latent vectors into 512 discrete structural tokens.
- Invariant Decoder (300M): Reconstructs full 3D structures, including side chains, from the structural tokens using an architecture adapted from ESMFold.

This model strikes a balance between reconstruction fidelity and structural locality, optimizing its suitability for downstream tasks such as structure prediction, homology detection, and multimodal protein modeling.

### Key Features

- Encoding Structures into Tokens (See [genbio-ai/AIDO.StructureEncoder](https://huggingface.co/genbio-ai/AIDO.StructureEncoder))
- Decoding Tokens into Structures (See [genbio-ai/AIDO.StructureDecoder](https://huggingface.co/genbio-ai/AIDO.StructureDecoder))
- Reconstructing Structures (See [below](#how-to-use))
- Structure Prediction (See [this section](https://huggingface.co/genbio-ai/AIDO.Protein2StructureToken-16B/blob/main/README.md#structure-prediction) in genbio-ai/AIDO.Protein2StructureToken-16B)

## Results

### Reconstructing Structures
![Reconstruction Results](./assets/images/reconstruction.png)
### Homology Detection
![Homology Detection Results](./assets/images/homology_detection.png)
### Structure Prediction
![Structure Prediction Results](./assets/images/structure_prediction.png)

## How to Use
Please see `experiments/AIDO.StructureTokenizer` in [Model Generator](https://github.com/genbio-ai/modelgenerator) for more details.

### Setup
Install [Model Generator](https://github.com/genbio-ai/modelgenerator)

### Data preparation

To reproduce the reconstruction results in the paper, we provide a preprocessed CASP15 dataset at [genbio-ai/sample-structure-dataset](https://huggingface.co/datasets/genbio-ai/sample-structure-dataset). It could be downloaded via
```bash
huggingface-cli download genbio-ai/sample-structure-dataset --repo-type dataset --local-dir ./data/protstruct_sample_data/
```

This dataset is based on the CASP15 dataset, which can be referenced at:
- [CASP15 Prediction Center](https://predictioncenter.org/casp15/)
- [Bhattacharya-Lab/CASP15](https://github.com/Bhattacharya-Lab/CASP15)

The downloaded directory includes:
- A `registries` folder containing a CSV file with metadata such as filenames and PDB IDs.
- A `CASP15_merged` folder containing PDB files, where domains are merged in the same way as described in [Bhattacharya-Lab/CASP15](https://github.com/Bhattacharya-Lab/CASP15).

To use customized data, you can prepare a dataset with the following structure:
- A folder containing PDB files (supported formats: `cif.gz`, `cif`, `ent.gz`, `pdb`).

Then, you need to prepare a registry file in CSV format using the following command:
``` bash
python experiments/AIDO.StructureTokenizer/register_dataset.py \
    --folder_path /path/to/folder_path \
    --format cif.gz \
    --output_file /path/to/output_file.csv
```

You need to replace the `folder_path` and the `registry_path` in the following steps accordingly.

### Running Encoding and Decoding Task

If you use the provided CASP15 dataset, you can run the combined encoding and decoding task using the following command:
```bash
CUDA_VISIBLE_DEVICES=0 mgen predict --config=experiments/AIDO.StructureTokenizer/encode_decode.yaml
```

If you use your own dataset, you need to update the `folder_path` and the `registry_path` in the `encode_decode.yaml` configuration file or override them when running the command. Example:
```bash
CUDA_VISIBLE_DEVICES=0 mgen predict --config experiments/AIDO.StructuctureTokenizer/encode_decode.yaml \
    --data.init_args.config.proteins_datasets_configs.name="your_dataset_name" \
    --data.init_args.config.proteins_datasets_configs.registry_path="your_dataset_folder_path" \
    --data.init_args.config.proteins_datasets_configs.folder_path="your_dataset_registry_path" \
    --trainer.callbacks.dict_kwargs.output_dir="your_output_dir"
```

The input and the output can be summarized as follows:

**Input:**
- The PDB files in the dataset folder.
- The registry file in CSV format indicating the metadata of the dataset.

**Output:**
- The decoded structures and their corresponding original structures will be saved in the output directory specified in the configuration file. By default it is saved in `logs/protstruct_model/`. 
- The decoded structures end with `output.pdb`.
- The original structures end with `input.pdb`.

**Notes:**
- Decoding the structures could take a long time even when using a GPU.
- Currently, this function only supports single GPU inference due to the file saving mechanism. We plan to support multi-GPU inference in the future.
- The reconstructed structures are aligned to the original structures using the Kabsch algorithm. This makes it easier to visualize and compare the structures.

### Visualizing the Reconstructed Structures

We use VS Code + [Protein Viewer Extension](https://marketplace.visualstudio.com/items?itemName=ArianJamasb.protein-viewer) to visualize the protein structures. It's a beginner-friendly tool for VS Code users. You could also use your preferred protein structure viewer to visualize the structures (e.g., PyMOL, ChimeraX, etc.), but here we focus on this extension.

If you have run the [Running Encoding and Decoding Task](#running-encoding-and-decoding-task), you could find the decoded structures and their corresponding original structures in the output directory. You could visualize them as follows.
- Find the desired `output.pdb` and `input.pdb` pair in the side panel. Select both files when holding the `Ctrl` key (for Mac users, hold the `Cmd` key). ![Select Files](./assets/images/structure_tokenizer/select_files.png)
- Right-click on the selected files and choose "Launch Protein Viewer". ![Launch Protein Viewer from File(s)](./assets/images/structure_tokenizer/launch_protein_viewer.png)
- A new tab will open with the protein structures displayed. You can interact with the structures using the Protein Viewer extension. Wwe have aligned the reconstructed structures to the original structures using the Kabsch algorithm, the displayed structures should be like this, where different colors mean different files. ![Visualize Reconstruction](./assets/images/structure_tokenizer/visualize_reconstruction.png)

# Citation
Please cite AIDO.StructureTokenizer using the following BibTex code:
```
@inproceedings{zhang_balancing_2024,
	title = {Balancing Locality and Reconstruction in Protein Structure Tokenizer},
	url = {https://www.biorxiv.org/content/10.1101/2024.12.02.626366v2},
	doi = {10.1101/2024.12.02.626366},
	publisher = {bioRxiv},
	author = {Zhang, Jiayou and Meynard-Piganeau, Barthelemy and Gong, James and Cheng, Xingyi and Luo, Yingtao and Ly, Hugo and Song, Le and Xing, Eric},
	year = {2024},
    booktitle={NeurIPS 2024 Workshop on Machine Learning in Structural Biology (MLSB)},
}
```