|
--- |
|
tags: |
|
- biology |
|
--- |
|
# ADIO.Protein 16B |
|
|
|
ADIO.Protein stands as the largest protein foundation model in the world to date, trained on 1.2 trillion amino acids sourced from UniRef90 and ColabFoldDB. |
|
|
|
By leveraging MoE layers, ADIO.Protein efficiently scales to 16 billion parameters, delivering exceptional performance across a vast variety of tasks in protein sequence understanding and sequence generation. Remarkably, ADIO.Protein demonstrates exceptional capability despite being trained solely on single protein sequences. Across over 280 DMS protein fitness prediction tasks, our model outperforms previous state-of-the-art protein sequence models without MSA and achieves 99% of the performance of models that utilize MSA, , highlighting the strength of its learned representations. |
|
|
|
# Model Architecture Details |
|
ADIO.Protein is a transformer encoder-only architecture with the dense MLP layer in each transformer block replaced by a sparse MoE layer. It uses single amino acid tokenization and is optimized using a masked languange modeling (MLM) training objective. For each token, 2 experts will be selectively activated by the top-2 rounting mechiansim. |
|
|
|
More architecture details are shown below: |
|
|Model Arch Component | Value | |
|
| Num Attention Head | | |
|
| Num Hidden Layer | | |
|
| Hidden Size | | |
|
| Intermediate Size | | |
|
| Num MoE Layer | | |
|
|Vocab Size| | |
|
| Context Length | | |
|
|
|
# Pre-training of ADIO.Protein 16B |
|
Here we briefly introduce the details of pre-training of ADIO.Protein 16B. For more information, please refer to <our paper> |
|
## Data |
|
|
|
## Training Details |
|
|
|
## Tokenization |
|
|
|
# Evaluation of ADIO.Protein 16B |
|
|
|
# Results |
|
|
|
|
|
## How to Use |
|
### Build any downstream models from this backbone |
|
#### Embedding |
|
```python |
|
from genbio_finetune.tasks import Embed |
|
model = Embed.from_config({"model.backbone": "proteinfm"}).eval() |
|
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]}) |
|
embedding = model(collated_batch) |
|
print(embedding.shape) |
|
print(embedding) |
|
``` |
|
#### Sequence Level Classification |
|
```python |
|
import torch |
|
from genbio_finetune.tasks import SequenceClassification |
|
model = SequenceClassification.from_config({"model.backbone": "proteinfm", "model.n_classes": 2}).eval() |
|
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]}) |
|
logits = model(collated_batch) |
|
print(logits) |
|
print(torch.argmax(logits, dim=-1)) |
|
``` |
|
#### Token Level Classification |
|
```python |
|
import torch |
|
from genbio_finetune.tasks import TokenClassification |
|
model = TokenClassification.from_config({"model.backbone": "proteinfm", "model.n_classes": 3}).eval() |
|
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]}) |
|
logits = model(collated_batch) |
|
print(logits) |
|
print(torch.argmax(logits, dim=-1)) |
|
``` |
|
#### Regression |
|
```python |
|
from genbio_finetune.tasks import SequenceRegression |
|
model = SequenceRegression.from_config({"model.backbone": "proteinfm"}).eval() |
|
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]}) |
|
logits = model(collated_batch) |
|
print(logits) |
|
``` |
|
#### Protein-Protein Interaction |
|
|
|
#### Or use our one-liner CLI to finetune or evaluate any of the above! |
|
``` |
|
gbft fit --model SequenceClassification --model.backbone proteinfm --data SequenceClassification --data.path <hf_or_local_path_to_your_dataset> |
|
gbft test --model SequenceClassification --model.backbone proteinfm --data SequenceClassification --data.path <hf_or_local_path_to_your_dataset> |
|
``` |
|
For more information, visit: [Model Generator](https://github.com/genbio-ai/modelgenerator) |
|
|
|
|
|
# Or use our one-liner CLI to finetune or evaluate any of the above |
|
|
|
For more information, visit: Model Generator |
|
|
|
# Citation |
|
Please cite ADIO.Protein using the following BibTex code: |
|
``` |
|
@inproceedings{Sun2024mixture, |
|
title={Mixture of Experts Enable Efficient and Effective |
|
Protein Understanding and Design}, |
|
author={Ning Sun, Shuxian Zou, Tianhua Tao, Sazan Mahbub, Dian Li, Yonghao Zhuang, Hongyi Wang, Le Song, Eric P. Xing}, |
|
booktitle={NeurIPS 2024 Workshop on AI for New Drug Modalities}, |
|
year={2024} |
|
} |
|
``` |
|
|
|
|
|
|
|
|
|
|
|
|