genbio-ai
/

AIDO.Protein2StructureToken-16B

PyTorch

fm4bio

Model card Files Files and versions Community

JiayouZhangGenbio commited on Dec 4, 2024

Commit

530e4c8

verified ·

1 Parent(s): aa5a3b1

Update README.md

Browse files

Files changed (1) hide show

README.md +65 -8

README.md CHANGED Viewed

@@ -1,8 +1,8 @@
-## genbio-ai/proteinMoE-16b-Petal
 **genbio-ai/proteinMoE-16b-Petal** is a fine-tuned version of **genbio-ai/proteinMoE-16b**, specifically designed for protein structure prediction. This model uses amino acid sequences as input to predict tokens that can be decoded into 3D structures by **genbio-ai/petal-decoder**. It surpasses existing state-of-the-art models, such as **ESM3-open**, in structure prediction tasks, demonstrating its robustness and capability in this domain.
-### Model Architecture Details
 This model retains the architecture of AIDO.Protein 16B, a transformer encoder-only architecture with dense MLP layers replaced by sparse Mixture of Experts (MoE) layers. Each token activates 2 experts using a top-2 routing mechanism. A visual summary of the architecture is provided below:
@@ -11,12 +11,12 @@ This model retains the architecture of AIDO.Protein 16B, a transformer encoder-o
 </center>
-#### Key Differences
 The final output linear layer has been adapted to support a new vocabulary size:
 - **Input Vocabulary Size**: 44 (amino acids + special tokens)
 - **Output Vocabulary Size**: 512 (structure tokens without special tokens)
-#### Architecture Parameters
 | Component                     | Value |
 |-------------------------------|-------|
 | Number of Attention Heads     | 36    |
@@ -28,7 +28,7 @@ The final output linear layer has been adapted to support a new vocabulary size:
 | Output Vocabulary Size        | 512   |
 | Context Length                | 1024  |
-### Training Details
 The fine-tuning process used **0.4 trillion tokens**, using AlphaFold database with **170M samples** and PDB database with **0.4M samples**, making it highly specialized for structure prediction. The training took around 20 days on 64 A100 GPUs.
@@ -41,13 +41,70 @@ The fine-tuning process used **0.4 trillion tokens**, using AlphaFold database w
 - **Tokens Trained**: 4T tokens
 - **Training steps**: 200k steps
-### Tokenization
 The input sequence should be single-chain amino acid sequences.
 - **Input Tokenization**: The sequences are tokenized at the amino acid level and terminated with a `[SEP]` token (id=34).
 - **Output Tokenization**: Each input token is converted into a structure token. The output can be decoded into 3D structures in PDB format using **genbio-ai/petal-decoder**.
-### Results
-TODO

+# genbio-ai/proteinMoE-16b-Petal
 **genbio-ai/proteinMoE-16b-Petal** is a fine-tuned version of **genbio-ai/proteinMoE-16b**, specifically designed for protein structure prediction. This model uses amino acid sequences as input to predict tokens that can be decoded into 3D structures by **genbio-ai/petal-decoder**. It surpasses existing state-of-the-art models, such as **ESM3-open**, in structure prediction tasks, demonstrating its robustness and capability in this domain.
+## Model Architecture Details
 This model retains the architecture of AIDO.Protein 16B, a transformer encoder-only architecture with dense MLP layers replaced by sparse Mixture of Experts (MoE) layers. Each token activates 2 experts using a top-2 routing mechanism. A visual summary of the architecture is provided below:
 </center>
+### Key Differences
 The final output linear layer has been adapted to support a new vocabulary size:
 - **Input Vocabulary Size**: 44 (amino acids + special tokens)
 - **Output Vocabulary Size**: 512 (structure tokens without special tokens)
+### Architecture Parameters
 | Component                     | Value |
 |-------------------------------|-------|
 | Number of Attention Heads     | 36    |
 | Output Vocabulary Size        | 512   |
 | Context Length                | 1024  |
+## Training Details
 The fine-tuning process used **0.4 trillion tokens**, using AlphaFold database with **170M samples** and PDB database with **0.4M samples**, making it highly specialized for structure prediction. The training took around 20 days on 64 A100 GPUs.
 - **Tokens Trained**: 4T tokens
 - **Training steps**: 200k steps
+## Tokenization
 The input sequence should be single-chain amino acid sequences.
 - **Input Tokenization**: The sequences are tokenized at the amino acid level and terminated with a `[SEP]` token (id=34).
 - **Output Tokenization**: Each input token is converted into a structure token. The output can be decoded into 3D structures in PDB format using **genbio-ai/petal-decoder**.
+## Results
+<center><img src="StructurePrediction.PNG" alt="Structure Prediction Results" style="width:40%; height:auto;" /></center>
+## How to use
+### Build any downstream models from this backbone
+The usage of this model is the same as **genbio-ai/proteinMoE-16b**. You only need to change the `model.backbone` to `proteinfm_struct_token`.
+#### Embedding
+```python
+from genbio_finetune.tasks import Embed
+model = Embed.from_config({"model.backbone": "proteinfm_struct_token"}).eval()
+collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
+embedding = model(collated_batch)
+print(embedding.shape)
+print(embedding)
+```
+#### Sequence Level Classification
+```python
+import torch
+from genbio_finetune.tasks import SequenceClassification
+model = SequenceClassification.from_config({"model.backbone": "proteinfm_struct_token", "model.n_classes": 2}).eval()
+collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
+logits = model(collated_batch)
+print(logits)
+print(torch.argmax(logits, dim=-1))
+```
+#### Token Level Classification
+```python
+import torch
+from genbio_finetune.tasks import TokenClassification
+model = TokenClassification.from_config({"model.backbone": "proteinfm_struct_token", "model.n_classes": 3}).eval()
+collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
+logits = model(collated_batch)
+print(logits)
+print(torch.argmax(logits, dim=-1))
+```
+#### Regression
+```python
+from genbio_finetune.tasks import SequenceRegression
+model = SequenceRegression.from_config({"model.backbone": "proteinfm_struct_token"}).eval()
+collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
+logits = model(collated_batch)
+print(logits)
+```
+#### Or use our one-liner CLI to finetune or evaluate any of the above!
+```
+gbft fit --model SequenceClassification --model.backbone proteinfm --data SequenceClassification --data.path <hf_or_local_path_to_your_dataset>
+gbft test --model SequenceClassification --model.backbone proteinfm --data SequenceClassification --data.path <hf_or_local_path_to_your_dataset>
+```
+For more information, visit: [Model Generator](https://github.com/genbio-ai/modelgenerator)
+## Or use our one-liner CLI to finetune or evaluate any of the above
+For more information, visit: Model Generator