|
--- |
|
tags: |
|
- biology |
|
- ibm |
|
- mammal |
|
- pytorch |
|
- transformers |
|
library_name: mammal |
|
license: apache-2.0 |
|
--- |
|
|
|
## Model Summary |
|
|
|
- **Developers:** IBM Research |
|
- **GitHub Repository:** [TBD](TBD) |
|
- **Paper:** [TBD](https://arxiv.org/abs/TBD) |
|
- **Release Date**: Oct ?th, 2024 |
|
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0). |
|
|
|
|
|
## Usage |
|
|
|
Using `MAMMAL` requires [TBD](https://github.com/TBD) |
|
|
|
``` |
|
pip install TBD |
|
``` |
|
|
|
A simple example: |
|
```python |
|
import torch |
|
from fuse.data.tokenizers.modular_tokenizer.op import ModularTokenizerOp |
|
from mammal.model import Mammal |
|
from mammal.keys import * |
|
|
|
# Load Model |
|
model = Mammal.from_pretrained("ibm/biomed.omics.bl.sm.ma-ted-400m") |
|
|
|
# Load Tokenizer |
|
tokenizer_op = ModularTokenizerOp.from_pretrained("ibm/biomed.omics.bl.sm.ma-ted-400m") |
|
|
|
# Prepare Input Prompt |
|
protein_calmodulin = "MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMISELDQDGFIDKEDLHDGDGKISFEEFLNLVNKEMTADVDGDGQVNYEEFVTMMTSK" |
|
protein_calcineurin = "MSSKLLLAGLDIERVLAEKNFYKEWDTWIIEAMNVGDEEVDRIKEFKEDEIFEEAKTLGTAEMQEYKKQKLEEAIEGAFDIFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIRQMWDQNGDWDRIKELKFGEIKKLSAKDTRGTIFIKVFENLGTGVDSEYEDVSKYMLKHQ" |
|
|
|
# Create and load sample |
|
sample_dict = dict() |
|
# Formatting prompt to match pre-training syntax |
|
sample_dict[ENCODER_INPUTS_STR] = f"<@TOKENIZER-TYPE=AA><BINDING_AFFINITY_CLASS><SENTINEL_ID_0><MOLECULAR_ENTITY><MOLECULAR_ENTITY_GENERAL_PROTEIN><SEQUENCE_NATURAL_START>{protein_calmodulin}<SEQUENCE_NATURAL_END><MOLECULAR_ENTITY><MOLECULAR_ENTITY_GENERAL_PROTEIN><SEQUENCE_NATURAL_START>{protein_calcineurin}<SEQUENCE_NATURAL_END><EOS>" |
|
|
|
# Tokenize |
|
tokenizer_op( |
|
sample_dict=sample_dict, |
|
key_in=ENCODER_INPUTS_STR, |
|
key_out_tokens_ids=ENCODER_INPUTS_TOKENS, |
|
key_out_attention_mask=ENCODER_INPUTS_ATTENTION_MASK, |
|
) |
|
sample_dict[ENCODER_INPUTS_TOKENS] = torch.tensor(sample_dict[ENCODER_INPUTS_TOKENS]) |
|
sample_dict[ENCODER_INPUTS_ATTENTION_MASK] = torch.tensor(sample_dict[ENCODER_INPUTS_ATTENTION_MASK]) |
|
|
|
# Generate Prediction |
|
batch_dict = model.generate( |
|
[sample_dict], |
|
output_scores=True, |
|
return_dict_in_generate=True, |
|
max_new_tokens=5, |
|
) |
|
|
|
# Get output |
|
generated_output = tokenizer_op._tokenizer.decode(batch_dict[CLS_PRED][0]) |
|
print(f"{generated_output=}") |
|
``` |
|
|
|
For more advanced usage, see our detailed example at: <LINK> |
|
|
|
|
|
## Citation |
|
|
|
If you found our work useful, please consider to give a star to the repo and cite our paper: |
|
``` |
|
@article{TBD, |
|
title={TBD}, |
|
author={IBM Research Team}, |
|
jounal={arXiv preprint arXiv:TBD}, |
|
year={2024} |
|
} |
|
``` |
|
|