metadata

tags:
  - Causal Language Modeling
  - GPT2
  - ESM2
  - Proteins
  - GNN
library_name: transformers
pipeline_tag: text-generation
language:
  - en
license: cc-by-nc-4.0

Prot2Text Model Card

Model Information

Model Page: Prot2Text
Paper: https://arxiv.org/abs/2307.14367
Github: https://github.com/hadi-abdine/Prot2Text
Authors: Hadi Abdine⁽¹⁾, Michail Chatzianastasis⁽¹⁾, Costas Bouyioukos^{(2, 3)}, Michalis Vazirgiannis⁽¹⁾
⁽¹⁾DaSciM, LIX, ÉcolePolytechnique, Institut Polytechnique de Paris, France.
⁽²⁾Epigenetics and Cell Fate, CNRS UMR7216, Université Paris Cité, Paris, France.
⁽³⁾Bioinformatics Research Laboratory, Department of Biological Sciences, University of Cyprus, Nicosia, Cyprus.

Prot2Text paper is published in AAAI 2024. Preliminary versions of the paper were accepted as a spotlight at DGM4H@NeurIPS 2023 and AI4Science@NeurIPS 2023.

@inproceedings{abdine2024prot2text,
  title={Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers},
  author={Abdine, Hadi and Chatzianastasis, Michail and Bouyioukos, Costas and Vazirgiannis, Michalis},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={38},
  pages={10757--10765},
  year={2024}
}

Description

Prot2Text is a family of models that predict a protein's function in a free text style, moving beyond the conventional binary or categorical classifications. By combining Graph Neural Networks(GNNs) and Large Language Models(LLMs), in an encoder-decoder framework. Prot2Text effectively integrates diverse data types including protein sequence, structure, and textual annotation and description. This multimodal approach allows for a holistic representation of proteins' functions, enabling the generation of detailed and accurate functional descriptions.

Prot2Text is trained on a multimodal dataset that consists of 256,690 proteins. For each protein, we have three information: the correspond- ing sequence, the AlphaFold accession ID and the textual description. To build this dataset, we used the SwissProt database the only curated proteins knowledge base with full proteins’ textual description included in the UniProtKB Consortium (2016) Release 2022_04.

Models and Results

Model	#params	BLEU Score	ROUGE-1	ROUGE-2	ROUGE-L	BERT Score	Link
Prot2Text_SMALL	256M	30.01	45.78	38.08	43.97	82.60	v1.0- v1.1
Prot2Text_BASE	283M	35.11	50.59	42.71	48.49	84.30	v1.0- v1.1
Prot2Text_MEDIUM	398M	36.51	52.13	44.17	50.04	84.83	v1.0- v1.1
Prot2Text_LARGE	898M	36.29	53.68	45.60	51.40	85.20	v1.0- v1.1
Esm2Text_BASE	225M	32.11	47.46	39.18	45.31	83.21	v1.0- v1.1

The reported results are computed using v1.0

Usage

Below we share some code snippets on how to get quickly started with running the model. First, install the Transformers library, graphein, DSSP, torch and torch geometric with:

pip install -U transformers
git clone https://github.com/a-r-j/graphein.git
pip install -e graphein/
pip install torch
pip install torch_geometric
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv
sudo apt-get install dssp
sudo ln -s /usr/bin/mkdssp /usr/bin/dssp

You might need to install different versions/variants according to your environnement.

Then, copy the snippet from the section that is relevant for your usecase.

Running Prot2Text to generate a protein's function using both its structure and sequence

To generate a protein's function using both its structure and amino-acid sequence, you need to load one of Prot2Text models and choose the AlphaFold database ID of the protein.

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('habdine/Prot2Text-Base-v1-1', 
                                            trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained('habdine/Prot2Text-Base-v1-1', 
                                            trust_remote_code=True)

function = model.generate_protein_description(protein_pdbID='Q10MK9', 
                                              tokenizer=tokenizer, 
                                              device='cuda' # replace with 'mps' to run on a Mac device
                                              )

print(function)
# 'Carboxylate--CoA ligase that may use 4-coumarate as substrate. Follows a two-step reaction mechanism, wherein the carboxylate substrate first undergoes adenylation by ATP, followed by a thioesterification in the presence of CoA to yield the final CoA thioester.'

Running Esm2Text to generate a protein's function using only its sequence

To generate a protein's function using only its amino-acid sequence, you need to load Esm2Text-Base model and pass an amino-acid sequence.

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('habdine/Esm2Text-Base-v1-1', 
                                            trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained('habdine/Esm2Text-Base-v1-1', 
                                            trust_remote_code=True)

function = model.generate_protein_description(protein_sequence='AEQAERYEEMVEFMEKL', 
                                              tokenizer=tokenizer, 
                                              device='cuda' # replace with 'mps' to run on a Mac device
                                              )

print(function)
# 'A cytochrome b6-f complex catalyzes the calcium-dependent hydrolysis of the 2-acyl groups in 3-sn-phosphoglycerides. Its physiological function is not known.'

Notice

THE INFORMATION PROVIDED IS THEORETICAL MODELLING ONLY AND CAUTION SHOULD BE EXERCISED IN ITS USE. IT IS PROVIDED "AS-IS" WITHOUT ANY WARRANTY OF ANY KIND, WHETHER EXPRESSED OR IMPLIED. NO WARRANTY IS GIVEN THAT USE OF THE INFORMATION SHALL NOT INFRINGE THE RIGHTS OF ANY THIRD PARTY. THE INFORMATION IS NOT INTENDED TO BE A SUBSTITUTE FOR PROFESSIONAL MEDICAL ADVICE, DIAGNOSIS, OR TREATMENT, AND DOES NOT CONSTITUTE MEDICAL OR OTHER PROFESSIONAL ADVICE.