metadata

license: cc-by-nc-nd-4.0
extra_gated_fields:
  Name: text
  Company: text
  Country: country
  Specific date: date_picker
  I want to use this model for:
    type: select
    options:
      - Research
      - Education
      - label: Other
        value: other
  I agree to share generated sequences and associated data with authors before publishing: checkbox
  I agree not to file patents on any sequences generated by this model: checkbox
  I agree to use this model for non-commercial use ONLY: checkbox
base_model:
  - facebook/esm2_t30_150M_UR50D
pipeline_tag: fill-mask

MeMDLM: De Novo Membrane Protein Design with Masked Diffusion Language Models

Masked Diffusion Language Models (MDLMs), introduced by Sahoo et al (arxiv.org/pdf/2406.07524), provide strong generative capabilities to BERT-style models. In this work, we pre-train and fine-tune ESM-2-150M on the MDLM objective to scaffold functional motifs while unconditionally generating realistic, high-quality membrane protein sequences.

Model Usage

The MDLM model leverages an internal backbone model, which is a fine-tune of ESM2 (150M). This backbone model can be used through this repo:

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("ChatterjeeLab/MeMDLM")
model = AutoModelForMaskedLM.from_pretrained("ChatterjeeLab/MeMDLM")

input_sequence = "QMMALTFITYIGCGLSSIFLSVTLVILIQLCAALLLLNLIFLLDSWIALYnTRGFCIAVAVFLHYFLLVSFTWMGLEAFHMYLKFCIVGWGIPAVVVSIVLTISPDNYGidFCWINSNVVFYITVVGYFCVIFLLNVSMFIVVLVQLCRIKKKKQLGDL"

inputs = tokenizer(input_sequence, return_tensors="pt")
output = model(**inputs)

filled_protein_seq = tokenizer.decode(output.squeeze()) # contains the output protein sequence with filled mask tokens

This backbone model can be integrated with the MDLM formulation by setting the model backbone type to "hf_dit" and setting the HuggingFace Model ID to "ChatterjeeLab/MeMDLM"