File size: 2,091 Bytes
cf80a81
 
 
 
 
 
 
 
 
f566f99
 
 
 
 
cf80a81
 
 
f566f99
 
 
cf80a81
ed920f9
c9c5888
ed920f9
c9c5888
 
f566f99
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
---
license: cc-by-nc-nd-4.0
extra_gated_fields:
  Name: text
  Company: text
  Country: country
  Specific date: date_picker
  I want to use this model for:
    type: select
    options:
    - Research
    - Education
    - label: Other
      value: other
  I agree to share generated sequences and associated data with authors before publishing: checkbox
  I agree not to file patents on any sequences generated by this model: checkbox
  I agree to use this model for non-commercial use ONLY: checkbox
base_model:
- facebook/esm2_t30_150M_UR50D
pipeline_tag: fill-mask
---

# MeMDLM: De Novo Membrane Protein Design with Masked Diffusion Language Models

![image/png](https://cdn-uploads.huggingface.co/production/uploads/65bbea9a26c639b000501321/uWW6xnJZwQFWDS1QZNQTm.png)

Masked Diffusion Language Models (MDLMs), introduced by Sahoo et al (arxiv.org/pdf/2406.07524), provide strong generative capabilities to BERT-style models. In this work, we pre-train and fine-tune ESM-2-150M on the MDLM objective to scaffold functional motifs while unconditionally generating realistic, high-quality membrane protein sequences.

## Model Usage

The MDLM model leverages an internal backbone model, which is a fine-tune of ESM2 (150M). This backbone model can be used through this repo:

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("ChatterjeeLab/MeMDLM")
model = AutoModelForMaskedLM.from_pretrained("ChatterjeeLab/MeMDLM")

input_sequence = "QMMALTFITYIGCGLSSIFLSVTLVILIQLCAALLLLNLIFLLDSWIALYnTRGFCIAVAVFLHYFLLVSFTWMGLEAFHMYLKFCIVGWGIPAVVVSIVLTISPDNYGidFCWINSNVVFYITVVGYFCVIFLLNVSMFIVVLVQLCRIKKKKQLGDL"

inputs = tokenizer(input_sequence, return_tensors="pt")
output = model(**inputs)

filled_protein_seq = tokenizer.decode(output.squeeze()) # contains the output protein sequence with filled mask tokens
```

This backbone model can be integrated with the [MDLM formulation](https://github.com/kuleshov-group/mdlm) by setting the model backbone type to "hf_dit" and setting the HuggingFace Model ID to "ChatterjeeLab/MeMDLM"