IgBert unpaired model

Model pretrained on protein and antibody sequences using a masked language modeling (MLM) objective. It was introduced in the paper Large scale paired antibody language models.

The model is finetuned from ProtBert-BFD using unpaired antibody sequences from the Observed Antibody Space.

Use

The model and tokeniser can be loaded using the transformers library

from transformers import BertModel, BertTokenizer

tokeniser = BertTokenizer.from_pretrained("Exscientia/IgBert_unpaired", do_lower_case=False)
model = BertModel.from_pretrained("Exscientia/IgBert_unpaired", add_pooling_layer=False)

The tokeniser is used to prepare batch inputs

# single chain sequences
sequences = [
    "EVVMTQSPASLSVSPGERATLSCRARASLGISTDLAWYQQRPGQAPRLLIYGASTRATGIPARFSGSGSGTEFTLTISSLQSEDSAVYYCQQYSNWPLTFGGGTKVEIK",
    "ALTQPASVSGSPGQSITISCTGTSSDVGGYNYVSWYQQHPGKAPKLMIYDVSKRPSGVSNRFSGSKSGNTASLTISGLQSEDEADYYCNSLTSISTWVFGGGTKLTVL"
]

# The tokeniser expects input of the form ["E V V M...", "A L T Q..."]
sequences = [' '.join(sequence) for sequence in sequences] 

tokens = tokeniser.batch_encode_plus(
    sequences, 
    add_special_tokens=True, 
    pad_to_max_length=True, 
    return_tensors="pt",
    return_special_tokens_mask=True
) 

Note that the tokeniser adds a [CLS] token at the beginning of each sequence, a [SEP] token at the end of each sequence and pads using the [PAD] token. For example a batch containing sequences E V V M, A L will be tokenised to [CLS] E V V M [SEP] and [CLS] A L [SEP] [PAD] [PAD].

Sequence embeddings are generated by feeding tokens through the model

output = model(
    input_ids=tokens['input_ids'], 
    attention_mask=tokens['attention_mask']
)

residue_embeddings = output.last_hidden_state

To obtain a sequence representation, the residue tokens can be averaged over like so

import torch

# mask special tokens before summing over embeddings
residue_embeddings[tokens["special_tokens_mask"] == 1] = 0
sequence_embeddings_sum = residue_embeddings.sum(1)

# average embedding by dividing sum by sequence lengths
sequence_lengths = torch.sum(tokens["special_tokens_mask"] == 0, dim=1)
sequence_embeddings = sequence_embeddings_sum / sequence_lengths.unsqueeze(1)

For sequence level fine-tuning the model can be loaded with a pooling head by setting add_pooling_layer=True and using output.pooler_output in the down-stream task.

Downloads last month
502
Safetensors
Model size
420M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for Exscientia/IgBert_unpaired

Finetuned
(2)
this model
Finetunes
1 model