File size: 8,103 Bytes

---
license: cc-by-nc-sa-4.0
library_name: transformers
tags:
- biology
- immunology
- seq2seq
pipeline_tag: text2text-generation
base_model:
- dkarthikeyan1/tcrt5_pre_tcrdb
---

# TCRT5 model (finetuned)


## Model description

TCRT5 is a seq2seq model designed to for the conditional generation of T-cell receptor (TCR) sequences given a target peptide-MHC (pMHC). It is a transformers model that 
is built on the [T5 architecture](https://github.com/google-research/text-to-text-transfer-transformer/tree/main/t5) operationalized by the associated 
HuggingFace [abstraction](https://huggingface.co/docs/transformers/v4.46.2/en/model_doc/t5#transformers.T5ForConditionalGeneration). 
It is released along with [this paper](google.com). 

## Intended uses & limitations

This model is designed for auto-regressively generating CDR3 \\(\beta\\) sequences against a pMHC of interest. 
This means that the model assumes a plausible pMHC is provided as input. We have not tested the model on peptides and MHC sequences
where the binding affinity between petpide-MHC is low and do not expect the model will adjust its predictions around this.
This model is intended for academic purposes and should not be used in a clinical setting. 

### How to use

You can use this model directly for conditional CDR3 \\(\beta\\) generation:

```python
import re
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('dkarthikeyan1/tcrt5_ft_tcrdb')
tcrt5 = T5ForConditionalGeneration.from_pretrained("dkarthikeyan1/tcrt5_ft_tcrdb")
pmhc = "[PMHC]KLGGALQAK[SEP]YFAMYQENVAQTDVDTLYIIYRDYTWAELAYTWY[EOS]"
encoded_pmhc = tokenizer(pmhc, return_tensors='pt')

# Define the number of TCRs you would like to generate ()
num_tcrs = 10
# Define the number of beams to explore (recommended: 3x the number of TCRs)
num_beams = 30

outputs = tcrt5.generate(**encoded_pmhc, max_new_tokens=25, num_return_sequences=num_tcrs, num_beams=num_beams, return_dict_in_generate=True)

# Use regex to get out the [TCR] tag
cdr3b_sequences = [re.sub(r'\[.*\]', '', x) for x in tokenizer.batch_decode(outputs['sequences'], skip_special_tokens=True)]

>>> cdr3b_sequences

['CASSLGTGGTDTQYF',
 'CASSPGTGGTDTQYF',
 'CASSLGQGGTEAFF',
 'CASSVGTGGTDTQYF',
 'CASSLGTGGSYEQYF',
 'CASSPGQGGTEAFF',
 'CASSSGTGGTDTQYF',
 'CASSLGGGGTDTQYF',
 'CASSLGGGSYEQYF',
 'CASSLGTGGNQPQHF']
```

This model can also be used for unconditional generation of CDR3 \\(\beta\\) sequences:

```python
import re
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('dkarthikeyan1/tcrt5_ft_tcrdb')
tcrt5 = T5ForConditionalGeneration.from_pretrained("dkarthikeyan1/tcrt5_ft_tcrdb")


# Define the number of TCRs you would like to generate ()
num_tcrs = 10
# Define the number of beams to explore (recommended: 3x the number of TCRs)
num_beams = 30

unconditional_outputs = tcrt5.generate(max_new_tokens=25, num_return_sequences=num_tcrs, num_beams=num_beams, return_dict_in_generate=True)

# Use regex to get out the [TCR] tag
uncond_cdr3b_sequences = [re.sub(r'\[.*\]', '', x) for x in tokenizer.batch_decode(unconditional_outputs['sequences'], skip_special_tokens=True)]

>>> uncond_cdr3b_sequences

['CASSLGGETQYF',
 'CASSLGQGNTEAFF',
 'CASSLGQGNTGELFF',
 'CASSLGTSGTDTQYF',
 'CASSLGLAGSYNEQFF',
 'CASSLGLAGTDTQYF',
 'CASSLGQGYEQYF',
 'CASSLGLAGGNTGELFF',
 'CASSLGGTGELFF',
 'CASSLGQGAYEQYF']
```

**Note:** For conditional generation, we found that the model performance was greatest using beam search decoding. However, we also report
a reduction in sequence diversity using this particular decoding method. If you would like to generate more diverse sequence, TCRT5 supports
a range of alternative decoding strategies which can be found [here](https://huggingface.co/docs/transformers/generation_strategies) and
[here](https://huggingface.co/blog/how-to-generate).

### Limitations and bias

One of the known biases of TCRT5's predictions is its preference for sampling high V(D)J recombination probability sequences as computed by [OLGA](https://github.com/statbiophys/OLGA). 
This can be attenuated with the use of alternative decoding methods such as ancestral sampling.

## Training data

TCRT5 was pre-trained on masked span reconstruction of ~14M TCR sequences from [TCRdb](http://bioinfo.life.hust.edu.cn/TCRdb/) 
as well as ~780k peptide-pseudosequence pairs taken from [IEDB](https://www.iedb.org/). Finetuning was done using a parallel
corpus of ~330k TCR:peptide-pseudosequence pairs taken from [VDJdb](https://vdjdb.cdr3.net/), [IEDB](https://www.iedb.org/), 
[McPAS](https://friedmanlab.weizmann.ac.il/McPAS-TCR/),  and semi-synthetic examples from [MIRA](https://pmc.ncbi.nlm.nih.gov/articles/PMC7418738/).

## Training procedure

### Preprocessing

All amino acid sequences, and V/J gene names were standardized using the `tidytcells` package. See [here](https://pmc.ncbi.nlm.nih.gov/articles/PMC10634431/). MHC 
allele information was standardized using `mhcgnomes`, available [here](https://pypi.org/project/mhcgnomes/) before mapping allele information to the MHC pseudo-sequence
as defined in [NetMHCpan](https://pmc.ncbi.nlm.nih.gov/articles/PMC3319061/).

### Pre-training

TCRT5 was pretrained with Masked language modeling (MLM): Span reconstruction similar to the original training loss 
of the T5 paper. For a given sequence, the model masks 15% of the sequence using contiguous spans of random length
from length 1-3. This is done via the sentinel tokens introduced in the T5 paper. Then the entire masked sequence is passed into
the model and the model is trained to reconstruct a concatenated sequence comprised of the sentinel tokens followed by the masked tokens.
This forces the model to learn richer k-mer dependencies of the masked sequences.

```
Masks 'mlm_probability' tokens grouped into spans of size 'max_span_length' according to the following algorithm:
        * Radnomly generate span lengths that add up to round(mlm_probability*seq_len) (ignoring pad token) for each sequence.
        * Ensure that the spans are not directly adjacent to ensure max_span_length is observed
        * Once the span masks are generated according to T5 standards mask the inputs and generate the targets 
    
    
    Example Input:
    
    CASSLGQGYEQYF
    
    Masked Input:
    
    CASSLG[X]GY[Y]F
    
    Target:
    
    [X]Q[Y]EQY[Z].

```

### Finetuning

TCRT5 was finetuned on peptide-pseudo sequence -> CDR3 \\(\beta\\) source:target pairs using the canonical cross entropy loss.


``` 
    Example Input:
    
    [PMHC]KLGGALQAK[SEP]YFAMYQENVAQTDVDTLYIIYRDYTWAELAYTWY[EOS]
    
    
    Target:
 
    [TCR]CASSLGYNEQFF[EOS].

```

## Results

This fine-tuned model achieves the following results on conditional CDR3 \\(\beta\\) generation on our validation set of the top-20 peptide-MHCs with the most abundant known TCRs (in alphabetical order):

1. AVFDRKSDAK_**A*11:01**
2. CRVRLCCYVL_**C*07:02**
3. EAAGIGILTV_**A*02:01**
4. ELAGIGILTV_**A*02:01**
5. GILGFVFTL_**A*02:01**
6. GLCTLVAML_**A*02:01**
7. IVTDFSVIK_**A*11:01**
8. KLGGALQAK_**A*03:01**
9. LLLDRLNQL_**A*02:01**
10. LLWNGPMAV_**A*02:01**
11. LPRRSGAAGA_**B*07:02**
12. LVVDFSQFSR_**A*11:01**
13. NLVPMVATV_**A*02:01**
14. RAKFKQLL_**B*08:01**
15. SPRWYFYYL_**B*07:02**
16. STLPETAAVRR_**A*11:01**
17. TPRVTGGGAM_**B*07:02**
18. TTDPSFLGRY_**A*01:01**
19. YLQPRTFLL_**A*02:01**
20. YVLDHLIVV_**A*02:01**

Benchmark results:

| Metric | Char-BLEU | F@100| SeqRec% | Diversity (num_seq) | Ave. Jaccard Dissimilarity  | Perplexity |
|:------:|:---------:|:----:|:-------:|:-------------------:|:---------------------------:|:----------:|
|        |    96.4   |  .09 |   89.2  |    1300  (2000 max) |             94.4/100        |    2.48    |

### BibTeX entry and citation info

```bibtex
@article{dkarthikeyan2024tcrtranslate,
  title={TCR-TRANSLATE: Conditional Generation of Real Antigen Specific T-cell Receptor Sequences},
  author={Dhuvarakesh Karthikeyan and Colin Raffel and Benjamin Vincent and Alex Rubinsteyn},
  journal={bioArXiv},
  year={2024},
}
```