File size: 8,103 Bytes
ef3effd cef2765 ef3effd cef2765 612a91c 785f7b8 314e880 785f7b8 7afa636 785f7b8 7afa636 785f7b8 cff82e0 785f7b8 a09686f 785f7b8 a09686f 785f7b8 7afa636 785f7b8 a85f2ad 785f7b8 a85f2ad 785f7b8 7afa636 785f7b8 7afa636 785f7b8 7afa636 785f7b8 86fdfd8 785f7b8 6b64c50 e8cefca 785f7b8 612a91c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 |
---
license: cc-by-nc-sa-4.0
library_name: transformers
tags:
- biology
- immunology
- seq2seq
pipeline_tag: text2text-generation
base_model:
- dkarthikeyan1/tcrt5_pre_tcrdb
---
# TCRT5 model (finetuned)
## Model description
TCRT5 is a seq2seq model designed to for the conditional generation of T-cell receptor (TCR) sequences given a target peptide-MHC (pMHC). It is a transformers model that
is built on the [T5 architecture](https://github.com/google-research/text-to-text-transfer-transformer/tree/main/t5) operationalized by the associated
HuggingFace [abstraction](https://huggingface.co/docs/transformers/v4.46.2/en/model_doc/t5#transformers.T5ForConditionalGeneration).
It is released along with [this paper](google.com).
## Intended uses & limitations
This model is designed for auto-regressively generating CDR3 \\(\beta\\) sequences against a pMHC of interest.
This means that the model assumes a plausible pMHC is provided as input. We have not tested the model on peptides and MHC sequences
where the binding affinity between petpide-MHC is low and do not expect the model will adjust its predictions around this.
This model is intended for academic purposes and should not be used in a clinical setting.
### How to use
You can use this model directly for conditional CDR3 \\(\beta\\) generation:
```python
import re
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('dkarthikeyan1/tcrt5_ft_tcrdb')
tcrt5 = T5ForConditionalGeneration.from_pretrained("dkarthikeyan1/tcrt5_ft_tcrdb")
pmhc = "[PMHC]KLGGALQAK[SEP]YFAMYQENVAQTDVDTLYIIYRDYTWAELAYTWY[EOS]"
encoded_pmhc = tokenizer(pmhc, return_tensors='pt')
# Define the number of TCRs you would like to generate ()
num_tcrs = 10
# Define the number of beams to explore (recommended: 3x the number of TCRs)
num_beams = 30
outputs = tcrt5.generate(**encoded_pmhc, max_new_tokens=25, num_return_sequences=num_tcrs, num_beams=num_beams, return_dict_in_generate=True)
# Use regex to get out the [TCR] tag
cdr3b_sequences = [re.sub(r'\[.*\]', '', x) for x in tokenizer.batch_decode(outputs['sequences'], skip_special_tokens=True)]
>>> cdr3b_sequences
['CASSLGTGGTDTQYF',
'CASSPGTGGTDTQYF',
'CASSLGQGGTEAFF',
'CASSVGTGGTDTQYF',
'CASSLGTGGSYEQYF',
'CASSPGQGGTEAFF',
'CASSSGTGGTDTQYF',
'CASSLGGGGTDTQYF',
'CASSLGGGSYEQYF',
'CASSLGTGGNQPQHF']
```
This model can also be used for unconditional generation of CDR3 \\(\beta\\) sequences:
```python
import re
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('dkarthikeyan1/tcrt5_ft_tcrdb')
tcrt5 = T5ForConditionalGeneration.from_pretrained("dkarthikeyan1/tcrt5_ft_tcrdb")
# Define the number of TCRs you would like to generate ()
num_tcrs = 10
# Define the number of beams to explore (recommended: 3x the number of TCRs)
num_beams = 30
unconditional_outputs = tcrt5.generate(max_new_tokens=25, num_return_sequences=num_tcrs, num_beams=num_beams, return_dict_in_generate=True)
# Use regex to get out the [TCR] tag
uncond_cdr3b_sequences = [re.sub(r'\[.*\]', '', x) for x in tokenizer.batch_decode(unconditional_outputs['sequences'], skip_special_tokens=True)]
>>> uncond_cdr3b_sequences
['CASSLGGETQYF',
'CASSLGQGNTEAFF',
'CASSLGQGNTGELFF',
'CASSLGTSGTDTQYF',
'CASSLGLAGSYNEQFF',
'CASSLGLAGTDTQYF',
'CASSLGQGYEQYF',
'CASSLGLAGGNTGELFF',
'CASSLGGTGELFF',
'CASSLGQGAYEQYF']
```
**Note:** For conditional generation, we found that the model performance was greatest using beam search decoding. However, we also report
a reduction in sequence diversity using this particular decoding method. If you would like to generate more diverse sequence, TCRT5 supports
a range of alternative decoding strategies which can be found [here](https://huggingface.co/docs/transformers/generation_strategies) and
[here](https://huggingface.co/blog/how-to-generate).
### Limitations and bias
One of the known biases of TCRT5's predictions is its preference for sampling high V(D)J recombination probability sequences as computed by [OLGA](https://github.com/statbiophys/OLGA).
This can be attenuated with the use of alternative decoding methods such as ancestral sampling.
## Training data
TCRT5 was pre-trained on masked span reconstruction of ~14M TCR sequences from [TCRdb](http://bioinfo.life.hust.edu.cn/TCRdb/)
as well as ~780k peptide-pseudosequence pairs taken from [IEDB](https://www.iedb.org/). Finetuning was done using a parallel
corpus of ~330k TCR:peptide-pseudosequence pairs taken from [VDJdb](https://vdjdb.cdr3.net/), [IEDB](https://www.iedb.org/),
[McPAS](https://friedmanlab.weizmann.ac.il/McPAS-TCR/), and semi-synthetic examples from [MIRA](https://pmc.ncbi.nlm.nih.gov/articles/PMC7418738/).
## Training procedure
### Preprocessing
All amino acid sequences, and V/J gene names were standardized using the `tidytcells` package. See [here](https://pmc.ncbi.nlm.nih.gov/articles/PMC10634431/). MHC
allele information was standardized using `mhcgnomes`, available [here](https://pypi.org/project/mhcgnomes/) before mapping allele information to the MHC pseudo-sequence
as defined in [NetMHCpan](https://pmc.ncbi.nlm.nih.gov/articles/PMC3319061/).
### Pre-training
TCRT5 was pretrained with Masked language modeling (MLM): Span reconstruction similar to the original training loss
of the T5 paper. For a given sequence, the model masks 15% of the sequence using contiguous spans of random length
from length 1-3. This is done via the sentinel tokens introduced in the T5 paper. Then the entire masked sequence is passed into
the model and the model is trained to reconstruct a concatenated sequence comprised of the sentinel tokens followed by the masked tokens.
This forces the model to learn richer k-mer dependencies of the masked sequences.
```
Masks 'mlm_probability' tokens grouped into spans of size 'max_span_length' according to the following algorithm:
* Radnomly generate span lengths that add up to round(mlm_probability*seq_len) (ignoring pad token) for each sequence.
* Ensure that the spans are not directly adjacent to ensure max_span_length is observed
* Once the span masks are generated according to T5 standards mask the inputs and generate the targets
Example Input:
CASSLGQGYEQYF
Masked Input:
CASSLG[X]GY[Y]F
Target:
[X]Q[Y]EQY[Z].
```
### Finetuning
TCRT5 was finetuned on peptide-pseudo sequence -> CDR3 \\(\beta\\) source:target pairs using the canonical cross entropy loss.
```
Example Input:
[PMHC]KLGGALQAK[SEP]YFAMYQENVAQTDVDTLYIIYRDYTWAELAYTWY[EOS]
Target:
[TCR]CASSLGYNEQFF[EOS].
```
## Results
This fine-tuned model achieves the following results on conditional CDR3 \\(\beta\\) generation on our validation set of the top-20 peptide-MHCs with the most abundant known TCRs (in alphabetical order):
1. AVFDRKSDAK_**A*11:01**
2. CRVRLCCYVL_**C*07:02**
3. EAAGIGILTV_**A*02:01**
4. ELAGIGILTV_**A*02:01**
5. GILGFVFTL_**A*02:01**
6. GLCTLVAML_**A*02:01**
7. IVTDFSVIK_**A*11:01**
8. KLGGALQAK_**A*03:01**
9. LLLDRLNQL_**A*02:01**
10. LLWNGPMAV_**A*02:01**
11. LPRRSGAAGA_**B*07:02**
12. LVVDFSQFSR_**A*11:01**
13. NLVPMVATV_**A*02:01**
14. RAKFKQLL_**B*08:01**
15. SPRWYFYYL_**B*07:02**
16. STLPETAAVRR_**A*11:01**
17. TPRVTGGGAM_**B*07:02**
18. TTDPSFLGRY_**A*01:01**
19. YLQPRTFLL_**A*02:01**
20. YVLDHLIVV_**A*02:01**
Benchmark results:
| Metric | Char-BLEU | F@100| SeqRec% | Diversity (num_seq) | Ave. Jaccard Dissimilarity | Perplexity |
|:------:|:---------:|:----:|:-------:|:-------------------:|:---------------------------:|:----------:|
| | 96.4 | .09 | 89.2 | 1300 (2000 max) | 94.4/100 | 2.48 |
### BibTeX entry and citation info
```bibtex
@article{dkarthikeyan2024tcrtranslate,
title={TCR-TRANSLATE: Conditional Generation of Real Antigen Specific T-cell Receptor Sequences},
author={Dhuvarakesh Karthikeyan and Colin Raffel and Benjamin Vincent and Alex Rubinsteyn},
journal={bioArXiv},
year={2024},
}
``` |