|
--- |
|
license: cc-by-nc-sa-4.0 |
|
library_name: transformers |
|
tags: |
|
- biology |
|
- immunology |
|
- seq2seq |
|
pipeline_tag: text2text-generation |
|
base_model: |
|
- dkarthikeyan1/tcrt5_pre_tcrdb |
|
--- |
|
|
|
# TCRT5 model (finetuned) |
|
|
|
|
|
## Model description |
|
|
|
TCRT5 is a seq2seq model designed to for the conditional generation of T-cell receptor (TCR) sequences given a target peptide-MHC (pMHC). It is a transformers model that |
|
is built on the [T5 architecture](https://github.com/google-research/text-to-text-transfer-transformer/tree/main/t5) operationalized by the associated |
|
HuggingFace [abstraction](https://huggingface.co/docs/transformers/v4.46.2/en/model_doc/t5#transformers.T5ForConditionalGeneration). |
|
It is released along with [this paper](google.com). |
|
|
|
## Intended uses & limitations |
|
|
|
This model is designed for auto-regressively generating CDR3 \\(\beta\\) sequences against a pMHC of interest. |
|
This means that the model assumes a plausible pMHC is provided as input. We have not tested the model on peptides and MHC sequences |
|
where the binding affinity between petpide-MHC is low and do not expect the model will adjust its predictions around this. |
|
This model is intended for academic purposes and should not be used in a clinical setting. |
|
|
|
### How to use |
|
|
|
You can use this model directly for conditional CDR3 \\(\beta\\) generation: |
|
|
|
```python |
|
import re |
|
from transformers import T5Tokenizer, T5ForConditionalGeneration |
|
tokenizer = T5Tokenizer.from_pretrained('dkarthikeyan1/tcrt5_ft_tcrdb') |
|
tcrt5 = T5ForConditionalGeneration.from_pretrained("dkarthikeyan1/tcrt5_ft_tcrdb") |
|
pmhc = "[PMHC]KLGGALQAK[SEP]YFAMYQENVAQTDVDTLYIIYRDYTWAELAYTWY[EOS]" |
|
encoded_pmhc = tokenizer(pmhc, return_tensors='pt') |
|
|
|
# Define the number of TCRs you would like to generate () |
|
num_tcrs = 10 |
|
# Define the number of beams to explore (recommended: 3x the number of TCRs) |
|
num_beams = 30 |
|
|
|
outputs = tcrt5.generate(**encoded_pmhc, max_new_tokens=25, num_return_sequences=num_tcrs, num_beams=num_beams, return_dict_in_generate=True) |
|
|
|
# Use regex to get out the [TCR] tag |
|
cdr3b_sequences = [re.sub(r'\[.*\]', '', x) for x in tokenizer.batch_decode(outputs['sequences'], skip_special_tokens=True)] |
|
|
|
>>> cdr3b_sequences |
|
|
|
['CASSLGTGGTDTQYF', |
|
'CASSPGTGGTDTQYF', |
|
'CASSLGQGGTEAFF', |
|
'CASSVGTGGTDTQYF', |
|
'CASSLGTGGSYEQYF', |
|
'CASSPGQGGTEAFF', |
|
'CASSSGTGGTDTQYF', |
|
'CASSLGGGGTDTQYF', |
|
'CASSLGGGSYEQYF', |
|
'CASSLGTGGNQPQHF'] |
|
``` |
|
|
|
This model can also be used for unconditional generation of CDR3 \\(\beta\\) sequences: |
|
|
|
```python |
|
import re |
|
from transformers import T5Tokenizer, T5ForConditionalGeneration |
|
tokenizer = T5Tokenizer.from_pretrained('dkarthikeyan1/tcrt5_ft_tcrdb') |
|
tcrt5 = T5ForConditionalGeneration.from_pretrained("dkarthikeyan1/tcrt5_ft_tcrdb") |
|
|
|
|
|
# Define the number of TCRs you would like to generate () |
|
num_tcrs = 10 |
|
# Define the number of beams to explore (recommended: 3x the number of TCRs) |
|
num_beams = 30 |
|
|
|
unconditional_outputs = tcrt5.generate(max_new_tokens=25, num_return_sequences=num_tcrs, num_beams=num_beams, return_dict_in_generate=True) |
|
|
|
# Use regex to get out the [TCR] tag |
|
uncond_cdr3b_sequences = [re.sub(r'\[.*\]', '', x) for x in tokenizer.batch_decode(unconditional_outputs['sequences'], skip_special_tokens=True)] |
|
|
|
>>> uncond_cdr3b_sequences |
|
|
|
['CASSLGGETQYF', |
|
'CASSLGQGNTEAFF', |
|
'CASSLGQGNTGELFF', |
|
'CASSLGTSGTDTQYF', |
|
'CASSLGLAGSYNEQFF', |
|
'CASSLGLAGTDTQYF', |
|
'CASSLGQGYEQYF', |
|
'CASSLGLAGGNTGELFF', |
|
'CASSLGGTGELFF', |
|
'CASSLGQGAYEQYF'] |
|
``` |
|
|
|
**Note:** For conditional generation, we found that the model performance was greatest using beam search decoding. However, we also report |
|
a reduction in sequence diversity using this particular decoding method. If you would like to generate more diverse sequence, TCRT5 supports |
|
a range of alternative decoding strategies which can be found [here](https://huggingface.co/docs/transformers/generation_strategies) and |
|
[here](https://huggingface.co/blog/how-to-generate). |
|
|
|
### Limitations and bias |
|
|
|
One of the known biases of TCRT5's predictions is its preference for sampling high V(D)J recombination probability sequences as computed by [OLGA](https://github.com/statbiophys/OLGA). |
|
This can be attenuated with the use of alternative decoding methods such as ancestral sampling. |
|
|
|
## Training data |
|
|
|
TCRT5 was pre-trained on masked span reconstruction of ~14M TCR sequences from [TCRdb](http://bioinfo.life.hust.edu.cn/TCRdb/) |
|
as well as ~780k peptide-pseudosequence pairs taken from [IEDB](https://www.iedb.org/). Finetuning was done using a parallel |
|
corpus of ~330k TCR:peptide-pseudosequence pairs taken from [VDJdb](https://vdjdb.cdr3.net/), [IEDB](https://www.iedb.org/), |
|
[McPAS](https://friedmanlab.weizmann.ac.il/McPAS-TCR/), and semi-synthetic examples from [MIRA](https://pmc.ncbi.nlm.nih.gov/articles/PMC7418738/). |
|
|
|
## Training procedure |
|
|
|
### Preprocessing |
|
|
|
All amino acid sequences, and V/J gene names were standardized using the `tidytcells` package. See [here](https://pmc.ncbi.nlm.nih.gov/articles/PMC10634431/). MHC |
|
allele information was standardized using `mhcgnomes`, available [here](https://pypi.org/project/mhcgnomes/) before mapping allele information to the MHC pseudo-sequence |
|
as defined in [NetMHCpan](https://pmc.ncbi.nlm.nih.gov/articles/PMC3319061/). |
|
|
|
### Pre-training |
|
|
|
TCRT5 was pretrained with Masked language modeling (MLM): Span reconstruction similar to the original training loss |
|
of the T5 paper. For a given sequence, the model masks 15% of the sequence using contiguous spans of random length |
|
from length 1-3. This is done via the sentinel tokens introduced in the T5 paper. Then the entire masked sequence is passed into |
|
the model and the model is trained to reconstruct a concatenated sequence comprised of the sentinel tokens followed by the masked tokens. |
|
This forces the model to learn richer k-mer dependencies of the masked sequences. |
|
|
|
``` |
|
Masks 'mlm_probability' tokens grouped into spans of size 'max_span_length' according to the following algorithm: |
|
* Radnomly generate span lengths that add up to round(mlm_probability*seq_len) (ignoring pad token) for each sequence. |
|
* Ensure that the spans are not directly adjacent to ensure max_span_length is observed |
|
* Once the span masks are generated according to T5 standards mask the inputs and generate the targets |
|
|
|
|
|
Example Input: |
|
|
|
CASSLGQGYEQYF |
|
|
|
Masked Input: |
|
|
|
CASSLG[X]GY[Y]F |
|
|
|
Target: |
|
|
|
[X]Q[Y]EQY[Z]. |
|
|
|
``` |
|
|
|
### Finetuning |
|
|
|
TCRT5 was finetuned on peptide-pseudo sequence -> CDR3 \\(\beta\\) source:target pairs using the canonical cross entropy loss. |
|
|
|
|
|
``` |
|
Example Input: |
|
|
|
[PMHC]KLGGALQAK[SEP]YFAMYQENVAQTDVDTLYIIYRDYTWAELAYTWY[EOS] |
|
|
|
|
|
Target: |
|
|
|
[TCR]CASSLGYNEQFF[EOS]. |
|
|
|
``` |
|
|
|
## Results |
|
|
|
This fine-tuned model achieves the following results on conditional CDR3 \\(\beta\\) generation on our validation set of the top-20 peptide-MHCs with the most abundant known TCRs (in alphabetical order): |
|
|
|
1. AVFDRKSDAK_**A*11:01** |
|
2. CRVRLCCYVL_**C*07:02** |
|
3. EAAGIGILTV_**A*02:01** |
|
4. ELAGIGILTV_**A*02:01** |
|
5. GILGFVFTL_**A*02:01** |
|
6. GLCTLVAML_**A*02:01** |
|
7. IVTDFSVIK_**A*11:01** |
|
8. KLGGALQAK_**A*03:01** |
|
9. LLLDRLNQL_**A*02:01** |
|
10. LLWNGPMAV_**A*02:01** |
|
11. LPRRSGAAGA_**B*07:02** |
|
12. LVVDFSQFSR_**A*11:01** |
|
13. NLVPMVATV_**A*02:01** |
|
14. RAKFKQLL_**B*08:01** |
|
15. SPRWYFYYL_**B*07:02** |
|
16. STLPETAAVRR_**A*11:01** |
|
17. TPRVTGGGAM_**B*07:02** |
|
18. TTDPSFLGRY_**A*01:01** |
|
19. YLQPRTFLL_**A*02:01** |
|
20. YVLDHLIVV_**A*02:01** |
|
|
|
Benchmark results: |
|
|
|
| Metric | Char-BLEU | F@100| SeqRec% | Diversity (num_seq) | Ave. Jaccard Dissimilarity | Perplexity | |
|
|:------:|:---------:|:----:|:-------:|:-------------------:|:---------------------------:|:----------:| |
|
| | 96.4 | .09 | 89.2 | 1300 (2000 max) | 94.4/100 | 2.48 | |
|
|
|
### BibTeX entry and citation info |
|
|
|
```bibtex |
|
@article{dkarthikeyan2024tcrtranslate, |
|
title={TCR-TRANSLATE: Conditional Generation of Real Antigen Specific T-cell Receptor Sequences}, |
|
author={Dhuvarakesh Karthikeyan and Colin Raffel and Benjamin Vincent and Alex Rubinsteyn}, |
|
journal={bioArXiv}, |
|
year={2024}, |
|
} |
|
``` |