tcrt5_ft_tcrdb / README.md
dkarthikeyan1's picture
Update README.md
a85f2ad verified
---
license: cc-by-nc-sa-4.0
library_name: transformers
tags:
- biology
- immunology
- seq2seq
pipeline_tag: text2text-generation
base_model:
- dkarthikeyan1/tcrt5_pre_tcrdb
---
# TCRT5 model (finetuned)
## Model description
TCRT5 is a seq2seq model designed to for the conditional generation of T-cell receptor (TCR) sequences given a target peptide-MHC (pMHC). It is a transformers model that
is built on the [T5 architecture](https://github.com/google-research/text-to-text-transfer-transformer/tree/main/t5) operationalized by the associated
HuggingFace [abstraction](https://huggingface.co/docs/transformers/v4.46.2/en/model_doc/t5#transformers.T5ForConditionalGeneration).
It is released along with [this paper](google.com).
## Intended uses & limitations
This model is designed for auto-regressively generating CDR3 \\(\beta\\) sequences against a pMHC of interest.
This means that the model assumes a plausible pMHC is provided as input. We have not tested the model on peptides and MHC sequences
where the binding affinity between petpide-MHC is low and do not expect the model will adjust its predictions around this.
This model is intended for academic purposes and should not be used in a clinical setting.
### How to use
You can use this model directly for conditional CDR3 \\(\beta\\) generation:
```python
import re
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('dkarthikeyan1/tcrt5_ft_tcrdb')
tcrt5 = T5ForConditionalGeneration.from_pretrained("dkarthikeyan1/tcrt5_ft_tcrdb")
pmhc = "[PMHC]KLGGALQAK[SEP]YFAMYQENVAQTDVDTLYIIYRDYTWAELAYTWY[EOS]"
encoded_pmhc = tokenizer(pmhc, return_tensors='pt')
# Define the number of TCRs you would like to generate ()
num_tcrs = 10
# Define the number of beams to explore (recommended: 3x the number of TCRs)
num_beams = 30
outputs = tcrt5.generate(**encoded_pmhc, max_new_tokens=25, num_return_sequences=num_tcrs, num_beams=num_beams, return_dict_in_generate=True)
# Use regex to get out the [TCR] tag
cdr3b_sequences = [re.sub(r'\[.*\]', '', x) for x in tokenizer.batch_decode(outputs['sequences'], skip_special_tokens=True)]
>>> cdr3b_sequences
['CASSLGTGGTDTQYF',
'CASSPGTGGTDTQYF',
'CASSLGQGGTEAFF',
'CASSVGTGGTDTQYF',
'CASSLGTGGSYEQYF',
'CASSPGQGGTEAFF',
'CASSSGTGGTDTQYF',
'CASSLGGGGTDTQYF',
'CASSLGGGSYEQYF',
'CASSLGTGGNQPQHF']
```
This model can also be used for unconditional generation of CDR3 \\(\beta\\) sequences:
```python
import re
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('dkarthikeyan1/tcrt5_ft_tcrdb')
tcrt5 = T5ForConditionalGeneration.from_pretrained("dkarthikeyan1/tcrt5_ft_tcrdb")
# Define the number of TCRs you would like to generate ()
num_tcrs = 10
# Define the number of beams to explore (recommended: 3x the number of TCRs)
num_beams = 30
unconditional_outputs = tcrt5.generate(max_new_tokens=25, num_return_sequences=num_tcrs, num_beams=num_beams, return_dict_in_generate=True)
# Use regex to get out the [TCR] tag
uncond_cdr3b_sequences = [re.sub(r'\[.*\]', '', x) for x in tokenizer.batch_decode(unconditional_outputs['sequences'], skip_special_tokens=True)]
>>> uncond_cdr3b_sequences
['CASSLGGETQYF',
'CASSLGQGNTEAFF',
'CASSLGQGNTGELFF',
'CASSLGTSGTDTQYF',
'CASSLGLAGSYNEQFF',
'CASSLGLAGTDTQYF',
'CASSLGQGYEQYF',
'CASSLGLAGGNTGELFF',
'CASSLGGTGELFF',
'CASSLGQGAYEQYF']
```
**Note:** For conditional generation, we found that the model performance was greatest using beam search decoding. However, we also report
a reduction in sequence diversity using this particular decoding method. If you would like to generate more diverse sequence, TCRT5 supports
a range of alternative decoding strategies which can be found [here](https://huggingface.co/docs/transformers/generation_strategies) and
[here](https://huggingface.co/blog/how-to-generate).
### Limitations and bias
One of the known biases of TCRT5's predictions is its preference for sampling high V(D)J recombination probability sequences as computed by [OLGA](https://github.com/statbiophys/OLGA).
This can be attenuated with the use of alternative decoding methods such as ancestral sampling.
## Training data
TCRT5 was pre-trained on masked span reconstruction of ~14M TCR sequences from [TCRdb](http://bioinfo.life.hust.edu.cn/TCRdb/)
as well as ~780k peptide-pseudosequence pairs taken from [IEDB](https://www.iedb.org/). Finetuning was done using a parallel
corpus of ~330k TCR:peptide-pseudosequence pairs taken from [VDJdb](https://vdjdb.cdr3.net/), [IEDB](https://www.iedb.org/),
[McPAS](https://friedmanlab.weizmann.ac.il/McPAS-TCR/), and semi-synthetic examples from [MIRA](https://pmc.ncbi.nlm.nih.gov/articles/PMC7418738/).
## Training procedure
### Preprocessing
All amino acid sequences, and V/J gene names were standardized using the `tidytcells` package. See [here](https://pmc.ncbi.nlm.nih.gov/articles/PMC10634431/). MHC
allele information was standardized using `mhcgnomes`, available [here](https://pypi.org/project/mhcgnomes/) before mapping allele information to the MHC pseudo-sequence
as defined in [NetMHCpan](https://pmc.ncbi.nlm.nih.gov/articles/PMC3319061/).
### Pre-training
TCRT5 was pretrained with Masked language modeling (MLM): Span reconstruction similar to the original training loss
of the T5 paper. For a given sequence, the model masks 15% of the sequence using contiguous spans of random length
from length 1-3. This is done via the sentinel tokens introduced in the T5 paper. Then the entire masked sequence is passed into
the model and the model is trained to reconstruct a concatenated sequence comprised of the sentinel tokens followed by the masked tokens.
This forces the model to learn richer k-mer dependencies of the masked sequences.
```
Masks 'mlm_probability' tokens grouped into spans of size 'max_span_length' according to the following algorithm:
* Radnomly generate span lengths that add up to round(mlm_probability*seq_len) (ignoring pad token) for each sequence.
* Ensure that the spans are not directly adjacent to ensure max_span_length is observed
* Once the span masks are generated according to T5 standards mask the inputs and generate the targets
Example Input:
CASSLGQGYEQYF
Masked Input:
CASSLG[X]GY[Y]F
Target:
[X]Q[Y]EQY[Z].
```
### Finetuning
TCRT5 was finetuned on peptide-pseudo sequence -> CDR3 \\(\beta\\) source:target pairs using the canonical cross entropy loss.
```
Example Input:
[PMHC]KLGGALQAK[SEP]YFAMYQENVAQTDVDTLYIIYRDYTWAELAYTWY[EOS]
Target:
[TCR]CASSLGYNEQFF[EOS].
```
## Results
This fine-tuned model achieves the following results on conditional CDR3 \\(\beta\\) generation on our validation set of the top-20 peptide-MHCs with the most abundant known TCRs (in alphabetical order):
1. AVFDRKSDAK_**A*11:01**
2. CRVRLCCYVL_**C*07:02**
3. EAAGIGILTV_**A*02:01**
4. ELAGIGILTV_**A*02:01**
5. GILGFVFTL_**A*02:01**
6. GLCTLVAML_**A*02:01**
7. IVTDFSVIK_**A*11:01**
8. KLGGALQAK_**A*03:01**
9. LLLDRLNQL_**A*02:01**
10. LLWNGPMAV_**A*02:01**
11. LPRRSGAAGA_**B*07:02**
12. LVVDFSQFSR_**A*11:01**
13. NLVPMVATV_**A*02:01**
14. RAKFKQLL_**B*08:01**
15. SPRWYFYYL_**B*07:02**
16. STLPETAAVRR_**A*11:01**
17. TPRVTGGGAM_**B*07:02**
18. TTDPSFLGRY_**A*01:01**
19. YLQPRTFLL_**A*02:01**
20. YVLDHLIVV_**A*02:01**
Benchmark results:
| Metric | Char-BLEU | F@100| SeqRec% | Diversity (num_seq) | Ave. Jaccard Dissimilarity | Perplexity |
|:------:|:---------:|:----:|:-------:|:-------------------:|:---------------------------:|:----------:|
| | 96.4 | .09 | 89.2 | 1300 (2000 max) | 94.4/100 | 2.48 |
### BibTeX entry and citation info
```bibtex
@article{dkarthikeyan2024tcrtranslate,
title={TCR-TRANSLATE: Conditional Generation of Real Antigen Specific T-cell Receptor Sequences},
author={Dhuvarakesh Karthikeyan and Colin Raffel and Benjamin Vincent and Alex Rubinsteyn},
journal={bioArXiv},
year={2024},
}
```