tcrt5_ft_tcrdb / README.md

Update README.md

a85f2ad verified 7 days ago

8.1 kB

	---
	license: cc-by-nc-sa-4.0
	library_name: transformers
	tags:
	- biology
	- immunology
	- seq2seq
	pipeline_tag: text2text-generation
	base_model:
	- dkarthikeyan1/tcrt5_pre_tcrdb
	---

	# TCRT5 model (finetuned)


	## Model description

	TCRT5 is a seq2seq model designed to for the conditional generation of T-cell receptor (TCR) sequences given a target peptide-MHC (pMHC). It is a transformers model that
	is built on the [T5 architecture](https://github.com/google-research/text-to-text-transfer-transformer/tree/main/t5) operationalized by the associated
	HuggingFace [abstraction](https://huggingface.co/docs/transformers/v4.46.2/en/model_doc/t5#transformers.T5ForConditionalGeneration).
	It is released along with [this paper](google.com).

	## Intended uses & limitations

	This model is designed for auto-regressively generating CDR3 \\(\beta\\) sequences against a pMHC of interest.
	This means that the model assumes a plausible pMHC is provided as input. We have not tested the model on peptides and MHC sequences
	where the binding affinity between petpide-MHC is low and do not expect the model will adjust its predictions around this.
	This model is intended for academic purposes and should not be used in a clinical setting.

	### How to use

	You can use this model directly for conditional CDR3 \\(\beta\\) generation:

	```python
	import re
	from transformers import T5Tokenizer, T5ForConditionalGeneration
	tokenizer = T5Tokenizer.from_pretrained('dkarthikeyan1/tcrt5_ft_tcrdb')
	tcrt5 = T5ForConditionalGeneration.from_pretrained("dkarthikeyan1/tcrt5_ft_tcrdb")
	pmhc = "[PMHC]KLGGALQAK[SEP]YFAMYQENVAQTDVDTLYIIYRDYTWAELAYTWY[EOS]"
	encoded_pmhc = tokenizer(pmhc, return_tensors='pt')

	# Define the number of TCRs you would like to generate ()
	num_tcrs = 10
	# Define the number of beams to explore (recommended: 3x the number of TCRs)
	num_beams = 30

	outputs = tcrt5.generate(**encoded_pmhc, max_new_tokens=25, num_return_sequences=num_tcrs, num_beams=num_beams, return_dict_in_generate=True)

	# Use regex to get out the [TCR] tag
	cdr3b_sequences = [re.sub(r'\[.*\]', '', x) for x in tokenizer.batch_decode(outputs['sequences'], skip_special_tokens=True)]

	>>> cdr3b_sequences

	['CASSLGTGGTDTQYF',
	'CASSPGTGGTDTQYF',
	'CASSLGQGGTEAFF',
	'CASSVGTGGTDTQYF',
	'CASSLGTGGSYEQYF',
	'CASSPGQGGTEAFF',
	'CASSSGTGGTDTQYF',
	'CASSLGGGGTDTQYF',
	'CASSLGGGSYEQYF',
	'CASSLGTGGNQPQHF']
	```

	This model can also be used for unconditional generation of CDR3 \\(\beta\\) sequences:

	```python
	import re
	from transformers import T5Tokenizer, T5ForConditionalGeneration
	tokenizer = T5Tokenizer.from_pretrained('dkarthikeyan1/tcrt5_ft_tcrdb')
	tcrt5 = T5ForConditionalGeneration.from_pretrained("dkarthikeyan1/tcrt5_ft_tcrdb")


	# Define the number of TCRs you would like to generate ()
	num_tcrs = 10
	# Define the number of beams to explore (recommended: 3x the number of TCRs)
	num_beams = 30

	unconditional_outputs = tcrt5.generate(max_new_tokens=25, num_return_sequences=num_tcrs, num_beams=num_beams, return_dict_in_generate=True)

	# Use regex to get out the [TCR] tag
	uncond_cdr3b_sequences = [re.sub(r'\[.*\]', '', x) for x in tokenizer.batch_decode(unconditional_outputs['sequences'], skip_special_tokens=True)]

	>>> uncond_cdr3b_sequences

	['CASSLGGETQYF',
	'CASSLGQGNTEAFF',
	'CASSLGQGNTGELFF',
	'CASSLGTSGTDTQYF',
	'CASSLGLAGSYNEQFF',
	'CASSLGLAGTDTQYF',
	'CASSLGQGYEQYF',
	'CASSLGLAGGNTGELFF',
	'CASSLGGTGELFF',
	'CASSLGQGAYEQYF']
	```

	Note: For conditional generation, we found that the model performance was greatest using beam search decoding. However, we also report
	a reduction in sequence diversity using this particular decoding method. If you would like to generate more diverse sequence, TCRT5 supports
	a range of alternative decoding strategies which can be found [here](https://huggingface.co/docs/transformers/generation_strategies) and
	[here](https://huggingface.co/blog/how-to-generate).

	### Limitations and bias

	One of the known biases of TCRT5's predictions is its preference for sampling high V(D)J recombination probability sequences as computed by [OLGA](https://github.com/statbiophys/OLGA).
	This can be attenuated with the use of alternative decoding methods such as ancestral sampling.

	## Training data

	TCRT5 was pre-trained on masked span reconstruction of ~14M TCR sequences from [TCRdb](http://bioinfo.life.hust.edu.cn/TCRdb/)
	as well as ~780k peptide-pseudosequence pairs taken from [IEDB](https://www.iedb.org/). Finetuning was done using a parallel
	corpus of ~330k TCR:peptide-pseudosequence pairs taken from [VDJdb](https://vdjdb.cdr3.net/), [IEDB](https://www.iedb.org/),
	[McPAS](https://friedmanlab.weizmann.ac.il/McPAS-TCR/), and semi-synthetic examples from [MIRA](https://pmc.ncbi.nlm.nih.gov/articles/PMC7418738/).

	## Training procedure

	### Preprocessing

	All amino acid sequences, and V/J gene names were standardized using the `tidytcells` package. See [here](https://pmc.ncbi.nlm.nih.gov/articles/PMC10634431/). MHC
	allele information was standardized using `mhcgnomes`, available [here](https://pypi.org/project/mhcgnomes/) before mapping allele information to the MHC pseudo-sequence
	as defined in [NetMHCpan](https://pmc.ncbi.nlm.nih.gov/articles/PMC3319061/).

	### Pre-training

	TCRT5 was pretrained with Masked language modeling (MLM): Span reconstruction similar to the original training loss
	of the T5 paper. For a given sequence, the model masks 15% of the sequence using contiguous spans of random length
	from length 1-3. This is done via the sentinel tokens introduced in the T5 paper. Then the entire masked sequence is passed into
	the model and the model is trained to reconstruct a concatenated sequence comprised of the sentinel tokens followed by the masked tokens.
	This forces the model to learn richer k-mer dependencies of the masked sequences.

	```
	Masks 'mlm_probability' tokens grouped into spans of size 'max_span_length' according to the following algorithm:
	* Radnomly generate span lengths that add up to round(mlm_probability*seq_len) (ignoring pad token) for each sequence.
	* Ensure that the spans are not directly adjacent to ensure max_span_length is observed
	* Once the span masks are generated according to T5 standards mask the inputs and generate the targets


	Example Input:

	CASSLGQGYEQYF

	Masked Input:

	CASSLG[X]GY[Y]F

	Target:

	[X]Q[Y]EQY[Z].

	```

	### Finetuning

	TCRT5 was finetuned on peptide-pseudo sequence -> CDR3 \\(\beta\\) source:target pairs using the canonical cross entropy loss.


	```
	Example Input:

	[PMHC]KLGGALQAK[SEP]YFAMYQENVAQTDVDTLYIIYRDYTWAELAYTWY[EOS]


	Target:

	[TCR]CASSLGYNEQFF[EOS].

	```

	## Results

	This fine-tuned model achieves the following results on conditional CDR3 \\(\beta\\) generation on our validation set of the top-20 peptide-MHCs with the most abundant known TCRs (in alphabetical order):

	1. AVFDRKSDAK_*A11:01**
	2. CRVRLCCYVL_*C07:02**
	3. EAAGIGILTV_*A02:01**
	4. ELAGIGILTV_*A02:01**
	5. GILGFVFTL_*A02:01**
	6. GLCTLVAML_*A02:01**
	7. IVTDFSVIK_*A11:01**
	8. KLGGALQAK_*A03:01**
	9. LLLDRLNQL_*A02:01**
	10. LLWNGPMAV_*A02:01**
	11. LPRRSGAAGA_*B07:02**
	12. LVVDFSQFSR_*A11:01**
	13. NLVPMVATV_*A02:01**
	14. RAKFKQLL_*B08:01**
	15. SPRWYFYYL_*B07:02**
	16. STLPETAAVRR_*A11:01**
	17. TPRVTGGGAM_*B07:02**
	18. TTDPSFLGRY_*A01:01**
	19. YLQPRTFLL_*A02:01**
	20. YVLDHLIVV_*A02:01**

	Benchmark results:

	\| Metric \| Char-BLEU \| F@100\| SeqRec% \| Diversity (num_seq) \| Ave. Jaccard Dissimilarity \| Perplexity \|
	\|:------:\|:---------:\|:----:\|:-------:\|:-------------------:\|:---------------------------:\|:----------:\|
	\| \| 96.4 \| .09 \| 89.2 \| 1300 (2000 max) \| 94.4/100 \| 2.48 \|

	### BibTeX entry and citation info

	```bibtex
	@article{dkarthikeyan2024tcrtranslate,
	title={TCR-TRANSLATE: Conditional Generation of Real Antigen Specific T-cell Receptor Sequences},
	author={Dhuvarakesh Karthikeyan and Colin Raffel and Benjamin Vincent and Alex Rubinsteyn},
	journal={bioArXiv},
	year={2024},
	}
	```