metadata

language:
  - en
license: mit
tags:
  - chemistry
  - SMILES
  - retrosynthesis
datasets:
  - ORD
metrics:
  - accuracy

Model Card for ReactionT5v2-retrosynthesis

This is a ReactionT5 pre-trained to predict the reactants of reactions and fine-tuned on USPOT_50k's train split. Base model before fine-tuning is here.

Model Sources

Repository: https://github.com/sagawatatsuya/ReactionT5v2
Paper: https://arxiv.org/abs/2311.06708
Demo: https://huggingface.co/spaces/sagawa/ReactionT5_task_retrosynthesis

Uses

You can use this model for retrosynthesis prediction or fine-tune this model with your dataset.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("sagawa/ReactionT5v2-retrosynthesis-USPTO_50k", return_tensors="pt")
model = AutoModelForSeq2SeqLM.from_pretrained("sagawa/ReactionT5v2-retrosynthesis-USPTO_50k")

inp = tokenizer('CCN(CC)CCNC(=S)NC1CCCc2cc(C)cnc21', return_tensors='pt')
output = model.generate(**inp, num_beams=1, num_return_sequences=1, return_dict_in_generate=True, output_scores=True)
output = tokenizer.decode(output['sequences'][0], skip_special_tokens=True).replace(' ', '').rstrip('.')
output # 'CCN(CC)CCN=C=S.Cc1cnc2c(c1)CCCC2N'

Training Details

Training Procedure

We used the USPTO_50k dataset for model finetuning. The command used for training is the following. For more information, please refer to the paper and GitHub repository.

cd task_retrosynthesis
python finetune.py \
    --output_dir='t5' \
    --epochs=20 \
    --lr=2e-5 \
    --batch_size=32 \
    --input_max_len=150 \
    --target_max_len=150 \
    --weight_decay=0.01 \
    --evaluation_strategy='epoch' \
    --save_strategy='epoch' \
    --logging_strategy='epoch' \
    --save_total_limit=10 \
    --train_data_path='../data/USPTO_50k/train.csv' \
    --valid_data_path='../data/USPTO_50k/val.csv' \
    --disable_tqdm \
    --model_name_or_path='sagawa/ReactionT5v2-retrosynthesis'

Results

Model	Training set	Test set	Top-1 [% acc.]	Top-2 [% acc.]	Top-3 [% acc.]	Top-5 [% acc.]
Sequence-to-sequence	USPTO_50k	USPTO_50k	37.4	-	52.4	57.0
Molecular Transformer	USPTO_50k	USPTO_50k	43.5	-	60.5	-
SCROP	USPTO_50k	USPTO_50k	43.7	-	60.0	65.2
T5Chem	USPTO_50k	USPTO_50k	46.5	-	64.4	70.5
CompoundT5	USPTO_50k	USPTO_50k	44,2	55.2	61.4	67.3
ReactionT5	-	USPTO_50k	13.8	18.6	21.4	26.2
ReactionT5 (This model)	USPTO_50k	USPTO_50k	71.2	81.4	84.9	88.2

Performance comparison of Compound T5, ReactionT5, and other models in product prediction.

Citation

arxiv link: https://arxiv.org/abs/2311.06708

@misc{sagawa2023reactiont5,  
      title={ReactionT5: a large-scale pre-trained model towards application of limited reaction data}, 
      author={Tatsuya Sagawa and Ryosuke Kojima},  
      year={2023},  
      eprint={2311.06708},  
      archivePrefix={arXiv},  
      primaryClass={physics.chem-ph}  
}