AI4PD/ZymCTRL · Finetuned model generates sequences far different from sequences in the finetune training set

We finetuned a ZymCTRL model using EC 4.2.1.1 (Carbonic Anhydrase) as the context label, and 131 carbonic anhydrase sequences, which are all highly similar and roughly ~190 residues long.

However, when we generate sequences with the finetuned model using EC 4.2.1.1 as the context label, the resulting sequences differ significantly from the sequences in the training set. The generated sequences exhibit an average Levenshtein distance of ~62 from the sequences in the training set.

What adjustments can we make to obtain generated sequences more similar to those used in the fine-tuning step?