Reason behind not using special tokens in the prompt format?
Hello, hobbyist model finetuner here. Thanks for sharing your training hyperparameters!
I was just curious if there was a specific reason behind not using dedicated special tokens for role headers in the prompt format (such as the ones already defined in the llama 3 tokenizer, i.e. <|start_header_id|>
etc.)?
It appears that the <|system|>
, <|user|>
, and <|assistant|>
headers used in the prompt format are not defined special tokens, so they could in theory be variably tokenized into different combinations of substrings during training/inference.
From the paper it seems like some empiric testing was done - was this also attempted with the tokens above being defined as special?
I just found about this and I'm curious as well.
@Doctor-Shotgun and @sszymczyk -- it's because we hard set the chat template in open instruct to be the same for every model. It's not necessarily optimal, but it is a simple approach we've been using for a few years as the goal of our efforts is to easily translate recipe and code to OLMo.