Reason behind not using special tokens in the prompt format?

#2
by Doctor-Shotgun - opened

Hello, hobbyist model finetuner here. Thanks for sharing your training hyperparameters!

I was just curious if there was a specific reason behind not using dedicated special tokens for role headers in the prompt format (such as the ones already defined in the llama 3 tokenizer, i.e. <|start_header_id|>etc.)?

It appears that the <|system|>, <|user|>, and <|assistant|> headers used in the prompt format are not defined special tokens, so they could in theory be variably tokenized into different combinations of substrings during training/inference.

From the paper it seems like some empiric testing was done - was this also attempted with the tokens above being defined as special?

I just found about this and I'm curious as well.

@Doctor-Shotgun and @sszymczyk -- it's because we hard set the chat template in open instruct to be the same for every model. It's not necessarily optimal, but it is a simple approach we've been using for a few years as the goal of our efforts is to easily translate recipe and code to OLMo.

natolambert changed discussion status to closed

Sign up or log in to comment