Axolotl prompt format (sharegpt, chatml) could differ from yours
Hi @teknium ,
I believe you trained using axolotl with this dataset config:
datasets:
- path: /data/chat_data/full_dataset_chat.jsonl
type: sharegpt
conversation: chatml
dataset_prepared_path: last_run_prepared
Did you realised that Axolotl actually adds an extra linebreak (somehow) and it becomes <|im_end|>\n\n
? or did you create your own custom dataset and dataloader? Hope to see your release of the configuration file and dataset format.
I found out by debugging step-by-step to run through the repo, the last few label tokens will be always be [....., 28766, 321, 28730, 416, 28766, 28767, 13, 13, 2] which when decoded is <|im_end|>\n\n</s>
.
Issue could be here (extra \n
in sep): https://github.com/OpenAccess-AI-Collective/axolotl/blob/a48dbf6561cc74c275a48070f397334a2c367dd5/src/axolotl/prompt_strategies/sharegpt.py#L16
Hi @teknium ,
I believe you trained using axolotl with this dataset config:
datasets: - path: /data/chat_data/full_dataset_chat.jsonl type: sharegpt conversation: chatml dataset_prepared_path: last_run_prepared
Did you realised that Axolotl actually adds an extra linebreak (somehow) and it becomes
<|im_end|>\n\n
? or did you create your own custom dataset and dataloader? Hope to see your release of the configuration file and dataset format.I found out by debugging step-by-step to run through the repo, the last few label tokens will be always be [....., 28766, 321, 28730, 416, 28766, 28767, 13, 13, 2] which when decoded is
<|im_end|>\n\n</s>
.Issue could be here (extra
\n
in sep): https://github.com/OpenAccess-AI-Collective/axolotl/blob/a48dbf6561cc74c275a48070f397334a2c367dd5/src/axolotl/prompt_strategies/sharegpt.py#L16
I believe they changed things for chatml format after this was trained