arxiv:2502.11330

System Message Generation for User Preferences using Open-Source Models

Published on Feb 17

· Submitted by

Minbyul on Feb 18

Upvote

Authors:

Minbyul Jeong ,

Jungho Cho ,

Dawoon Jung ,

Teakgyu Hong

Abstract

System messages play a crucial role in interactions with large language models (LLMs), often serving as prompts to initiate conversations. Through system messages, users can assign specific roles, perform intended tasks, incorporate background information, specify various output formats and communication styles. Despite such versatility, publicly available data are often lack system messages and subject to strict license constraints in the industry field. Manual labeling of publicly available data with system messages that align with user instructions demands significant resources. In view of such challenges, our work introduces SysGen, a pipeline for generating system messages with better aligned assistant responses from the supervised fine-tuning dataset without system messages. Training on SysGen data has demonstrated substantial improvements in the alignment of model responses with system messages and user instructions, as demonstrated across various open-source models on the Multifacet benchmark, while maintaining minimal impact on other unseen benchmarks such as Open LLM Leaderboard 2. Our qualitative analysis highlights the importance of diverse system messages to ensure better adaptability across different contexts.

View arXiv page View PDF Add to collection

Community

Minbyul

Paper author Paper submitter 4 days ago

System messages, also known as initial prompt, play a crucial role in steering LLM behaviors during conversations, shaping their behavior by providing roles, background information, task instructions, and communication styles.

※ Despite their versatility, publicly available datasets rarely include system messages, and those that do often contain only generic ones (e.g., You are a helpful AI assistant). Moreover, in the industry, licensing constraints limit the use of existing datasets and models, making it challenging to develop well-aligned AI assistants.

» To address these challenges, we introduce SYSGEN—a novel pipeline for generating system messages from supervised fine-tuning (SFT) datasets without system messages.
Open-source models fine-tuned with SYSGEN data produce well-aligned assistant responses with both system messages and user instructions, while minimizing performance degradation on the unseen benchmark.

If you’re already leveraging the SFT datasets, we highly recommend applying SYSGEN on top of it for enhanced alignment and usability.