data/xtuner/docs/zh_cn/reward_model/preference_data.md

偏好数据集

简介

XTuner 的 Reward Model 与 DPO、ORPO 等依赖偏好数据的算法都采用了同样的数据格式，偏好数据集中的每一条训练样本需要包含以下三个字段：prompt、chosen、rejected。其中每个字段的值都使用了 OpenAI chat message 格式。一个具体的例子如下所示：

{
  "prompt": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Who won the world series in 2020?"
    },
    {
      "role": "assistant",
      "content": "The Los Angeles Dodgers won the World Series in 2020."
    },
    {
      "role": "user",
      "content": "Where was it played?"
    }
  ],
  "chosen": [
    {
      "role": "assistant",
      "content": "The 2020 World Series was played at Globe Life Field in Arlington, Texas."
    }
  ],
  "rejected": [
    {
      "role": "assistant",
      "content": "I don't know."
    }
  ]
}

当进行 Reward Model 训练或是 DPO 训练时，xtuner 会根据训练任务类型的不同，将偏好数据集处理为不同的训练标签。

如上图所示，当进行 Reward Model 训练时，我们参考 ChatGPT 的训练方式，在对话数据的最后添加一个特殊的<|reward|> token，只对该 token 输出的 logits 计算损失。而当进行 DPO 系列算法的训练时，我们则会屏蔽掉 prompt 部分的 token，只对 chosen 和 rejected 的回复部分计算损失。在配置文件中，我们通过 dataset 中的 is_reward 和 is_dpo 两个字段来控制数据集的类型。

使用自定义数据进行训练

在理解了单个样本的数据格式之后，我们只需要将转换好的数据保存为JSON Lines格式即可。文件中的每一行都是一个样本。

{"prompt": [{"content": "How can I learn to better regulate my emotions as a grown adult?", "role": "user"}], "chosen": [{"content": "You could do emotional regulation exercises with me, and try to notice and practice ways that you could better control your emotions in real life.", "role": "assistant"}], "rejected": [{"content": "Do you mean how can you learn to be more emotionally mature?", "role": "assistant"}]}
{"prompt": [{"content": "What are some interesting things I can add to a grilled cheese sandwich?", "role": "user"}, {"content": "A tomato, a slice of pickle, some chopped pickles, some chopped green onions.  Also, some mayonnaise if you like.", "role": "assistant"}, {"content": "Sounds great. Anything else?", "role": "user"}], "chosen": [{"content": "If you’d like something a little different, I recommend grating some horseradish on top of the sandwich.", "role": "assistant"}], "rejected": [{"content": "Maybe some nice fresh basil?", "role": "assistant"}]}
{"prompt": [{"content": "How do I bob for apples?", "role": "user"}], "chosen": [{"content": "Ah!  Do you mean the game of “applesauce,” or do you mean the idea of “to bob for apples”?", "role": "assistant"}], "rejected": [{"content": "Sorry, I don’t know that term.", "role": "assistant"}]}
......

在准备完自定义数据集之后，你需要将你保存的数据路径填入配置文件中的 data_files 字段，你可以同时加载多个 jsonl 数据进行训练。

#######################################################################
#                      PART 3  Dataset & Dataloader                   #
#######################################################################
train_dataset = dict(
    type=build_preference_dataset,
    dataset=dict(
        type=load_jsonl_dataset,
        data_files=[
            '/your/jsonl/path/here.jsonl',
            '/your/another/jsonl/path/here.jsonl'
        ]),
)

使用开源数据集进行训练

与 XTuner 配置 SFT 数据一样，在使用 huggingface 上的开源数据集时，我们只需要定义映射函数 map_fn，将开源数据集格式处理为 XTuner 中的数据格式即可。

这里我们以 Intel/orca_dpo_pairs 为例，该数据集有 system、question、chosen、rejected 四个字段，并且每个字段的值为 text 而非 OpenAI chat message 格式。因此我们需要为该数据集定义一个 map_fn：

def intel_orca_dpo_map_fn(example):
    prompt = [{
        'role': 'system',
        'content': example['system']
    }, {
        'role': 'user',
        'content': example['question']
    }]
    chosen = [{'role': 'assistant', 'content': example['chosen']}]
    rejected = [{'role': 'assistant', 'content': example['rejected']}]
    return {'prompt': prompt, 'chosen': chosen, 'rejected': rejected}

通过代码可以看到，intel_orca_dpo_map_fn 对原数据中的四个字段进行处理，将其转换为了 prompt、chosen、rejected 三个字段，并且每个字段都处理为了OpenAI chat message 格式，确保了后续数据处理流程的统一。

完成了 map_fn 的定义之后，需要在配置文件中 import 该函数，并在 dataset_map_fn 字段中进行配置。

train_dataset = dict(
    type=build_preference_dataset,
    dataset=dict(
        type=load_dataset,
        path='Intel/orca_dpo_pairs'),
    tokenizer=tokenizer,
    max_length=max_length,
    dataset_map_fn=intel_orca_dpo_map_fn,
)