data/xtuner/docs/en/dpo/modify_settings.md

Modify DPO Training Configuration

This section introduces config parameters related to DPO (Direct Preference Optimization) training. For more details on XTuner config files, please refer to Modifying Training Configuration.

Loss Function

In DPO training, you can choose different types of loss functions according to your needs. XTuner provides various loss function options, such as sigmoid, hinge, ipo, etc. You can select the desired loss function type by setting the dpo_loss_type parameter.

Additionally, you can control the temperature coefficient in the loss function by adjusting the loss_beta parameter. The label_smoothing parameter can be used for smoothing labels.

#######################################################################
#                          PART 1  Settings                           #
#######################################################################
# Model
dpo_loss_type = 'sigmoid'  # One of ['sigmoid', 'hinge', 'ipo', 'kto_pair', 'sppo_hard', 'nca_pair', 'robust']
loss_beta = 0.1
label_smoothing = 0.0

Modifying the Model

Users can modify pretrained_model_name_or_path to change the pretrained model.

#######################################################################
#                          PART 1  Settings                           #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-1_8b-sft'

Training Data

In DPO training, you can specify the maximum number of tokens for a single sample sequence using the max_length parameter. XTuner will automatically truncate or pad the data.

# Data
max_length = 2048

In the configuration file, we use the train_dataset field to specify the training dataset. You can specify the dataset loading method using the dataset field and the dataset mapping function using the dataset_map_fn field.

#######################################################################
#                      PART 3  Dataset & Dataloader                   #
#######################################################################
sampler = SequenceParallelSampler \
    if sequence_parallel_size > 1 else DefaultSampler

train_dataset = dict(
    type=build_preference_dataset,
    dataset=dict(type=load_dataset, path='mlabonne/orpo-dpo-mix-40k'),
    tokenizer=tokenizer,
    max_length=max_length,
    dataset_map_fn=orpo_dpo_mix_40k_map_fn,
    is_dpo=True,
    is_reward=False,
    reward_token_id=-1,
    num_proc=32,
    use_varlen_attn=use_varlen_attn,
    max_packed_length=max_packed_length,
    shuffle_before_pack=True,
)

train_dataloader = dict(
    batch_size=batch_size,
    num_workers=dataloader_num_workers,
    dataset=train_dataset,
    sampler=dict(type=sampler, shuffle=True),
    collate_fn=dict(
        type=preference_collate_fn, use_varlen_attn=use_varlen_attn))

In the above configuration, we use load_dataset to load the mlabonne/orpo-dpo-mix-40k dataset from Hugging Face and use orpo_dpo_mix_40k_map_fn as the dataset mapping function.

For more information on handling datasets and writing dataset mapping functions, please refer to the Preference Dataset Section.

Accelerating Training

When training with preference data, we recommend enabling the Variable-Length Attention Mechanism to avoid memory waste caused by length differences between chosen and rejected samples within a single preference. You can enable the variable-length attention mechanism by setting use_varlen_attn=True.

XTuner also supports many training acceleration methods. For details on how to use them, please refer to the Acceleration Strategies Section.