Regularized-Preference-Optimization
Collection
The models trained in https://github.com/YSLIU627/Regularized-Preference-Optimization
•
4 items
•
Updated
This model is a fine-tuned version of HuggingFaceH4/zephyr-7b-gemma-sft-v0.1 on the argilla/dpo-mix-7k dataset. It achieves the following results on the evaluation set:
More information needed
More information needed
More information needed
The following hyperparameters were used during training:
Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen | Eta |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0.5915 | 0.9479 | 50 | 0.5745 | -0.7212 | -1.8546 | 0.7021 | 1.1334 | -458.1599 | -408.3000 | 171.6370 | 174.0368 | 0.0050 |
0.2599 | 1.8957 | 100 | 0.5899 | -0.7561 | -2.1676 | 0.7234 | 1.4115 | -464.4193 | -408.9966 | 161.6372 | 163.3166 | 0.0050 |
Base model
google/gemma-7b