README.md · wzhouad/Llama3-Instruct-8B-WPO-HB-v2 at 295de7618920bb821ebcd27203caedbabbd3e086

metadata

library_name: transformers
tags: []

Description

Llama3-Instruct-8B model finetuned by hybrid WPO (GPT-4-turbo + on-policy sampling + Ultrafeedback). Details in WPO: Enhancing RLHF with Weighted Preference Optimization.

In comparison to the Llama3-Instruct-8B-WPO-HB model, it employs an enhanced preference data construction method:

Uses the response with the minimum score as the rejected one.
When multiple outputs have the same highest score, the one with the shortest length is selected.
When multiple outputs have the same minimum score, the one with the smallest length difference from the chosen output is selected.

This model is licensed under the Zoom software license and is permitted for use only for noncommercial, educational, or academic research purposes.