Training method?

#3
by noamgat - opened

Hi,
I was wondering what training method you used to train the model, I didn't find it specified anywhere.
The model architecture is a simple linear layer that translates from the hidden dimension to 1 (reward score), correct?
If so, what loss did you use? Did you use a regression loss that aims for accepted->1 and rejected->0, or just tried to maximize the margin between accepted and rejected (something like sigmoid(rejected_score - accepted_score) )?
Are there any details on this?

Skywork org

Hi,

We are currently preparing our technical report, which will be released soon.

Regarding the questions above:

  1. Yes, the last layer is a linear transformation from dimension D to 1. Here, D represents the dimension of the last token's hidden state in the penultimate layer.
  2. We use the standard Bradley-Terry model (i.e., binary ranking loss) for reward modeling.
chrisliu298 changed discussion status to closed

Sign up or log in to comment