Training method?

by noamgat - opened Sep 25

Sep 25

Hi,
I was wondering what training method you used to train the model, I didn't find it specified anywhere.
The model architecture is a simple linear layer that translates from the hidden dimension to 1 (reward score), correct?
If so, what loss did you use? Did you use a regression loss that aims for accepted->1 and rejected->0, or just tried to maximize the margin between accepted and rejected (something like sigmoid(rejected_score - accepted_score) )?
Are there any details on this?

chrisliu298

Skywork org Sep 25

Hi,

We are currently preparing our technical report, which will be released soon.

Regarding the questions above:

Yes, the last layer is a linear transformation from dimension D to 1. Here, D represents the dimension of the last token's hidden state in the penultimate layer.
We use the standard Bradley-Terry model (i.e., binary ranking loss) for reward modeling.

chrisliu298 changed discussion status to closed about 1 month ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment