wangclnlp commited on
Commit
881d4fd
·
verified ·
1 Parent(s): fca84b3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -16,7 +16,7 @@ This repository contains the released models for the paper [GRAM: A Generative F
16
 
17
  <img src="https://raw.githubusercontent.com/wangclnlp/GRAM/refs/heads/main/gram.png" width="1000px"></img>
18
 
19
- This training process is introduced above. Traditionally, these models are trained using labeled data, which can limit their potential. In this study, we propose a new method that combines both labeled and unlabeled data for training reward models. We introduce a generative reward model that first learns from a large amount of unlabeled data and is then fine-tuned with supervised data. Additionally, we demonstrate that using label smoothing during training improves performance by optimizing a regularized ranking loss. This approach bridges generative and discriminative models, offering a new perspective on training reward models. Our model can be easily applied to various tasks without the need for extensive fine-tuning. This means that when aligning LLMs, there is no longer a need to train a reward model from scratch with large amounts of task-specific labeled data. Instead, **you can directly apply our reward model or adapt it to align your LLM based on our [code](https://github.com/wangclnlp/GRAM/tree/main)**.
20
 
21
  This reward model is fine-tuned from [Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)
22
 
 
16
 
17
  <img src="https://raw.githubusercontent.com/wangclnlp/GRAM/refs/heads/main/gram.png" width="1000px"></img>
18
 
19
+ This training process is introduced above. Traditionally, these models are trained using labeled data, which can limit their potential. In this study, we propose a new method that combines both labeled and unlabeled data for training reward models. We introduce a generative reward model that first learns from a large amount of unlabeled data and is then fine-tuned with supervised data. Additionally, we demonstrate that using label smoothing during training improves performance by optimizing a regularized ranking loss. This approach bridges generative and discriminative models, offering a new perspective on training reward models. Our model can be easily applied to various tasks without the need for extensive fine-tuning. This means that when aligning LLMs, there is no longer a need to train a reward model from scratch with large amounts of task-specific labeled data. Instead, **you can directly apply our reward model or adapt it to align your LLM based on our [code](https://github.com/NiuTrans/GRAM)**.
20
 
21
  This reward model is fine-tuned from [Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)
22