Is RewardAnything also effective for model evaluation and comparison? Like llm as a judge?
#2
by
BILL-SUN318
- opened
reward model and model evaluation is quite similar. Some literature even use RewardBench as a benchmark for model evaluation
Yes it could be used directly as a LLM-as-a-judge model. However, you should use it along with some prompt dataset in your desired domain and ensure the effectiveness, reliability and integrity of the dataset. We'll include more details on this in our next version.
感谢回复。不过有篇论文Improving LLM-as-a-Judge Inference with the Judgment Distribution提到用用思维链(CoT)提示的llm-as-a-judge会使得使判断分布变得“尖锐”或“坍缩”,减少了分布的离散度,从而丢失了有价值的细微偏好信息 。我没在benchmark上面跑过,但是简单测试了一下您用大模型微调后的奖励模型,也同样会出现论文中提到的"scores": {"model-1": 下一个token在1,2,3,4,5上的logits会塌缩到一个具体值,使得它的prob几乎为1的情况,我把temperature调整为10才能让最高token的prob从1变为0.95,这似乎不太好解释。