InternLM2-1.8B-Reward is a reward model trained on the foundation of InternLM2-Chat-1.8B-SFT. This model has been trained using over 2.4 million preference samples, both human-annotated and AI-synthesized, achieving outstanding performance while ensuring a balance between helpful and harmless.

Key Features:

  • Variety of Sizes Available: Our open-sourced reward models are available in sizes of 1.8B, 7B, and 20B, each demonstrating exceptional performance across various metrics. We aim for these different-sized models to facilitate research on the scaling laws of reward models, providing valuable insights to the community.
  • Comprehensive Coverage of Preference: Trained with 2.4 million preference pairs derived from both human annotations and AI synthesis, covering diverse areas such as dialogue, writing, poetry, summarization, coding, mathematics, etc. It also maintains a balance between helpful and harmless.
  • Multilingual Support: InternLM2-Reward was trained on high-quality English and Chinese preference data, delivering robust performance in both languages.

This model was applied to the RLHF training process of InternLM2-Chat. The reward model training techniques from the InternLM2 Technical Report have been open-sourced in XTuner, try it out here!

Performance Evaluation on RewardBench

Models Score Chat Chat Hard Safety Reasoning
InternLM2-20B-Reward 89.5 98.6 74.1 89.4 95.7
InternLM2-7B-Reward 86.6 98.6 66.7 88.3 92.8
InternLM2-1.8B-Reward 80.6 95.0 58.1 81.8 87.4
  • The evaluation is conducted on the RewardBench dataset.
  • For a fair comparison, conditional system prompts proposed in our technical report were not included during testing.

Demo Code

Basic Usage

We provide some user-friendly APIs for you to use the model. Here is an example of how to use the model to get the reward score of a chat, compare two chats, or rank multiple chats.

import torch
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained(
tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-1_8b-reward", trust_remote_code=True)

chat_1 = [
    {"role": "user", "content": "Hello! What's your name?"},
    {"role": "assistant", "content": "My name is InternLM2! A helpful AI assistant. What can I do for you?"}
chat_2 = [
    {"role": "user", "content": "Hello! What's your name?"}, 
    {"role": "assistant", "content": "I have no idea."}

# get reward score for a single chat
score1 = model.get_score(tokenizer, chat_1)
score2 = model.get_score(tokenizer, chat_2)
print("score1: ", score1)
print("score2: ", score2)
# >>> score1:  0.767578125
# >>> score2:  -2.22265625

# batch inference, get multiple scores at once
scores = model.get_scores(tokenizer, [chat_1, chat_2])
print("scores: ", scores)
# >>> scores:  [0.767578125, -2.22265625]

# compare whether chat_1 is better than chat_2
compare_res = model.compare(tokenizer, chat_1, chat_2)
print("compare_res: ", compare_res)
# >>> compare_res:  True

# rank multiple chats, it will return the ranking index of each chat
# the chat with the highest score will have ranking index as 0 
rank_res = model.rank(tokenizer, [chat_1, chat_2])
print("rank_res: ", rank_res)  # lower index means higher score
# >>> rank_res:  [0, 1]  

Best of N Sampling

Here is an example of how to use the reward model to perform best of N sampling. The code below demonstrates how to select the best response from the candidates generated by the language model.

import torch
from transformers import AutoModel, AutoTokenizer

# prepare the llm model and tokenizer
llm = AutoModel.from_pretrained(
llm_tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", trust_remote_code=True)

# prepare the reward model and tokenizer
reward = AutoModel.from_pretrained(
reward_tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-1_8b-reward", trust_remote_code=True)

# prepare the chat prompt
prompt = "Write an article about the artificial intelligence revolution."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
text = llm_tokenizer.apply_chat_template(
model_inputs = llm_tokenizer([text], return_tensors="pt").to("cuda")

# generate best of N candidates
num_candidates = 10  # N=10
candidates = []

outputs = llm.generate(
outputs = outputs[:, model_inputs["input_ids"].shape[1]:]
for i in range(num_candidates):
    candidate = llm_tokenizer.decode(outputs[i], skip_special_tokens=True)
    candidates.append(messages + [{"role": "assistant", "content": candidate}])

rank_indices = reward.rank(reward_tokenizer, candidates)
sorted_candidates = sorted(zip(rank_indices, candidates), key=lambda x: x[0])

## print the ranked candidates
# for i, (rank_index, candidate) in enumerate(sorted_candidates):
#     print(f"------------Rank {i}------------: \n{candidate[-1]['content']}")

# print the best response
best_response = sorted_candidates[0][1][-1]['content']

Open Source License

The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow free commercial usage. To apply for a commercial license, please fill in the application form (English)/申请表(中文). For other questions or collaborations, please contact [email protected].


InternLM2-1.8B-Reward 是基于 InternLM2-Chat-1.8B-SFT 训练的奖励模型。该模型使用超过 240 万条人工标注和 AI 合成的偏好样本,覆盖了包括对话、写作、诗歌、总结、编码和数学等多个领域。在取得了出色性能的同时也兼顾了实用性和安全性偏好的平衡。

InternLM2-Reward 的主要特点:

  • 多种尺寸可供选择:我们开源的奖励模型有 1.8B、7B 和 20B 三种尺寸,每种尺寸都展示出了卓越的性能。我们希望这些不同大小的模型能够促进社区关于 Reward Model 缩放定律的研究。
  • 全面覆盖偏好:模型训练了 240 万条来自人工标注和AI合成的偏好样本,涉及对话、写作、诗歌、总结、编码和数学等多个领域,同时确保了实用性和安全性偏好的平衡。
  • 多语言支持:InternLM2-Reward 在高质量的英文和中文偏好数据上进行训练,确保了在这两种语言上都有稳健的表现。

该模型运用在了 InternLM2-Chat 的 PPO 训练过程中。我们的技术报告中提出的 Reward Model 训练技巧已在 XTuner 中公开。欢迎点击链接进行尝试!

RewardBench 上的性能评估

Models Score Chat Chat Hard Safety Reasoning
InternLM2-20B-Reward 89.5 98.6 74.1 89.4 95.7
InternLM2-7B-Reward 86.6 98.6 66.7 88.3 92.8
InternLM2-1.8B-Reward 80.6 95.0 58.1 81.8 87.4
  • 评估使用了 RewardBench 数据集进行。
  • 为了公平比较,测试期间没有使用我们技术报告中提出的"条件系统提示"。



我们为您提供了一些用户友好的 API 以便使用该模型。以下是一些示例,展示如何使用 InternLM2-Reward 获取对话的奖励分数、比较两组对话或对多个对话进行排名。

import torch
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained(
tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-1_8b-reward", trust_remote_code=True)

chat_1 = [
    {"role": "user", "content": "Hello! What's your name?"},
    {"role": "assistant", "content": "My name is InternLM2! A helpful AI assistant. What can I do for you?"}
chat_2 = [
    {"role": "user", "content": "Hello! What's your name?"}, 
    {"role": "assistant", "content": "I have no idea."}

# 获取单个对话的奖励分数
score1 = model.get_score(tokenizer, chat_1)
score2 = model.get_score(tokenizer, chat_2)
print("score1: ", score1)
print("score2: ", score2)
# >>> score1:  0.767578125
# >>> score2:  -2.22265625

# 批量推理,一次获取多个分数
scores = model.get_scores(tokenizer, [chat_1, chat_2])
print("scores: ", scores)
# >>> scores:  [0.767578125, -2.22265625]

# 比较 chat_1 是否比 chat_2 更好
compare_res = model.compare(tokenizer, chat_1, chat_2)
print("compare_res: ", compare_res)
# >>> compare_res:  True

# 排名多个对话,它将返回每个对话的排名序号
# 分数最高的对话排名序号为 0
rank_res = model.rank(tokenizer, [chat_1, chat_2])
print("rank_res: ", rank_res)  # 排名序号越低表示分数越高
# >>> rank_res:  [0, 1]  

Best of N 采样

以下是如何使用 InternLM2-Reward 执行Best of N 采样的示例。 以下代码演示了如何从语言模型生成的候选回答中选择最佳回答。

import torch
from transformers import AutoModel, AutoTokenizer

# 准备语言模型和分词器
llm = AutoModel.from_pretrained(
llm_tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", trust_remote_code=True)

# 准备奖励模型和分词器
reward = AutoModel.from_pretrained(
reward_tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-1_8b-reward", trust_remote_code=True)

# 准备提示词
prompt = "Write an article about the artificial intelligence revolution."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
text = llm_tokenizer.apply_chat_template(
model_inputs = llm_tokenizer([text], return_tensors="pt").to("cuda")

# 生成 N 个候选
num_candidates = 10  # N=10
candidates = []

outputs = llm.generate(
outputs = outputs[:, model_inputs["input_ids"].shape[1]:]

for i in range(num_candidates):
    candidate = llm_tokenizer.decode(outputs[i], skip_special_tokens=True)
    candidates.append(messages + [{"role": "assistant", "content": candidate}])

rank_indices = reward.rank(reward_tokenizer, candidates)
sorted_candidates = sorted(zip(rank_indices, candidates), key=lambda x: x[0])

## 打印排序后的候选
# for i, (rank_index, candidate) in enumerate(sorted_candidates):
#     print(f"------------Rank {i}------------: \n{candidate[-1]['content']}")

# 打印最佳回答
best_response = sorted_candidates[0][1][-1]['content']


本仓库的代码依照 Apache-2.0 协议开源。模型权重对学术研究完全开放,也可申请免费的商业使用授权(申请表)。其他问题与合作请联系 [email protected]


Collection including internlm/internlm2-1_8b-reward