File size: 10,782 Bytes
2ab0c2d
 
 
 
 
 
 
6a2c761
 
2ab0c2d
6a2c761
 
 
 
 
2ab0c2d
 
 
 
6a2c761
 
 
 
 
 
 
 
 
 
 
 
 
 
c4e1e6b
6a2c761
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2ab0c2d
 
 
 
 
 
 
 
6a2c761
2ab0c2d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6a2c761
 
2ab0c2d
6a2c761
0132940
2ab0c2d
6a2c761
0132940
2ab0c2d
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
---
library_name: transformers
base_model: meta-llama/Llama-3.1-70B-Instruct
datasets:
- infly/INF-ORM-Preference-Magnitude-80K
pipeline_tag: text-classification
---
<div align="center">
<img src="INF.jpg" width="300"/>

🤗 <a href="https://huggingface.co/infly" target="_blank">Hugging Face</a> 
<br>
<br>
<br>
</div>

# INF Outcome Reward Model
## Introduction

[**INF-ORM-Llama3.1-70B**](https://huggingface.co/Skywork/Skywork-Reward-Gemma-2-27B-v0.2) is the outcome reward model roughly built on the [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) architecture and trained with the dataset [INF-ORM-Preference-Magnitude-80K](https://huggingface.co/datasets/infly/INF-ORM-Preference-Magnitude-80K). 

We did the following three things to improve the performance of our model.
### Data Pre-processing
We trained it on the dataset [INF-ORM-Preference-Magnitude-80K](https://huggingface.co/datasets/infly/INF-ORM-Preference-Magnitude-80K), which is derived from the **decontaminated dataset** [Skywork/Skywork-Reward-Perference-80k-v0.2](https://huggingface.co/datasets/Skywork/Skywork-Reward-Preference-80K-v0.2).

We use GPT-4o to evaluate the difference between the chosen answer and the rejected answer in the [Skywork/Skywork-Reward-Perference-80k-v0.2](https://huggingface.co/datasets/Skywork/Skywork-Reward-Preference-80K-v0.2) and then add the 'Magnitude' column in the dataset.

The evaluation follows the following rules:
1. If the chosen answer is much better than rejected answer, set 'Magnitude'  value $d$ to 3.
2. If the chosen answer is better than the rejected answer, set 'Magnitude'  value  $d$ to 2.
3. If the chosen answer is slightly better than rejected answer, set 'Magnitude'  value  $d$ to 1.

After that, we train our model with the scaled BT loss. The scaled BT loss is defined as:
$$\mathcal{L}_{Scaled-BT} = -\alpha*d*log(\sigma(r_{\theta}(x, y_{c})-r_{\theta}(x, y_{r})))$$
where $\alpha$ is the scaling factor. You can find more details about scaled BT loss here [1](https://arxiv.org/pdf/2410.01257). 

> Here we look at the performance gains of scaled BT loss from a different perspective than [1](https://arxiv.org/pdf/2410.01257). The scaled BT loss can be thought of as a form of cross-entropy, where the distribution of the difference of the logits produced by the model is sensitive to the distribution of the magnitude. Therefore, we improve the difference of the values in the 'Magnitude' column from 1, 2, 3 to 1, 3, 10 and finally get better performance.

### Modified Score Head
We use the modified score head instead of origin score head.
```python
        # modified score head
        self.score = nn.Sequential(
            nn.Linear(config.hidden_size, config.hidden_size),
            nn.ReLU(),
            nn.Linear(config.hidden_size, 1)
        )
        # origin score head
        self.score = nn.linear(config.hidden_size, 1)
```

### Model Merge
We trained two models and merge them with the weight $0.5$.
| Model        | Score | Chat  | Chat Hard | Safety | Reasoning |
| ----------------- | :---: | :---: | :-------: | :----: | :-------: |
| INF-ORM-v1   | 94.3  | 96.1  |   88.2    |  94.6  |   98.2    |
| INF-ORM-v2   | 94.4  | 95.5  |   90.8    |  93  |   99.1    |
| INF-ORM-v3(Averaged)   | 95.1  | 96.6  |   91.0    |  93.6  |   99.1    |



## RewardBench Leaderboard

We evaluate our model on [RewardBench](https://huggingface.co/spaces/allenai/reward-bench) using the [official test script](https://github.com/allenai/reward-bench) locally. As of December 2024, INF-ORM-Llama3.1-70B ranks first on the RewardBench leaderboard.

| Rank  | Model                                        | Model Type        | Score | Chat  | Chat Hard | Safety | Reasoning |
| :---: | -------------------------------------------- | ----------------- | :---: | :---: | :-------: | :----: | :-------: |
|   1   | **infly/INF-ORM-Llama3.1-70B**  | Seq. Classifier   | 95.1  | 96.6  |   91.0    |  93.6  |   99.1    |
|   2   | Skywork/Skywork-Reward-Gemma-2-27B-v0.2  | Seq. Classifier   | 94.3  | 96.1  |   89.9    |  93.0  |   98.1    |
|   3   | nvidia/Llama-3.1-Nemotron-70B-Reward         | Custom Classifier | 94.1  | 97.5  |   85.7    |  95.1  |   98.1    |
|   4   | Skywork/Skywork-Reward-Gemma-2-27B           | Seq. Classifier   | 93.8  | 95.8  |   91.4    |  91.9  |   96.1    |
|   5   | SF-Foundation/TextEval-Llama3.1-70B          | Generative        | 93.5  | 94.1  |   90.1    |  93.2  |   96.4    |
|   6   | meta-metrics/MetaMetrics-RM-v1.0             | Custom Classifier | 93.4  | 98.3  |   86.4    |  90.8  |   98.2    |
|   7   | Skywork/Skywork-Critic-Llama-3.1-70B         | Generative        | 93.3  | 96.6  |   87.9    |  93.1  |   95.5    |
|   8   | Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 | Seq. Classifier   | 93.1  | 94.7  |   88.4    |  92.7  |   96.7    |
|   9   | nicolinho/QRM-Llama3.1-8B                    | Seq. Classifier   | 93.1  | 94.4  |   89.7    |  92.3  |   95.8    |
|   10   | LxzGordon/URM-LLaMa-3.1-8B                   | Seq. Classifier   | 92.9  | 95.5  |   88.2    |  91.1  |   97.0    |

## Demo Code

We provide an example usage of the INF-ORM-Llama3.1-70B below. 
Below is an example of obtaining the reward scores of two conversations.

```python
from typing import List, Optional, Union

import torch
import torch.nn as nn
from transformers import LlamaPreTrainedModel, LlamaModel, PreTrainedTokenizerFast
from transformers.modeling_outputs import SequenceClassifierOutputWithPast

class INFORMForSequenceClassification(LlamaPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.model = LlamaModel(config)
        self.score = nn.Sequential(
            nn.Linear(config.hidden_size, config.hidden_size),
            nn.ReLU(),
            nn.Linear(config.hidden_size, self.num_labels)
        )
        # Initialize weights and apply final processing
        self.post_init()

    def forward(
        self,
        input_ids: Optional[torch.LongTensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        inputs_embeds: Optional[torch.FloatTensor] = None,
        labels: Optional[torch.LongTensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ):

        transformer_outputs = self.model(
            input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
        )
        hidden_states = transformer_outputs[0]
        logits = self.score(hidden_states)

        if input_ids is not None:
            batch_size = input_ids.shape[0]
        else:
            batch_size = inputs_embeds.shape[0]

        if self.config.pad_token_id is None and batch_size != 1:
            raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
        if self.config.pad_token_id is None:
            sequence_lengths = -1
        else:
            if input_ids is not None:
                # if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
                sequence_lengths = torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1
                sequence_lengths = sequence_lengths % input_ids.shape[-1]
                sequence_lengths = sequence_lengths.to(logits.device)
            else:
                sequence_lengths = -1

        pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]

        loss = None
        return SequenceClassifierOutputWithPast(
            loss=loss,
            logits=pooled_logits,
            past_key_values=transformer_outputs.past_key_values,
            hidden_states=transformer_outputs.hidden_states,
            attentions=transformer_outputs.attentions,
        )

# Load model
model_name = "infly/INF-ORM-Llama3.1-70B"
orm = INFORMForSequenceClassification.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2",
    num_labels=1,
)

# Load tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_name)
conv1 = [ { "content": "Tom has 20 oranges. He gave 5 oranges to his friend Lisa, then he bought 3 more oranges. Finally, he divided all the oranges equally among himself and his 4 friends. How many oranges does each person get?", "role": "user" }, { "content": "Tom has 20 oranges. He gave 5 oranges to Lisa. 20 - 5 = 15. Tom now has 15 oranges. Tom bought 3 more oranges. 15 + 3 = 18. Tom now has 18 oranges. Tom divides the 18 oranges equally among himself and his 4 friends (a total of 5 people). 18 ÷ 5 = 3.6 oranges. Each person gets 3.6 oranges.", "role": "assistant" } ]
conv2= [ { "content": "Tom has 20 oranges. He gave 5 oranges to his friend Lisa, then he bought 3 more oranges. Finally, he divided all the oranges equally among himself and his 4 friends. How many oranges does each person get?", "role": "user" }, { "content": "Tom has 20 oranges. He gave 5 oranges to his friend Lisa. 20 - 5 = 15. Tom now has 15 oranges. Tom bought 3 more oranges. 15 + 3 = 18. Tom now has 18 oranges. Tom divides the 18 oranges equally among his 4 friends (a total of 4 people). 18 ÷ 4 = 4.5 oranges. Each person gets 4.5 oranges.", "role": "assistant" } ]
conv1_tokenized = tokenizer.apply_chat_template(conv1, tokenize=True, return_tensors="pt").to("cuda")
conv2_tokenized = tokenizer.apply_chat_template(conv2, tokenize=True, return_tensors="pt").to("cuda")

# Inference
with torch.no_grad():
    score1 = orm(conv1_tokenized).logits[0][0].item()
    score2 = orm(conv2_tokenized).logits[0][0].item()
print(f"Score for response 1: {score1}")
print(f"Score for response 2: {score2}")

# Output:
# Score for response 1: 4.96875
# Score for response 2: 2.890625

```

## License Agreement
INF-ORM-Llama3.1-70B support commercial applications under a permissive [License](https://huggingface.co/infly/INF-ORM-Llama3.1-70B/blob/main/LICENSE).

## Contact
If you have any questions, please feel free to reach us at Yang Minghao <[email protected]>, Qu Chao <[email protected]> and Tan Xiaoyu <[email protected]>.

## Acknowledgement 
This work was done during my internship at INF. I would like to thank my mentor (Qu Chao, Tan Xiaoyu) and the INF team for their support. Their insights and expertise greatly contributed to the successful completion of this work.