Update README.md

2ab0c2d verified 18 days ago

7.64 kB

	---
	library_name: transformers
	base_model: meta-llama/Llama-3.1-70B-Instruct
	datasets:
	- infly/INF-ORM-Preference-Magnitude-80K
	pipeline_tag: text-classification
	---


	# INF Outcome Reward Model
	## Introduction

	[INF-ORM-Llama3.1-70B](https://huggingface.co/Skywork/Skywork-Reward-Gemma-2-27B-v0.2) is the outcome reward model roughly built on the [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) architecture and trained with the dataset [INF-ORM-Preference-Magnitude-80K](https://huggingface.co/datasets/infly/INF-ORM-Preference-Magnitude-80K).

	Note: Train Details are coming soon!

	## RewardBench Leaderboard

	We evaluate our model on [RewardBench](https://huggingface.co/spaces/allenai/reward-bench) using the [official test script](https://github.com/allenai/reward-bench) locally. As of December 2024, INF-ORM-Llama3.1-70B ranks first on the RewardBench leaderboard.

	\| Rank \| Model \| Model Type \| Score \| Chat \| Chat Hard \| Safety \| Reasoning \|
	\| :---: \| -------------------------------------------- \| ----------------- \| :---: \| :---: \| :-------: \| :----: \| :-------: \|
	\| 1 \| infly/INF-ORM-Llama3.1-70B \| Custom Classifier \| 95.2 \| 96.9 \| 91.0 \| 93.8 \| 99.1 \|
	\| 2 \| Skywork/Skywork-Reward-Gemma-2-27B-v0.2 \| Seq. Classifier \| 94.3 \| 96.1 \| 89.9 \| 93.0 \| 98.1 \|
	\| 3 \| nvidia/Llama-3.1-Nemotron-70B-Reward \| Custom Classifier \| 94.1 \| 97.5 \| 85.7 \| 95.1 \| 98.1 \|
	\| 4 \| Skywork/Skywork-Reward-Gemma-2-27B \| Seq. Classifier \| 93.8 \| 95.8 \| 91.4 \| 91.9 \| 96.1 \|
	\| 5 \| SF-Foundation/TextEval-Llama3.1-70B \| Generative \| 93.5 \| 94.1 \| 90.1 \| 93.2 \| 96.4 \|
	\| 6 \| meta-metrics/MetaMetrics-RM-v1.0 \| Custom Classifier \| 93.4 \| 98.3 \| 86.4 \| 90.8 \| 98.2 \|
	\| 7 \| Skywork/Skywork-Critic-Llama-3.1-70B \| Generative \| 93.3 \| 96.6 \| 87.9 \| 93.1 \| 95.5 \|
	\| 8 \| Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 \| Seq. Classifier \| 93.1 \| 94.7 \| 88.4 \| 92.7 \| 96.7 \|
	\| 9 \| nicolinho/QRM-Llama3.1-8B \| Seq. Classifier \| 93.1 \| 94.4 \| 89.7 \| 92.3 \| 95.8 \|
	\| 10 \| LxzGordon/URM-LLaMa-3.1-8B \| Seq. Classifier \| 92.9 \| 95.5 \| 88.2 \| 91.1 \| 97.0 \|

	## Demo Code

	We provide an example usage of the INF-ORM-Llama3.1-70B below.
	Below is an example of obtaining the reward scores of two conversations.

	```python
	from typing import List, Optional, Union

	import torch
	import torch.nn as nn
	from transformers import LlamaPreTrainedModel, LlamaModel, PreTrainedTokenizerFast
	from transformers.modeling_outputs import SequenceClassifierOutputWithPast

	class INFORMForSequenceClassification(LlamaPreTrainedModel):
	def __init__(self, config):
	super().__init__(config)
	self.num_labels = config.num_labels
	self.model = LlamaModel(config)
	self.score = nn.Sequential(
	nn.Linear(config.hidden_size, config.hidden_size),
	nn.ReLU(),
	nn.Linear(config.hidden_size, self.num_labels)
	)
	# Initialize weights and apply final processing
	self.post_init()

	def forward(
	self,
	input_ids: Optional[torch.LongTensor] = None,
	attention_mask: Optional[torch.Tensor] = None,
	position_ids: Optional[torch.LongTensor] = None,
	past_key_values: Optional[List[torch.FloatTensor]] = None,
	inputs_embeds: Optional[torch.FloatTensor] = None,
	labels: Optional[torch.LongTensor] = None,
	use_cache: Optional[bool] = None,
	output_attentions: Optional[bool] = None,
	output_hidden_states: Optional[bool] = None,
	return_dict: Optional[bool] = None,
	):

	transformer_outputs = self.model(
	input_ids,
	attention_mask=attention_mask,
	position_ids=position_ids,
	past_key_values=past_key_values,
	inputs_embeds=inputs_embeds,
	)
	hidden_states = transformer_outputs[0]
	logits = self.score(hidden_states)

	if input_ids is not None:
	batch_size = input_ids.shape[0]
	else:
	batch_size = inputs_embeds.shape[0]

	if self.config.pad_token_id is None and batch_size != 1:
	raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
	if self.config.pad_token_id is None:
	sequence_lengths = -1
	else:
	if input_ids is not None:
	# if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
	sequence_lengths = torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1
	sequence_lengths = sequence_lengths % input_ids.shape[-1]
	sequence_lengths = sequence_lengths.to(logits.device)
	else:
	sequence_lengths = -1

	pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]

	loss = None
	return SequenceClassifierOutputWithPast(
	loss=loss,
	logits=pooled_logits,
	past_key_values=transformer_outputs.past_key_values,
	hidden_states=transformer_outputs.hidden_states,
	attentions=transformer_outputs.attentions,
	)

	# Load model
	model_name = "infly/INF-ORM-Llama3.1-70B"
	orm = INFORMForSequenceClassification.from_pretrained(
	model_name,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	attn_implementation="flash_attention_2",
	num_labels=1,
	)

	# Load tokenizer
	tokenizer = PreTrainedTokenizerFast.from_pretrained(model_name)
	conv1 = [ { "content": "Tom has 20 oranges. He gave 5 oranges to his friend Lisa, then he bought 3 more oranges. Finally, he divided all the oranges equally among himself and his 4 friends. How many oranges does each person get?", "role": "user" }, { "content": "Tom has 20 oranges. He gave 5 oranges to Lisa. 20 - 5 = 15. Tom now has 15 oranges. Tom bought 3 more oranges. 15 + 3 = 18. Tom now has 18 oranges. Tom divides the 18 oranges equally among himself and his 4 friends (a total of 5 people). 18 ÷ 5 = 3.6 oranges. Each person gets 3.6 oranges.", "role": "assistant" } ]
	conv2= [ { "content": "Tom has 20 oranges. He gave 5 oranges to his friend Lisa, then he bought 3 more oranges. Finally, he divided all the oranges equally among himself and his 4 friends. How many oranges does each person get?", "role": "user" }, { "content": "Tom has 20 oranges. He gave 5 oranges to his friend Lisa. 20 - 5 = 15. Tom now has 15 oranges. Tom bought 3 more oranges. 15 + 3 = 18. Tom now has 18 oranges. Tom divides the 18 oranges equally among his 4 friends (a total of 4 people). 18 ÷ 4 = 4.5 oranges. Each person gets 4.5 oranges.", "role": "assistant" } ]
	conv1_tokenized = tokenizer.apply_chat_template(conv1, tokenize=True, return_tensors="pt").to("cuda")
	conv2_tokenized = tokenizer.apply_chat_template(conv2, tokenize=True, return_tensors="pt").to("cuda")

	# Inference
	with torch.no_grad():
	score1 = orm(conv1_tokenized).logits[0][0].item()
	score2 = orm(conv2_tokenized).logits[0][0].item()
	print(f"Score for response 1: {score1}")
	print(f"Score for response 2: {score2}")

	# Output:

	# Score for response 1: 4.96875
	# Score for response 2: 2.890625

	```

	## Declaration and License Agreement

	### Declaration

	### License Agreement

	## Contact
	If you have any questions, please feel free to reach us at <[email protected]>.
	## Citation