llama-2-7b-reward-oasst1

This model is a fine-tuned version of meta-llama/Llama-2-7b-chat-hf on the first 10000 rows of the tasksource/oasst1_pairwise_rlhf_reward dataset. It achieves the following results on the evaluation set:

Loss: 0.5713
Accuracy: 0.7435

See also vincentmin/llama-2-13b-reward-oasst1 for a 13b version of this model.

Model description

This is a reward model trained with QLoRA in 4bit precision. The base model is meta-llama/Llama-2-7b-chat-hf for which you need to have accepted the license in order to be able use it. Once you've been given permission, you can load the reward model as follows:

import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSequenceClassification, AutoTokenizer

peft_model_id = "vincentmin/llama-2-7b-reward-oasst1"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForSequenceClassification.from_pretrained(
    config.base_model_name_or_path,
    num_labels=1,
    load_in_4bit=True,
    torch_dtype=torch.float16,
)
model = PeftModel.from_pretrained(model, peft_model_id)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path, use_auth_token=True)
model.eval()
with torch.no_grad():
  reward = model(**tokenizer("prompter: hello world. assistant: foo bar", return_tensors='pt')).logits
reward

For best results, one should use the prompt format used during training:

prompt = "prompter: <prompt_1> assistant: <response_1> prompter: <prompt_2> ..."

Please use a version of peft where #755 has been merged to make sure the model is loaded correctly. You can install peft with pip install git+https://github.com/huggingface/peft.git to make sure this is the case.

Intended uses & limitations

Since the model was trained on oasst1 data, the reward will reflect any biases present in the oasst1 data.

Training and evaluation data

The model was trained using QLoRA and the trl library's RewardTrainer on the tasksource/oasst1_pairwise_rlhf_reward dataset. Examples with more than 1024 tokens were filtered out and the training data was restricted to the first 10000 rows of the filtered dataset.

Training hyperparameters

The following bitsandbytes quantization config was used during training:

load_in_8bit: False
load_in_4bit: True
llm_int8_threshold: 6.0
llm_int8_skip_modules: None
llm_int8_enable_fp32_cpu_offload: False
llm_int8_has_fp16_weight: False
bnb_4bit_quant_type: nf4
bnb_4bit_use_double_quant: False
bnb_4bit_compute_dtype: float16

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 1
eval_batch_size: 1
seed: 42
gradient_accumulation_steps: 4
total_train_batch_size: 4
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 1
max_seq_length: 1024

Training results

Training Loss	Epoch	Step	Validation Loss	Accuracy
0.8409	0.1	250	0.8243	0.6220
0.6288	0.2	500	0.7539	0.6715
0.5882	0.3	750	0.6792	0.7075
0.7671	0.4	1000	0.6130	0.7334
0.5782	0.5	1250	0.6115	0.7255
0.5691	0.6	1500	0.5795	0.7413
0.6579	0.7	1750	0.5774	0.7469
0.6107	0.8	2000	0.5691	0.7402
0.6255	0.9	2250	0.5710	0.7435
0.7034	1.0	2500	0.5713	0.7435

Framework versions

PEFT 0.5.0.dev0 (with https://github.com/huggingface/peft/pull/755)
Transformers 4.32.0.dev0
Pytorch 2.0.1+cu118
Datasets 2.14.0
Tokenizers 0.13.3

vincentmin
/

llama-2-7b-reward-oasst1