metadata
datasets:
- PKU-Alignment/PKU-SafeRLHF
language:
- en
tags:
- reinforcement-learning-from-human-feedback
- reinforcement-learning
- beaver
- safety
- llama
- ai-safety
- deepspeed
- rlhf
- alpaca
library_name: safe-rlhf
🦫 Beaver's Cost Model
Model Details
The Beaver cost model is a preference model trained using the PKU-SafeRLHF dataset. It can play a role in the safe RLHF algorithm, helping the Beaver model become more safe and harmless.
- Developed by: the PKU-Alignment Team.
- Model Type: An auto-regressive language model based on the transformer architecture.
- License: Non-commercial license.
- Fine-tuned from model: LLaMA, Alpaca.
Model Sources
- Repository: https://github.com/PKU-Alignment/safe-rlhf
- Beaver: https://huggingface.co/PKU-Alignment/beaver-7b-v2.0
- Dataset: https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF
- Reward Model: https://huggingface.co/PKU-Alignment/beaver-7b-v2.0-reward
- Cost Model: https://huggingface.co/PKU-Alignment/beaver-7b-v2.0-cost
- Dataset Paper: https://arxiv.org/abs/2307.04657
- Paper: https://arxiv.org/abs/2310.12773
How to Use the Cost Model
import torch
from transformers import AutoTokenizer
from safe_rlhf.models import AutoModelForScore
model = AutoModelForScore.from_pretrained('PKU-Alignment/beaver-7b-v2.0-cost', torch_dtype=torch.bfloat16, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained('PKU-Alignment/beaver-7b-v2.0-cost')
input = 'BEGINNING OF CONVERSATION: USER: hello ASSISTANT:Hello! How can I help you today?'
input_ids = tokenizer(input, return_tensors='pt')
output = model(**input_ids)
print(output)
# ScoreModelOutput(
# scores=tensor([[[ 1.2031],
# [ 2.0469],
# [ 2.1875],
# [ 2.0938],
# [ 2.9219],
# [ 2.2656],
# [ 3.1250],
# [ 2.4219],
# [ 3.6406],
# [ 2.4062],
# [ 0.7383],
# [ 0.6719],
# [-0.4414],
# [-1.2734],
# [-1.6562],
# [ 0.3340],
# [ 0.2432],
# [-0.6914],
# [-1.0938],
# [-1.9453],
# [-3.0469],
# [-2.7812],
# [-2.2188],
# [-1.6250],
# [-1.5000],
# [-1.9922],
# [-2.6562],
# [-9.4375]]], grad_fn=<ToCopyBackward0>),
# end_scores=tensor([[-9.4375]], grad_fn=<ToCopyBackward0>),
# last_hidden_state=tensor([[[ 7.4219e-02, 3.6865e-02, -2.4414e-01, ..., -5.7129e-02,
# 1.1963e-01, 2.7734e-01],
# [-7.0703e-01, 1.0234e+00, 9.8145e-02, ..., 2.6719e+00,
# 8.2422e-01, 4.7119e-02],
# [-1.5332e-01, 1.0938e+00, -5.0000e-01, ..., -1.6699e-01,
# -6.0156e-01, 5.3516e-01],
# ...,
# [-1.0469e+00, 3.5858e-03, -1.1094e+00, ..., -1.1094e+00,
# 9.2578e-01, 1.3750e+00],
# [ 3.1445e-01, -9.7266e-01, -1.8984e+00, ..., -9.4141e-01,
# 2.0703e-01, 9.4531e-01],
# [ 5.5625e+00, -1.8672e+00, -1.3359e+00, ..., 8.0078e-01,
# -1.8906e+00, -1.3516e+00]]], dtype=torch.bfloat16,
# grad_fn=<ToCopyBackward0>),
# end_last_hidden_state=tensor([[ 5.5625, -1.8672, -1.3359, ..., 0.8008, -1.8906, -1.3516]],
# dtype=torch.bfloat16, grad_fn=<ToCopyBackward0>),
# end_index=tensor([27])
# )