yyqoni
/

meta-llama-3.1-instruct-8b-bandit-rm-700k

Text Classification

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

This is the bandit reward model introduced in the preprint Segmenting Text and Learning Their Rewards for Improved RLHF in Language Models (https://arxiv.org/abs/2501.02790). For more details, please visit our repository at https://github.com/yinyueqin/DenseRewardRLHF-PPO.

Downloads last month: 11

Safetensors

Model size

7.5B params

Tensor type

BF16

·

Inference Providers NEW

Text Classification

This model is not currently available via any of the supported Inference Providers.

Model tree for yyqoni/meta-llama-3.1-instruct-8b-bandit-rm-700k

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Finetuned

(918)

this model

Dataset used to train yyqoni/meta-llama-3.1-instruct-8b-bandit-rm-700k

Collection including yyqoni/meta-llama-3.1-instruct-8b-bandit-rm-700k

DenseRewardRLHF-PPO

This repository contains the released models for our paper Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model. • 18 items • Updated Jan 11 • 1