Spaces:
Runtime error
A newer version of the Gradio SDK is available:
5.23.3
Learning to summarize from Human Feedback using trlx
This example shows how to use trlx
to train a summarization model using human feedback
following the fine-tuning procedures described in Stiennon et al.'s, "Learning to Summarize from human feedback".
Before running everything, we need some extra packages not included in the trlx
dependency list. Specifically, we need HuggingFace's evaluate
package and Google's re-implementation of ROUGE, rouge-score
. To install them, run requirements.txt
in this example's root directory:
pip install -r requirements.txt
Training Process
For an in-depth description of the example, please refer to our blog post. We leave the following for a quick overview of the fine-tuning process and what scripts to run.
Train SFT:
cd sft/ && deepspeed train_gptj_summarize.py
Checkpoint: SFT
Train Reward Model:
cd reward_model/ && deepspeed train_reward_model_gptj.py
Download reward model checkpoint:
mkdir reward_model/rm_checkpoint wget https://huggingface.co/CarperAI/openai_summarize_tldr_rm_checkpoint/resolve/main/pytorch_model.bin -O reward_model/rm_checkpoint/pytorch_model.bin
PPO training:
accelerate launch --config_file configs/default_accelerate_config.yaml trlx_gptj_text_summarization.py
Checkpoint: PPO
🩹 Warning: This particular training configuration requires at least 55GB of VRAM and is setup to use two GPUs, decrease
batch_size
in case you're running out of memory.
Results
The following tables display ROUGE and reward scores on the test set of the TL;DR dataset between SFT and PPO models.
SFT vs PPO
ROUGE scores
Model Rouge-1 Rouge-2 Rouge-L Average SFT 0.334 0.125 0.261 0.240 PPO 0.323 0.109 0.238 0.223 Reward scores
Model Average Reward Reward $\Delta$ SFT 2.729 -0.181 PPO 3.291 +0.411 Examples of generated summaries can be found here.
Check our blog post for metric logs and other results here.
References
- Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano, "Learning to Summarize from human feedback", Neural Information Processing Systems, 2020.