## Learning to summarize from Human Feedback using `trlx` This example shows how to use `trlx` to train a summarization model using human feedback following the fine-tuning procedures described in Stiennon et al.'s, "[Learning to Summarize from human feedback](https://arxiv.org/abs/2009.01325)". Before running everything, we need some extra packages not included in the `trlx` dependency list. Specifically, we need HuggingFace's [`evaluate`](https://huggingface.co/docs/evaluate/index) package and Google's re-implementation of ROUGE, [`rouge-score`](https://github.com/google-research/google-research/tree/master/rouge). To install them, run `requirements.txt` in this example's root directory: ```bash pip install -r requirements.txt ``` ### Training Process For an in-depth description of the example, please refer to our [blog post](http://wandb.me/summarize-rlhf-trlx). We leave the following for a quick overview of the fine-tuning process and what scripts to run. 1. Train SFT: ```bash cd sft/ && deepspeed train_gptj_summarize.py ``` Checkpoint: [SFT](https://huggingface.co/CarperAI/openai_summarize_tldr_sft) 2. Train Reward Model: ```bash cd reward_model/ && deepspeed train_reward_model_gptj.py ``` Download reward model checkpoint: ```bash mkdir reward_model/rm_checkpoint wget https://huggingface.co/CarperAI/openai_summarize_tldr_rm_checkpoint/resolve/main/pytorch_model.bin -O reward_model/rm_checkpoint/pytorch_model.bin ``` 3. PPO training: ```bash accelerate launch --config_file configs/default_accelerate_config.yaml trlx_gptj_text_summarization.py ``` Checkpoint: [PPO](https://huggingface.co/CarperAI/openai_summarize_tldr_ppo) 🩹 Warning: This particular training configuration requires at least 55GB of VRAM and is setup to use two GPUs, decrease `batch_size` in case you're running out of memory. ### Results The following tables display ROUGE and reward scores on the test set of the TL;DR dataset between SFT and PPO models. 1. SFT vs PPO __ROUGE scores__ | Model | Rouge-1 | Rouge-2 | Rouge-L | Average | | --- | --- | --- | --- | --- | | SFT | 0.334 | 0.125 | 0.261 | 0.240 | | PPO | 0.323 | 0.109 | 0.238 | 0.223 | __Reward scores__ | Model | Average Reward | Reward $\Delta$ | | --- | --- | --- | | SFT | 2.729 | -0.181 | | PPO | 3.291 | +0.411 | 2. Examples of generated summaries can be found [here](https://wandb.ai/carperai/summarize_RLHF/runs/2uirt89a). 3. Check our blog post for metric logs and other results [here](http://wandb.me/summarize-rlhf-trlx). ## References 1. Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano, "[Learning to Summarize from human feedback](https://arxiv.org/abs/2009.01325)", Neural Information Processing Systems, 2020.