teachyourselfcoding's picture
Upload 245 files
fa6856c
## Learning to summarize from Human Feedback using `trlx`
This example shows how to use `trlx` to train a summarization model using human feedback
following the fine-tuning procedures described in Stiennon et al.'s, "[Learning to Summarize from human feedback](https://arxiv.org/abs/2009.01325)".
Before running everything, we need some extra packages not included in the `trlx` dependency list. Specifically, we need HuggingFace's [`evaluate`](https://huggingface.co/docs/evaluate/index) package and Google's re-implementation of ROUGE, [`rouge-score`](https://github.com/google-research/google-research/tree/master/rouge). To install them, run `requirements.txt` in this example's root directory:
```bash
pip install -r requirements.txt
```
### Training Process
For an in-depth description of the example, please refer to our [blog post](http://wandb.me/summarize-rlhf-trlx). We leave the following for a quick overview of the fine-tuning process and what scripts to run.
1. Train SFT:
```bash
cd sft/ && deepspeed train_gptj_summarize.py
```
Checkpoint: [SFT](https://huggingface.co/CarperAI/openai_summarize_tldr_sft)
2. Train Reward Model:
```bash
cd reward_model/ && deepspeed train_reward_model_gptj.py
```
Download reward model checkpoint:
```bash
mkdir reward_model/rm_checkpoint
wget https://huggingface.co/CarperAI/openai_summarize_tldr_rm_checkpoint/resolve/main/pytorch_model.bin -O reward_model/rm_checkpoint/pytorch_model.bin
```
3. PPO training:
```bash
accelerate launch --config_file configs/default_accelerate_config.yaml trlx_gptj_text_summarization.py
```
Checkpoint: [PPO](https://huggingface.co/CarperAI/openai_summarize_tldr_ppo)
🩹 Warning: This particular training configuration requires at least 55GB of VRAM and is setup to use two GPUs, decrease `batch_size` in case you're running out of memory.
### Results
The following tables display ROUGE and reward scores on the test set of the TL;DR dataset between SFT and PPO models.
1. SFT vs PPO
__ROUGE scores__
| Model | Rouge-1 | Rouge-2 | Rouge-L | Average |
| --- | --- | --- | --- | --- |
| SFT | 0.334 | 0.125 | 0.261 | 0.240 |
| PPO | 0.323 | 0.109 | 0.238 | 0.223 |
__Reward scores__
| Model | Average Reward | Reward $\Delta$ |
| --- | --- | --- |
| SFT | 2.729 | -0.181 |
| PPO | 3.291 | +0.411 |
2. Examples of generated summaries can be found [here](https://wandb.ai/carperai/summarize_RLHF/runs/2uirt89a).
3. Check our blog post for metric logs and other results [here](http://wandb.me/summarize-rlhf-trlx).
## References
1. Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano, "[Learning to Summarize from human feedback](https://arxiv.org/abs/2009.01325)", Neural Information Processing Systems, 2020.