teachyourselfcoding's picture
Upload 245 files
fa6856c

A newer version of the Gradio SDK is available: 5.23.3

Upgrade

Learning to summarize from Human Feedback using trlx

This example shows how to use trlx to train a summarization model using human feedback following the fine-tuning procedures described in Stiennon et al.'s, "Learning to Summarize from human feedback".

Before running everything, we need some extra packages not included in the trlx dependency list. Specifically, we need HuggingFace's evaluate package and Google's re-implementation of ROUGE, rouge-score. To install them, run requirements.txt in this example's root directory:

pip install -r requirements.txt

Training Process

For an in-depth description of the example, please refer to our blog post. We leave the following for a quick overview of the fine-tuning process and what scripts to run.

  1. Train SFT:

    cd sft/ && deepspeed train_gptj_summarize.py
    

    Checkpoint: SFT

  2. Train Reward Model:

    cd reward_model/ && deepspeed train_reward_model_gptj.py
    

    Download reward model checkpoint:

    mkdir reward_model/rm_checkpoint
    wget https://huggingface.co/CarperAI/openai_summarize_tldr_rm_checkpoint/resolve/main/pytorch_model.bin -O reward_model/rm_checkpoint/pytorch_model.bin
    
  3. PPO training:

    accelerate launch --config_file configs/default_accelerate_config.yaml trlx_gptj_text_summarization.py
    

    Checkpoint: PPO

    🩹 Warning: This particular training configuration requires at least 55GB of VRAM and is setup to use two GPUs, decrease batch_size in case you're running out of memory.

Results

The following tables display ROUGE and reward scores on the test set of the TL;DR dataset between SFT and PPO models.

  1. SFT vs PPO

    ROUGE scores

    Model Rouge-1 Rouge-2 Rouge-L Average
    SFT 0.334 0.125 0.261 0.240
    PPO 0.323 0.109 0.238 0.223

    Reward scores

    Model Average Reward Reward $\Delta$
    SFT 2.729 -0.181
    PPO 3.291 +0.411
  2. Examples of generated summaries can be found here.

  3. Check our blog post for metric logs and other results here.

References

  1. Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano, "Learning to Summarize from human feedback", Neural Information Processing Systems, 2020.