Junrulu's picture
Update README.md
3b39e12 verified
|
raw
history blame
1.66 kB
metadata
model-index:
  - name: Junrulu/Reproduced-tulu2-dpo-13b
    results: []
datasets:
  - HuggingFaceH4/ultrafeedback_binarized
language:
  - en
base_model: allenai/tulu-2-13b

Model Card for Reproduced Tulu2 DPO 13B

This repository provides a reproduction version of Tulu2-DPO-13B finetuned upon Tulu2-13B and Ultrafeedback. Therefore, we obey all licenses mentioned in Tulu2's work. Check our codes for more details: https://github.com/LuJunru/LLM_Finetune/tree/DPO, which is built with TRL.

Performance

Model Size Alignment MT-Bench (score) AlpacaEval 2.0 (win rate %)
Tulu-v2-13b 🐪 13B SFT 5.79 2.61
Tulu-v2-dpo-13b 🐪 13B DPO 6.06 6.96
Reproduced-tulu2-dpo-13b 13B DPO 6.27 6.71

Input Format

The model is trained to use the following format (note the newlines):

<|user|>
Your message here!
<|assistant|>

For best results, format all inputs in this manner. Make sure to include a newline after <|assistant|>, this can affect generation quality quite a bit.

Training hyperparameters

The following hyperparameters were used during DPO training:

  • learning_rate: 1e-6 * sqrt(Num of Nodes)
  • total_train_batch_size: 128 * Num of Nodes
  • optimizer: AdamW with beta1 0.9, beta2 0.999 and epsilon 1e-8
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.1
  • Weight Decay: 0.0
  • num_epochs: 3.0