metadata
model-index:
- name: Junrulu/Reproduced-tulu2-dpo-13b
results: []
datasets:
- HuggingFaceH4/ultrafeedback_binarized
language:
- en
base_model: allenai/tulu-2-13b
Model Card for Reproduced Tulu2 DPO 13B
This repository provides a reproduction version of Tulu2-DPO-13B finetuned upon Tulu2-13B and Ultrafeedback. Therefore, we obey all licenses mentioned in Tulu2's work. Check our codes for more details: https://github.com/LuJunru/LLM_Finetune/tree/DPO, which is built with TRL.
Performance
Model | Size | Alignment | MT-Bench (score) | AlpacaEval 2.0 (win rate %) |
---|---|---|---|---|
Tulu-v2-13b 🐪 | 13B | SFT | 5.79 | 2.61 |
Tulu-v2-dpo-13b 🐪 | 13B | DPO | 6.06 | 6.96 |
Reproduced-tulu2-dpo-13b | 13B | DPO | 6.27 | 6.71 |
Input Format
The model is trained to use the following format (note the newlines):
<|user|>
Your message here!
<|assistant|>
For best results, format all inputs in this manner. Make sure to include a newline after <|assistant|>
, this can affect generation quality quite a bit.
Training hyperparameters
The following hyperparameters were used during DPO training:
- learning_rate: 1e-6 * sqrt(Num of Nodes)
- total_train_batch_size: 128 * Num of Nodes
- optimizer: AdamW with beta1 0.9, beta2 0.999 and epsilon 1e-8
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- Weight Decay: 0.0
- num_epochs: 3.0