--- model-index: - name: Junrulu/Reproduced-tulu2-dpo-13b results: [] datasets: - HuggingFaceH4/ultrafeedback_binarized language: - en base_model: allenai/tulu-2-13b --- # Model Card for Reproduced Tulu2 DPO 13B This repository provides a reproduction version of Tulu2-DPO-13B finetuned upon [Tulu2-13B](https://huggingface.co/allenai/tulu-2-13b) and [Ultrafeedback](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized). Therefore, we obey all licenses mentioned in Tulu2's work. Check our codes for more details: https://github.com/LuJunru/LLM_Finetune/tree/DPO, which is built with [TRL](https://github.com/huggingface/trl/tree/main). ## Performance | Model | Size | Alignment | MT-Bench (score) | AlpacaEval 2.0 (win rate %) | |-------------|-----|----|---------------|--------------| | **Tulu-v2-13b** 🐪 | **13B** | **SFT** | **5.79** | **2.61** | | **Tulu-v2-dpo-13b** 🐪 | **13B** | **DPO** | **6.06** | **6.96** | | **Reproduced-tulu2-dpo-13b** | **13B** | **DPO** | **6.27** | **6.71** | ## Input Format The model is trained to use the following format (note the newlines): ``` <|user|> Your message here! <|assistant|> ``` For best results, format all inputs in this manner. **Make sure to include a newline after `<|assistant|>`, this can affect generation quality quite a bit.** ## Training hyperparameters The following hyperparameters were used during DPO training: - learning_rate: 1e-6 * sqrt(Num of Nodes) - total_train_batch_size: 128 * Num of Nodes - optimizer: AdamW with beta1 0.9, beta2 0.999 and epsilon 1e-8 - lr_scheduler_type: linear - lr_scheduler_warmup_ratio: 0.1 - Weight Decay: 0.0 - num_epochs: 3.0