--- model-index: - name: Reproduced-tulu2-dpo-13b results: [] datasets: - HuggingFaceH4/ultrafeedback_binarized language: - en base_model: allenai/tulu-2-13b --- # Model Card for Reproduced Tulu2 DPO 13B - This repository provides a reproduction version of Tulu2-DPO-13B finetuned upon [Tulu2-13B](https://huggingface.co/allenai/tulu-2-13b) and [Ultrafeedback](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized). - Therefore, we obey all licenses mentioned in Tulu2's work. - Check our codes for more details: https://github.com/LuJunru/LLM_Finetune/tree/DPO. The codes are built with [TRL](https://github.com/huggingface/trl/tree/main). ## Performance | Model | Size | Alignment | MT-Bench (score) | AlpacaEval (win rate %) | |-------------|-----|----|---------------|--------------| | **Tulu2-13b** | **13B** | **SFT** | **6.70** | **78.9** | | **Tulu2-dpo-13b** | **13B** | **DPO** | **7.00** | **89.5** | | **Reproduced-Tulu2-dpo-13b** | **13B** | **DPO** | **?** | **?** | ## Input Format The model is trained to use the following format (note the newlines): ``` <|user|> Your message here! <|assistant|> ``` For best results, format all inputs in this manner. **Make sure to include a newline after `<|assistant|>`, this can affect generation quality quite a bit.** ## Training hyperparameters The following hyperparameters were used during DPO training: - learning_rate: 1e-6 * sqrt(Num of Nodes) - total_train_batch_size: 128 * Num of Nodes - optimizer: AdamW with default values - lr_scheduler_type: linear - lr_scheduler_warmup_ratio: 0.1 - Weight Decay: 0.05 - num_epochs: 3.0