File size: 2,087 Bytes
eaad91b 456be72 73416b4 456be72 575802d 456be72 4c889f0 eaad91b 456be72 4c889f0 456be72 0b401b0 456be72 3b39e12 456be72 f5b52f1 456be72 4c889f0 456be72 2f57a21 4c889f0 a3cf69a 456be72 96a1289 456be72 b20d809 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
---
model-index:
- name: Junrulu/Reproduced-tulu2-dpo-13b
results: []
datasets:
- HuggingFaceH4/ultrafeedback_binarized
- Junrulu/Reproduced-tulu2-test-sets
language:
- en
base_model: allenai/tulu-2-13b
---
# Model Card for Reproduced Tulu2 DPO 13B
This repository provides a reproduction version of Tulu2-DPO-13B finetuned upon [Tulu2-13B](https://huggingface.co/allenai/tulu-2-13b) and [Ultrafeedback](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized). Therefore, we obey all licenses mentioned in Tulu2's work. Check our codes for more details: https://github.com/LuJunru/LLM_Finetune/tree/DPO, which is built with [TRL](https://github.com/huggingface/trl/tree/main).
## Performance
| Model | Size | Alignment | MT-Bench (score) | AlpacaEval 2.0 (win rate %) |
|-------------|-----|----|---------------|--------------|
| **Tulu-v2-13b** 🐪 | **13B** | **SFT** | **5.79** | **2.61** |
| **Tulu-v2-dpo-13b** 🐪 | **13B** | **DPO** | **6.06** | **6.96** |
| **Reproduced-tulu2-dpo-13b** | **13B** | **DPO** | **6.27** | **6.71** |
## Input Format
The model is trained to use the following format (note the newlines):
```
<|user|>
Your message here!
<|assistant|>
```
For best results, format all inputs in this manner. **Make sure to include a newline after `<|assistant|>`, this can affect generation quality quite a bit.** Note: if fine-tuning with this chat template, ensure to evaluate and test with the chat template. Otherwise, fine-tining without the template if you choose to not use template during testing. Any mismatch of the chatting template between training and testing phases can obviously dampen the final performance.
## Training hyperparameters
The following hyperparameters were used during DPO training:
- DPO beta: 0.1
- learning_rate: 1e-6 * sqrt(Num of Nodes)
- total_train_batch_size: 128 * Num of Nodes
- optimizer: AdamW with beta1 0.9, beta2 0.999 and epsilon 1e-8
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- Weight Decay: 0.0
- num_epochs: 3.0
- Specifically add above input format over training samples |