File size: 2,087 Bytes
eaad91b
456be72
73416b4
456be72
 
 
575802d
456be72
 
4c889f0
eaad91b
456be72
4c889f0
456be72
0b401b0
456be72
 
 
3b39e12
 
 
 
 
456be72
 
 
 
 
 
 
 
 
 
f5b52f1
456be72
4c889f0
456be72
 
2f57a21
4c889f0
 
a3cf69a
456be72
 
96a1289
456be72
b20d809
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
---
model-index:
- name: Junrulu/Reproduced-tulu2-dpo-13b
  results: []
datasets:
- HuggingFaceH4/ultrafeedback_binarized
- Junrulu/Reproduced-tulu2-test-sets
language:
- en
base_model: allenai/tulu-2-13b
---

# Model Card for Reproduced Tulu2 DPO 13B

This repository provides a reproduction version of Tulu2-DPO-13B finetuned upon [Tulu2-13B](https://huggingface.co/allenai/tulu-2-13b) and [Ultrafeedback](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized). Therefore, we obey all licenses mentioned in Tulu2's work. Check our codes for more details: https://github.com/LuJunru/LLM_Finetune/tree/DPO, which is built with [TRL](https://github.com/huggingface/trl/tree/main).

## Performance

| Model | Size | Alignment | MT-Bench (score) | AlpacaEval 2.0 (win rate %) |
|-------------|-----|----|---------------|--------------|
| **Tulu-v2-13b** 🐪 | **13B** | **SFT** | **5.79** | **2.61** |
| **Tulu-v2-dpo-13b** 🐪 | **13B** | **DPO** | **6.06** | **6.96** |
| **Reproduced-tulu2-dpo-13b** | **13B** | **DPO** | **6.27** | **6.71** |

## Input Format

The model is trained to use the following format (note the newlines):
```
<|user|>
Your message here!
<|assistant|>
```

For best results, format all inputs in this manner. **Make sure to include a newline after `<|assistant|>`, this can affect generation quality quite a bit.** Note: if fine-tuning with this chat template, ensure to evaluate and test with the chat template. Otherwise, fine-tining without the template if you choose to not use template during testing. Any mismatch of the chatting template between training and testing phases can obviously dampen the final performance.

## Training hyperparameters

The following hyperparameters were used during DPO training:
- DPO beta: 0.1 
- learning_rate: 1e-6 * sqrt(Num of Nodes)
- total_train_batch_size: 128 * Num of Nodes
- optimizer: AdamW with beta1 0.9, beta2 0.999 and epsilon 1e-8
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- Weight Decay: 0.0
- num_epochs: 3.0
- Specifically add above input format over training samples