Wenboz commited on
Commit
b62ce55
·
verified ·
1 Parent(s): 98fad4a

Model save

Browse files
Files changed (4) hide show
  1. README.md +82 -0
  2. all_results.json +9 -0
  3. train_results.json +9 -0
  4. trainer_state.json +1626 -0
README.md ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: princeton-nlp/Llama-3-Base-8B-SFT
3
+ library_name: peft
4
+ tags:
5
+ - trl
6
+ - dpo
7
+ - generated_from_trainer
8
+ model-index:
9
+ - name: llama3-dpo-lora
10
+ results: []
11
+ ---
12
+
13
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
14
+ should probably proofread and complete it, then remove this comment. -->
15
+
16
+ # llama3-dpo-lora
17
+
18
+ This model is a fine-tuned version of [princeton-nlp/Llama-3-Base-8B-SFT](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-SFT) on the None dataset.
19
+ It achieves the following results on the evaluation set:
20
+ - Loss: 0.5190
21
+ - Rewards/chosen: -0.1399
22
+ - Rewards/rejected: -0.9459
23
+ - Rewards/accuracies: 0.7260
24
+ - Rewards/margins: 0.8060
25
+ - Logps/rejected: -283.9163
26
+ - Logps/chosen: -291.1600
27
+ - Logits/rejected: -0.3917
28
+ - Logits/chosen: -0.4219
29
+
30
+ ## Model description
31
+
32
+ More information needed
33
+
34
+ ## Intended uses & limitations
35
+
36
+ More information needed
37
+
38
+ ## Training and evaluation data
39
+
40
+ More information needed
41
+
42
+ ## Training procedure
43
+
44
+ ### Training hyperparameters
45
+
46
+ The following hyperparameters were used during training:
47
+ - learning_rate: 5e-06
48
+ - train_batch_size: 1
49
+ - eval_batch_size: 4
50
+ - seed: 42
51
+ - distributed_type: multi-GPU
52
+ - num_devices: 4
53
+ - gradient_accumulation_steps: 16
54
+ - total_train_batch_size: 64
55
+ - total_eval_batch_size: 16
56
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
57
+ - lr_scheduler_type: cosine
58
+ - lr_scheduler_warmup_ratio: 0.1
59
+ - num_epochs: 1
60
+
61
+ ### Training results
62
+
63
+ | Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
64
+ |:-------------:|:------:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
65
+ | 0.6297 | 0.1047 | 100 | 0.6140 | 0.1358 | -0.1277 | 0.6960 | 0.2634 | -275.7340 | -288.4034 | -0.5479 | -0.5526 |
66
+ | 0.5676 | 0.2094 | 200 | 0.5569 | -0.1144 | -0.6599 | 0.7000 | 0.5455 | -281.0560 | -290.9051 | -0.4945 | -0.5116 |
67
+ | 0.5414 | 0.3141 | 300 | 0.5403 | -0.3808 | -1.0461 | 0.7260 | 0.6652 | -284.9180 | -293.5698 | -0.4540 | -0.4775 |
68
+ | 0.5124 | 0.4187 | 400 | 0.5341 | -0.2337 | -0.9896 | 0.7040 | 0.7559 | -284.3532 | -292.0986 | -0.4243 | -0.4516 |
69
+ | 0.5529 | 0.5234 | 500 | 0.5260 | -0.2177 | -1.0037 | 0.7240 | 0.7861 | -284.4948 | -291.9380 | -0.3995 | -0.4290 |
70
+ | 0.53 | 0.6281 | 600 | 0.5244 | -0.0687 | -0.8583 | 0.7200 | 0.7895 | -283.0403 | -290.4489 | -0.4028 | -0.4317 |
71
+ | 0.5028 | 0.7328 | 700 | 0.5190 | -0.3357 | -1.1360 | 0.7320 | 0.8003 | -285.8177 | -293.1184 | -0.3874 | -0.4179 |
72
+ | 0.5347 | 0.8375 | 800 | 0.5191 | -0.1404 | -0.9419 | 0.7320 | 0.8015 | -283.8760 | -291.1650 | -0.3924 | -0.4225 |
73
+ | 0.4783 | 0.9422 | 900 | 0.5190 | -0.1399 | -0.9459 | 0.7260 | 0.8060 | -283.9163 | -291.1600 | -0.3917 | -0.4219 |
74
+
75
+
76
+ ### Framework versions
77
+
78
+ - PEFT 0.7.1
79
+ - Transformers 4.44.2
80
+ - Pytorch 2.2.1+cu121
81
+ - Datasets 2.14.6
82
+ - Tokenizers 0.19.1
all_results.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 0.9997382884061764,
3
+ "total_flos": 0.0,
4
+ "train_loss": 0.543355998318857,
5
+ "train_runtime": 19333.2554,
6
+ "train_samples": 61135,
7
+ "train_samples_per_second": 3.162,
8
+ "train_steps_per_second": 0.049
9
+ }
train_results.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 0.9997382884061764,
3
+ "total_flos": 0.0,
4
+ "train_loss": 0.543355998318857,
5
+ "train_runtime": 19333.2554,
6
+ "train_samples": 61135,
7
+ "train_samples_per_second": 3.162,
8
+ "train_steps_per_second": 0.049
9
+ }
trainer_state.json ADDED
@@ -0,0 +1,1626 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 0.9997382884061764,
5
+ "eval_steps": 100,
6
+ "global_step": 955,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.0010468463752944255,
13
+ "grad_norm": 5.5,
14
+ "learning_rate": 5.208333333333333e-08,
15
+ "logits/chosen": -0.32969313859939575,
16
+ "logits/rejected": -0.3478139042854309,
17
+ "logps/chosen": -282.7626953125,
18
+ "logps/rejected": -210.6118621826172,
19
+ "loss": 0.6931,
20
+ "rewards/accuracies": 0.0,
21
+ "rewards/chosen": 0.0,
22
+ "rewards/margins": 0.0,
23
+ "rewards/rejected": 0.0,
24
+ "step": 1
25
+ },
26
+ {
27
+ "epoch": 0.010468463752944255,
28
+ "grad_norm": 6.0,
29
+ "learning_rate": 5.208333333333334e-07,
30
+ "logits/chosen": -0.5201655030250549,
31
+ "logits/rejected": -0.5292737483978271,
32
+ "logps/chosen": -312.750244140625,
33
+ "logps/rejected": -275.4716491699219,
34
+ "loss": 0.6919,
35
+ "rewards/accuracies": 0.4652777910232544,
36
+ "rewards/chosen": 0.0034895986318588257,
37
+ "rewards/margins": 0.002287701005116105,
38
+ "rewards/rejected": 0.001201898674480617,
39
+ "step": 10
40
+ },
41
+ {
42
+ "epoch": 0.02093692750588851,
43
+ "grad_norm": 5.5,
44
+ "learning_rate": 1.0416666666666667e-06,
45
+ "logits/chosen": -0.4813454747200012,
46
+ "logits/rejected": -0.5021841526031494,
47
+ "logps/chosen": -304.3456726074219,
48
+ "logps/rejected": -268.962158203125,
49
+ "loss": 0.6934,
50
+ "rewards/accuracies": 0.5375000238418579,
51
+ "rewards/chosen": -0.0017953884089365602,
52
+ "rewards/margins": 0.0013109362917020917,
53
+ "rewards/rejected": -0.003106324467808008,
54
+ "step": 20
55
+ },
56
+ {
57
+ "epoch": 0.031405391258832765,
58
+ "grad_norm": 5.78125,
59
+ "learning_rate": 1.5625e-06,
60
+ "logits/chosen": -0.4880186915397644,
61
+ "logits/rejected": -0.4969080984592438,
62
+ "logps/chosen": -287.50286865234375,
63
+ "logps/rejected": -249.9375457763672,
64
+ "loss": 0.688,
65
+ "rewards/accuracies": 0.5562499761581421,
66
+ "rewards/chosen": 0.010473029688000679,
67
+ "rewards/margins": 0.01431797631084919,
68
+ "rewards/rejected": -0.003844945225864649,
69
+ "step": 30
70
+ },
71
+ {
72
+ "epoch": 0.04187385501177702,
73
+ "grad_norm": 5.375,
74
+ "learning_rate": 2.0833333333333334e-06,
75
+ "logits/chosen": -0.46122869849205017,
76
+ "logits/rejected": -0.4958384037017822,
77
+ "logps/chosen": -303.1500549316406,
78
+ "logps/rejected": -241.998046875,
79
+ "loss": 0.683,
80
+ "rewards/accuracies": 0.6499999761581421,
81
+ "rewards/chosen": 0.03339455649256706,
82
+ "rewards/margins": 0.0322079211473465,
83
+ "rewards/rejected": 0.0011866316199302673,
84
+ "step": 40
85
+ },
86
+ {
87
+ "epoch": 0.05234231876472128,
88
+ "grad_norm": 4.59375,
89
+ "learning_rate": 2.604166666666667e-06,
90
+ "logits/chosen": -0.5217183828353882,
91
+ "logits/rejected": -0.5492846369743347,
92
+ "logps/chosen": -302.4110412597656,
93
+ "logps/rejected": -280.4798278808594,
94
+ "loss": 0.6784,
95
+ "rewards/accuracies": 0.625,
96
+ "rewards/chosen": 0.07218249142169952,
97
+ "rewards/margins": 0.04362189769744873,
98
+ "rewards/rejected": 0.028560593724250793,
99
+ "step": 50
100
+ },
101
+ {
102
+ "epoch": 0.06281078251766553,
103
+ "grad_norm": 5.03125,
104
+ "learning_rate": 3.125e-06,
105
+ "logits/chosen": -0.5444208979606628,
106
+ "logits/rejected": -0.5450385212898254,
107
+ "logps/chosen": -288.3908996582031,
108
+ "logps/rejected": -252.4159393310547,
109
+ "loss": 0.662,
110
+ "rewards/accuracies": 0.625,
111
+ "rewards/chosen": 0.10959690809249878,
112
+ "rewards/margins": 0.04304002225399017,
113
+ "rewards/rejected": 0.0665568858385086,
114
+ "step": 60
115
+ },
116
+ {
117
+ "epoch": 0.07327924627060979,
118
+ "grad_norm": 5.21875,
119
+ "learning_rate": 3.6458333333333333e-06,
120
+ "logits/chosen": -0.5294634699821472,
121
+ "logits/rejected": -0.5446035265922546,
122
+ "logps/chosen": -284.2558288574219,
123
+ "logps/rejected": -261.0161437988281,
124
+ "loss": 0.6465,
125
+ "rewards/accuracies": 0.6875,
126
+ "rewards/chosen": 0.1622026413679123,
127
+ "rewards/margins": 0.1133759468793869,
128
+ "rewards/rejected": 0.04882669448852539,
129
+ "step": 70
130
+ },
131
+ {
132
+ "epoch": 0.08374771002355404,
133
+ "grad_norm": 4.78125,
134
+ "learning_rate": 4.166666666666667e-06,
135
+ "logits/chosen": -0.46931973099708557,
136
+ "logits/rejected": -0.4959704279899597,
137
+ "logps/chosen": -285.55169677734375,
138
+ "logps/rejected": -271.7218017578125,
139
+ "loss": 0.6312,
140
+ "rewards/accuracies": 0.706250011920929,
141
+ "rewards/chosen": 0.2240675389766693,
142
+ "rewards/margins": 0.17874710261821747,
143
+ "rewards/rejected": 0.04532044008374214,
144
+ "step": 80
145
+ },
146
+ {
147
+ "epoch": 0.0942161737764983,
148
+ "grad_norm": 5.90625,
149
+ "learning_rate": 4.6875000000000004e-06,
150
+ "logits/chosen": -0.5021494626998901,
151
+ "logits/rejected": -0.5237849950790405,
152
+ "logps/chosen": -328.7523193359375,
153
+ "logps/rejected": -293.5586853027344,
154
+ "loss": 0.6197,
155
+ "rewards/accuracies": 0.71875,
156
+ "rewards/chosen": 0.19337503612041473,
157
+ "rewards/margins": 0.22977392375469208,
158
+ "rewards/rejected": -0.036398887634277344,
159
+ "step": 90
160
+ },
161
+ {
162
+ "epoch": 0.10468463752944256,
163
+ "grad_norm": 5.65625,
164
+ "learning_rate": 4.9997324926814375e-06,
165
+ "logits/chosen": -0.5387491583824158,
166
+ "logits/rejected": -0.5343273282051086,
167
+ "logps/chosen": -274.4442443847656,
168
+ "logps/rejected": -288.30035400390625,
169
+ "loss": 0.6297,
170
+ "rewards/accuracies": 0.637499988079071,
171
+ "rewards/chosen": 0.221399188041687,
172
+ "rewards/margins": 0.2429426610469818,
173
+ "rewards/rejected": -0.0215434692800045,
174
+ "step": 100
175
+ },
176
+ {
177
+ "epoch": 0.10468463752944256,
178
+ "eval_logits/chosen": -0.5525645613670349,
179
+ "eval_logits/rejected": -0.5478540062904358,
180
+ "eval_logps/chosen": -288.4034118652344,
181
+ "eval_logps/rejected": -275.7340087890625,
182
+ "eval_loss": 0.6140495538711548,
183
+ "eval_rewards/accuracies": 0.6959999799728394,
184
+ "eval_rewards/chosen": 0.13576927781105042,
185
+ "eval_rewards/margins": 0.26342061161994934,
186
+ "eval_rewards/rejected": -0.1276513636112213,
187
+ "eval_runtime": 351.3471,
188
+ "eval_samples_per_second": 5.692,
189
+ "eval_steps_per_second": 0.356,
190
+ "step": 100
191
+ },
192
+ {
193
+ "epoch": 0.11515310128238682,
194
+ "grad_norm": 4.25,
195
+ "learning_rate": 4.996723692767927e-06,
196
+ "logits/chosen": -0.5775566101074219,
197
+ "logits/rejected": -0.603683590888977,
198
+ "logps/chosen": -287.82733154296875,
199
+ "logps/rejected": -275.31671142578125,
200
+ "loss": 0.606,
201
+ "rewards/accuracies": 0.71875,
202
+ "rewards/chosen": 0.15195702016353607,
203
+ "rewards/margins": 0.34537211060523987,
204
+ "rewards/rejected": -0.1934150755405426,
205
+ "step": 110
206
+ },
207
+ {
208
+ "epoch": 0.12562156503533106,
209
+ "grad_norm": 4.96875,
210
+ "learning_rate": 4.9903757462135984e-06,
211
+ "logits/chosen": -0.5295812487602234,
212
+ "logits/rejected": -0.5440601110458374,
213
+ "logps/chosen": -260.30078125,
214
+ "logps/rejected": -251.58401489257812,
215
+ "loss": 0.5924,
216
+ "rewards/accuracies": 0.6625000238418579,
217
+ "rewards/chosen": 0.02338511124253273,
218
+ "rewards/margins": 0.281913697719574,
219
+ "rewards/rejected": -0.25852862000465393,
220
+ "step": 120
221
+ },
222
+ {
223
+ "epoch": 0.1360900287882753,
224
+ "grad_norm": 5.46875,
225
+ "learning_rate": 4.980697142834315e-06,
226
+ "logits/chosen": -0.49980980157852173,
227
+ "logits/rejected": -0.5171049237251282,
228
+ "logps/chosen": -300.2707214355469,
229
+ "logps/rejected": -336.19036865234375,
230
+ "loss": 0.5933,
231
+ "rewards/accuracies": 0.65625,
232
+ "rewards/chosen": 0.02033509686589241,
233
+ "rewards/margins": 0.3162926733493805,
234
+ "rewards/rejected": -0.2959575653076172,
235
+ "step": 130
236
+ },
237
+ {
238
+ "epoch": 0.14655849254121958,
239
+ "grad_norm": 4.75,
240
+ "learning_rate": 4.967700826904229e-06,
241
+ "logits/chosen": -0.5714845061302185,
242
+ "logits/rejected": -0.5790292024612427,
243
+ "logps/chosen": -280.66845703125,
244
+ "logps/rejected": -273.6711120605469,
245
+ "loss": 0.5811,
246
+ "rewards/accuracies": 0.6937500238418579,
247
+ "rewards/chosen": -0.004461909644305706,
248
+ "rewards/margins": 0.407376229763031,
249
+ "rewards/rejected": -0.4118381440639496,
250
+ "step": 140
251
+ },
252
+ {
253
+ "epoch": 0.15702695629416383,
254
+ "grad_norm": 4.84375,
255
+ "learning_rate": 4.951404179843963e-06,
256
+ "logits/chosen": -0.5678334832191467,
257
+ "logits/rejected": -0.5395095944404602,
258
+ "logps/chosen": -306.054443359375,
259
+ "logps/rejected": -276.5906066894531,
260
+ "loss": 0.5777,
261
+ "rewards/accuracies": 0.7124999761581421,
262
+ "rewards/chosen": 0.13024172186851501,
263
+ "rewards/margins": 0.4546026587486267,
264
+ "rewards/rejected": -0.3243609070777893,
265
+ "step": 150
266
+ },
267
+ {
268
+ "epoch": 0.16749542004710807,
269
+ "grad_norm": 4.375,
270
+ "learning_rate": 4.931828996974498e-06,
271
+ "logits/chosen": -0.5105162858963013,
272
+ "logits/rejected": -0.5018196702003479,
273
+ "logps/chosen": -296.04840087890625,
274
+ "logps/rejected": -270.5509033203125,
275
+ "loss": 0.5478,
276
+ "rewards/accuracies": 0.7437499761581421,
277
+ "rewards/chosen": 0.04980681464076042,
278
+ "rewards/margins": 0.5175832509994507,
279
+ "rewards/rejected": -0.4677763879299164,
280
+ "step": 160
281
+ },
282
+ {
283
+ "epoch": 0.17796388380005235,
284
+ "grad_norm": 4.78125,
285
+ "learning_rate": 4.909001458367867e-06,
286
+ "logits/chosen": -0.5939355492591858,
287
+ "logits/rejected": -0.5804780125617981,
288
+ "logps/chosen": -287.94708251953125,
289
+ "logps/rejected": -275.73980712890625,
290
+ "loss": 0.5789,
291
+ "rewards/accuracies": 0.7250000238418579,
292
+ "rewards/chosen": -0.19239839911460876,
293
+ "rewards/margins": 0.4369952082633972,
294
+ "rewards/rejected": -0.6293936967849731,
295
+ "step": 170
296
+ },
297
+ {
298
+ "epoch": 0.1884323475529966,
299
+ "grad_norm": 5.15625,
300
+ "learning_rate": 4.882952093833628e-06,
301
+ "logits/chosen": -0.5993765592575073,
302
+ "logits/rejected": -0.5707160830497742,
303
+ "logps/chosen": -301.9080505371094,
304
+ "logps/rejected": -265.2618713378906,
305
+ "loss": 0.555,
306
+ "rewards/accuracies": 0.706250011920929,
307
+ "rewards/chosen": -0.19447076320648193,
308
+ "rewards/margins": 0.48417288064956665,
309
+ "rewards/rejected": -0.6786437630653381,
310
+ "step": 180
311
+ },
312
+ {
313
+ "epoch": 0.19890081130594087,
314
+ "grad_norm": 5.90625,
315
+ "learning_rate": 4.853715742087947e-06,
316
+ "logits/chosen": -0.5384653806686401,
317
+ "logits/rejected": -0.5204354524612427,
318
+ "logps/chosen": -275.14886474609375,
319
+ "logps/rejected": -281.9067687988281,
320
+ "loss": 0.5706,
321
+ "rewards/accuracies": 0.6937500238418579,
322
+ "rewards/chosen": -0.033171653747558594,
323
+ "rewards/margins": 0.469027042388916,
324
+ "rewards/rejected": -0.5021986961364746,
325
+ "step": 190
326
+ },
327
+ {
328
+ "epoch": 0.2093692750588851,
329
+ "grad_norm": 6.5625,
330
+ "learning_rate": 4.821331504159906e-06,
331
+ "logits/chosen": -0.5440086722373962,
332
+ "logits/rejected": -0.5563145279884338,
333
+ "logps/chosen": -296.43060302734375,
334
+ "logps/rejected": -255.0513916015625,
335
+ "loss": 0.5676,
336
+ "rewards/accuracies": 0.7250000238418579,
337
+ "rewards/chosen": -0.12640251219272614,
338
+ "rewards/margins": 0.5171942710876465,
339
+ "rewards/rejected": -0.6435968279838562,
340
+ "step": 200
341
+ },
342
+ {
343
+ "epoch": 0.2093692750588851,
344
+ "eval_logits/chosen": -0.5116367340087891,
345
+ "eval_logits/rejected": -0.4945087134838104,
346
+ "eval_logps/chosen": -290.90509033203125,
347
+ "eval_logps/rejected": -281.0560302734375,
348
+ "eval_loss": 0.5569251179695129,
349
+ "eval_rewards/accuracies": 0.699999988079071,
350
+ "eval_rewards/chosen": -0.11436504870653152,
351
+ "eval_rewards/margins": 0.5454893708229065,
352
+ "eval_rewards/rejected": -0.6598544716835022,
353
+ "eval_runtime": 349.6273,
354
+ "eval_samples_per_second": 5.72,
355
+ "eval_steps_per_second": 0.358,
356
+ "step": 200
357
+ },
358
+ {
359
+ "epoch": 0.21983773881182936,
360
+ "grad_norm": 6.15625,
361
+ "learning_rate": 4.7858426910973435e-06,
362
+ "logits/chosen": -0.5827375650405884,
363
+ "logits/rejected": -0.5831678509712219,
364
+ "logps/chosen": -278.0469970703125,
365
+ "logps/rejected": -271.9605407714844,
366
+ "loss": 0.5508,
367
+ "rewards/accuracies": 0.737500011920929,
368
+ "rewards/chosen": -0.025282684713602066,
369
+ "rewards/margins": 0.5209798216819763,
370
+ "rewards/rejected": -0.5462625622749329,
371
+ "step": 210
372
+ },
373
+ {
374
+ "epoch": 0.23030620256477363,
375
+ "grad_norm": 5.125,
376
+ "learning_rate": 4.747296766042161e-06,
377
+ "logits/chosen": -0.5575278997421265,
378
+ "logits/rejected": -0.5416348576545715,
379
+ "logps/chosen": -317.4739990234375,
380
+ "logps/rejected": -269.8050231933594,
381
+ "loss": 0.5368,
382
+ "rewards/accuracies": 0.7124999761581421,
383
+ "rewards/chosen": 0.03362050652503967,
384
+ "rewards/margins": 0.569665789604187,
385
+ "rewards/rejected": -0.536045253276825,
386
+ "step": 220
387
+ },
388
+ {
389
+ "epoch": 0.24077466631771788,
390
+ "grad_norm": 5.6875,
391
+ "learning_rate": 4.705745280752586e-06,
392
+ "logits/chosen": -0.589636504650116,
393
+ "logits/rejected": -0.5510541200637817,
394
+ "logps/chosen": -290.55206298828125,
395
+ "logps/rejected": -287.7290954589844,
396
+ "loss": 0.5581,
397
+ "rewards/accuracies": 0.706250011920929,
398
+ "rewards/chosen": -0.08335064351558685,
399
+ "rewards/margins": 0.5679007768630981,
400
+ "rewards/rejected": -0.6512514352798462,
401
+ "step": 230
402
+ },
403
+ {
404
+ "epoch": 0.2512431300706621,
405
+ "grad_norm": 4.96875,
406
+ "learning_rate": 4.661243806657256e-06,
407
+ "logits/chosen": -0.5663197636604309,
408
+ "logits/rejected": -0.525894045829773,
409
+ "logps/chosen": -299.3938903808594,
410
+ "logps/rejected": -263.598876953125,
411
+ "loss": 0.5563,
412
+ "rewards/accuracies": 0.6812499761581421,
413
+ "rewards/chosen": -0.1382952779531479,
414
+ "rewards/margins": 0.5307878255844116,
415
+ "rewards/rejected": -0.6690832376480103,
416
+ "step": 240
417
+ },
418
+ {
419
+ "epoch": 0.26171159382360637,
420
+ "grad_norm": 5.6875,
421
+ "learning_rate": 4.613851860533367e-06,
422
+ "logits/chosen": -0.5768588781356812,
423
+ "logits/rejected": -0.5346202850341797,
424
+ "logps/chosen": -293.30059814453125,
425
+ "logps/rejected": -259.82177734375,
426
+ "loss": 0.5789,
427
+ "rewards/accuracies": 0.7250000238418579,
428
+ "rewards/chosen": -0.08053269237279892,
429
+ "rewards/margins": 0.5023098587989807,
430
+ "rewards/rejected": -0.5828425288200378,
431
+ "step": 250
432
+ },
433
+ {
434
+ "epoch": 0.2721800575765506,
435
+ "grad_norm": 5.6875,
436
+ "learning_rate": 4.563632824908252e-06,
437
+ "logits/chosen": -0.5458053946495056,
438
+ "logits/rejected": -0.5105060935020447,
439
+ "logps/chosen": -292.6302185058594,
440
+ "logps/rejected": -279.1827392578125,
441
+ "loss": 0.5404,
442
+ "rewards/accuracies": 0.78125,
443
+ "rewards/chosen": 0.12764385342597961,
444
+ "rewards/margins": 0.8003277778625488,
445
+ "rewards/rejected": -0.6726840138435364,
446
+ "step": 260
447
+ },
448
+ {
449
+ "epoch": 0.2826485213294949,
450
+ "grad_norm": 5.78125,
451
+ "learning_rate": 4.510653863290871e-06,
452
+ "logits/chosen": -0.545439600944519,
453
+ "logits/rejected": -0.5205134749412537,
454
+ "logps/chosen": -294.71783447265625,
455
+ "logps/rejected": -302.39599609375,
456
+ "loss": 0.5465,
457
+ "rewards/accuracies": 0.699999988079071,
458
+ "rewards/chosen": -0.015889674425125122,
459
+ "rewards/margins": 0.5484660267829895,
460
+ "rewards/rejected": -0.564355731010437,
461
+ "step": 270
462
+ },
463
+ {
464
+ "epoch": 0.29311698508243916,
465
+ "grad_norm": 4.78125,
466
+ "learning_rate": 4.454985830346574e-06,
467
+ "logits/chosen": -0.6070719957351685,
468
+ "logits/rejected": -0.5709980726242065,
469
+ "logps/chosen": -300.3115234375,
470
+ "logps/rejected": -283.37689208984375,
471
+ "loss": 0.5681,
472
+ "rewards/accuracies": 0.6187499761581421,
473
+ "rewards/chosen": -0.04909560829401016,
474
+ "rewards/margins": 0.496112197637558,
475
+ "rewards/rejected": -0.5452078580856323,
476
+ "step": 280
477
+ },
478
+ {
479
+ "epoch": 0.3035854488353834,
480
+ "grad_norm": 4.65625,
481
+ "learning_rate": 4.396703177135262e-06,
482
+ "logits/chosen": -0.5642537474632263,
483
+ "logits/rejected": -0.5349927544593811,
484
+ "logps/chosen": -285.5408935546875,
485
+ "logps/rejected": -256.77691650390625,
486
+ "loss": 0.5249,
487
+ "rewards/accuracies": 0.75,
488
+ "rewards/chosen": 0.10653576999902725,
489
+ "rewards/margins": 0.6360087990760803,
490
+ "rewards/rejected": -0.5294729471206665,
491
+ "step": 290
492
+ },
493
+ {
494
+ "epoch": 0.31405391258832765,
495
+ "grad_norm": 5.75,
496
+ "learning_rate": 4.335883851539693e-06,
497
+ "logits/chosen": -0.5834770798683167,
498
+ "logits/rejected": -0.544857382774353,
499
+ "logps/chosen": -296.65936279296875,
500
+ "logps/rejected": -293.7464599609375,
501
+ "loss": 0.5414,
502
+ "rewards/accuracies": 0.7124999761581421,
503
+ "rewards/chosen": -0.30995655059814453,
504
+ "rewards/margins": 0.6452124714851379,
505
+ "rewards/rejected": -0.9551690220832825,
506
+ "step": 300
507
+ },
508
+ {
509
+ "epoch": 0.31405391258832765,
510
+ "eval_logits/chosen": -0.4775179624557495,
511
+ "eval_logits/rejected": -0.45395201444625854,
512
+ "eval_logps/chosen": -293.56982421875,
513
+ "eval_logps/rejected": -284.91802978515625,
514
+ "eval_loss": 0.5403068661689758,
515
+ "eval_rewards/accuracies": 0.7260000109672546,
516
+ "eval_rewards/chosen": -0.38083866238594055,
517
+ "eval_rewards/margins": 0.6652126312255859,
518
+ "eval_rewards/rejected": -1.0460513830184937,
519
+ "eval_runtime": 349.5769,
520
+ "eval_samples_per_second": 5.721,
521
+ "eval_steps_per_second": 0.358,
522
+ "step": 300
523
+ },
524
+ {
525
+ "epoch": 0.3245223763412719,
526
+ "grad_norm": 6.28125,
527
+ "learning_rate": 4.2726091940171055e-06,
528
+ "logits/chosen": -0.5361314415931702,
529
+ "logits/rejected": -0.5678786635398865,
530
+ "logps/chosen": -296.285400390625,
531
+ "logps/rejected": -342.7236022949219,
532
+ "loss": 0.5056,
533
+ "rewards/accuracies": 0.793749988079071,
534
+ "rewards/chosen": -0.19761790335178375,
535
+ "rewards/margins": 0.8299154043197632,
536
+ "rewards/rejected": -1.0275332927703857,
537
+ "step": 310
538
+ },
539
+ {
540
+ "epoch": 0.33499084009421615,
541
+ "grad_norm": 5.21875,
542
+ "learning_rate": 4.206963828813555e-06,
543
+ "logits/chosen": -0.5878556370735168,
544
+ "logits/rejected": -0.5477277040481567,
545
+ "logps/chosen": -296.2107849121094,
546
+ "logps/rejected": -280.1604919433594,
547
+ "loss": 0.524,
548
+ "rewards/accuracies": 0.737500011920929,
549
+ "rewards/chosen": -0.1949264109134674,
550
+ "rewards/margins": 0.7900466918945312,
551
+ "rewards/rejected": -0.9849729537963867,
552
+ "step": 320
553
+ },
554
+ {
555
+ "epoch": 0.34545930384716045,
556
+ "grad_norm": 5.0625,
557
+ "learning_rate": 4.139035550786495e-06,
558
+ "logits/chosen": -0.6225862503051758,
559
+ "logits/rejected": -0.5590678453445435,
560
+ "logps/chosen": -289.2728576660156,
561
+ "logps/rejected": -260.7643737792969,
562
+ "loss": 0.5217,
563
+ "rewards/accuracies": 0.7124999761581421,
564
+ "rewards/chosen": -0.2666059136390686,
565
+ "rewards/margins": 0.6325812339782715,
566
+ "rewards/rejected": -0.8991872072219849,
567
+ "step": 330
568
+ },
569
+ {
570
+ "epoch": 0.3559277676001047,
571
+ "grad_norm": 5.96875,
572
+ "learning_rate": 4.068915207986931e-06,
573
+ "logits/chosen": -0.5779263377189636,
574
+ "logits/rejected": -0.5189130902290344,
575
+ "logps/chosen": -298.147705078125,
576
+ "logps/rejected": -258.82391357421875,
577
+ "loss": 0.5224,
578
+ "rewards/accuracies": 0.71875,
579
+ "rewards/chosen": -0.4627823233604431,
580
+ "rewards/margins": 0.7694815397262573,
581
+ "rewards/rejected": -1.2322638034820557,
582
+ "step": 340
583
+ },
584
+ {
585
+ "epoch": 0.36639623135304894,
586
+ "grad_norm": 6.21875,
587
+ "learning_rate": 3.996696580158211e-06,
588
+ "logits/chosen": -0.5267971158027649,
589
+ "logits/rejected": -0.5050491690635681,
590
+ "logps/chosen": -335.9749755859375,
591
+ "logps/rejected": -290.8840026855469,
592
+ "loss": 0.5206,
593
+ "rewards/accuracies": 0.737500011920929,
594
+ "rewards/chosen": -0.16786900162696838,
595
+ "rewards/margins": 0.7010205984115601,
596
+ "rewards/rejected": -0.8688896894454956,
597
+ "step": 350
598
+ },
599
+ {
600
+ "epoch": 0.3768646951059932,
601
+ "grad_norm": 5.3125,
602
+ "learning_rate": 3.922476253313921e-06,
603
+ "logits/chosen": -0.5035579800605774,
604
+ "logits/rejected": -0.5138710141181946,
605
+ "logps/chosen": -273.709716796875,
606
+ "logps/rejected": -296.85650634765625,
607
+ "loss": 0.5587,
608
+ "rewards/accuracies": 0.643750011920929,
609
+ "rewards/chosen": -0.28165164589881897,
610
+ "rewards/margins": 0.6714268922805786,
611
+ "rewards/rejected": -0.9530784487724304,
612
+ "step": 360
613
+ },
614
+ {
615
+ "epoch": 0.38733315885893743,
616
+ "grad_norm": 5.875,
617
+ "learning_rate": 3.846353490562664e-06,
618
+ "logits/chosen": -0.5329638123512268,
619
+ "logits/rejected": -0.5389483571052551,
620
+ "logps/chosen": -288.07745361328125,
621
+ "logps/rejected": -262.96124267578125,
622
+ "loss": 0.5038,
623
+ "rewards/accuracies": 0.7562500238418579,
624
+ "rewards/chosen": -0.05612688139081001,
625
+ "rewards/margins": 0.8792274594306946,
626
+ "rewards/rejected": -0.9353543519973755,
627
+ "step": 370
628
+ },
629
+ {
630
+ "epoch": 0.39780162261188173,
631
+ "grad_norm": 5.25,
632
+ "learning_rate": 3.768430099352445e-06,
633
+ "logits/chosen": -0.5536895990371704,
634
+ "logits/rejected": -0.5606611371040344,
635
+ "logps/chosen": -304.82574462890625,
636
+ "logps/rejected": -277.5342712402344,
637
+ "loss": 0.5089,
638
+ "rewards/accuracies": 0.75,
639
+ "rewards/chosen": -0.04281007498502731,
640
+ "rewards/margins": 0.9195000529289246,
641
+ "rewards/rejected": -0.9623101353645325,
642
+ "step": 380
643
+ },
644
+ {
645
+ "epoch": 0.408270086364826,
646
+ "grad_norm": 6.5,
647
+ "learning_rate": 3.6888102953122307e-06,
648
+ "logits/chosen": -0.5975215435028076,
649
+ "logits/rejected": -0.5678552389144897,
650
+ "logps/chosen": -261.8155212402344,
651
+ "logps/rejected": -262.86151123046875,
652
+ "loss": 0.5554,
653
+ "rewards/accuracies": 0.6937500238418579,
654
+ "rewards/chosen": -0.049895767122507095,
655
+ "rewards/margins": 0.72125244140625,
656
+ "rewards/rejected": -0.7711483240127563,
657
+ "step": 390
658
+ },
659
+ {
660
+ "epoch": 0.4187385501177702,
661
+ "grad_norm": 5.40625,
662
+ "learning_rate": 3.607600562872785e-06,
663
+ "logits/chosen": -0.5524119138717651,
664
+ "logits/rejected": -0.5214421153068542,
665
+ "logps/chosen": -284.4479675292969,
666
+ "logps/rejected": -274.91790771484375,
667
+ "loss": 0.5124,
668
+ "rewards/accuracies": 0.71875,
669
+ "rewards/chosen": -0.08701231330633163,
670
+ "rewards/margins": 0.7274686694145203,
671
+ "rewards/rejected": -0.8144810795783997,
672
+ "step": 400
673
+ },
674
+ {
675
+ "epoch": 0.4187385501177702,
676
+ "eval_logits/chosen": -0.45163384079933167,
677
+ "eval_logits/rejected": -0.42426639795303345,
678
+ "eval_logps/chosen": -292.09857177734375,
679
+ "eval_logps/rejected": -284.3532409667969,
680
+ "eval_loss": 0.5341187715530396,
681
+ "eval_rewards/accuracies": 0.7039999961853027,
682
+ "eval_rewards/chosen": -0.23370328545570374,
683
+ "eval_rewards/margins": 0.7558707594871521,
684
+ "eval_rewards/rejected": -0.9895740151405334,
685
+ "eval_runtime": 349.3488,
686
+ "eval_samples_per_second": 5.725,
687
+ "eval_steps_per_second": 0.358,
688
+ "step": 400
689
+ },
690
+ {
691
+ "epoch": 0.42920701387071447,
692
+ "grad_norm": 5.1875,
693
+ "learning_rate": 3.5249095128531863e-06,
694
+ "logits/chosen": -0.5735105872154236,
695
+ "logits/rejected": -0.5243286490440369,
696
+ "logps/chosen": -278.07611083984375,
697
+ "logps/rejected": -275.5641784667969,
698
+ "loss": 0.5147,
699
+ "rewards/accuracies": 0.75,
700
+ "rewards/chosen": -0.19002243876457214,
701
+ "rewards/margins": 0.7818821668624878,
702
+ "rewards/rejected": -0.9719046354293823,
703
+ "step": 410
704
+ },
705
+ {
706
+ "epoch": 0.4396754776236587,
707
+ "grad_norm": 5.75,
708
+ "learning_rate": 3.4408477372034743e-06,
709
+ "logits/chosen": -0.5684117078781128,
710
+ "logits/rejected": -0.5453870296478271,
711
+ "logps/chosen": -308.50128173828125,
712
+ "logps/rejected": -296.2371826171875,
713
+ "loss": 0.5608,
714
+ "rewards/accuracies": 0.768750011920929,
715
+ "rewards/chosen": -0.22137188911437988,
716
+ "rewards/margins": 0.63337242603302,
717
+ "rewards/rejected": -0.8547442555427551,
718
+ "step": 420
719
+ },
720
+ {
721
+ "epoch": 0.45014394137660296,
722
+ "grad_norm": 6.125,
723
+ "learning_rate": 3.355527661097728e-06,
724
+ "logits/chosen": -0.5637370347976685,
725
+ "logits/rejected": -0.5657171010971069,
726
+ "logps/chosen": -280.50177001953125,
727
+ "logps/rejected": -281.47467041015625,
728
+ "loss": 0.5229,
729
+ "rewards/accuracies": 0.71875,
730
+ "rewards/chosen": -0.39310961961746216,
731
+ "rewards/margins": 0.6110748648643494,
732
+ "rewards/rejected": -1.0041844844818115,
733
+ "step": 430
734
+ },
735
+ {
736
+ "epoch": 0.46061240512954726,
737
+ "grad_norm": 5.53125,
738
+ "learning_rate": 3.269063392575352e-06,
739
+ "logits/chosen": -0.5321542024612427,
740
+ "logits/rejected": -0.5234124660491943,
741
+ "logps/chosen": -328.39447021484375,
742
+ "logps/rejected": -307.01556396484375,
743
+ "loss": 0.5135,
744
+ "rewards/accuracies": 0.7124999761581421,
745
+ "rewards/chosen": -0.2074432671070099,
746
+ "rewards/margins": 0.7093298435211182,
747
+ "rewards/rejected": -0.9167732000350952,
748
+ "step": 440
749
+ },
750
+ {
751
+ "epoch": 0.4710808688824915,
752
+ "grad_norm": 6.625,
753
+ "learning_rate": 3.181570569931697e-06,
754
+ "logits/chosen": -0.5547568202018738,
755
+ "logits/rejected": -0.5429819822311401,
756
+ "logps/chosen": -286.77447509765625,
757
+ "logps/rejected": -282.605224609375,
758
+ "loss": 0.5,
759
+ "rewards/accuracies": 0.7562500238418579,
760
+ "rewards/chosen": -0.41850847005844116,
761
+ "rewards/margins": 0.6984038352966309,
762
+ "rewards/rejected": -1.1169124841690063,
763
+ "step": 450
764
+ },
765
+ {
766
+ "epoch": 0.48154933263543576,
767
+ "grad_norm": 5.125,
768
+ "learning_rate": 3.09316620706208e-06,
769
+ "logits/chosen": -0.5107685327529907,
770
+ "logits/rejected": -0.5204188227653503,
771
+ "logps/chosen": -307.8090515136719,
772
+ "logps/rejected": -288.7105407714844,
773
+ "loss": 0.4984,
774
+ "rewards/accuracies": 0.7562500238418579,
775
+ "rewards/chosen": -0.21866807341575623,
776
+ "rewards/margins": 0.888043999671936,
777
+ "rewards/rejected": -1.106711983680725,
778
+ "step": 460
779
+ },
780
+ {
781
+ "epoch": 0.49201779638838,
782
+ "grad_norm": 5.3125,
783
+ "learning_rate": 3.0039685369660785e-06,
784
+ "logits/chosen": -0.5120314359664917,
785
+ "logits/rejected": -0.46821707487106323,
786
+ "logps/chosen": -281.59722900390625,
787
+ "logps/rejected": -266.38134765625,
788
+ "loss": 0.5331,
789
+ "rewards/accuracies": 0.699999988079071,
790
+ "rewards/chosen": -0.03815505653619766,
791
+ "rewards/margins": 0.8301553726196289,
792
+ "rewards/rejected": -0.8683103322982788,
793
+ "step": 470
794
+ },
795
+ {
796
+ "epoch": 0.5024862601413242,
797
+ "grad_norm": 5.53125,
798
+ "learning_rate": 2.91409685362137e-06,
799
+ "logits/chosen": -0.5207866430282593,
800
+ "logits/rejected": -0.5130727887153625,
801
+ "logps/chosen": -278.73504638671875,
802
+ "logps/rejected": -276.3402404785156,
803
+ "loss": 0.5098,
804
+ "rewards/accuracies": 0.71875,
805
+ "rewards/chosen": -0.15987971425056458,
806
+ "rewards/margins": 0.769777774810791,
807
+ "rewards/rejected": -0.9296576380729675,
808
+ "step": 480
809
+ },
810
+ {
811
+ "epoch": 0.5129547238942685,
812
+ "grad_norm": 4.9375,
813
+ "learning_rate": 2.8236713524386085e-06,
814
+ "logits/chosen": -0.5813151597976685,
815
+ "logits/rejected": -0.5460097193717957,
816
+ "logps/chosen": -278.7942810058594,
817
+ "logps/rejected": -256.22735595703125,
818
+ "loss": 0.498,
819
+ "rewards/accuracies": 0.731249988079071,
820
+ "rewards/chosen": -0.02987014129757881,
821
+ "rewards/margins": 0.7739948034286499,
822
+ "rewards/rejected": -0.8038649559020996,
823
+ "step": 490
824
+ },
825
+ {
826
+ "epoch": 0.5234231876472127,
827
+ "grad_norm": 5.59375,
828
+ "learning_rate": 2.7328129695107205e-06,
829
+ "logits/chosen": -0.4917908310890198,
830
+ "logits/rejected": -0.5057145357131958,
831
+ "logps/chosen": -264.3317565917969,
832
+ "logps/rejected": -273.6480712890625,
833
+ "loss": 0.5529,
834
+ "rewards/accuracies": 0.7875000238418579,
835
+ "rewards/chosen": -0.05067148059606552,
836
+ "rewards/margins": 0.9685169458389282,
837
+ "rewards/rejected": -1.0191885232925415,
838
+ "step": 500
839
+ },
840
+ {
841
+ "epoch": 0.5234231876472127,
842
+ "eval_logits/chosen": -0.4290330708026886,
843
+ "eval_logits/rejected": -0.3994762897491455,
844
+ "eval_logps/chosen": -291.93804931640625,
845
+ "eval_logps/rejected": -284.49481201171875,
846
+ "eval_loss": 0.5259639620780945,
847
+ "eval_rewards/accuracies": 0.7239999771118164,
848
+ "eval_rewards/chosen": -0.2176663726568222,
849
+ "eval_rewards/margins": 0.7860648036003113,
850
+ "eval_rewards/rejected": -1.0037312507629395,
851
+ "eval_runtime": 349.4608,
852
+ "eval_samples_per_second": 5.723,
853
+ "eval_steps_per_second": 0.358,
854
+ "step": 500
855
+ },
856
+ {
857
+ "epoch": 0.533891651400157,
858
+ "grad_norm": 4.75,
859
+ "learning_rate": 2.641643219871597e-06,
860
+ "logits/chosen": -0.5275259017944336,
861
+ "logits/rejected": -0.49047979712486267,
862
+ "logps/chosen": -313.4814147949219,
863
+ "logps/rejected": -297.6827697753906,
864
+ "loss": 0.5085,
865
+ "rewards/accuracies": 0.7250000238418579,
866
+ "rewards/chosen": -0.3123074471950531,
867
+ "rewards/margins": 0.7204921841621399,
868
+ "rewards/rejected": -1.0327996015548706,
869
+ "step": 510
870
+ },
871
+ {
872
+ "epoch": 0.5443601151531012,
873
+ "grad_norm": 6.90625,
874
+ "learning_rate": 2.5502840349805074e-06,
875
+ "logits/chosen": -0.5200182199478149,
876
+ "logits/rejected": -0.5098154544830322,
877
+ "logps/chosen": -310.73944091796875,
878
+ "logps/rejected": -297.84326171875,
879
+ "loss": 0.5471,
880
+ "rewards/accuracies": 0.706250011920929,
881
+ "rewards/chosen": -0.353024423122406,
882
+ "rewards/margins": 0.8153821229934692,
883
+ "rewards/rejected": -1.1684064865112305,
884
+ "step": 520
885
+ },
886
+ {
887
+ "epoch": 0.5548285789060455,
888
+ "grad_norm": 5.03125,
889
+ "learning_rate": 2.4588575996495797e-06,
890
+ "logits/chosen": -0.47494420409202576,
891
+ "logits/rejected": -0.48031002283096313,
892
+ "logps/chosen": -272.15692138671875,
893
+ "logps/rejected": -262.4747009277344,
894
+ "loss": 0.5205,
895
+ "rewards/accuracies": 0.7437499761581421,
896
+ "rewards/chosen": -0.41294246912002563,
897
+ "rewards/margins": 0.8126422166824341,
898
+ "rewards/rejected": -1.225584626197815,
899
+ "step": 530
900
+ },
901
+ {
902
+ "epoch": 0.5652970426589898,
903
+ "grad_norm": 5.53125,
904
+ "learning_rate": 2.367486188632446e-06,
905
+ "logits/chosen": -0.5167185068130493,
906
+ "logits/rejected": -0.514365017414093,
907
+ "logps/chosen": -285.01080322265625,
908
+ "logps/rejected": -324.41302490234375,
909
+ "loss": 0.5084,
910
+ "rewards/accuracies": 0.7124999761581421,
911
+ "rewards/chosen": -0.18025778234004974,
912
+ "rewards/margins": 0.8430238962173462,
913
+ "rewards/rejected": -1.0232816934585571,
914
+ "step": 540
915
+ },
916
+ {
917
+ "epoch": 0.575765506411934,
918
+ "grad_norm": 4.1875,
919
+ "learning_rate": 2.276292003092593e-06,
920
+ "logits/chosen": -0.5328477621078491,
921
+ "logits/rejected": -0.514901340007782,
922
+ "logps/chosen": -257.83953857421875,
923
+ "logps/rejected": -265.4601135253906,
924
+ "loss": 0.4871,
925
+ "rewards/accuracies": 0.768750011920929,
926
+ "rewards/chosen": -0.24479135870933533,
927
+ "rewards/margins": 0.8982963562011719,
928
+ "rewards/rejected": -1.1430877447128296,
929
+ "step": 550
930
+ },
931
+ {
932
+ "epoch": 0.5862339701648783,
933
+ "grad_norm": 4.875,
934
+ "learning_rate": 2.1853970071701415e-06,
935
+ "logits/chosen": -0.5274088978767395,
936
+ "logits/rejected": -0.4985182285308838,
937
+ "logps/chosen": -278.2223815917969,
938
+ "logps/rejected": -279.3916931152344,
939
+ "loss": 0.5206,
940
+ "rewards/accuracies": 0.7749999761581421,
941
+ "rewards/chosen": -0.12979185581207275,
942
+ "rewards/margins": 0.8115113973617554,
943
+ "rewards/rejected": -0.9413032531738281,
944
+ "step": 560
945
+ },
946
+ {
947
+ "epoch": 0.5967024339178225,
948
+ "grad_norm": 5.84375,
949
+ "learning_rate": 2.0949227648656194e-06,
950
+ "logits/chosen": -0.5541775822639465,
951
+ "logits/rejected": -0.5307848453521729,
952
+ "logps/chosen": -294.7757263183594,
953
+ "logps/rejected": -261.8687438964844,
954
+ "loss": 0.5405,
955
+ "rewards/accuracies": 0.731249988079071,
956
+ "rewards/chosen": -0.2242761105298996,
957
+ "rewards/margins": 0.8702728152275085,
958
+ "rewards/rejected": -1.0945489406585693,
959
+ "step": 570
960
+ },
961
+ {
962
+ "epoch": 0.6071708976707668,
963
+ "grad_norm": 5.78125,
964
+ "learning_rate": 2.00499027745888e-06,
965
+ "logits/chosen": -0.5282704830169678,
966
+ "logits/rejected": -0.5080757141113281,
967
+ "logps/chosen": -299.1280212402344,
968
+ "logps/rejected": -297.7095031738281,
969
+ "loss": 0.5509,
970
+ "rewards/accuracies": 0.6812499761581421,
971
+ "rewards/chosen": -0.1550193578004837,
972
+ "rewards/margins": 0.7501784563064575,
973
+ "rewards/rejected": -0.9051979184150696,
974
+ "step": 580
975
+ },
976
+ {
977
+ "epoch": 0.6176393614237111,
978
+ "grad_norm": 5.65625,
979
+ "learning_rate": 1.915719821680624e-06,
980
+ "logits/chosen": -0.5544322729110718,
981
+ "logits/rejected": -0.5079981088638306,
982
+ "logps/chosen": -285.83575439453125,
983
+ "logps/rejected": -281.488037109375,
984
+ "loss": 0.5198,
985
+ "rewards/accuracies": 0.793749988079071,
986
+ "rewards/chosen": 0.1644931137561798,
987
+ "rewards/margins": 0.8637059330940247,
988
+ "rewards/rejected": -0.6992128491401672,
989
+ "step": 590
990
+ },
991
+ {
992
+ "epoch": 0.6281078251766553,
993
+ "grad_norm": 5.34375,
994
+ "learning_rate": 1.8272307888529276e-06,
995
+ "logits/chosen": -0.47845038771629333,
996
+ "logits/rejected": -0.44052690267562866,
997
+ "logps/chosen": -261.1684265136719,
998
+ "logps/rejected": -279.1067810058594,
999
+ "loss": 0.53,
1000
+ "rewards/accuracies": 0.7437499761581421,
1001
+ "rewards/chosen": -0.013519972562789917,
1002
+ "rewards/margins": 0.8593659400939941,
1003
+ "rewards/rejected": -0.8728858828544617,
1004
+ "step": 600
1005
+ },
1006
+ {
1007
+ "epoch": 0.6281078251766553,
1008
+ "eval_logits/chosen": -0.4317292273044586,
1009
+ "eval_logits/rejected": -0.4027506709098816,
1010
+ "eval_logps/chosen": -290.4488830566406,
1011
+ "eval_logps/rejected": -283.0403137207031,
1012
+ "eval_loss": 0.5244117379188538,
1013
+ "eval_rewards/accuracies": 0.7200000286102295,
1014
+ "eval_rewards/chosen": -0.06874838471412659,
1015
+ "eval_rewards/margins": 0.7895320653915405,
1016
+ "eval_rewards/rejected": -0.8582804203033447,
1017
+ "eval_runtime": 349.48,
1018
+ "eval_samples_per_second": 5.723,
1019
+ "eval_steps_per_second": 0.358,
1020
+ "step": 600
1021
+ },
1022
+ {
1023
+ "epoch": 0.6385762889295996,
1024
+ "grad_norm": 4.625,
1025
+ "learning_rate": 1.739641525213929e-06,
1026
+ "logits/chosen": -0.510543942451477,
1027
+ "logits/rejected": -0.5007991194725037,
1028
+ "logps/chosen": -264.67572021484375,
1029
+ "logps/rejected": -270.05438232421875,
1030
+ "loss": 0.4957,
1031
+ "rewards/accuracies": 0.731249988079071,
1032
+ "rewards/chosen": -0.04250194877386093,
1033
+ "rewards/margins": 0.8683522343635559,
1034
+ "rewards/rejected": -0.9108540415763855,
1035
+ "step": 610
1036
+ },
1037
+ {
1038
+ "epoch": 0.6490447526825438,
1039
+ "grad_norm": 4.5625,
1040
+ "learning_rate": 1.6530691736402317e-06,
1041
+ "logits/chosen": -0.5225728750228882,
1042
+ "logits/rejected": -0.49780869483947754,
1043
+ "logps/chosen": -291.7324523925781,
1044
+ "logps/rejected": -280.8247985839844,
1045
+ "loss": 0.5072,
1046
+ "rewards/accuracies": 0.7875000238418579,
1047
+ "rewards/chosen": -0.15649111568927765,
1048
+ "rewards/margins": 0.8370261192321777,
1049
+ "rewards/rejected": -0.9935172200202942,
1050
+ "step": 620
1051
+ },
1052
+ {
1053
+ "epoch": 0.6595132164354881,
1054
+ "grad_norm": 5.34375,
1055
+ "learning_rate": 1.5676295169786864e-06,
1056
+ "logits/chosen": -0.5341150164604187,
1057
+ "logits/rejected": -0.4966842532157898,
1058
+ "logps/chosen": -283.0377502441406,
1059
+ "logps/rejected": -271.4356994628906,
1060
+ "loss": 0.5294,
1061
+ "rewards/accuracies": 0.7124999761581421,
1062
+ "rewards/chosen": -0.335578590631485,
1063
+ "rewards/margins": 0.8176633715629578,
1064
+ "rewards/rejected": -1.1532419919967651,
1065
+ "step": 630
1066
+ },
1067
+ {
1068
+ "epoch": 0.6699816801884323,
1069
+ "grad_norm": 5.25,
1070
+ "learning_rate": 1.4834368231970922e-06,
1071
+ "logits/chosen": -0.5592583417892456,
1072
+ "logits/rejected": -0.5109054446220398,
1073
+ "logps/chosen": -287.51068115234375,
1074
+ "logps/rejected": -272.8183288574219,
1075
+ "loss": 0.5,
1076
+ "rewards/accuracies": 0.7250000238418579,
1077
+ "rewards/chosen": -0.3240619897842407,
1078
+ "rewards/margins": 0.7095167636871338,
1079
+ "rewards/rejected": -1.0335787534713745,
1080
+ "step": 640
1081
+ },
1082
+ {
1083
+ "epoch": 0.6804501439413766,
1084
+ "grad_norm": 5.28125,
1085
+ "learning_rate": 1.4006036925609245e-06,
1086
+ "logits/chosen": -0.5300047993659973,
1087
+ "logits/rejected": -0.49794530868530273,
1088
+ "logps/chosen": -300.1281433105469,
1089
+ "logps/rejected": -249.81936645507812,
1090
+ "loss": 0.5413,
1091
+ "rewards/accuracies": 0.7124999761581421,
1092
+ "rewards/chosen": -0.32234618067741394,
1093
+ "rewards/margins": 0.8122032880783081,
1094
+ "rewards/rejected": -1.1345494985580444,
1095
+ "step": 650
1096
+ },
1097
+ {
1098
+ "epoch": 0.6909186076943209,
1099
+ "grad_norm": 5.15625,
1100
+ "learning_rate": 1.3192409070404582e-06,
1101
+ "logits/chosen": -0.550862193107605,
1102
+ "logits/rejected": -0.532629668712616,
1103
+ "logps/chosen": -303.74505615234375,
1104
+ "logps/rejected": -305.6463928222656,
1105
+ "loss": 0.5143,
1106
+ "rewards/accuracies": 0.762499988079071,
1107
+ "rewards/chosen": -0.20528188347816467,
1108
+ "rewards/margins": 0.7565289735794067,
1109
+ "rewards/rejected": -0.9618108868598938,
1110
+ "step": 660
1111
+ },
1112
+ {
1113
+ "epoch": 0.7013870714472651,
1114
+ "grad_norm": 6.59375,
1115
+ "learning_rate": 1.2394572821496953e-06,
1116
+ "logits/chosen": -0.5544015169143677,
1117
+ "logits/rejected": -0.5221393704414368,
1118
+ "logps/chosen": -277.2763977050781,
1119
+ "logps/rejected": -258.49676513671875,
1120
+ "loss": 0.5181,
1121
+ "rewards/accuracies": 0.71875,
1122
+ "rewards/chosen": -0.20379216969013214,
1123
+ "rewards/margins": 0.7780588269233704,
1124
+ "rewards/rejected": -0.9818509817123413,
1125
+ "step": 670
1126
+ },
1127
+ {
1128
+ "epoch": 0.7118555352002094,
1129
+ "grad_norm": 5.09375,
1130
+ "learning_rate": 1.1613595214152713e-06,
1131
+ "logits/chosen": -0.5822625160217285,
1132
+ "logits/rejected": -0.540503203868866,
1133
+ "logps/chosen": -286.454833984375,
1134
+ "logps/rejected": -275.0696105957031,
1135
+ "loss": 0.5043,
1136
+ "rewards/accuracies": 0.7562500238418579,
1137
+ "rewards/chosen": -0.32862305641174316,
1138
+ "rewards/margins": 0.7822999954223633,
1139
+ "rewards/rejected": -1.1109230518341064,
1140
+ "step": 680
1141
+ },
1142
+ {
1143
+ "epoch": 0.7223239989531536,
1144
+ "grad_norm": 4.3125,
1145
+ "learning_rate": 1.0850520736699362e-06,
1146
+ "logits/chosen": -0.5231102705001831,
1147
+ "logits/rejected": -0.49847880005836487,
1148
+ "logps/chosen": -340.5146484375,
1149
+ "logps/rejected": -315.625,
1150
+ "loss": 0.5348,
1151
+ "rewards/accuracies": 0.762499988079071,
1152
+ "rewards/chosen": -0.29148069024086,
1153
+ "rewards/margins": 0.8814651370048523,
1154
+ "rewards/rejected": -1.1729459762573242,
1155
+ "step": 690
1156
+ },
1157
+ {
1158
+ "epoch": 0.7327924627060979,
1159
+ "grad_norm": 5.625,
1160
+ "learning_rate": 1.0106369933615043e-06,
1161
+ "logits/chosen": -0.5652047991752625,
1162
+ "logits/rejected": -0.5295856595039368,
1163
+ "logps/chosen": -315.13458251953125,
1164
+ "logps/rejected": -262.6495361328125,
1165
+ "loss": 0.5028,
1166
+ "rewards/accuracies": 0.731249988079071,
1167
+ "rewards/chosen": -0.3104090690612793,
1168
+ "rewards/margins": 0.7127448320388794,
1169
+ "rewards/rejected": -1.0231537818908691,
1170
+ "step": 700
1171
+ },
1172
+ {
1173
+ "epoch": 0.7327924627060979,
1174
+ "eval_logits/chosen": -0.417924702167511,
1175
+ "eval_logits/rejected": -0.38744810223579407,
1176
+ "eval_logps/chosen": -293.1184387207031,
1177
+ "eval_logps/rejected": -285.8177185058594,
1178
+ "eval_loss": 0.5190439224243164,
1179
+ "eval_rewards/accuracies": 0.7319999933242798,
1180
+ "eval_rewards/chosen": -0.33567702770233154,
1181
+ "eval_rewards/margins": 0.8003441095352173,
1182
+ "eval_rewards/rejected": -1.1360211372375488,
1183
+ "eval_runtime": 349.5304,
1184
+ "eval_samples_per_second": 5.722,
1185
+ "eval_steps_per_second": 0.358,
1186
+ "step": 700
1187
+ },
1188
+ {
1189
+ "epoch": 0.7432609264590422,
1190
+ "grad_norm": 6.1875,
1191
+ "learning_rate": 9.382138040640714e-07,
1192
+ "logits/chosen": -0.5737341046333313,
1193
+ "logits/rejected": -0.5333853363990784,
1194
+ "logps/chosen": -264.9908752441406,
1195
+ "logps/rejected": -278.7557067871094,
1196
+ "loss": 0.5569,
1197
+ "rewards/accuracies": 0.7250000238418579,
1198
+ "rewards/chosen": -0.3085269331932068,
1199
+ "rewards/margins": 0.818738579750061,
1200
+ "rewards/rejected": -1.1272655725479126,
1201
+ "step": 710
1202
+ },
1203
+ {
1204
+ "epoch": 0.7537293902119864,
1205
+ "grad_norm": 5.34375,
1206
+ "learning_rate": 8.678793653740633e-07,
1207
+ "logits/chosen": -0.4978073537349701,
1208
+ "logits/rejected": -0.49744582176208496,
1209
+ "logps/chosen": -263.29010009765625,
1210
+ "logps/rejected": -263.02618408203125,
1211
+ "loss": 0.5204,
1212
+ "rewards/accuracies": 0.762499988079071,
1213
+ "rewards/chosen": -0.2587752938270569,
1214
+ "rewards/margins": 0.8211803436279297,
1215
+ "rewards/rejected": -1.0799555778503418,
1216
+ "step": 720
1217
+ },
1218
+ {
1219
+ "epoch": 0.7641978539649307,
1220
+ "grad_norm": 4.75,
1221
+ "learning_rate": 7.997277433690984e-07,
1222
+ "logits/chosen": -0.5180292725563049,
1223
+ "logits/rejected": -0.47244367003440857,
1224
+ "logps/chosen": -301.2723693847656,
1225
+ "logps/rejected": -287.05584716796875,
1226
+ "loss": 0.5153,
1227
+ "rewards/accuracies": 0.800000011920929,
1228
+ "rewards/chosen": -0.28081485629081726,
1229
+ "rewards/margins": 0.7739724516868591,
1230
+ "rewards/rejected": -1.054787278175354,
1231
+ "step": 730
1232
+ },
1233
+ {
1234
+ "epoch": 0.7746663177178749,
1235
+ "grad_norm": 4.65625,
1236
+ "learning_rate": 7.338500848029603e-07,
1237
+ "logits/chosen": -0.47029241919517517,
1238
+ "logits/rejected": -0.48912668228149414,
1239
+ "logps/chosen": -290.9686584472656,
1240
+ "logps/rejected": -274.376953125,
1241
+ "loss": 0.4939,
1242
+ "rewards/accuracies": 0.768750011920929,
1243
+ "rewards/chosen": -0.0831630602478981,
1244
+ "rewards/margins": 0.7876149415969849,
1245
+ "rewards/rejected": -0.8707780838012695,
1246
+ "step": 740
1247
+ },
1248
+ {
1249
+ "epoch": 0.7851347814708192,
1250
+ "grad_norm": 5.09375,
1251
+ "learning_rate": 6.70334495204884e-07,
1252
+ "logits/chosen": -0.5157723426818848,
1253
+ "logits/rejected": -0.49868693947792053,
1254
+ "logps/chosen": -323.7619323730469,
1255
+ "logps/rejected": -285.82916259765625,
1256
+ "loss": 0.5023,
1257
+ "rewards/accuracies": 0.75,
1258
+ "rewards/chosen": -0.11948978900909424,
1259
+ "rewards/margins": 0.8135945200920105,
1260
+ "rewards/rejected": -0.9330843687057495,
1261
+ "step": 750
1262
+ },
1263
+ {
1264
+ "epoch": 0.7956032452237635,
1265
+ "grad_norm": 5.46875,
1266
+ "learning_rate": 6.092659210462232e-07,
1267
+ "logits/chosen": -0.5377079248428345,
1268
+ "logits/rejected": -0.5193135142326355,
1269
+ "logps/chosen": -268.98455810546875,
1270
+ "logps/rejected": -268.32025146484375,
1271
+ "loss": 0.5455,
1272
+ "rewards/accuracies": 0.706250011920929,
1273
+ "rewards/chosen": -0.2502976357936859,
1274
+ "rewards/margins": 0.698966920375824,
1275
+ "rewards/rejected": -0.949264645576477,
1276
+ "step": 760
1277
+ },
1278
+ {
1279
+ "epoch": 0.8060717089767077,
1280
+ "grad_norm": 5.28125,
1281
+ "learning_rate": 5.507260361320738e-07,
1282
+ "logits/chosen": -0.5280566215515137,
1283
+ "logits/rejected": -0.5302770137786865,
1284
+ "logps/chosen": -284.72637939453125,
1285
+ "logps/rejected": -279.0828552246094,
1286
+ "loss": 0.5093,
1287
+ "rewards/accuracies": 0.706250011920929,
1288
+ "rewards/chosen": -0.12712079286575317,
1289
+ "rewards/margins": 0.6862561106681824,
1290
+ "rewards/rejected": -0.8133770227432251,
1291
+ "step": 770
1292
+ },
1293
+ {
1294
+ "epoch": 0.816540172729652,
1295
+ "grad_norm": 5.21875,
1296
+ "learning_rate": 4.947931323697983e-07,
1297
+ "logits/chosen": -0.5104556083679199,
1298
+ "logits/rejected": -0.48463624715805054,
1299
+ "logps/chosen": -286.16925048828125,
1300
+ "logps/rejected": -279.46295166015625,
1301
+ "loss": 0.5278,
1302
+ "rewards/accuracies": 0.6875,
1303
+ "rewards/chosen": -0.2269764244556427,
1304
+ "rewards/margins": 0.6587584018707275,
1305
+ "rewards/rejected": -0.8857349157333374,
1306
+ "step": 780
1307
+ },
1308
+ {
1309
+ "epoch": 0.8270086364825961,
1310
+ "grad_norm": 5.375,
1311
+ "learning_rate": 4.4154201506053985e-07,
1312
+ "logits/chosen": -0.5380119681358337,
1313
+ "logits/rejected": -0.5271115303039551,
1314
+ "logps/chosen": -300.27313232421875,
1315
+ "logps/rejected": -263.89984130859375,
1316
+ "loss": 0.511,
1317
+ "rewards/accuracies": 0.8187500238418579,
1318
+ "rewards/chosen": -0.15483184158802032,
1319
+ "rewards/margins": 0.8044347763061523,
1320
+ "rewards/rejected": -0.9592665433883667,
1321
+ "step": 790
1322
+ },
1323
+ {
1324
+ "epoch": 0.8374771002355405,
1325
+ "grad_norm": 4.84375,
1326
+ "learning_rate": 3.910439028537638e-07,
1327
+ "logits/chosen": -0.5491563677787781,
1328
+ "logits/rejected": -0.5051991939544678,
1329
+ "logps/chosen": -347.644287109375,
1330
+ "logps/rejected": -302.0996398925781,
1331
+ "loss": 0.5347,
1332
+ "rewards/accuracies": 0.706250011920929,
1333
+ "rewards/chosen": -0.08171938359737396,
1334
+ "rewards/margins": 0.7037029266357422,
1335
+ "rewards/rejected": -0.7854223251342773,
1336
+ "step": 800
1337
+ },
1338
+ {
1339
+ "epoch": 0.8374771002355405,
1340
+ "eval_logits/chosen": -0.42248398065567017,
1341
+ "eval_logits/rejected": -0.39241039752960205,
1342
+ "eval_logps/chosen": -291.1650085449219,
1343
+ "eval_logps/rejected": -283.87603759765625,
1344
+ "eval_loss": 0.5191403031349182,
1345
+ "eval_rewards/accuracies": 0.7319999933242798,
1346
+ "eval_rewards/chosen": -0.14035792648792267,
1347
+ "eval_rewards/margins": 0.8014954924583435,
1348
+ "eval_rewards/rejected": -0.9418535232543945,
1349
+ "eval_runtime": 349.5564,
1350
+ "eval_samples_per_second": 5.722,
1351
+ "eval_steps_per_second": 0.358,
1352
+ "step": 800
1353
+ },
1354
+ {
1355
+ "epoch": 0.8479455639884846,
1356
+ "grad_norm": 4.6875,
1357
+ "learning_rate": 3.4336633249862084e-07,
1358
+ "logits/chosen": -0.5698053240776062,
1359
+ "logits/rejected": -0.5044723153114319,
1360
+ "logps/chosen": -319.55126953125,
1361
+ "logps/rejected": -288.7781066894531,
1362
+ "loss": 0.5021,
1363
+ "rewards/accuracies": 0.7437499761581421,
1364
+ "rewards/chosen": -0.17804962396621704,
1365
+ "rewards/margins": 0.8044875860214233,
1366
+ "rewards/rejected": -0.9825371503829956,
1367
+ "step": 810
1368
+ },
1369
+ {
1370
+ "epoch": 0.8584140277414289,
1371
+ "grad_norm": 5.25,
1372
+ "learning_rate": 2.98573068519539e-07,
1373
+ "logits/chosen": -0.5468612909317017,
1374
+ "logits/rejected": -0.5319810509681702,
1375
+ "logps/chosen": -307.5137939453125,
1376
+ "logps/rejected": -293.56561279296875,
1377
+ "loss": 0.5358,
1378
+ "rewards/accuracies": 0.706250011920929,
1379
+ "rewards/chosen": -0.15299567580223083,
1380
+ "rewards/margins": 0.7571539282798767,
1381
+ "rewards/rejected": -0.9101495742797852,
1382
+ "step": 820
1383
+ },
1384
+ {
1385
+ "epoch": 0.8688824914943732,
1386
+ "grad_norm": 5.375,
1387
+ "learning_rate": 2.5672401793681854e-07,
1388
+ "logits/chosen": -0.565108597278595,
1389
+ "logits/rejected": -0.5492611527442932,
1390
+ "logps/chosen": -274.61260986328125,
1391
+ "logps/rejected": -269.25439453125,
1392
+ "loss": 0.5048,
1393
+ "rewards/accuracies": 0.737500011920929,
1394
+ "rewards/chosen": -0.17146927118301392,
1395
+ "rewards/margins": 0.8092424273490906,
1396
+ "rewards/rejected": -0.9807117581367493,
1397
+ "step": 830
1398
+ },
1399
+ {
1400
+ "epoch": 0.8793509552473174,
1401
+ "grad_norm": 5.1875,
1402
+ "learning_rate": 2.178751501463036e-07,
1403
+ "logits/chosen": -0.533309280872345,
1404
+ "logits/rejected": -0.5187536478042603,
1405
+ "logps/chosen": -313.8269958496094,
1406
+ "logps/rejected": -307.17987060546875,
1407
+ "loss": 0.517,
1408
+ "rewards/accuracies": 0.6812499761581421,
1409
+ "rewards/chosen": -0.20717601478099823,
1410
+ "rewards/margins": 0.7723677158355713,
1411
+ "rewards/rejected": -0.9795438051223755,
1412
+ "step": 840
1413
+ },
1414
+ {
1415
+ "epoch": 0.8898194190002617,
1416
+ "grad_norm": 5.4375,
1417
+ "learning_rate": 1.820784220652766e-07,
1418
+ "logits/chosen": -0.5832973122596741,
1419
+ "logits/rejected": -0.5479104518890381,
1420
+ "logps/chosen": -345.9014587402344,
1421
+ "logps/rejected": -279.53985595703125,
1422
+ "loss": 0.5083,
1423
+ "rewards/accuracies": 0.71875,
1424
+ "rewards/chosen": -0.0073674083687365055,
1425
+ "rewards/margins": 0.875013530254364,
1426
+ "rewards/rejected": -0.8823810815811157,
1427
+ "step": 850
1428
+ },
1429
+ {
1430
+ "epoch": 0.9002878827532059,
1431
+ "grad_norm": 5.375,
1432
+ "learning_rate": 1.4938170864468636e-07,
1433
+ "logits/chosen": -0.5060287117958069,
1434
+ "logits/rejected": -0.4896390438079834,
1435
+ "logps/chosen": -289.37042236328125,
1436
+ "logps/rejected": -268.93487548828125,
1437
+ "loss": 0.4806,
1438
+ "rewards/accuracies": 0.731249988079071,
1439
+ "rewards/chosen": -0.11767210066318512,
1440
+ "rewards/margins": 0.9038281440734863,
1441
+ "rewards/rejected": -1.0215001106262207,
1442
+ "step": 860
1443
+ },
1444
+ {
1445
+ "epoch": 0.9107563465061502,
1446
+ "grad_norm": 5.53125,
1447
+ "learning_rate": 1.1982873884064466e-07,
1448
+ "logits/chosen": -0.4754942059516907,
1449
+ "logits/rejected": -0.4765293598175049,
1450
+ "logps/chosen": -287.2554931640625,
1451
+ "logps/rejected": -276.70794677734375,
1452
+ "loss": 0.5263,
1453
+ "rewards/accuracies": 0.7124999761581421,
1454
+ "rewards/chosen": -0.16617272794246674,
1455
+ "rewards/margins": 0.6353666186332703,
1456
+ "rewards/rejected": -0.801539421081543,
1457
+ "step": 870
1458
+ },
1459
+ {
1460
+ "epoch": 0.9212248102590945,
1461
+ "grad_norm": 5.75,
1462
+ "learning_rate": 9.345903713082305e-08,
1463
+ "logits/chosen": -0.5580819845199585,
1464
+ "logits/rejected": -0.5499614477157593,
1465
+ "logps/chosen": -315.15478515625,
1466
+ "logps/rejected": -280.8697814941406,
1467
+ "loss": 0.5463,
1468
+ "rewards/accuracies": 0.625,
1469
+ "rewards/chosen": -0.1284041702747345,
1470
+ "rewards/margins": 0.6124891042709351,
1471
+ "rewards/rejected": -0.7408932447433472,
1472
+ "step": 880
1473
+ },
1474
+ {
1475
+ "epoch": 0.9316932740120387,
1476
+ "grad_norm": 5.125,
1477
+ "learning_rate": 7.030787065396866e-08,
1478
+ "logits/chosen": -0.5307958722114563,
1479
+ "logits/rejected": -0.49479198455810547,
1480
+ "logps/chosen": -318.7244567871094,
1481
+ "logps/rejected": -293.06219482421875,
1482
+ "loss": 0.5198,
1483
+ "rewards/accuracies": 0.71875,
1484
+ "rewards/chosen": -0.11201741546392441,
1485
+ "rewards/margins": 0.724387526512146,
1486
+ "rewards/rejected": -0.8364049792289734,
1487
+ "step": 890
1488
+ },
1489
+ {
1490
+ "epoch": 0.942161737764983,
1491
+ "grad_norm": 4.96875,
1492
+ "learning_rate": 5.0406202043228604e-08,
1493
+ "logits/chosen": -0.541701078414917,
1494
+ "logits/rejected": -0.5375146865844727,
1495
+ "logps/chosen": -333.6339111328125,
1496
+ "logps/rejected": -274.9059753417969,
1497
+ "loss": 0.4783,
1498
+ "rewards/accuracies": 0.768750011920929,
1499
+ "rewards/chosen": -0.0002611428499221802,
1500
+ "rewards/margins": 0.9774513244628906,
1501
+ "rewards/rejected": -0.9777124524116516,
1502
+ "step": 900
1503
+ },
1504
+ {
1505
+ "epoch": 0.942161737764983,
1506
+ "eval_logits/chosen": -0.4219192862510681,
1507
+ "eval_logits/rejected": -0.391748309135437,
1508
+ "eval_logps/chosen": -291.15997314453125,
1509
+ "eval_logps/rejected": -283.916259765625,
1510
+ "eval_loss": 0.5189567804336548,
1511
+ "eval_rewards/accuracies": 0.7260000109672546,
1512
+ "eval_rewards/chosen": -0.13985459506511688,
1513
+ "eval_rewards/margins": 0.8060198426246643,
1514
+ "eval_rewards/rejected": -0.9458745121955872,
1515
+ "eval_runtime": 349.4658,
1516
+ "eval_samples_per_second": 5.723,
1517
+ "eval_steps_per_second": 0.358,
1518
+ "step": 900
1519
+ },
1520
+ {
1521
+ "epoch": 0.9526302015179272,
1522
+ "grad_norm": 5.15625,
1523
+ "learning_rate": 3.378064801637687e-08,
1524
+ "logits/chosen": -0.5818526148796082,
1525
+ "logits/rejected": -0.5350463390350342,
1526
+ "logps/chosen": -314.6712341308594,
1527
+ "logps/rejected": -313.63116455078125,
1528
+ "loss": 0.5208,
1529
+ "rewards/accuracies": 0.731249988079071,
1530
+ "rewards/chosen": -0.047819893807172775,
1531
+ "rewards/margins": 0.7616057991981506,
1532
+ "rewards/rejected": -0.8094256520271301,
1533
+ "step": 910
1534
+ },
1535
+ {
1536
+ "epoch": 0.9630986652708715,
1537
+ "grad_norm": 4.6875,
1538
+ "learning_rate": 2.0453443778310766e-08,
1539
+ "logits/chosen": -0.5121539831161499,
1540
+ "logits/rejected": -0.48242586851119995,
1541
+ "logps/chosen": -329.2786865234375,
1542
+ "logps/rejected": -304.5957336425781,
1543
+ "loss": 0.5356,
1544
+ "rewards/accuracies": 0.768750011920929,
1545
+ "rewards/chosen": -0.06458105146884918,
1546
+ "rewards/margins": 0.869833767414093,
1547
+ "rewards/rejected": -0.9344147443771362,
1548
+ "step": 920
1549
+ },
1550
+ {
1551
+ "epoch": 0.9735671290238157,
1552
+ "grad_norm": 4.78125,
1553
+ "learning_rate": 1.0442413283435759e-08,
1554
+ "logits/chosen": -0.509577751159668,
1555
+ "logits/rejected": -0.47219425439834595,
1556
+ "logps/chosen": -317.72943115234375,
1557
+ "logps/rejected": -278.4623718261719,
1558
+ "loss": 0.5143,
1559
+ "rewards/accuracies": 0.762499988079071,
1560
+ "rewards/chosen": -0.05142117291688919,
1561
+ "rewards/margins": 1.1160078048706055,
1562
+ "rewards/rejected": -1.167428970336914,
1563
+ "step": 930
1564
+ },
1565
+ {
1566
+ "epoch": 0.98403559277676,
1567
+ "grad_norm": 4.1875,
1568
+ "learning_rate": 3.760945397705828e-09,
1569
+ "logits/chosen": -0.5354495048522949,
1570
+ "logits/rejected": -0.4856652617454529,
1571
+ "logps/chosen": -290.7127380371094,
1572
+ "logps/rejected": -260.788330078125,
1573
+ "loss": 0.487,
1574
+ "rewards/accuracies": 0.8125,
1575
+ "rewards/chosen": 0.03208203241229057,
1576
+ "rewards/margins": 1.0025131702423096,
1577
+ "rewards/rejected": -0.9704310297966003,
1578
+ "step": 940
1579
+ },
1580
+ {
1581
+ "epoch": 0.9945040565297043,
1582
+ "grad_norm": 4.6875,
1583
+ "learning_rate": 4.1797599220405605e-10,
1584
+ "logits/chosen": -0.5512427687644958,
1585
+ "logits/rejected": -0.5348908305168152,
1586
+ "logps/chosen": -295.21319580078125,
1587
+ "logps/rejected": -278.82257080078125,
1588
+ "loss": 0.5053,
1589
+ "rewards/accuracies": 0.7749999761581421,
1590
+ "rewards/chosen": -0.0824163556098938,
1591
+ "rewards/margins": 0.8457223176956177,
1592
+ "rewards/rejected": -0.9281387329101562,
1593
+ "step": 950
1594
+ },
1595
+ {
1596
+ "epoch": 0.9997382884061764,
1597
+ "step": 955,
1598
+ "total_flos": 0.0,
1599
+ "train_loss": 0.543355998318857,
1600
+ "train_runtime": 19333.2554,
1601
+ "train_samples_per_second": 3.162,
1602
+ "train_steps_per_second": 0.049
1603
+ }
1604
+ ],
1605
+ "logging_steps": 10,
1606
+ "max_steps": 955,
1607
+ "num_input_tokens_seen": 0,
1608
+ "num_train_epochs": 1,
1609
+ "save_steps": 100000,
1610
+ "stateful_callbacks": {
1611
+ "TrainerControl": {
1612
+ "args": {
1613
+ "should_epoch_stop": false,
1614
+ "should_evaluate": false,
1615
+ "should_log": false,
1616
+ "should_save": true,
1617
+ "should_training_stop": true
1618
+ },
1619
+ "attributes": {}
1620
+ }
1621
+ },
1622
+ "total_flos": 0.0,
1623
+ "train_batch_size": 1,
1624
+ "trial_name": null,
1625
+ "trial_params": null
1626
+ }