model_hh_usp3_dpo1

This model is a fine-tuned version of meta-llama/Llama-2-7b-chat-hf on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.1146	2.67	100	0.6900	-4.8037	-6.4467	0.7300	1.6430	-178.8784	-158.8832	-0.8887	-0.9297
0.0024	5.33	200	1.0780	-9.2061	-12.3558	0.7000	3.1497	-237.9695	-202.9073	-0.8656	-0.8655
0.0001	8.0	300	0.9490	-10.1793	-13.8565	0.7200	3.6772	-252.9766	-212.6395	-0.8313	-0.8261
0.0001	10.67	400	0.9700	-10.4127	-14.1403	0.7300	3.7276	-255.8140	-214.9731	-0.8237	-0.8199
0.0001	13.33	500	0.9721	-10.5124	-14.2790	0.7300	3.7666	-257.2012	-215.9702	-0.8195	-0.8168
0.0001	16.0	600	0.9839	-10.5785	-14.3606	0.7200	3.7820	-258.0171	-216.6317	-0.8172	-0.8146
0.0001	18.67	700	0.9829	-10.6126	-14.4106	0.7300	3.7980	-258.5171	-216.9720	-0.8160	-0.8135
0.0001	21.33	800	0.9837	-10.6177	-14.4091	0.7200	3.7913	-258.5019	-217.0236	-0.8153	-0.8132
0.0001	24.0	900	0.9832	-10.6228	-14.4149	0.7300	3.7921	-258.5607	-217.0746	-0.8156	-0.8134
0.0001	26.67	1000	0.9835	-10.6241	-14.4122	0.7300	3.7881	-258.5334	-217.0869	-0.8154	-0.8130