model_hh_usp1_dpo5

This model is a fine-tuned version of meta-llama/Llama-2-7b-chat-hf on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.1062	2.67	100	0.9731	-1.3173	-2.6971	0.6900	1.3798	-119.1878	-115.3486	-0.0430	-0.0352
0.0053	5.33	200	1.4158	-3.1407	-7.2838	0.6700	4.1431	-128.3614	-118.9955	-0.6209	-0.5447
0.0359	8.0	300	1.6639	-5.1494	-9.1017	0.6900	3.9524	-131.9972	-123.0127	-0.7498	-0.6842
0.0	10.67	400	1.6816	-7.9965	-12.3589	0.7000	4.3624	-138.5115	-128.7070	-0.7830	-0.7067
0.0	13.33	500	1.6979	-8.0167	-12.3847	0.7000	4.3680	-138.5631	-128.7474	-0.7829	-0.7068
0.0	16.0	600	1.6968	-8.0295	-12.3907	0.7000	4.3611	-138.5751	-128.7731	-0.7829	-0.7066
0.0	18.67	700	1.6993	-8.0304	-12.3986	0.7000	4.3682	-138.5910	-128.7749	-0.7833	-0.7066
0.0	21.33	800	1.6953	-8.0311	-12.4055	0.7000	4.3743	-138.6046	-128.7763	-0.7835	-0.7067
0.0	24.0	900	1.6919	-8.0283	-12.4122	0.7100	4.3839	-138.6181	-128.7706	-0.7836	-0.7075
0.0	26.67	1000	1.6912	-8.0278	-12.3890	0.7100	4.3613	-138.5718	-128.7695	-0.7832	-0.7072