model_hh_usp4_dpo9

This model is a fine-tuned version of meta-llama/Llama-2-7b-chat-hf on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.0472	2.67	100	1.5866	-2.4737	-5.2244	0.6600	2.7507	-114.6509	-116.6005	-0.1123	-0.1103
0.061	5.33	200	2.8352	-8.5414	-13.8302	0.6600	5.2888	-124.2130	-123.3425	-0.2214	-0.1997
0.0022	8.0	300	3.6078	-5.7355	-11.8144	0.6600	6.0789	-121.9732	-120.2247	-0.2463	-0.2014
0.0001	10.67	400	4.1244	-1.6102	-7.8752	0.6300	6.2650	-117.5963	-115.6411	-0.1230	-0.0965
0.0	13.33	500	4.0644	-1.1614	-7.5191	0.6300	6.3577	-117.2006	-115.1424	-0.1061	-0.0806
0.0	16.0	600	4.0669	-1.1412	-7.4965	0.6300	6.3554	-117.1756	-115.1199	-0.1068	-0.0813
0.0	18.67	700	4.0482	-1.1597	-7.5269	0.6300	6.3672	-117.2094	-115.1405	-0.1065	-0.0810
0.0	21.33	800	4.0720	-1.1432	-7.5025	0.6300	6.3594	-117.1822	-115.1221	-0.1067	-0.0811
0.0	24.0	900	4.0691	-1.1439	-7.4980	0.6300	6.3541	-117.1772	-115.1229	-0.1069	-0.0810
0.0	26.67	1000	4.0767	-1.1762	-7.5013	0.6300	6.3252	-117.1809	-115.1588	-0.1065	-0.0807