model_hh_usp3_dpo9

This model is a fine-tuned version of meta-llama/Llama-2-7b-chat-hf on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.0667	2.67	100	1.2931	-0.0486	-2.3792	0.6500	2.3307	-115.4677	-113.0558	-0.0497	-0.0386
0.0265	5.33	200	2.5238	-3.3105	-7.5646	0.6600	4.2541	-121.2292	-116.6801	-0.3923	-0.3765
0.139	8.0	300	4.4570	-13.8321	-19.1751	0.6100	5.3430	-134.1298	-128.3709	-0.2657	-0.2456
0.0061	10.67	400	4.9964	-19.0684	-25.0784	0.6300	6.0099	-140.6890	-134.1890	-0.4660	-0.4443
0.0	13.33	500	5.0051	-22.7007	-28.5148	0.6100	5.8141	-144.5073	-138.2248	-0.4580	-0.4287
0.0	16.0	600	4.9951	-22.7131	-28.5252	0.6000	5.8121	-144.5188	-138.2386	-0.4569	-0.4278
0.0	18.67	700	4.9801	-22.6913	-28.5241	0.6200	5.8329	-144.5176	-138.2144	-0.4571	-0.4278
0.0	21.33	800	4.9915	-22.6547	-28.5091	0.6000	5.8544	-144.5009	-138.1738	-0.4569	-0.4278
0.0	24.0	900	4.9990	-22.6732	-28.5298	0.6200	5.8566	-144.5239	-138.1943	-0.4568	-0.4277
0.0	26.67	1000	4.9971	-22.6484	-28.5100	0.6200	5.8617	-144.5019	-138.1667	-0.4573	-0.4284