model_usp2_dpo5

This model is a fine-tuned version of meta-llama/Llama-2-7b-chat-hf on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.0339	2.67	100	1.1175	-5.1390	-8.2341	0.6900	3.0951	-126.8717	-120.9441	-0.3625	-0.3504
0.0325	5.33	200	1.6541	-3.3759	-6.0514	0.6000	2.6755	-122.5063	-117.4180	-0.3455	-0.3686
0.0021	8.0	300	2.0449	-11.1907	-16.3902	0.6500	5.1995	-143.1838	-133.0475	-1.0202	-0.9909
0.0	10.67	400	2.0215	-12.0042	-17.6127	0.6600	5.6085	-145.6288	-134.6745	-1.0461	-1.0130
0.0	13.33	500	2.0139	-12.0087	-17.6260	0.6700	5.6174	-145.6555	-134.6835	-1.0475	-1.0146
0.0	16.0	600	2.0087	-12.0288	-17.6593	0.6700	5.6304	-145.7220	-134.7238	-1.0479	-1.0150
0.0	18.67	700	2.0128	-12.0402	-17.6571	0.6700	5.6170	-145.7177	-134.7465	-1.0484	-1.0150
0.0	21.33	800	2.0087	-12.0427	-17.6819	0.6600	5.6391	-145.7672	-134.7516	-1.0483	-1.0151
0.0	24.0	900	2.0120	-12.0409	-17.6812	0.6700	5.6402	-145.7658	-134.7480	-1.0489	-1.0159
0.0	26.67	1000	2.0083	-12.0382	-17.6824	0.6600	5.6442	-145.7682	-134.7426	-1.0481	-1.0149