model_usp1_dpo9

This model is a fine-tuned version of meta-llama/Llama-2-7b-chat-hf on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.1142	2.67	100	1.4766	1.7195	-0.5069	0.6900	2.2263	-115.3510	-108.5270	-0.1562	-0.1401
0.0431	5.33	200	2.4525	-23.9043	-28.0492	0.6600	4.1448	-145.9536	-136.9979	-0.5122	-0.4496
0.0035	8.0	300	2.2089	-1.4351	-6.8781	0.75	5.4430	-122.4302	-112.0322	0.1830	0.2540
0.0	10.67	400	2.4382	-3.7206	-9.8637	0.7700	6.1431	-125.7475	-114.5715	0.0396	0.1191
0.0	13.33	500	2.4513	-3.7668	-9.8629	0.7700	6.0960	-125.7465	-114.6229	0.0387	0.1181
0.0	16.0	600	2.4281	-3.7498	-9.8885	0.7700	6.1387	-125.7750	-114.6039	0.0379	0.1174
0.0	18.67	700	2.4461	-3.7862	-9.9007	0.7700	6.1145	-125.7885	-114.6444	0.0376	0.1165
0.0	21.33	800	2.4172	-3.7467	-9.9054	0.7700	6.1587	-125.7938	-114.6005	0.0372	0.1166
0.0	24.0	900	2.4260	-3.7798	-9.9215	0.7700	6.1417	-125.8117	-114.6373	0.0377	0.1171
0.0	26.67	1000	2.4153	-3.7856	-9.9056	0.7700	6.1200	-125.7940	-114.6437	0.0370	0.1165