model_usp3_dpo5

This model is a fine-tuned version of meta-llama/Llama-2-7b-chat-hf on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.0839	2.67	100	1.2957	-6.5028	-8.3281	0.6100	1.8253	-129.3219	-123.1500	-0.2745	-0.2356
0.0042	5.33	200	1.7603	-5.5711	-8.4268	0.7000	2.8558	-129.5194	-121.2866	-0.9020	-0.8864
0.0011	8.0	300	1.6254	-11.3020	-17.4510	0.6900	6.1490	-147.5677	-132.7485	-0.7903	-0.7597
0.0	10.67	400	1.9460	-10.1997	-15.8452	0.6500	5.6455	-144.3562	-130.5438	-0.9297	-0.9163
0.0	13.33	500	1.9345	-10.2432	-15.9204	0.6500	5.6772	-144.5066	-130.6309	-0.9318	-0.9181
0.0	16.0	600	1.9390	-10.2636	-15.9466	0.6500	5.6831	-144.5590	-130.6716	-0.9321	-0.9182
0.0	18.67	700	1.9438	-10.2986	-15.9585	0.6500	5.6599	-144.5827	-130.7415	-0.9326	-0.9190
0.0	21.33	800	1.9351	-10.2903	-15.9732	0.6500	5.6829	-144.6121	-130.7250	-0.9323	-0.9188
0.0	24.0	900	1.9341	-10.3034	-15.9669	0.6500	5.6635	-144.5995	-130.7512	-0.9328	-0.9192
0.0	26.67	1000	1.9416	-10.2877	-15.9492	0.6500	5.6615	-144.5642	-130.7199	-0.9327	-0.9193