model_usp3_dpo1

This model is a fine-tuned version of meta-llama/Llama-2-7b-chat-hf on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.1152	2.67	100	0.9230	-6.0994	-7.5730	0.6100	1.4737	-185.5316	-169.0449	-0.9595	-0.9289
0.0007	5.33	200	1.2625	-10.4507	-12.9431	0.7100	2.4924	-239.2319	-212.5582	-0.8924	-0.8878
0.0002	8.0	300	1.2065	-10.0237	-12.4963	0.7000	2.4725	-234.7639	-208.2885	-0.8455	-0.8351
0.0001	10.67	400	1.2314	-10.3811	-12.9055	0.7100	2.5245	-238.8566	-211.8620	-0.8259	-0.8181
0.0001	13.33	500	1.2449	-10.5483	-13.1112	0.7100	2.5629	-240.9136	-213.5344	-0.8155	-0.8090
0.0001	16.0	600	1.2475	-10.6353	-13.2168	0.7100	2.5815	-241.9690	-214.4042	-0.8099	-0.8041
0.0001	18.67	700	1.2504	-10.6796	-13.2671	0.7000	2.5875	-242.4725	-214.8474	-0.8075	-0.8022
0.0001	21.33	800	1.2562	-10.7029	-13.2944	0.7000	2.5915	-242.7449	-215.0800	-0.8065	-0.8014
0.0001	24.0	900	1.2542	-10.7066	-13.2945	0.7000	2.5879	-242.7467	-215.1174	-0.8061	-0.8009
0.0001	26.67	1000	1.2527	-10.7037	-13.2986	0.7000	2.5949	-242.7872	-215.0880	-0.8066	-0.8008