Llama-2-7b-hf-DPO-LookAhead5_FullEval_TTree1.4_TLoop0.7_TEval0.2_V2.0

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6766	0.2994	78	0.7200	0.0401	0.0730	0.4167	-0.0329	-127.2202	-142.6290	-0.3528	-0.3393
0.6875	0.5988	156	0.6657	-0.4080	-0.4881	0.5833	0.0801	-132.8311	-147.1099	-0.3720	-0.3575
0.7999	0.8983	234	0.6842	-0.3659	-0.4094	0.6667	0.0435	-132.0449	-146.6892	-0.3674	-0.3517
0.4879	1.1977	312	0.6694	-0.2237	-0.2979	0.4167	0.0742	-130.9293	-145.2672	-0.3979	-0.3821
0.6233	1.4971	390	0.6523	-0.9992	-1.1797	0.5	0.1804	-139.7471	-153.0225	-0.5012	-0.4885
0.4034	1.7965	468	0.7021	-0.9141	-1.0257	0.4167	0.1116	-138.2080	-152.1710	-0.4511	-0.4394
0.1778	2.0960	546	0.7896	-1.2322	-1.2047	0.4167	-0.0275	-139.9971	-155.3521	-0.5752	-0.5642
0.2732	2.3954	624	0.9364	-1.8694	-1.7281	0.4167	-0.1412	-145.2318	-161.7236	-0.7728	-0.7633
0.1812	2.6948	702	0.9683	-2.0710	-1.9135	0.4167	-0.1575	-147.0860	-163.7400	-0.8137	-0.8049
0.1798	2.9942	780	0.9728	-2.0556	-1.8953	0.4167	-0.1602	-146.9038	-163.5856	-0.8198	-0.8110