llama-8b-dpo-full

This model is a fine-tuned version of princeton-nlp/Llama-3-Base-8B-SFT on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 0.6316
Rewards/chosen: 0.6899
Rewards/rejected: 0.3044
Rewards/accuracies: 0.6600
Rewards/margins: 0.3855
Logps/rejected: -2200.0752
Logps/chosen: -2603.7832
Logits/rejected: -1.4288
Logits/chosen: -1.4752

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-06
train_batch_size: 4
eval_batch_size: 4
seed: 42
distributed_type: multi-GPU
num_devices: 4
gradient_accumulation_steps: 2
total_train_batch_size: 32
total_eval_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6558	0.05	100	0.6527	0.7712	0.5799	0.5740	0.1913	-2172.5291	-2595.6543	-1.1822	-1.2241
0.6404	0.1	200	0.6911	0.4590	0.2677	0.5860	0.1913	-2203.7483	-2626.8760	-1.2019	-1.2423
0.6725	0.16	300	0.6603	0.8108	0.5231	0.6320	0.2877	-2178.2058	-2591.6921	-1.3149	-1.3646
0.689	0.21	400	0.6529	0.8101	0.4993	0.6280	0.3108	-2180.5830	-2591.7649	-1.4428	-1.5029
0.6682	0.26	500	0.6674	0.9667	0.6125	0.6420	0.3542	-2169.2654	-2576.1008	-1.5148	-1.5665
0.6309	0.31	600	0.6445	0.8348	0.4673	0.6580	0.3675	-2183.7852	-2589.2971	-1.5885	-1.6449
0.6467	0.37	700	0.6482	0.8852	0.5455	0.6240	0.3397	-2175.9651	-2584.2512	-1.6562	-1.7105
0.6215	0.42	800	0.6453	1.0902	0.6825	0.6380	0.4077	-2162.2678	-2563.7546	-1.6541	-1.7085
0.6674	0.47	900	0.6416	0.7802	0.4490	0.6440	0.3312	-2185.6135	-2594.7568	-1.5145	-1.5652
0.644	0.52	1000	0.6500	0.7077	0.3679	0.6400	0.3398	-2193.7285	-2602.0039	-1.4506	-1.5047
0.6539	0.58	1100	0.6389	0.8477	0.4852	0.6500	0.3625	-2181.9937	-2588.0068	-1.4697	-1.5227
0.7267	0.63	1200	0.6421	0.5390	0.2257	0.6620	0.3133	-2207.9438	-2618.8738	-1.6292	-1.6800
0.5746	0.68	1300	0.6301	0.9057	0.4892	0.6660	0.4164	-2181.5920	-2582.2095	-1.4994	-1.5461
0.6053	0.73	1400	0.6342	0.8758	0.4563	0.6660	0.4196	-2184.8909	-2585.1914	-1.4440	-1.4891
0.6232	0.79	1500	0.6324	0.8055	0.3994	0.6580	0.4062	-2190.5796	-2592.2219	-1.4283	-1.4759
0.6326	0.84	1600	0.6392	0.4525	0.1032	0.6560	0.3493	-2220.1997	-2627.5283	-1.4501	-1.4959
0.6469	0.89	1700	0.6306	0.7453	0.3498	0.6660	0.3955	-2195.5359	-2598.2412	-1.4289	-1.4758
0.669	0.94	1800	0.6323	0.6544	0.2748	0.6600	0.3796	-2203.0393	-2607.3367	-1.4308	-1.4769
0.6531	0.99	1900	0.6317	0.6900	0.3040	0.6640	0.3860	-2200.1182	-2603.7776	-1.4289	-1.4754

Framework versions

Transformers 4.36.2
Pytorch 2.1.2
Datasets 2.14.6
Tokenizers 0.15.2

fenguhao
/

llama-8b-dpo-full

llama-8b-dpo-full

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for fenguhao/llama-8b-dpo-full

Dataset used to train fenguhao/llama-8b-dpo-full

Evaluation results