Llama-3-8b-ultra-dpo-e3

This model is a fine-tuned version of meta-llama/Meta-Llama-3-8B-Instruct on an unknown dataset. It achieves the following results on the evaluation set:

Loss: 0.5561
Rewards/chosen: -1.5336
Rewards/rejected: -2.7616
Rewards/accuracies: 0.7344
Rewards/margins: 1.2280
Logps/rejected: -540.8266
Logps/chosen: -409.9130
Logits/rejected: 0.6689
Logits/chosen: 0.6266

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-07
train_batch_size: 2
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 8
gradient_accumulation_steps: 8
total_train_batch_size: 128
total_eval_batch_size: 64
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 3.0

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.633	0.2060	100	0.6299	-0.3411	-0.5682	0.6719	0.2271	-321.4839	-290.6666	0.2859	0.2237
0.6057	0.4119	200	0.6008	-0.4146	-0.7787	0.6875	0.3642	-342.5381	-298.0109	0.2313	0.1500
0.5805	0.6179	300	0.5810	-0.5541	-1.0549	0.6953	0.5007	-370.1489	-311.9684	0.4273	0.3082
0.5674	0.8239	400	0.5631	-0.5553	-1.1335	0.7031	0.5782	-378.0127	-312.0860	0.4776	0.3485
0.5212	1.0299	500	0.5476	-0.8333	-1.6260	0.7422	0.7927	-427.2674	-339.8888	0.5328	0.3993
0.462	1.2358	600	0.5485	-1.0524	-1.9650	0.6953	0.9126	-461.1649	-361.7939	0.7274	0.6099
0.4705	1.4418	700	0.5406	-0.9470	-1.8724	0.7266	0.9254	-451.9069	-351.2586	0.6854	0.5662
0.4708	1.6478	800	0.5353	-0.9113	-1.7896	0.7266	0.8782	-443.6194	-347.6862	0.7169	0.6033
0.4723	1.8538	900	0.5403	-1.0264	-1.9967	0.7734	0.9703	-464.3328	-359.1928	0.6471	0.5481
0.3965	2.0597	1000	0.5528	-1.4400	-2.6263	0.75	1.1863	-527.2926	-400.5552	0.6392	0.5672
0.3825	2.2657	1100	0.5514	-1.4290	-2.6129	0.7344	1.1839	-525.9548	-399.4589	0.6708	0.6138
0.3819	2.4717	1200	0.5506	-1.4568	-2.6381	0.7266	1.1813	-528.4744	-402.2388	0.6711	0.6090
0.3897	2.6777	1300	0.5536	-1.4476	-2.6317	0.7422	1.1842	-527.8379	-401.3105	0.6740	0.6252
0.3681	2.8836	1400	0.5568	-1.5360	-2.7672	0.7422	1.2312	-541.3793	-410.1517	0.6666	0.6226

Framework versions

Transformers 4.45.1
Pytorch 2.4.1+cu121
Datasets 3.0.0
Tokenizers 0.20.0

tongliuphysics
/

Llama-3-8b-ultra-dpo-e3

Llama-3-8b-ultra-dpo-e3

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for tongliuphysics/Llama-3-8b-ultra-dpo-e3

Evaluation results