dpo

This model is a fine-tuned version of mistralai/Mistral-Nemo-Instruct-2407 on the heat_transfer_dpo dataset. It achieves the following results on the evaluation set:

Loss: 0.1331
Rewards/chosen: -4.9675
Rewards/rejected: -13.7312
Rewards/accuracies: 0.9480
Rewards/margins: 8.7637
Logps/chosen: -224.7040
Logps/rejected: -310.9190
Logits/chosen: -1.4384
Logits/rejected: -1.4474

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 5
eval_batch_size: 5
seed: 42
distributed_type: multi-GPU
num_devices: 2
total_train_batch_size: 10
total_eval_batch_size: 10
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 2

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/chosen	Logps/rejected	Logits/chosen	Logits/rejected
0.6939	0.0667	60	0.6921	-0.0219	-0.0246	0.5190	0.0026	-175.2482	-173.8529	-1.4010	-1.4008
0.6871	0.1333	120	0.6830	-0.0278	-0.0494	0.6080	0.0216	-175.3069	-174.1010	-1.4030	-1.4029
0.6159	0.2	180	0.6382	-0.5399	-0.7225	0.5610	0.1826	-180.4279	-180.8317	-1.4021	-1.4025
0.368	0.2667	240	0.3849	-1.3538	-2.7449	0.8310	1.3911	-188.5674	-201.0563	-1.3971	-1.3996
0.3234	0.3333	300	0.3633	-2.1358	-4.6104	0.8230	2.4747	-196.3865	-219.7114	-1.4248	-1.4282
0.2649	0.4	360	0.3037	-3.3073	-6.0363	0.8800	2.7290	-208.1017	-233.9699	-1.4411	-1.4450
0.1784	0.4667	420	0.2159	-3.8934	-7.0789	0.9100	3.1855	-213.9628	-244.3959	-1.4470	-1.4523
0.2608	0.5333	480	0.2073	-3.8076	-7.8889	0.9100	4.0813	-213.1049	-252.4960	-1.4509	-1.4571
0.2459	0.6	540	0.2173	-4.7738	-9.6025	0.8890	4.8287	-222.7667	-269.6319	-1.4478	-1.4529
0.1729	0.6667	600	0.2264	-3.6641	-9.1186	0.9200	5.4546	-211.6696	-264.7935	-1.4379	-1.4430
0.2136	0.7333	660	0.1994	-3.1520	-8.0180	0.9190	4.8660	-206.5491	-253.7874	-1.4456	-1.4518
0.2148	0.8	720	0.2623	-3.3220	-8.6375	0.9040	5.3155	-208.2492	-259.9820	-1.4527	-1.4588
0.151	0.8667	780	0.2628	-3.7843	-9.3305	0.8830	5.5462	-212.8717	-266.9124	-1.4556	-1.4621
0.1759	0.9333	840	0.1736	-3.7518	-9.3561	0.9270	5.6043	-212.5472	-267.1683	-1.4565	-1.4631
0.1455	1.0	900	0.1967	-3.4547	-10.0926	0.9290	6.6379	-209.5764	-274.5335	-1.4551	-1.4625
0.1456	1.0667	960	0.2037	-3.9507	-10.4184	0.9290	6.4677	-214.5359	-277.7913	-1.4538	-1.4610
0.1276	1.1333	1020	0.2090	-3.7958	-10.3930	0.9240	6.5972	-212.9869	-277.5373	-1.4494	-1.4568
0.1768	1.2	1080	0.1744	-3.7397	-10.8265	0.9350	7.0868	-212.4255	-281.8718	-1.4487	-1.4565
0.2379	1.2667	1140	0.1679	-4.2998	-11.1092	0.9260	6.8094	-218.0269	-284.6993	-1.4458	-1.4532
0.0571	1.3333	1200	0.1626	-4.5185	-12.4102	0.9420	7.8917	-220.2143	-297.7095	-1.4335	-1.4415
0.1644	1.4	1260	0.1614	-4.3048	-12.2288	0.9400	7.9240	-218.0764	-295.8950	-1.4410	-1.4497
0.3264	1.4667	1320	0.1427	-4.5696	-12.5596	0.9470	7.9900	-220.7249	-299.2028	-1.4390	-1.4475
0.1088	1.5333	1380	0.1382	-4.6426	-12.7848	0.9510	8.1422	-221.4554	-301.4557	-1.4380	-1.4465
0.1853	1.6	1440	0.1417	-4.9985	-13.2069	0.9490	8.2084	-225.0136	-305.6761	-1.4349	-1.4433
0.1406	1.6667	1500	0.1741	-5.1167	-13.8396	0.9410	8.7229	-226.1956	-312.0029	-1.4283	-1.4373
0.1751	1.7333	1560	0.1433	-4.9687	-13.7012	0.9480	8.7325	-224.7161	-310.6195	-1.4309	-1.4397
0.1648	1.8	1620	0.1368	-4.9785	-13.6896	0.9500	8.7111	-224.8141	-310.5035	-1.4335	-1.4424
0.1109	1.8667	1680	0.1367	-5.0609	-13.8370	0.9480	8.7762	-225.6376	-311.9777	-1.4341	-1.4430
0.1875	1.9333	1740	0.1388	-5.0304	-13.7910	0.9500	8.7607	-225.3328	-311.5176	-1.4356	-1.4445
0.0947	2.0	1800	0.1331	-4.9675	-13.7312	0.9480	8.7637	-224.7040	-310.9190	-1.4384	-1.4474

Framework versions

PEFT 0.12.0
Transformers 4.46.0
Pytorch 2.4.0+cu121
Datasets 2.21.0
Tokenizers 0.20.1

Howard881010
/

heat_transfer_dpo

dpo

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for Howard881010/heat_transfer_dpo

Evaluation results