Llama-3-8b-ultra-dpo-e2

This model is a fine-tuned version of meta-llama/Meta-Llama-3-8B-Instruct on an unknown dataset. It achieves the following results on the evaluation set:

Loss: 0.5453
Rewards/chosen: -0.8950
Rewards/rejected: -1.7403
Rewards/accuracies: 0.7422
Rewards/margins: 0.8454
Logps/rejected: -438.6973
Logps/chosen: -346.0516
Logits/rejected: 0.6221
Logits/chosen: 0.4858

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-07
train_batch_size: 2
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 8
gradient_accumulation_steps: 8
total_train_batch_size: 128
total_eval_batch_size: 64
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 2.0

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6335	0.2060	100	0.6304	-0.3352	-0.5566	0.6797	0.2214	-320.3228	-290.0782	0.2964	0.2341
0.6079	0.4119	200	0.6033	-0.3981	-0.7457	0.6875	0.3475	-339.2305	-296.3674	0.2534	0.1750
0.5833	0.6179	300	0.5853	-0.5366	-1.0116	0.6641	0.4749	-365.8224	-310.2185	0.4021	0.2900
0.5721	0.8239	400	0.5701	-0.5617	-1.1202	0.7031	0.5585	-376.6856	-312.7222	0.4446	0.3219
0.5326	1.0299	500	0.5544	-0.7451	-1.4427	0.7578	0.6976	-408.9373	-331.0641	0.4961	0.3617
0.4773	1.2358	600	0.5543	-0.9312	-1.7472	0.7031	0.8160	-439.3852	-349.6768	0.6470	0.5120
0.4892	1.4418	700	0.5471	-0.8746	-1.7007	0.7344	0.8261	-434.7292	-344.0101	0.6372	0.5024
0.4895	1.6478	800	0.5452	-0.9033	-1.7335	0.7188	0.8302	-438.0132	-346.8821	0.6595	0.5221
0.4926	1.8538	900	0.5455	-0.9149	-1.7694	0.7266	0.8545	-441.6077	-348.0443	0.6296	0.4935

Framework versions

Transformers 4.45.1
Pytorch 2.4.1+cu121
Datasets 3.0.0
Tokenizers 0.20.0

tongliuphysics
/

Llama-3-8b-ultra-dpo-e2

Llama-3-8b-ultra-dpo-e2

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for tongliuphysics/Llama-3-8b-ultra-dpo-e2

Evaluation results