zephyr-7b-dpo-full-gpt-reward-scale-05

This model is a fine-tuned version of alignment-handbook/zephyr-7b-sft-full on the None dataset. It achieves the following results on the evaluation set:

Loss: 0.5238
Rewards/chosen: -1.1890
Rewards/rejected: -2.1821
Rewards/accuracies: 0.7241
Rewards/margins: 0.9930
Logps/rejected: -463.8542
Logps/chosen: -402.9079
Logits/rejected: 3.3069
Logits/chosen: 1.9855

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-07
train_batch_size: 8
eval_batch_size: 8
seed: 55
distributed_type: multi-GPU
num_devices: 8
gradient_accumulation_steps: 2
total_train_batch_size: 128
total_eval_batch_size: 64
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6687	0.1147	50	0.6560	-0.0264	-0.1298	0.6724	0.1034	-258.6246	-286.6438	-2.5075	-2.6072
0.581	0.2294	100	0.5764	-0.7311	-1.3172	0.7155	0.5861	-377.3666	-357.1160	0.6340	0.0270
0.558	0.3440	150	0.5510	-1.2031	-1.9696	0.7241	0.7665	-442.6071	-404.3199	3.0036	2.0828
0.5346	0.4587	200	0.5381	-1.1677	-2.0355	0.7112	0.8679	-449.2019	-400.7711	2.7759	1.7577
0.5391	0.5734	250	0.5333	-1.0858	-1.9666	0.7198	0.8807	-442.3041	-392.5903	2.9561	1.8167
0.5479	0.6881	300	0.5265	-1.0463	-1.9706	0.7069	0.9243	-442.7093	-388.6379	3.2239	2.0026
0.5232	0.8028	350	0.5262	-1.3359	-2.3191	0.7241	0.9832	-477.5577	-417.5966	3.6066	2.3484
0.5267	0.9174	400	0.5238	-1.1890	-2.1821	0.7241	0.9930	-463.8542	-402.9079	3.3069	1.9855

Framework versions

Transformers 4.44.0.dev0
Pytorch 2.1.2
Datasets 2.20.0
Tokenizers 0.19.1

sfulay
/

zephyr-7b-dpo-full-gpt-reward-scale-05

zephyr-7b-dpo-full-gpt-reward-scale-05

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for sfulay/zephyr-7b-dpo-full-gpt-reward-scale-05

Evaluation results