sablo-pebble-mistral-dpo-lora-HelpSteer_binarized

This model is a fine-tuned version of sablo/sablo-pebble-mistral on the sablo/HelpSteer_binarized dataset. It achieves the following results on the evaluation set:

Loss: 0.5371
Rewards/chosen: -0.9335
Rewards/rejected: -1.6455
Rewards/accuracies: 0.7264
Rewards/margins: 0.7121
Logps/rejected: -298.0735
Logps/chosen: -253.4149
Logits/rejected: -2.4554
Logits/chosen: -2.5093

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 4
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
gradient_accumulation_steps: 2
total_train_batch_size: 8
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6874	0.1	100	0.6892	0.0213	0.0133	0.6698	0.0080	-132.1924	-157.9395	-2.4463	-2.4843
0.6592	0.2	200	0.6594	0.0055	-0.0704	0.6698	0.0759	-140.5588	-159.5180	-2.4922	-2.5370
0.5451	0.3	300	0.5867	-0.4490	-0.7587	0.6863	0.3097	-209.3938	-204.9713	-2.5128	-2.5620
0.4933	0.39	400	0.5591	-0.6060	-1.1029	0.7146	0.4968	-243.8062	-220.6713	-2.4868	-2.5386
0.5271	0.49	500	0.5488	-0.6712	-1.2738	0.7193	0.6026	-260.8958	-227.1889	-2.4784	-2.5312
0.4594	0.59	600	0.5418	-0.7977	-1.4672	0.7311	0.6695	-280.2420	-239.8430	-2.4672	-2.5200
0.5444	0.69	700	0.5358	-0.7688	-1.4528	0.7335	0.6840	-278.8014	-236.9531	-2.4594	-2.5127
0.5755	0.79	800	0.5405	-1.0672	-1.7631	0.7311	0.6959	-309.8293	-266.7906	-2.4585	-2.5118
0.5495	0.89	900	0.5371	-0.9321	-1.6450	0.7288	0.7129	-298.0242	-253.2804	-2.4558	-2.5096
0.5948	0.98	1000	0.5371	-0.9335	-1.6455	0.7264	0.7121	-298.0735	-253.4149	-2.4554	-2.5093

Framework versions

PEFT 0.7.1
Transformers 4.36.2
Pytorch 2.0.1+cu118
Datasets 2.14.6
Tokenizers 0.15.0

dctanner
/

sablo-pebble-mistral-dpo-lora-HelpSteer_binarized

sablo-pebble-mistral-dpo-lora-HelpSteer_binarized

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for dctanner/sablo-pebble-mistral-dpo-lora-HelpSteer_binarized

Dataset used to train dctanner/sablo-pebble-mistral-dpo-lora-HelpSteer_binarized

Evaluation results