qwen_cpo_entropy_0_1

This model is a fine-tuned version of trl-lib/qwen1.5-0.5b-sft on the yakazimir/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 0.7405
Sft Loss: 1.6848
Rewards/chosen: -1.7146
Rewards/rejected: -2.3727
Rewards/accuracies: 0.6773
Rewards/margins: 0.6581
Logps/rejected: -2.3727
Logps/chosen: -1.7146
Logits/rejected: 0.3000
Logits/chosen: 0.1875

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-06
train_batch_size: 2
eval_batch_size: 4
seed: 42
distributed_type: multi-GPU
gradient_accumulation_steps: 16
total_train_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 3.0

Training results

Training Loss	Epoch	Step	Validation Loss	Sft Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.8248	0.2141	400	0.8255	1.3905	-1.3850	-1.5360	0.5645	0.1510	-1.5360	-1.3850	0.3069	0.2210
0.7884	0.4282	800	0.7811	1.4857	-1.5199	-1.8625	0.6113	0.3426	-1.8625	-1.5199	0.4914	0.3895
0.8073	0.6422	1200	0.7653	1.5452	-1.5531	-1.9756	0.6298	0.4226	-1.9756	-1.5531	0.5229	0.4111
0.7417	0.8563	1600	0.7599	1.5652	-1.5632	-1.9862	0.6484	0.4230	-1.9862	-1.5632	0.5072	0.3924
0.8212	1.0704	2000	0.7518	1.5561	-1.5506	-2.0302	0.6543	0.4796	-2.0302	-1.5506	0.4351	0.3208
0.7326	1.2845	2400	0.7455	1.6027	-1.6077	-2.1582	0.6632	0.5505	-2.1582	-1.6077	0.4993	0.3799
0.7742	1.4986	2800	0.7444	1.6196	-1.6148	-2.1590	0.6632	0.5442	-2.1590	-1.6148	0.4611	0.3432
0.7597	1.7127	3200	0.7438	1.6039	-1.6049	-2.1441	0.6632	0.5392	-2.1441	-1.6049	0.3926	0.2796
0.7128	1.9267	3600	0.7399	1.6368	-1.6446	-2.2337	0.6780	0.5891	-2.2337	-1.6446	0.3607	0.2486
0.6636	2.1408	4000	0.7399	1.6738	-1.6828	-2.3162	0.6780	0.6334	-2.3162	-1.6828	0.3064	0.1955
0.6929	2.3549	4400	0.7421	1.7043	-1.7385	-2.4029	0.6795	0.6644	-2.4029	-1.7385	0.3030	0.1902
0.6939	2.5690	4800	0.7411	1.6769	-1.7078	-2.3536	0.6758	0.6458	-2.3536	-1.7078	0.1986	0.0944
0.6831	2.7831	5200	0.7409	1.6830	-1.7130	-2.3694	0.6766	0.6564	-2.3694	-1.7130	0.3256	0.2110
0.6951	2.9972	5600	0.7405	1.6848	-1.7146	-2.3727	0.6773	0.6581	-2.3727	-1.7146	0.3000	0.1875

Framework versions

Transformers 4.44.2
Pytorch 2.2.2+cu121
Datasets 2.18.0
Tokenizers 0.19.1

yakazimir
/

qwen_cpo_entropy_0_1

qwen_cpo_entropy_0_1

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for yakazimir/qwen_cpo_entropy_0_1

Dataset used to train yakazimir/qwen_cpo_entropy_0_1

Evaluation results