metadata
tags:
- trl
- dpo
- generated_from_trainer
model-index:
- name: dpo-selective-buffer-spo-shift
results: []
dpo-selective-buffer-spo-shift
This model was trained from scratch on the None dataset. It achieves the following results on the evaluation set:
- Loss: 3740.8657
- Rewards/chosen: -0.7639
- Rewards/rejected: -0.8006
- Rewards/accuracies: 0.5440
- Rewards/margins: 0.0367
- Rewards/safe Rewards: -0.7642
- Rewards/unsafe Rewards: -0.7631
- Logps/rejected: -172.4103
- Logps/chosen: -207.2579
- Logits/rejected: -0.8679
- Logits/chosen: -1.2929
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-07
- train_batch_size: 2
- eval_batch_size: 8
- seed: 42
- distributed_type: multi-GPU
- num_devices: 2
- gradient_accumulation_steps: 8
- total_train_batch_size: 32
- total_eval_batch_size: 16
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 1
Training results
Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Rewards/safe Rewards | Rewards/unsafe Rewards | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10014.0016 | 0.27 | 500 | 3855.2058 | -0.6234 | -0.6551 | 0.5693 | 0.0317 | -0.6247 | -0.6222 | -157.8614 | -193.2104 | -1.4395 | -1.7663 |
9326.0547 | 0.54 | 1000 | 3789.4490 | -0.6776 | -0.7269 | 0.5769 | 0.0494 | -0.6763 | -0.6752 | -165.0468 | -198.6211 | -1.0852 | -1.4592 |
8544.8563 | 0.81 | 1500 | 3742.0378 | -0.7610 | -0.7967 | 0.5451 | 0.0358 | -0.7614 | -0.7604 | -172.0258 | -206.9610 | -0.7429 | -1.1929 |
Framework versions
- Transformers 4.36.2
- Pytorch 2.1.2
- Datasets 2.14.6
- Tokenizers 0.15.2