Mixtral-8x7B-Instruct-v0.1-QLoRA-Assessment-Rationale-dpo

The model trained with w/o private data from the EMNLP 2024 Paper: Calibrating LLMs with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring.

Intended uses & limitations

This model offers a valuable resource for research in explainable AI within educational technology. The model is trained with noisy response-level rationales. This makes them unsuitable for direct application in high-stakes assessments without additional verification.

Training and evaluation data

We trained and evaluated the model on the Synthetic Rationale data, which was generated from the Rationale MCTS data.

To extract scores from rationales, please use the jiazhengli/deberta-v3-large-Rationale-to-Score.

Citation

Please cite the following work if you utilize this model:

BibTeX:

@misc{li2024calibratingllmspreferenceoptimization,
      title={Calibrating LLMs with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring}, 
      author={Jiazheng Li and Hainiu Xu and Zhaoyue Sun and Yuxiang Zhou and David West and Cesare Aloisi and Yulan He},
      year={2024},
      eprint={2406.19949},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.19949}, 
}

Training procedure

Please refer to our paper.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-05
  • train_batch_size: 1
  • eval_batch_size: 1
  • seed: 42
  • gradient_accumulation_steps: 8
  • total_train_batch_size: 8
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 0.1
  • num_epochs: 3.0
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen
1.1726 0.33 200 2.6079 14.8760 13.6948 0.5929 1.1812 -199.3195 -162.4827 -0.7607 -0.7827
1.0028 0.67 400 2.6743 14.9730 13.8342 0.5844 1.1388 -197.9255 -161.5126 -0.8669 -0.8743
0.5127 1.0 600 2.5239 15.4063 13.9931 0.6000 1.4132 -196.3373 -157.1801 -0.8501 -0.8561
0.3787 1.33 800 2.5951 15.2695 13.9112 0.6142 1.3582 -197.1555 -158.5480 -0.8385 -0.8358
0.381 1.67 1000 2.5814 15.0186 13.4813 0.6213 1.5373 -201.4548 -161.0572 -0.7846 -0.7808
0.2993 2.0 1200 2.5816 15.0307 13.3917 0.6383 1.6390 -202.3505 -160.9355 -0.7590 -0.7554
0.2917 2.33 1400 2.6270 14.5370 12.8885 0.6426 1.6485 -207.3829 -165.8732 -0.8337 -0.8292
0.2881 2.67 1600 2.6358 14.3849 12.6973 0.6468 1.6875 -209.2946 -167.3941 -0.8503 -0.8468

Framework versions

  • PEFT 0.10.0
  • Transformers 4.38.2
  • Pytorch 2.2.1+cu121
  • Datasets 2.18.0
  • Tokenizers 0.15.2
Downloads last month
2
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for jiazhengli/Mixtral-8x7B-Instruct-v0.1-QLoRA-Assessment-Rationale-dpo

Datasets used to train jiazhengli/Mixtral-8x7B-Instruct-v0.1-QLoRA-Assessment-Rationale-dpo

Collection including jiazhengli/Mixtral-8x7B-Instruct-v0.1-QLoRA-Assessment-Rationale-dpo