Mixtral-8x7B-Instruct-v0.1-QLoRA-Assessment-Rationale-dpo

The model trained with w/o private data from the EMNLP 2024 Paper: Calibrating LLMs with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring.

Paper: Calibrating LLMs with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring (EMNLP 2024 Findings)
GitHub Repository: Thought Tree Assessment Repository

Intended uses & limitations

This model offers a valuable resource for research in explainable AI within educational technology. The model is trained with noisy response-level rationales. This makes them unsuitable for direct application in high-stakes assessments without additional verification.

Training and evaluation data

We trained and evaluated the model on the Synthetic Rationale data, which was generated from the Rationale MCTS data.

To extract scores from rationales, please use the jiazhengli/deberta-v3-large-Rationale-to-Score.

Citation

Please cite the following work if you utilize this model:

BibTeX:

@misc{li2024calibratingllmspreferenceoptimization,
      title={Calibrating LLMs with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring}, 
      author={Jiazheng Li and Hainiu Xu and Zhaoyue Sun and Yuxiang Zhou and David West and Cesare Aloisi and Yulan He},
      year={2024},
      eprint={2406.19949},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.19949}, 
}

Training procedure

Please refer to our paper.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 1
eval_batch_size: 1
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 8
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 0.1
num_epochs: 3.0
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
1.1726	0.33	200	2.6079	14.8760	13.6948	0.5929	1.1812	-199.3195	-162.4827	-0.7607	-0.7827
1.0028	0.67	400	2.6743	14.9730	13.8342	0.5844	1.1388	-197.9255	-161.5126	-0.8669	-0.8743
0.5127	1.0	600	2.5239	15.4063	13.9931	0.6000	1.4132	-196.3373	-157.1801	-0.8501	-0.8561
0.3787	1.33	800	2.5951	15.2695	13.9112	0.6142	1.3582	-197.1555	-158.5480	-0.8385	-0.8358
0.381	1.67	1000	2.5814	15.0186	13.4813	0.6213	1.5373	-201.4548	-161.0572	-0.7846	-0.7808
0.2993	2.0	1200	2.5816	15.0307	13.3917	0.6383	1.6390	-202.3505	-160.9355	-0.7590	-0.7554
0.2917	2.33	1400	2.6270	14.5370	12.8885	0.6426	1.6485	-207.3829	-165.8732	-0.8337	-0.8292
0.2881	2.67	1600	2.6358	14.3849	12.6973	0.6468	1.6875	-209.2946	-167.3941	-0.8503	-0.8468

Framework versions

PEFT 0.10.0
Transformers 4.38.2
Pytorch 2.2.1+cu121
Datasets 2.18.0
Tokenizers 0.15.2

jiazhengli
/

Mixtral-8x7B-Instruct-v0.1-QLoRA-Assessment-Rationale-dpo

Mixtral-8x7B-Instruct-v0.1-QLoRA-Assessment-Rationale-dpo

Intended uses & limitations

Training and evaluation data

Citation

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for jiazhengli/Mixtral-8x7B-Instruct-v0.1-QLoRA-Assessment-Rationale-dpo

Datasets used to train jiazhengli/Mixtral-8x7B-Instruct-v0.1-QLoRA-Assessment-Rationale-dpo

Collection including jiazhengli/Mixtral-8x7B-Instruct-v0.1-QLoRA-Assessment-Rationale-dpo

MCTS with Preference Optimisation

Evaluation results