Deita banner

Model Card for Deita Quality Scorer

GitHub | Paper

Deita is an open-sourced project designed to facilitate Automatic Data Selection for instruction tuning in Large Language Models (LLMs).

Deita Quality Scorer is a tool for automatically annotating the Instruction Quality of SFT data.

Model description

  • Model type: Model fine tuned to automatically annotate the Instruction-Response Pair Quality
  • Language(s) (NLP): Primarily English
  • Finetuned from model: Llama-1-13b-hf

Model Sources

Performance

Model Align Data Size MT-Bench AlpacaEval(%) OpenLLM (Avg.)
Proprietary Models
GPT-4-Turbo ? -- 9.32 97.70 --
GPT-4 SFT + PPO -- 8.99 95.03 --
Claude-2 SFT + PPO -- 8.06 91.36 --
GPT-3.5-turbo SFT + PPO -- 7.94 89.37 --
Open-sourced Models based on LLaMA-1-13B
LIMA SFT 1K SFT 4.29 41.98 59.82
WizardLM-13B SFT 70K SFT 6.35 75.31 58.96
Vicuna-13B-v1.3 SFT 125K SFT 6.39 82.11 60.01
Random SFT 10K SFT 6.03 71.52 60.14
DEITA-LLaMA1-13B-v1.0-sft SFT 10K SFT 6.60 78.01 64.27
Open-sourced Models based on LLaMA-2-13B
Tulu-2-13B SFT 326K SFT 6.70 78.90 --
Tulu-2-13B+DPO SFT + DPO 326K SFT + 60K DPO 7.00 89.50 --
LLaMA2-13B-Chat SFT + PPO -- 6.65 81.09 --
WizardLM-13B-v1.2 SFT >70K SFT 7.09 89.17 --
Vicuna-13B-v1.5 SFT 125K SFT 6.57 78.80 61.63
Random SFT 10K SFT 5.78 65.19 61.32
DEITA-LLaMA2-13B-v1.0-sft SFT 10K SFT 6.79 81.09 62.71
Open-sourced Models based on Mistral-7B
Mistral-7B-Instruct-v0.1 -- -- 6.84 69.65 60.45
Zephyr-7B-sft SFT 200K SFT 5.32 75.12 60.93
$\text{Zephyr-7B-}\beta$ SFT + DPO 200K SFT + 60K DPO 7.34 90.60 66.36
OpenChat-3.5 C-RLFT >> 70K C-RLFT 7.81 88.51 --
Starling-7B C-RLFT + APA >>70K C-RLFT + 183K APA 8.09 91.99 --
Random SFT 10K SFT 5.89 56.90 61.72
DEITA-7B-v1.0-sft (6K) SFT 6K SFT 7.22 80.78 64.94
DEITA-7B-v1.0-sft (10K) SFT 10K SFT 7.32 81.67 64.00
DEITA-7B-v1.0 SFT + DPO 6K SFT + 10K DPO 7.55 90.06 69.86

Usage

Please use the following format to score the quality of Instruction-Response Pair

from transformers import AutoTokenizer, AutoModelForCausalLM
import numpy as np
from scipy.special import softmax
model_name = "hkust-nlp/deita-quality-scorer"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)


def infer_Quality(model, tokenizer, input_text, resp_text):
    quality_template = ("You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \n #Question#:\n{instruction}\n#Response#:\n{output} \n##Quality: ")
    user_input = quality_template.format(instruction=input_text, output=resp_text)
    input_ids = tokenizer.encode(user_input, return_tensors="pt")
    max_length = 512
    outputs = model.generate(input_ids, max_length=512, num_return_sequences=1, return_dict_in_generate=True, output_scores=True)
    logprobs_list = outputs.scores[0][0]
    score_logits = []
    id2score = {
        29896: "1",
        29906: "2",
        29941: "3",
        29946: "4",
        29945: "5",
        29953: "6"
    }
    score_template = np.array([1,2,3,4,5,6])
    for k in id2score:
        score_logits.append(logprobs_list[k])
    score_logits = np.array(score_logits)
    score_npy = softmax(score_logits, axis=0)
    score_npy = score_npy * score_template

    score_npy = np.sum(score_npy, axis=0)
    return score_npy

input_text = "word to describe UI with helpful tooltips" # Example Input
output_text = "User-friendly or intuitive UI" # Example Output
quality_score = infer_quality(model, tokenizer, input_text)

print(quality_score)

Citation

If you find the content of this project helpful, please cite our paper as follows:

@misc{liu2023what,
      title={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning}, 
      author={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},
      year={2023},
      eprint={2312.15685},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Downloads last month
1,875
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including hkust-nlp/deita-quality-scorer