Storm-7B / README.md
leaderboard-pr-bot's picture
Adding Evaluation Results
81a13d6 verified
|
raw
history blame
8.89 kB
metadata
language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - storm
  - mistral
  - openchat
  - RLAIF
  - reward model
base_model: openchat/openchat-3.5-0106
datasets:
  - berkeley-nest/Nectar
model-index:
  - name: Storm-7B
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: IFEval (0-Shot)
          type: HuggingFaceH4/ifeval
          args:
            num_few_shot: 0
        metrics:
          - type: inst_level_strict_acc and prompt_level_strict_acc
            value: 34.24
            name: strict accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=jieliu/Storm-7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: BBH (3-Shot)
          type: BBH
          args:
            num_few_shot: 3
        metrics:
          - type: acc_norm
            value: 32.33
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=jieliu/Storm-7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MATH Lvl 5 (4-Shot)
          type: hendrycks/competition_math
          args:
            num_few_shot: 4
        metrics:
          - type: exact_match
            value: 5.97
            name: exact match
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=jieliu/Storm-7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GPQA (0-shot)
          type: Idavidrein/gpqa
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 7.72
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=jieliu/Storm-7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MuSR (0-shot)
          type: TAUR-Lab/MuSR
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 14.63
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=jieliu/Storm-7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU-PRO (5-shot)
          type: TIGER-Lab/MMLU-Pro
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 23.55
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=jieliu/Storm-7B
          name: Open LLM Leaderboard

Storm-7B

Please see our paper for more details.

Introduction

We released Storm-7B, the first open-source language model comparable to the GPT-4 series on the AlpacaEval 2.0 leaderboard.

Recent studies show that DPO benefits from iterative training with online preferences labeled by a trained reward model. In this work, we identify a pitfall of vanilla iterative DPO - improved response quality can lead to increased verbosity. To address this, we introduce iterative length-regularized DPO (iLR-DPO) to penalize response length. Our empirical results show that iLR-DPO can enhance a 7B model to perform on par with GPT-4 without increasing verbosity.

Performance

Our 7B model achieves a 50.5% length-controlled win rate against GPT-4 Preview on AlpacaEval 2.0.

Our model's LC win rate improves over iterations without significantly changing the response length, indicating better alignment with human values without length bias. The final trained model (iteration 3) achieves a 50.5% LC win rate, making it the first open-source model to surpass the baseline model GPT-4 Preview.

In addition to regular decoding, we also test beam search and best-of-n sampling on top of our trained model. Beam search over our trained model shows a 5% improvement over regular decoding, Best-of-n sampling with Starling-RM-34B achieves 61.6% LC Win rate and outperforms GPT-4 Omni.

We observe no significant degradation in traditional NLP tasks from the Huggingface Open LLM Leaderboard.

Uses

Our model uses the same chat template as Openchat-3.5-0106. A sample code snippet for inference using our model is provided below.

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"

model = AutoModelForCausalLM.from_pretrained("jieliu/Storm-7B").to(device)
tokenizer = AutoTokenizer.from_pretrained("jieliu/Storm-7B")
model.eval().requires_grad_(False)

def generate_response(prompt):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    outputs = model.generate(
        input_ids,
        max_length=2048,
        do_sample=True,
        temperature=1.0,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )
    response_ids = outputs[0]
    response_text = tokenizer.decode(response_ids, skip_special_tokens=True)
    return response_text

prompt = "How does a telescope work?"
input_prompt = f"GPT4 Correct User: {prompt}<|end_of_turn|>GPT4 Correct Assistant:"
response_text = generate_response(input_prompt)
print("Response:", response_text)

Scripts

You can reproduce our results on AlphaEval 2.0 using the script provided below.

git clone https://github.com/tatsu-lab/alpaca_eval.git
cd alpaca_eval
pip install -e .
export OPENAI_API_KEY=<your_api_key>
alpaca_eval evaluate_from_model --model_configs 'Storm-7B'

Limitations

Our work has several limitations: (1) We focus on aligning with human preferences but only use GPT-4 as a proxy for human judgment to evaluate language models. (2) We reduce verbosity with a length penalty, though verbosity and length are not necessarily correlated. Future work could train a specific reward model to directly penalize verbosity, replacing the length margin with a verbosity margin, following the standard MODPO pipeline.

Citation

@article{liu2024iterative,
    title = {Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level},
    author = {Liu, Jie and Zhou, Zhanhui and Liu, Jiaheng and Bu, Xingyuan and Yang, Chao and Zhong Han-Sen and Ouyang, Wanli},
    journal={arXiv preprint arXiv:2406.11817},
    year={2024}
}

@article{zhou2023beyond,
  title={Beyond one-preference-for-all: Multi-objective direct preference optimization},
  author={Zhou, Zhanhui and Liu, Jie and Yang, Chao and Shao, Jing and Liu, Yu and Yue, Xiangyu and Ouyang, Wanli and Qiao, Yu},
  journal={arXiv preprint arXiv:2310.03708},
  year={2023}
}

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 19.74
IFEval (0-Shot) 34.24
BBH (3-Shot) 32.33
MATH Lvl 5 (4-Shot) 5.97
GPQA (0-shot) 7.72
MuSR (0-shot) 14.63
MMLU-PRO (5-shot) 23.55