Text Generation
Transformers
Safetensors
Japanese
English
llama
text-generation-inference
Inference Endpoints
leaderboard-pr-bot's picture
Adding Evaluation Results
c3422ab verified
|
raw
history blame
6.35 kB
metadata
language:
  - ja
  - en
license: cc-by-4.0
datasets:
  - cyberagent/chatbot-arena-ja-calm2-7b-chat-experimental
base_model: cyberagent/calm2-7b-chat
model-index:
  - name: calm2-7b-chat-dpo-experimental
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 41.04
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=cyberagent/calm2-7b-chat-dpo-experimental
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 68.99
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=cyberagent/calm2-7b-chat-dpo-experimental
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 39.82
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=cyberagent/calm2-7b-chat-dpo-experimental
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 43.13
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=cyberagent/calm2-7b-chat-dpo-experimental
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 65.67
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=cyberagent/calm2-7b-chat-dpo-experimental
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 5.53
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=cyberagent/calm2-7b-chat-dpo-experimental
          name: Open LLM Leaderboard

Model Card for "calm2-7b-chat-dpo-experimental"

cyberagent/calm2-7b-chatcyberagent/chatbot-arena-ja-calm2-7b-chat-experimentalデータセットを用いてDirect Preference Optimization (DPO)をしたモデルです。 DPOにはLow-Rank Adaptation (LoRA)を用いました。

Requirements, Usage, Chat Template

cyberagent/calm2-7b-chatと同様です。 同様のコード・プロンプトで動かすことができます。

import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

assert transformers.__version__ >= "4.34.1"

model = AutoModelForCausalLM.from_pretrained("cyberagent/calm2-7b-chat-dpo-experimental", device_map="auto", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("cyberagent/calm2-7b-chat-dpo-experimental")
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

prompt = """USER: AIによって私達の暮らしはどのように変わりますか?
ASSISTANT: """

token_ids = tokenizer.encode(prompt, return_tensors="pt")
output_ids = model.generate(
    input_ids=token_ids.to(model.device),
    max_new_tokens=300,
    do_sample=True,
    temperature=0.8,
    streamer=streamer,
)

実験結果

ELYZA-tasks-100 (GPT-4 eval)

実験結果のランダム性を避けるため、greedy searchで出力しました。

calm2-7b-chat calm2-7b-chat-dpo
2.67 2.85

Japanese MT-Bench

以下の文をシステムプロンプト(system_message)としてcalm2-7b-chat-dpoとcalm2-7b-chatの評価を行いました。

"以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。"

このシステムプロンプトはstabilityai/japanese-stablelm-instruct-alpha-7bを評価するときに使われるものをそのまま使いました。 他のデコーディングパラメータはデフォルトのままです(ランダム性があります)。

calm2-7b-chat calm2-7b-chat-dpo
平均 6.1 6.7
extraction 4.1 5.4
humanities 8.2 8.4
reasoning 3.9 4.3
roleplay 6.4 7.0
stem 6.3 6.2
writing 7.7 9.1

Releases

1.0: v1 release (Jan 24, 2024)

Author

Yuu Jinnai ([email protected]), Standing on the shoulders of giants

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 44.03
AI2 Reasoning Challenge (25-Shot) 41.04
HellaSwag (10-Shot) 68.99
MMLU (5-Shot) 39.82
TruthfulQA (0-shot) 43.13
Winogrande (5-shot) 65.67
GSM8k (5-shot) 5.53