metadata
language:
- ja
- en
license: cc-by-4.0
datasets:
- cyberagent/chatbot-arena-ja-calm2-7b-chat-experimental
base_model: cyberagent/calm2-7b-chat
model-index:
- name: calm2-7b-chat-dpo-experimental
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: AI2 Reasoning Challenge (25-Shot)
type: ai2_arc
config: ARC-Challenge
split: test
args:
num_few_shot: 25
metrics:
- type: acc_norm
value: 41.04
name: normalized accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=cyberagent/calm2-7b-chat-dpo-experimental
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: HellaSwag (10-Shot)
type: hellaswag
split: validation
args:
num_few_shot: 10
metrics:
- type: acc_norm
value: 68.99
name: normalized accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=cyberagent/calm2-7b-chat-dpo-experimental
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU (5-Shot)
type: cais/mmlu
config: all
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 39.82
name: accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=cyberagent/calm2-7b-chat-dpo-experimental
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: TruthfulQA (0-shot)
type: truthful_qa
config: multiple_choice
split: validation
args:
num_few_shot: 0
metrics:
- type: mc2
value: 43.13
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=cyberagent/calm2-7b-chat-dpo-experimental
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: Winogrande (5-shot)
type: winogrande
config: winogrande_xl
split: validation
args:
num_few_shot: 5
metrics:
- type: acc
value: 65.67
name: accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=cyberagent/calm2-7b-chat-dpo-experimental
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: GSM8k (5-shot)
type: gsm8k
config: main
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 5.53
name: accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=cyberagent/calm2-7b-chat-dpo-experimental
name: Open LLM Leaderboard
Model Card for "calm2-7b-chat-dpo-experimental"
cyberagent/calm2-7b-chatにcyberagent/chatbot-arena-ja-calm2-7b-chat-experimentalデータセットを用いてDirect Preference Optimization (DPO)をしたモデルです。 DPOにはLow-Rank Adaptation (LoRA)を用いました。
Requirements, Usage, Chat Template
cyberagent/calm2-7b-chatと同様です。 同様のコード・プロンプトで動かすことができます。
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
assert transformers.__version__ >= "4.34.1"
model = AutoModelForCausalLM.from_pretrained("cyberagent/calm2-7b-chat-dpo-experimental", device_map="auto", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("cyberagent/calm2-7b-chat-dpo-experimental")
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
prompt = """USER: AIによって私達の暮らしはどのように変わりますか?
ASSISTANT: """
token_ids = tokenizer.encode(prompt, return_tensors="pt")
output_ids = model.generate(
input_ids=token_ids.to(model.device),
max_new_tokens=300,
do_sample=True,
temperature=0.8,
streamer=streamer,
)
実験結果
ELYZA-tasks-100 (GPT-4 eval)
実験結果のランダム性を避けるため、greedy searchで出力しました。
calm2-7b-chat | calm2-7b-chat-dpo |
---|---|
2.67 | 2.85 |
Japanese MT-Bench
以下の文をシステムプロンプト(system_message)としてcalm2-7b-chat-dpoとcalm2-7b-chatの評価を行いました。
"以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。"
このシステムプロンプトはstabilityai/japanese-stablelm-instruct-alpha-7bを評価するときに使われるものをそのまま使いました。 他のデコーディングパラメータはデフォルトのままです(ランダム性があります)。
calm2-7b-chat | calm2-7b-chat-dpo | |
---|---|---|
平均 | 6.1 | 6.7 |
extraction | 4.1 | 5.4 |
humanities | 8.2 | 8.4 |
reasoning | 3.9 | 4.3 |
roleplay | 6.4 | 7.0 |
stem | 6.3 | 6.2 |
writing | 7.7 | 9.1 |
Releases
1.0: v1 release (Jan 24, 2024)
Author
Yuu Jinnai ([email protected]), Standing on the shoulders of giants
Open LLM Leaderboard Evaluation Results
Detailed results can be found here
Metric | Value |
---|---|
Avg. | 44.03 |
AI2 Reasoning Challenge (25-Shot) | 41.04 |
HellaSwag (10-Shot) | 68.99 |
MMLU (5-Shot) | 39.82 |
TruthfulQA (0-shot) | 43.13 |
Winogrande (5-shot) | 65.67 |
GSM8k (5-shot) | 5.53 |