YC-Chen commited on
Commit
55d170f
·
verified ·
1 Parent(s): 0645363

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -6
README.md CHANGED
@@ -67,7 +67,7 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
67
  We use the code revised from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate **TMMLU+**, **DRCD**, **Table**, and **MMLU**.
68
 
69
 
70
- | Models | | TMMLU+ (ACC) | DRCD (EM) | Table (ACC) | MMLU (ACC) |
71
  |----------------------------------------------|--------|--------------|-------------|-------------|------------|
72
  | | |TC, Knowledge |TC, Reasoning|TC, Reasoning|EN, Knowledge|
73
  | | | 5 shot | 3 shot | 5 shot | 5 shot |
@@ -83,7 +83,7 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
83
 
84
  **Category ACC of TMMLU+ (5 shot)**
85
 
86
- | Models | STEM | Social Science | Humanities | Other | AVG |
87
  |----------------------------------|--------------|----------------|------------|------------|-------|
88
  | Yi-34B | 56.03 | 73.06 | 61.12 | 62.19 | 63.10 |
89
  | Qwen-14B | 46.51 | 58.20 | 51.12 | 49.38 | 51.30 |
@@ -105,7 +105,7 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
105
  We use the code revised from [fastchat llm_judge](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) to evaluate **MT-Bench-tw** and **MT-Bench**.
106
 
107
 
108
- | Models | |MT-Bench-tw (Score) ↑| TMMLU+ (ACC) | TMMLU+ (ACC) | DRCD (EM) | Table (ACC) | MT-Bench (Score) | MMLU (ACC) | MMLU (ACC) |
109
  |---------------------------------------------------------------------------------------------------------|--------|--------------------|--------------|--------------|-------------|-------------|------------------|-------------|-------------|
110
  | | |TC, Chat |TC, Knowledge |TC, Knowledge |TC, Reasoning|TC, Reasoning|EN, Chat |EN, Knowledge|EN, Knowledge|
111
  | | |0 shot | 0 shot | 5 shot | 3 shot | 0 shot |0 shot | 0 shot | 5 shot |
@@ -123,7 +123,7 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
123
 
124
  **Category Score of MT-Bench-tw (0 shot)**
125
 
126
- | Models | STEM |Extraction|Reasoning| Math | Coding | Roleplay| Writing |Humanities|AVG|
127
  |-----------------------------------------------------|---------|---------|---------|---------|---------|---------|---------|---------|---------|
128
  | gpt-3.5-turbo | | | | | | | | | |
129
  | Yi-34B-Chat | | | | | | | | | |
@@ -137,7 +137,7 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
137
 
138
  **Category ACC of TMMLU+ (0 shot)**
139
 
140
- | Model | STEM | Social Science | Humanities | Other | AVG|
141
  |-----------------------------------------------------|--------------|----------------|------------|------------|---------|
142
  | Yi-34B-Chat | 47.65 | 64.25 | 52.73 | 54.91 | 54.87 |
143
  | Qwen-14B-Chat | 43.83 | 55.00 | 48.55 | 46.22 | 48.41 |
@@ -155,7 +155,7 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
155
  In this test, we use the first 700 characters of the [web article](https://health.udn.com/health/story/5976/7699252?from=udn_ch1005_main_index) as the input and ask the model to write the same article again.
156
  All inferences run on 2 RTX A6000 GPUs (using `vllm`, with a tensor-parallel size of 2).
157
 
158
- | Models | Inference Time (sec) ↓|Estimated Max Input Length (Char)|
159
  |--------------------------------------------------------------------|-------------------|--------------------------|
160
  | Yi-6B | 10.62 | 5.2k |
161
  | **Breeze-7B-Instruct-v0.1** | 10.74 | 11.1k |
 
67
  We use the code revised from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate **TMMLU+**, **DRCD**, **Table**, and **MMLU**.
68
 
69
 
70
+ | Models | |↑ TMMLU+ (ACC) | DRCD (EM) | Table (ACC) | MMLU (ACC) |
71
  |----------------------------------------------|--------|--------------|-------------|-------------|------------|
72
  | | |TC, Knowledge |TC, Reasoning|TC, Reasoning|EN, Knowledge|
73
  | | | 5 shot | 3 shot | 5 shot | 5 shot |
 
83
 
84
  **Category ACC of TMMLU+ (5 shot)**
85
 
86
+ | Models | STEM | Social Science | Humanities | Other | AVG |
87
  |----------------------------------|--------------|----------------|------------|------------|-------|
88
  | Yi-34B | 56.03 | 73.06 | 61.12 | 62.19 | 63.10 |
89
  | Qwen-14B | 46.51 | 58.20 | 51.12 | 49.38 | 51.30 |
 
105
  We use the code revised from [fastchat llm_judge](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) to evaluate **MT-Bench-tw** and **MT-Bench**.
106
 
107
 
108
+ | Models | |↑ MT-Bench-tw (Score)| TMMLU+ (ACC) | TMMLU+ (ACC) | DRCD (EM) | Table (ACC) | MT-Bench (Score) | MMLU (ACC) | MMLU (ACC) |
109
  |---------------------------------------------------------------------------------------------------------|--------|--------------------|--------------|--------------|-------------|-------------|------------------|-------------|-------------|
110
  | | |TC, Chat |TC, Knowledge |TC, Knowledge |TC, Reasoning|TC, Reasoning|EN, Chat |EN, Knowledge|EN, Knowledge|
111
  | | |0 shot | 0 shot | 5 shot | 3 shot | 0 shot |0 shot | 0 shot | 5 shot |
 
123
 
124
  **Category Score of MT-Bench-tw (0 shot)**
125
 
126
+ | Models | STEM |Extraction|Reasoning| Math | Coding | Roleplay| Writing |Humanities|↑ AVG |
127
  |-----------------------------------------------------|---------|---------|---------|---------|---------|---------|---------|---------|---------|
128
  | gpt-3.5-turbo | | | | | | | | | |
129
  | Yi-34B-Chat | | | | | | | | | |
 
137
 
138
  **Category ACC of TMMLU+ (0 shot)**
139
 
140
+ | Model | STEM | Social Science | Humanities | Other | AVG |
141
  |-----------------------------------------------------|--------------|----------------|------------|------------|---------|
142
  | Yi-34B-Chat | 47.65 | 64.25 | 52.73 | 54.91 | 54.87 |
143
  | Qwen-14B-Chat | 43.83 | 55.00 | 48.55 | 46.22 | 48.41 |
 
155
  In this test, we use the first 700 characters of the [web article](https://health.udn.com/health/story/5976/7699252?from=udn_ch1005_main_index) as the input and ask the model to write the same article again.
156
  All inferences run on 2 RTX A6000 GPUs (using `vllm`, with a tensor-parallel size of 2).
157
 
158
+ | Models | Inference Time (sec)|Estimated Max Input Length (Char)|
159
  |--------------------------------------------------------------------|-------------------|--------------------------|
160
  | Yi-6B | 10.62 | 5.2k |
161
  | **Breeze-7B-Instruct-v0.1** | 10.74 | 11.1k |