MediaTek-Research
/

Breeze-7B-Base-v0_1

@@ -67,7 +67,7 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
  We use the code revised from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate **TMMLU+**, **DRCD**, **Table**, and **MMLU**.
-| Models                                       |        | TMMLU+ (ACC) ↑ | DRCD (EM)   | Table (ACC) | MMLU (ACC) |
 |----------------------------------------------|--------|--------------|-------------|-------------|------------|
 |                                              |        |TC, Knowledge |TC, Reasoning|TC, Reasoning|EN, Knowledge|
 |                                              |        | 5 shot       | 3 shot      | 5 shot      | 5 shot     |
@@ -83,7 +83,7 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
 **Category ACC of TMMLU+ (5 shot)**
-| Models                           | STEM         | Social Science | Humanities | Other      | AVG ↑ |
 |----------------------------------|--------------|----------------|------------|------------|-------|
 | Yi-34B                           | 56.03        | 73.06          | 61.12      | 62.19      | 63.10 |
 | Qwen-14B                         | 46.51        | 58.20          | 51.12      | 49.38      | 51.30 |
@@ -105,7 +105,7 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
  We use the code revised from [fastchat llm_judge](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) to evaluate **MT-Bench-tw** and **MT-Bench**.
-| Models                                                                                                  |        |MT-Bench-tw (Score) ↑| TMMLU+ (ACC) | TMMLU+ (ACC) | DRCD (EM)   | Table (ACC) | MT-Bench (Score) | MMLU (ACC)  | MMLU (ACC)  |
 |---------------------------------------------------------------------------------------------------------|--------|--------------------|--------------|--------------|-------------|-------------|------------------|-------------|-------------|
 |                                                                                                         |        |TC, Chat            |TC, Knowledge |TC, Knowledge |TC, Reasoning|TC, Reasoning|EN, Chat          |EN, Knowledge|EN, Knowledge|
 |                                                                                                         |        |0 shot              | 0 shot       | 5 shot       | 3 shot      | 0 shot      |0 shot            |  0 shot     | 5 shot      |
@@ -123,7 +123,7 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
 **Category Score of MT-Bench-tw (0 shot)**
-| Models                                              | STEM    |Extraction|Reasoning| Math   | Coding  | Roleplay| Writing |Humanities|AVG  ↑ |
 |-----------------------------------------------------|---------|---------|---------|---------|---------|---------|---------|---------|---------|
 | gpt-3.5-turbo                                       |         |         |         |         |         |         |         |         |         |
 | Yi-34B-Chat                                         |         |         |         |         |         |         |         |         |         |
@@ -137,7 +137,7 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
 **Category ACC of TMMLU+ (0 shot)**
-| Model                                               | STEM         | Social Science | Humanities | Other      | AVG  ↑  |
 |-----------------------------------------------------|--------------|----------------|------------|------------|---------|
 | Yi-34B-Chat                                         | 47.65        | 64.25          | 52.73      | 54.91      | 54.87   |
 | Qwen-14B-Chat                                       | 43.83        | 55.00          | 48.55      | 46.22      | 48.41   |
@@ -155,7 +155,7 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
 In this test, we use the first 700 characters of the [web article](https://health.udn.com/health/story/5976/7699252?from=udn_ch1005_main_index) as the input and ask the model to write the same article again.
 All inferences run on 2 RTX A6000 GPUs (using `vllm`, with a tensor-parallel size of 2).
-| Models                                                             | Inference Time (sec) ↓|Estimated Max Input Length (Char)|
 |--------------------------------------------------------------------|-------------------|--------------------------|
 | Yi-6B                                                              |   10.62  |   5.2k                |
 | **Breeze-7B-Instruct-v0.1**                                        |  10.74  |    11.1k                 |

  We use the code revised from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate **TMMLU+**, **DRCD**, **Table**, and **MMLU**.
+| Models                                       |        |↑ TMMLU+ (ACC) | DRCD (EM)   | Table (ACC) | MMLU (ACC) |
 |----------------------------------------------|--------|--------------|-------------|-------------|------------|
 |                                              |        |TC, Knowledge |TC, Reasoning|TC, Reasoning|EN, Knowledge|
 |                                              |        | 5 shot       | 3 shot      | 5 shot      | 5 shot     |
 **Category ACC of TMMLU+ (5 shot)**
+| Models                           | STEM         | Social Science | Humanities | Other      | ↑ AVG |
 |----------------------------------|--------------|----------------|------------|------------|-------|
 | Yi-34B                           | 56.03        | 73.06          | 61.12      | 62.19      | 63.10 |
 | Qwen-14B                         | 46.51        | 58.20          | 51.12      | 49.38      | 51.30 |
  We use the code revised from [fastchat llm_judge](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) to evaluate **MT-Bench-tw** and **MT-Bench**.
+| Models                                                                                                  |        |↑ MT-Bench-tw (Score)| TMMLU+ (ACC) | TMMLU+ (ACC) | DRCD (EM)   | Table (ACC) | MT-Bench (Score) | MMLU (ACC)  | MMLU (ACC)  |
 |---------------------------------------------------------------------------------------------------------|--------|--------------------|--------------|--------------|-------------|-------------|------------------|-------------|-------------|
 |                                                                                                         |        |TC, Chat            |TC, Knowledge |TC, Knowledge |TC, Reasoning|TC, Reasoning|EN, Chat          |EN, Knowledge|EN, Knowledge|
 |                                                                                                         |        |0 shot              | 0 shot       | 5 shot       | 3 shot      | 0 shot      |0 shot            |  0 shot     | 5 shot      |
 **Category Score of MT-Bench-tw (0 shot)**
+| Models                                              | STEM    |Extraction|Reasoning| Math   | Coding  | Roleplay| Writing |Humanities|↑ AVG   |
 |-----------------------------------------------------|---------|---------|---------|---------|---------|---------|---------|---------|---------|
 | gpt-3.5-turbo                                       |         |         |         |         |         |         |         |         |         |
 | Yi-34B-Chat                                         |         |         |         |         |         |         |         |         |         |
 **Category ACC of TMMLU+ (0 shot)**
+| Model                                               | STEM         | Social Science | Humanities | Other      | ↑ AVG   |
 |-----------------------------------------------------|--------------|----------------|------------|------------|---------|
 | Yi-34B-Chat                                         | 47.65        | 64.25          | 52.73      | 54.91      | 54.87   |
 | Qwen-14B-Chat                                       | 43.83        | 55.00          | 48.55      | 46.22      | 48.41   |
 In this test, we use the first 700 characters of the [web article](https://health.udn.com/health/story/5976/7699252?from=udn_ch1005_main_index) as the input and ask the model to write the same article again.
 All inferences run on 2 RTX A6000 GPUs (using `vllm`, with a tensor-parallel size of 2).
+| Models                                                             | ↓ Inference Time (sec)|Estimated Max Input Length (Char)|
 |--------------------------------------------------------------------|-------------------|--------------------------|
 | Yi-6B                                                              |   10.62  |   5.2k                |
 | **Breeze-7B-Instruct-v0.1**                                        |  10.74  |    11.1k                 |