MediaTek-Research
/

Breeze-7B-Base-v0_1

@@ -1,27 +1,63 @@
 ---
 pipeline_tag: text-generation
 ---
 # Model Card for Breeze-7B-Base-v0.1
-Breeze-7B-Base-v0.1 is a 7-billion-parameter language model built from Mistral-7B and tailored for Traditional Chinese (zh-tw).
-This model expands the Traditional Chinese vocabulary by adding an extra 30k Traditional Chinese tokens to the original Mistral-7B. With this, the model adapts better to Traditional Chinese and is 2x as efficient in the encoding and decoding of Traditional Chinese compared to Mistral-7B.
-To the best of our knowledge, this is the first work on vocabulary expansion in Traditional Chinese.
-This model is trained on 250GB of high-quality Traditional Chinese data using continual pre-training.
-Breeze-7B-Base-v0.1 performs well on both EN and TC benchmarks, outperforming Taiwan-LLM-7B-v2.1-base, Taiwan-LLM-13B-v2.0-base, and Yi-6B-Base on all TC benchmarks
-and is comparable with Mistral-7B-v0.1 on MMLU and MT-Bench in English.
 *A project by the members (in alphabetical order): Chan-Jan Hsu 許湛然, Chang-Le Liu 劉昶樂, Feng-Ting Liao 廖峰挺, Po-Chun Hsu 許博竣, Yi-Chang Chen 陳宜昌, and the supervisor Da-Shan Shiu 許大山.*
 ## Features
-- Expanding the vocabulary dictionary from 32k to 62k vocabulary size to better support Traditional Chinese
-- 8k context length
 ## Model Details
-- **Finetuned from:** [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
-- **Model type:** Causal decoder-only transformer language model
-- **Language:** English and Traditional Chinese (zh-tw)
 ## Base Model Performance
@@ -49,6 +85,12 @@ and is comparable with Mistral-7B-v0.1 on MMLU and MT-Bench in English.
 | Mistral-7B-v0.1                           | 33.01        | 42.23          | 35.86      | 37.63      |
 ## Chat Model Performance
 | Models                                     |        | TMMLU+ (ACC) | TMMLU+ (ACC) | DRCD (EM) | Table (ACC) | MT-Bench-tw (Score) | MMLU (ACC) | MMLU (ACC) | MT-Bench (Score) |
@@ -80,12 +122,19 @@ and is comparable with Mistral-7B-v0.1 on MMLU and MT-Bench in English.
 | Taiwan-LLM-13B-v2.0-chat                            | 27.74        | 33.69          | 27.03      | 29.43      |
 | Taiwan-LLM-7B-v2.1-chat                             | 25.58        | 31.76          | 27.36      | 27.61      |
 ## Inference Performance
 In this test, we use the first 700 characters of the [web article](https://health.udn.com/health/story/5976/7699252?from=udn_ch1005_main_index) as the input and ask the model to write the same article again.
-All models were inferenced with `vllm` on 2 A6000 (TP=2).
-| Models                                                             | Inference Time (sec)|Estimated Max Input Length (TC Char)|
 |--------------------------------------------------------------------|-------------------|--------------------------|
 | Yi-6B                                                        |   10.62  |   5.2k                |
 | **Breeze-7B-Instruct-v0.1**                              |  10.74  |    11.1k                 |
@@ -97,12 +146,19 @@ All models were inferenced with `vllm` on 2 A6000 (TP=2).
 | Taiwan-LLM-13B-v2.0-base                                |   36.80          |    2.2k                  |
 | Yi-34B                                                       |  43.71   |    4.5k                  |
 ## Use in Transformers
-First, install direct dependencies:
 ```
-pip install transformers==4.36.1 torch accelerate
 ```
 If you want faster inference using flash-attention2, you need to install these dependencies:
 ```bash
@@ -115,9 +171,20 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
 model = AutoModelForCausalLM.from_pretrained(
-    model="MediaTek-Research/Breeze-7B-Base-v0.1",
     device_map="auto",
     torch_dtype=torch.bfloat16,
-    attn_implementation="flash_attention_2" # optional
 )
 ```

 ---
 pipeline_tag: text-generation
+license: apache-2.0
+language:
+- zh
 ---
 # Model Card for Breeze-7B-Base-v0.1
+Breeze-7B is a language model that builds upon the foundation of [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1), specifically enhanced for Traditional Chinese.
+[Breeze-7B-Base-v0.1](https://huggingface.co/MediaTek-Research/Breeze-7B-Base-v0.1) introduces an expanded vocabulary with additional 30,000 Traditional Chinese tokens and
+is pre-trained on a substantial dataset of 250GB of Traditional Chinese content.
+With the expanded vocabulary, the base model operates at twice the inference speed for Traditional Chinese characters compared to Mistral-7B. [See [Inference Performance](#inference-performance).]
+This achievement marks a significant milestone as it is the first instance of vocabulary expansion in a model tailored for Traditional Chinese.
+[Breeze-7B-Instruct-v0.1](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-v0.1) derives from the base model Breeze-7B-Base-v0.1
+and has undergone supervised fine-tuning with over 1 million instances to
+sharpen its capabilities. This fine-tuned model demonstrates impressive performance in benchmarks for both English and Traditional Chinese, surpassing the results of
+Taiwan-LLM-7B-v2.1-chat, Taiwan-LLM-13B-v2.0-chat and Qwen-7B-chat in Traditional Chinese assessments. It also excels in some benchmarks against Yi-6B-Chat.
+In English evaluations, Breeze-7B-Instruct-v0.1 shows comparable results to Mistral-7B-Instruct-v0.1 on the MMLU and MT-Bench benchmarks. [See [Chat Model Performance](#chat-model-performance).]
+[Breeze-7B-Instruct-64k-v0.1](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-64k-v0.1) is an extension to Breeze-7B-Instruct-v0.1
+to enable 64k
+context length, which is equivalent to 88k Traditional Chinese characters. With minimal sacrifice in the performance of the regular benchmarks,
+Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summarization on document-level inputs. [See [Long-context Performance](#long-context-performance).]
 *A project by the members (in alphabetical order): Chan-Jan Hsu 許湛然, Chang-Le Liu 劉昶樂, Feng-Ting Liao 廖峰挺, Po-Chun Hsu 許博竣, Yi-Chang Chen 陳宜昌, and the supervisor Da-Shan Shiu 許大山.*
 ## Features
+- Breeze-7B-Base-v0.1
+  - Expanding the vocabulary dictionary size from 32k to 62k to better support Traditional Chinese
+  - 8k tokens context length
+- Breeze-7B-Instruct-v0.1
+  - Expanding the vocabulary dictionary size from 32k to 62k to better support Traditional Chinese
+  - 8k tokens context length
+  - Multi-turn dialogue (without special handling for harmfulness)
+- Breeze-7B-Instruct-64k-v0.1
+  - Expanding the vocabulary dictionary size from 32k to 62k to better support Traditional Chinese
+  - 64k tokens context length
+  - Multi-turn dialogue (without special handling for harmfulness)
 ## Model Details
+- Breeze-7B-Base-v0.1
+  - Finetuned from: [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
+  - Model type: Causal decoder-only transformer language model
+  - Language: English and Traditional Chinese (zh-tw)
+- Breeze-7B-Instruct-v0.1
+  - Finetuned from: [MediaTek-Research/Breeze-7B-Base-v0.1](https://huggingface.co/MediaTek-Research/Breeze-7B-Base-v0.1)
+  - Model type: Causal decoder-only transformer language model
+  - Language: English and Traditional Chinese (zh-tw)
+- Breeze-7B-Instruct-64k-v0.1
+  - Finetuned from: [MediaTek-Research/Breeze-7B-Instruct-v0.1](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-v0.1)
+  - Model type: Causal decoder-only transformer language model
+  - Language: English and Traditional Chinese (zh-tw)
 ## Base Model Performance
 | Mistral-7B-v0.1                           | 33.01        | 42.23          | 35.86      | 37.63      |
+**TMMLU+**, **DRCD**, and **Table** source from [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2).
+[MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2) derives from [TCEval-v1](https://github.com/mtkresearch/MR-Models/tree/main/TC-Eval)
+ and [ikala/tmmluplus](https://huggingface.co/datasets/ikala/tmmluplus). **MMLU** sources from [hails/mmlu_no_train](https://huggingface.co/datasets/hails/mmlu_no_train).
+ We use the code revised from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate **TMMLU+**, **DRCD**, **Table**, and **MMLU**.
 ## Chat Model Performance
 | Models                                     |        | TMMLU+ (ACC) | TMMLU+ (ACC) | DRCD (EM) | Table (ACC) | MT-Bench-tw (Score) | MMLU (ACC) | MMLU (ACC) | MT-Bench (Score) |
 | Taiwan-LLM-13B-v2.0-chat                            | 27.74        | 33.69          | 27.03      | 29.43      |
 | Taiwan-LLM-7B-v2.1-chat                             | 25.58        | 31.76          | 27.36      | 27.61      |
+**TMMLU+**, **DRCD**, **Table**, and **MT-Bench-tw** source from [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2).
+[MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2) derives from [TCEval-v1](https://github.com/mtkresearch/MR-Models/tree/main/TC-Eval)
+ and [ikala/tmmluplus](https://huggingface.co/datasets/ikala/tmmluplus). **MMLU** sources from [hails/mmlu_no_train](https://huggingface.co/datasets/hails/mmlu_no_train).
+ **MT-Bench** source from [lmsys/mt_bench_human_judgments](https://huggingface.co/datasets/lmsys/mt_bench_human_judgments).
+ We use the code revised from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate **TMMLU+**, **DRCD**, **Table**, and **MMLU**.
+ We use the code revised from [fastchat llm_judge](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) to evaluate **MT-Bench-tw** and **MT-Bench**.
 ## Inference Performance
 In this test, we use the first 700 characters of the [web article](https://health.udn.com/health/story/5976/7699252?from=udn_ch1005_main_index) as the input and ask the model to write the same article again.
+All inferences run on 2 RTX A6000 GPUs (using `vllm`, with a tensor-parallel size of 2).
+| Models                                                             | Inference Time (sec)|Estimated Max Input Length (Char)|
 |--------------------------------------------------------------------|-------------------|--------------------------|
 | Yi-6B                                                        |   10.62  |   5.2k                |
 | **Breeze-7B-Instruct-v0.1**                              |  10.74  |    11.1k                 |
 | Taiwan-LLM-13B-v2.0-base                                |   36.80          |    2.2k                  |
 | Yi-34B                                                       |  43.71   |    4.5k                  |
+## Long-context Performance
+TBD
+## Examples
+TBD
 ## Use in Transformers
+First install direct dependencies:
 ```
+pip install transformers torch accelerate
 ```
 If you want faster inference using flash-attention2, you need to install these dependencies:
 ```bash
 import torch
 model = AutoModelForCausalLM.from_pretrained(
+    model="MediaTek-Research/Breeze-7B-Instruct-v0.1",
     device_map="auto",
     torch_dtype=torch.bfloat16,
+    use_flash_attn_2=True # optional
 )
 ```
+The structure of the query template follows that of Mistral-7B-Instruct, as shown below.
+```txt
+<s> SYS_PROMPT   [INST] QUERY1 [/INST] RESPONSE1 [INST] QUERY2 [/INST]
+```
+where `SYS_PROMPT`, `QUERY1`, `RESPONSE1`, and `QUERY2` can be provided by the user.
+The suggested default `SYS_PROMPT` is
+```txt
+You are a helpful AI assistant built by MediaTek Research. The user you are helping speaks Traditional Chinese and comes from Taiwan.
+```