SUSTech
/

SUS-Chat-34B

 pipeline_tag: text-generation
 ---
+.
+## Introduction
+**SUS-CHhat** is powered by SUSTech x IDEA-CCNL, based on `01-ai/Yi-34B`
+## News
+<details open>
+<summary>🎯 <b>2023/11/23</b>: The chat models are open to public.</summary>
+This release contains two chat models based on previous released base models, two 8-bits models quantized by GPTQ, two 4-bits models quantized by AWQ.
+- `Yi-34B-Chat`
+- `Yi-34B-Chat-4bits`
+- `Yi-34B-Chat-8bits`
+- `Yi-6B-Chat`
+- `Yi-6B-Chat-4bits`
+- `Yi-6B-Chat-8bits`
+You can try some of them interactively at:
+- [HuggingFace](https://huggingface.co/spaces/01-ai/Yi-34B-Chat)
+- [Replicate](https://replicate.com/01-ai)
+</details>
+<details open>
+<summary>🔔 <b>2023/11/23</b>: The Yi Series Models Community License Agreement is updated to v2.1.</summary>
+</details>
+<details>
+<summary>🔥 <b>2023/11/08</b>: Invited test of Yi-34B chat model.</summary>
+Application form:
+- [English](https://cn.mikecrm.com/l91ODJf)
+- [Chinese](https://cn.mikecrm.com/gnEZjiQ)
+</details>
+<details>
+<summary>🎯 <b>2023/11/05</b>: The base model of <code>Yi-6B-200K</code> and <code>Yi-34B-200K</code>.</summary>
+This release contains two base models with the same parameter sizes of previous
+release, except that the context window is extended to 200K.
+</details>
+<details>
+<summary>🎯 <b>2023/11/02</b>: The base model of <code>Yi-6B</code> and <code>Yi-34B</code>.</summary>
+The first public release contains two bilingual (English/Chinese) base models
+with the parameter sizes of 6B and 34B.  Both of them are trained with 4K
+sequence length and can be extended to 32K during inference time.
+</details>
+## Model Performance
+### Base Model Performance
+| Model         |   MMLU   |  CMMLU   |  C-Eval  |  GAOKAO  |   BBH    | Common-sense Reasoning | Reading Comprehension | Math & Code |
+| :------------ | :------: | :------: | :------: | :------: | :------: | :--------------------: | :-------------------: | :---------: |
+|               |  5-shot  |  5-shot  |  5-shot  |  0-shot  | 3-shot@1 |           -            |           -           |      -      |
+| LLaMA2-34B    |   62.6   |    -     |    -     |    -     |   44.1   |          69.9          |         68.0          |    26.0     |
+| LLaMA2-70B    |   68.9   |   53.3   |    -     |   49.8   |   51.2   |          71.9          |         69.4          |    36.8     |
+| Baichuan2-13B |   59.2   |   62.0   |   58.1   |   54.3   |   48.8   |          64.3          |         62.4          |    23.0     |
+| Qwen-14B      |   66.3   |   71.0   |   72.1   |   62.5   |   53.4   |          73.3          |         72.5          |  **39.8**   |
+| Skywork-13B   |   62.1   |   61.8   |   60.6   |   68.1   |   41.7   |          72.4          |         61.4          |    24.9     |
+| InternLM-20B  |   62.1   |   59.0   |   58.8   |   45.5   |   52.5   |          78.3          |           -           |    30.4     |
+| Aquila-34B    |   67.8   |   71.4   |   63.1   |    -     |    -     |           -            |           -           |      -      |
+| Falcon-180B   |   70.4   |   58.0   |   57.8   |   59.0   |   54.0   |          77.3          |         68.8          |    34.0     |
+| Yi-6B         |   63.2   |   75.5   |   72.0   |   72.2   |   42.8   |          72.3          |         68.7          |    19.8     |
+| Yi-6B-200K    |   64.0   |   75.3   |   73.5   |   73.9   |   42.0   |          72.0          |         69.1          |    19.0     |
+| **Yi-34B**    | **76.3** | **83.7** |   81.4   |   82.8   | **54.3** |        **80.1**        |         76.4          |    37.1     |
+| Yi-34B-200K   |   76.1   |   83.6   | **81.9** | **83.4** |   52.7   |          79.7          |       **76.6**        |    36.3     |
+While benchmarking open-source models, we have observed a disparity between the
+results generated by our pipeline and those reported in public sources (e.g.
+OpenCompass). Upon conducting a more in-depth investigation of this difference,
+we have discovered that various models may employ different prompts,
+post-processing strategies, and sampling techniques, potentially resulting in
+significant variations in the outcomes. Our prompt and post-processing strategy
+remains consistent with the original benchmark, and greedy decoding is employed
+during evaluation without any post-processing for the generated content. For
+scores that were not reported by the original authors (including scores reported
+with different settings), we try to get results with our pipeline.
+To evaluate the model's capability extensively, we adopted the methodology
+outlined in Llama2. Specifically, we included PIQA, SIQA, HellaSwag, WinoGrande,
+ARC, OBQA, and CSQA to assess common sense reasoning. SquAD, QuAC, and BoolQ
+were incorporated to evaluate reading comprehension. CSQA was exclusively tested
+using a 7-shot setup, while all other tests were conducted with a 0-shot
+configuration. Additionally, we introduced GSM8K (8-shot@1), MATH (4-shot@1),
+HumanEval (0-shot@1), and MBPP (3-shot@1) under the category "Math & Code". Due
+to technical constraints, we did not test Falcon-180 on QuAC and OBQA; the score
+is derived by averaging the scores on the remaining tasks. Since the scores for
+these two tasks are generally lower than the average, we believe that
+Falcon-180B's performance was not underestimated.