|
--- |
|
license: other |
|
license_name: yi-license |
|
license_link: LICENSE |
|
widget: |
|
- example_title: Yi-34B-Chat |
|
text: hi |
|
output: |
|
text: ' Hello! How can I assist you today?' |
|
- example_title: Yi-34B |
|
text: >- |
|
There's a place where time stands still. A place of breath taking wonder, |
|
but also |
|
output: |
|
text: >2- |
|
an eerie sense that something is just not right… |
|
Between the two worlds lies The Forgotten Kingdom - home to creatures |
|
long since thought extinct and ancient magic so strong it defies belief! |
|
Only here can you find what has been lost for centuries: An Elixir Of |
|
Life which will restore youth and vitality if only those who seek its |
|
power are brave enough to face up against all manner of dangers lurking |
|
in this mysterious land! But beware; some say there may even exist |
|
powerful entities beyond our comprehension whose intentions towards |
|
humanity remain unclear at best ---- they might want nothing more than |
|
destruction itself rather then anything else from their quest after |
|
immortality (and maybe someone should tell them about modern medicine)? |
|
In any event though – one thing remains true regardless : whether or not |
|
success comes easy depends entirely upon how much effort we put into |
|
conquering whatever challenges lie ahead along with having faith deep |
|
down inside ourselves too ;) So let’s get started now shall We? |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
. |
|
|
|
## Introduction |
|
|
|
**SUS-CHhat** is powered by SUSTech x IDEA-CCNL, based on `01-ai/Yi-34B` |
|
|
|
## News |
|
|
|
<details open> |
|
<summary>🎯 <b>2023/11/23</b>: The chat models are open to public.</summary> |
|
|
|
This release contains two chat models based on previous released base models, two 8-bits models quantized by GPTQ, two 4-bits models quantized by AWQ. |
|
|
|
- `Yi-34B-Chat` |
|
- `Yi-34B-Chat-4bits` |
|
- `Yi-34B-Chat-8bits` |
|
- `Yi-6B-Chat` |
|
- `Yi-6B-Chat-4bits` |
|
- `Yi-6B-Chat-8bits` |
|
|
|
You can try some of them interactively at: |
|
|
|
- [HuggingFace](https://huggingface.co/spaces/01-ai/Yi-34B-Chat) |
|
- [Replicate](https://replicate.com/01-ai) |
|
</details> |
|
|
|
<details open> |
|
<summary>🔔 <b>2023/11/23</b>: The Yi Series Models Community License Agreement is updated to v2.1.</summary> |
|
</details> |
|
|
|
<details> |
|
<summary>🔥 <b>2023/11/08</b>: Invited test of Yi-34B chat model.</summary> |
|
|
|
Application form: |
|
|
|
- [English](https://cn.mikecrm.com/l91ODJf) |
|
- [Chinese](https://cn.mikecrm.com/gnEZjiQ) |
|
|
|
</details> |
|
|
|
<details> |
|
<summary>🎯 <b>2023/11/05</b>: The base model of <code>Yi-6B-200K</code> and <code>Yi-34B-200K</code>.</summary> |
|
|
|
This release contains two base models with the same parameter sizes of previous |
|
release, except that the context window is extended to 200K. |
|
|
|
</details> |
|
|
|
<details> |
|
<summary>🎯 <b>2023/11/02</b>: The base model of <code>Yi-6B</code> and <code>Yi-34B</code>.</summary> |
|
|
|
The first public release contains two bilingual (English/Chinese) base models |
|
with the parameter sizes of 6B and 34B. Both of them are trained with 4K |
|
sequence length and can be extended to 32K during inference time. |
|
|
|
</details> |
|
|
|
## Model Performance |
|
|
|
### Base Model Performance |
|
|
|
| Model | MMLU | CMMLU | C-Eval | GAOKAO | BBH | Common-sense Reasoning | Reading Comprehension | Math & Code | |
|
| :------------ | :------: | :------: | :------: | :------: | :------: | :--------------------: | :-------------------: | :---------: | |
|
| | 5-shot | 5-shot | 5-shot | 0-shot | 3-shot@1 | - | - | - | |
|
| LLaMA2-34B | 62.6 | - | - | - | 44.1 | 69.9 | 68.0 | 26.0 | |
|
| LLaMA2-70B | 68.9 | 53.3 | - | 49.8 | 51.2 | 71.9 | 69.4 | 36.8 | |
|
| Baichuan2-13B | 59.2 | 62.0 | 58.1 | 54.3 | 48.8 | 64.3 | 62.4 | 23.0 | |
|
| Qwen-14B | 66.3 | 71.0 | 72.1 | 62.5 | 53.4 | 73.3 | 72.5 | **39.8** | |
|
| Skywork-13B | 62.1 | 61.8 | 60.6 | 68.1 | 41.7 | 72.4 | 61.4 | 24.9 | |
|
| InternLM-20B | 62.1 | 59.0 | 58.8 | 45.5 | 52.5 | 78.3 | - | 30.4 | |
|
| Aquila-34B | 67.8 | 71.4 | 63.1 | - | - | - | - | - | |
|
| Falcon-180B | 70.4 | 58.0 | 57.8 | 59.0 | 54.0 | 77.3 | 68.8 | 34.0 | |
|
| Yi-6B | 63.2 | 75.5 | 72.0 | 72.2 | 42.8 | 72.3 | 68.7 | 19.8 | |
|
| Yi-6B-200K | 64.0 | 75.3 | 73.5 | 73.9 | 42.0 | 72.0 | 69.1 | 19.0 | |
|
| **Yi-34B** | **76.3** | **83.7** | 81.4 | 82.8 | **54.3** | **80.1** | 76.4 | 37.1 | |
|
| Yi-34B-200K | 76.1 | 83.6 | **81.9** | **83.4** | 52.7 | 79.7 | **76.6** | 36.3 | |
|
|
|
While benchmarking open-source models, we have observed a disparity between the |
|
results generated by our pipeline and those reported in public sources (e.g. |
|
OpenCompass). Upon conducting a more in-depth investigation of this difference, |
|
we have discovered that various models may employ different prompts, |
|
post-processing strategies, and sampling techniques, potentially resulting in |
|
significant variations in the outcomes. Our prompt and post-processing strategy |
|
remains consistent with the original benchmark, and greedy decoding is employed |
|
during evaluation without any post-processing for the generated content. For |
|
scores that were not reported by the original authors (including scores reported |
|
with different settings), we try to get results with our pipeline. |
|
|
|
To evaluate the model's capability extensively, we adopted the methodology |
|
outlined in Llama2. Specifically, we included PIQA, SIQA, HellaSwag, WinoGrande, |
|
ARC, OBQA, and CSQA to assess common sense reasoning. SquAD, QuAC, and BoolQ |
|
were incorporated to evaluate reading comprehension. CSQA was exclusively tested |
|
using a 7-shot setup, while all other tests were conducted with a 0-shot |
|
configuration. Additionally, we introduced GSM8K (8-shot@1), MATH (4-shot@1), |
|
HumanEval (0-shot@1), and MBPP (3-shot@1) under the category "Math & Code". Due |
|
to technical constraints, we did not test Falcon-180 on QuAC and OBQA; the score |
|
is derived by averaging the scores on the remaining tasks. Since the scores for |
|
these two tasks are generally lower than the average, we believe that |
|
Falcon-180B's performance was not underestimated. |
|
|