Update table scores for BHASA and format table
Browse files
README.md
CHANGED
@@ -11,7 +11,7 @@ license: llama3
|
|
11 |
|
12 |
SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
|
13 |
|
14 |
-
LLaMA3 8B CPT SEA-LIONv2 Instruct is a multilingual model which has been fine-tuned with **
|
15 |
These instructions have been carefully curated and rewritten to ensure the model was trained on truly open, commercially permissive and high quality datasets.
|
16 |
|
17 |
SEA-LION stands for _Southeast Asian Languages In One Network_.
|
@@ -36,48 +36,48 @@ These tasks include Question Answering (QA), Sentiment Analysis (Sentiment), Tox
|
|
36 |
|
37 |
The evaluation was done zero-shot with native prompts and only a sample of 100-1000 instances for each dataset was used as per the setting described in the paper.
|
38 |
|
39 |
-
|
40 |
-
|
41 |
-
|
|
42 |
-
| ID
|
43 |
-
| ID
|
44 |
-
| ID
|
45 |
-
| ID
|
46 |
-
| ID
|
47 |
-
| ID
|
48 |
-
| ID
|
49 |
-
| ID
|
50 |
-
| ID
|
51 |
-
|
|
52 |
-
| VI
|
53 |
-
| VI
|
54 |
-
| VI
|
55 |
-
| VI
|
56 |
-
| VI
|
57 |
-
| VI
|
58 |
-
| VI
|
59 |
-
| VI
|
60 |
-
| VI
|
61 |
-
|
|
62 |
-
| TH
|
63 |
-
| TH
|
64 |
-
| TH
|
65 |
-
| TH
|
66 |
-
| TH
|
67 |
-
| TH
|
68 |
-
| TH
|
69 |
-
| TH
|
70 |
-
| TH
|
71 |
-
|
|
72 |
-
| TA
|
73 |
-
| TA
|
74 |
-
| TA
|
75 |
-
| TA
|
76 |
-
| TA
|
77 |
-
| TA
|
78 |
-
| TA
|
79 |
-
| TA
|
80 |
-
| TA
|
81 |
|
82 |
|
83 |
#### Instruction-following Capabilities
|
@@ -89,34 +89,41 @@ As these two datasets were originally in English, the linguists and native speak
|
|
89 |
|
90 |
IFEval evaluates a model's ability to adhere to constraints provided in the prompt, for example beginning a response with a specific word/phrase or answering with a certain number of sections. The metric used is accuracy normalized by language (if the model performs the task correctly but responds in the wrong language, it is judged to have failed the task).
|
91 |
|
92 |
-
|
93 |
-
|
94 |
-
|
95 |
-
|
|
96 |
-
|
|
97 |
-
|
|
98 |
-
|
|
99 |
-
|
|
100 |
-
|
|
101 |
-
|
|
102 |
-
|
|
|
|
|
|
|
|
|
|
103 |
|
104 |
|
105 |
**MT-Bench**
|
106 |
|
107 |
MT-Bench evaluates a model's ability to engage in multi-turn (2 turns) conversations and respond in ways that align with human needs. We use `gpt-4-1106-preview` as the judge model and compare against `gpt-3.5-turbo-0125` as the baseline model. The metric used is the win rate against the baseline model. A tie is given a score of 0.5.
|
108 |
|
109 |
-
|
110 |
-
|
111 |
-
|
112 |
-
|
|
113 |
-
|
|
114 |
-
|
|
115 |
-
|
|
116 |
-
|
|
117 |
-
|
|
118 |
-
|
|
119 |
-
|
|
|
|
|
|
|
|
120 |
|
121 |
### Usage
|
122 |
SEA-LION can be run using the 🤗 Transformers library
|
|
|
11 |
|
12 |
SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
|
13 |
|
14 |
+
LLaMA3 8B CPT SEA-LIONv2 Instruct is a multilingual model which has been fine-tuned with around **100,000 English instruction-completion pairs** alongside a smaller pool of around **50,000 instruction-completion pairs** from other ASEAN languages, such as Indonesian, Thai and Vietnamese.
|
15 |
These instructions have been carefully curated and rewritten to ensure the model was trained on truly open, commercially permissive and high quality datasets.
|
16 |
|
17 |
SEA-LION stands for _Southeast Asian Languages In One Network_.
|
|
|
36 |
|
37 |
The evaluation was done zero-shot with native prompts and only a sample of 100-1000 instances for each dataset was used as per the setting described in the paper.
|
38 |
|
39 |
+
#### General Language Capabilities (BHASA)
|
40 |
+
|**Language**| **Model** | **Sentiment (f1)** | **QA (f1)** | **Toxicity (macro-f1)** | **Eng>Lang (chrf_score)** | **Lang>Eng (chrf_score)** | **Summary (f1)** | **Causal (accuracy)** | **NLI (accuracy)** | **LINDSEA (accuracy)** |
|
41 |
+
| ---------- | -------------------------------- | ------------------ | ----------- | ----------------------- | ------------------------- | ------------------------- | ---------------- | --------------------- | ------------------ | ---------------------- |
|
42 |
+
| ID | llama3-8b-cpt-sealionv2-instruct | 84.7 | 72.2 | 54.6 | 66.7 | 65.3 | 18.7 | 87.4 | 68.9 | 39.9 |
|
43 |
+
| ID | gemma-2-9b-it | 78.8 | 54.8 | 53.4 | 66.6 | 65.1 | 18.2 | 94.2 | 72.0 | 72.1 |
|
44 |
+
| ID | aya-23-8B | 82.6 | 64.5 | 45.4 | 64.6 | 63.9 | 22.2 | 89.0 | 44.4 | 50.4 |
|
45 |
+
| ID | SeaLLM3-7B-Chat | 74.6 | 45.4 | 50.4 | 64.0 | 63.4 | 17.4 | 92.0 | 58.2 | 65.2 |
|
46 |
+
| ID | Qwen2-7B-Instruct | 82.0 | 45.8 | 42.9 | 58.8 | 62.8 | 13.7 | 90.8 | 63.7 | 65.3 |
|
47 |
+
| ID | Meta-Llama-3.1-8B-Instruct | 61.3 | 64.0 | 37.1 | 63.9 | 65.4 | 19.4 | 83.2 | 29.4 | 57.1 |
|
48 |
+
| ID | Sailor-7B-Chat | 85.2 | 36.9 | 42.7 | 66.6 | 63.3 | 14.2 | 85.2 | 59.5 | 54.1 |
|
49 |
+
| ID | Meta-Llama-3-8B-Instruct | 72.3 | 55.5 | 44.7 | 56.5 | 55.6 | 15.4 | 82.4 | 71.8 | 59.2 |
|
50 |
+
| ID | Mistral-7B-Instruct-v0.3 | 78.8 | 40.7 | 40.3 | 49.9 | 57.9 | 15.7 | 71.8 | 59.6 | 34.5 |
|
51 |
+
| | | | | | | | | | | |
|
52 |
+
| VI | gemma-2-9b-it | 64.2 | 48.1 | 50.1 | 57.2 | 59.2 | 17.2 | 92.6 | 52.4 | \- |
|
53 |
+
| VI | llama3-8b-cpt-sealionv2-instruct | 54.1 | 57.1 | 22.0 | 58.6 | 59.0 | 18.3 | 87.8 | 52.4 | \- |
|
54 |
+
| VI | SeaLLM3-7B-Chat | 51.4 | 48.7 | 27.6 | 55.1 | 57.6 | 16.4 | 89.4 | 54.5 | \- |
|
55 |
+
| VI | Qwen2-7B-Instruct | 61.9 | 43.2 | 38.4 | 52.0 | 57.0 | 13.1 | 88.6 | 60.0 | \- |
|
56 |
+
| VI | aya-23-8B | 42.1 | 73.7 | 21.2 | 56.7 | 57.0 | 22.4 | 86.8 | 50.8 | \- |
|
57 |
+
| VI | Meta-Llama-3.1-8B-Instruct | 61.4 | 63.5 | 7.0 | 55.9 | 60.1 | 18.8 | 78.4 | 33.2 | \- |
|
58 |
+
| VI | Sailor-7B-Chat | 13.1 | 31.0 | 30.7 | 58.9 | 59.0 | 11.8 | 85.8 | 49.2 | \- |
|
59 |
+
| VI | Meta-Llama-3-8B-Instruct | 70.4 | 35.4 | 20.9 | 48.4 | 52.9 | 9.6 | 83.0 | 41.1 | \- |
|
60 |
+
| VI | Mistral-7B-Instruct-v0.3 | 51.0 | 36.1 | 41.3 | 36.9 | 49.1 | 13.2 | 69.6 | 34.7 | \- |
|
61 |
+
| | | | | | | | | | | |
|
62 |
+
| TH | gemma-2-9b-it | 49.0 | 76.3 | 65.5 | 43.5 | 56.5 | 25.8 | 90.4 | 38.9 | \- |
|
63 |
+
| TH | llama3-8b-cpt-sealionv2-instruct | 52.5 | 72.4 | 38.3 | 44.8 | 56.0 | 18.7 | 85.8 | 48.8 | \- |
|
64 |
+
| TH | Qwen2-7B-Instruct | 50.9 | 39.5 | 65.9 | 37.0 | 52.6 | 21.3 | 88.0 | 47.4 | \- |
|
65 |
+
| TH | SeaLLM3-7B-Chat | 40.2 | 45.0 | 55.5 | 41.8 | 54.6 | 23.3 | 90.2 | 36.4 | \- |
|
66 |
+
| TH | Sailor-7B-Chat | 48.1 | 31.4 | 33.1 | 44.3 | 56.0 | 15.2 | 85.6 | 45.3 | \- |
|
67 |
+
| TH | Meta-Llama-3.1-8B-Instruct | 32.5 | 82.2 | 25.5 | 39.7 | 55.5 | 24.9 | 73.4 | 6.2 | \- |
|
68 |
+
| TH | Meta-Llama-3-8B-Instruct | 38.8 | 68.6 | 48.6 | 35.0 | 47.7 | 14.2 | 78.2 | 54.3 | \- |
|
69 |
+
| TH | Mistral-7B-Instruct-v0.3 | 45.9 | 29.8 | 55.6 | 22.9 | 41.8 | 18.7 | 59.2 | 41.7 | \- |
|
70 |
+
| TH | aya-23-8B | 28.8 | 43.3 | 27.6 | 19.1 | 40.3 | 19.5 | 50.6 | 33.6 | \- |
|
71 |
+
| | | | | | | | | | | |
|
72 |
+
| TA | gemma-2-9b-it | 97.7 | 39.0 | 74.0 | 42.1 | 53.8 | 13.4 | 89.2 | 38.3 | \- |
|
73 |
+
| TA | llama3-8b-cpt-sealionv2-instruct | 97.2 | 29.4 | 58.0 | 45.6 | 53.2 | 6.4 | 76.8 | 34.5 | \- |
|
74 |
+
| TA | Meta-Llama-3.1-8B-Instruct | 88.5 | 51.9 | 70.0 | 37.8 | 51.5 | 8.8 | 56.6 | 30.8 | \- |
|
75 |
+
| TA | SeaLLM3-7B-Chat | 91.7 | 31.8 | 65.0 | 32.9 | 42.2 | 9.4 | 51.8 | 34.6 | \- |
|
76 |
+
| TA | Qwen2-7B-Instruct | 86.4 | 25.1 | 58.0 | 22.4 | 37.0 | 8.9 | 57.6 | 37.2 | \- |
|
77 |
+
| TA | Meta-Llama-3-8B-Instruct | 67.4 | 20.9 | 66.0 | 33.9 | 40.2 | 1.0 | 58.6 | 41.3 | \- |
|
78 |
+
| TA | aya-23-8B | 41.7 | 41.9 | 51.0 | 18.1 | 37.2 | 7.2 | 43.4 | 40.6 | \- |
|
79 |
+
| TA | Sailor-7B-Chat | 32.7 | 17.5 | 54.0 | 19.5 | 31.3 | 7.9 | 11.0 | 0.0 | \- |
|
80 |
+
| TA | Mistral-7B-Instruct-v0.3 | 0.0 | 13.8 | 53.0 | 15.6 | 22.5 | 7.4 | 14.2 | 0.8 | \- |
|
81 |
|
82 |
|
83 |
#### Instruction-following Capabilities
|
|
|
89 |
|
90 |
IFEval evaluates a model's ability to adhere to constraints provided in the prompt, for example beginning a response with a specific word/phrase or answering with a certain number of sections. The metric used is accuracy normalized by language (if the model performs the task correctly but responds in the wrong language, it is judged to have failed the task).
|
91 |
|
92 |
+
#### Instruction-following Capabilities (IFEval)
|
93 |
+
| Model | **Indonesian** | **Vietnamese** | **English** |
|
94 |
+
|----------------------------------|:------------------------------------:|:------------------------------------:|:---------------------------------:|
|
95 |
+
| gemma-2-9b-it | 0.88 | 0.77 | 0.85 |
|
96 |
+
| Meta-Llama-3.1-8B-Instruct | 0.68 | 0.68 | 0.85 |
|
97 |
+
| Qwen2-7B-Instruct | 0.63 | 0.65 | 0.70 |
|
98 |
+
| llama3-8b-cpt-sealionv2-instruct | 0.61 | 0.66 | 0.70 |
|
99 |
+
| aya-23-8B | 0.58 | 0.56 | 0.67 |
|
100 |
+
| SeaLLMs-v3-7B-Chat | 0.55 | 0.52 | 0.67 |
|
101 |
+
| Mistral-7B-Instruct-v0.3 | 0.43 | 0.39 | 0.70 |
|
102 |
+
| Meta-Llama-3-8B-Instruct | 0.27 | 0.21 | 0.80 |
|
103 |
+
| Sailor-7B-Chat | 0.26 | 0.25 | 0.42 |
|
104 |
+
|
105 |
+
Note: Scores are the language normalized accuracies ie. models are penalised when they respond in the incorrect language even if they may follow the instructions correctly;
|
106 |
+
|
107 |
|
108 |
|
109 |
**MT-Bench**
|
110 |
|
111 |
MT-Bench evaluates a model's ability to engage in multi-turn (2 turns) conversations and respond in ways that align with human needs. We use `gpt-4-1106-preview` as the judge model and compare against `gpt-3.5-turbo-0125` as the baseline model. The metric used is the win rate against the baseline model. A tie is given a score of 0.5.
|
112 |
|
113 |
+
#### Multi-turn Capatbilities (MT-Bench)
|
114 |
+
| **Model** | **Indonesian** | **Vietnamese** | **English** |
|
115 |
+
|---------------------------------|:---------------------:|:---------------------:|:----------------------:|
|
116 |
+
| gemma-2-9b-it | 0.684 | 0.674 | 0.638 |
|
117 |
+
| SeaLLMs-v3-7B-Chat | 0.583 | 0.656 | 0.429 |
|
118 |
+
| Qwen2-7B-Instruct | 0.498 | 0.556 | 0.597 |
|
119 |
+
| llama3-8b-cpt-sealionv2-instruct| 0.531 | 0.517 | 0.510 |
|
120 |
+
| Meta-Llama-3.1-8B-Instruct | 0.411 | 0.477 | 0.618 |
|
121 |
+
| aya-23-8B | 0.499 | 0.546 | 0.416 |
|
122 |
+
| Meta-Llama-3-8B-Instruct | 0.403 | 0.437 | 0.564 |
|
123 |
+
| Mistral-7B-Instruct-v0.3 | 0.347 | 0.202 | 0.524 |
|
124 |
+
| Sailor-7B-Chat | 0.290 | 0.314 | 0.190 |
|
125 |
+
|
126 |
+
Note: Scores are the Weighted Win Rate across reasoning, stem, math, humanities, extraction, writing, roleplay.
|
127 |
|
128 |
### Usage
|
129 |
SEA-LION can be run using the 🤗 Transformers library
|