waiyiaisg commited on
Commit
72c52c2
·
verified ·
1 Parent(s): 26436c6

Update table scores for BHASA and format table

Browse files
Files changed (1) hide show
  1. README.md +72 -65
README.md CHANGED
@@ -11,7 +11,7 @@ license: llama3
11
 
12
  SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
13
 
14
- LLaMA3 8B CPT SEA-LIONv2 Instruct is a multilingual model which has been fine-tuned with **thousands of English and Indonesian instruction-completion pairs** alongside a smaller pool of instruction-completion pairs from other ASEAN languages.
15
  These instructions have been carefully curated and rewritten to ensure the model was trained on truly open, commercially permissive and high quality datasets.
16
 
17
  SEA-LION stands for _Southeast Asian Languages In One Network_.
@@ -36,48 +36,48 @@ These tasks include Question Answering (QA), Sentiment Analysis (Sentiment), Tox
36
 
37
  The evaluation was done zero-shot with native prompts and only a sample of 100-1000 instances for each dataset was used as per the setting described in the paper.
38
 
39
- | | | | **QA** | **Sentiment** | **Toxicity** | **Eng>Lang** | **Lang>Eng** | **Summ** | **NLI** | **Causal** | **LINDSEA** |
40
- |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
41
- | **Language** | **Model** | **Win-rate** | **F1** | **F1** | **Macro-F1** | **ChrF++** | **ChrF++** | **F1** | **Accuracy** | **Accuracy** | **Accuracy** |
42
- | ID | llama3-8b-cpt-sealionv2-instruct | 76.39% | 72.23 | 84.72 | 54.64 | 66.71 | 65.29 | 18.70 | 68.90 | 87.40 | 39.91 |
43
- | ID | gemma-2-9b-it | 76.39% | 54.77 | 78.83 | 53.37 | 66.56 | 65.15 | 18.20 | 72.00 | 94.20 | 72.14 |
44
- | ID | aya-23-8B | 61.11% | 64.51 | 82.61 | 45.40 | 64.60 | 63.91 | 22.15 | 44.40 | 89.00 | 50.45 |
45
- | ID | SeaLLM3-7B-Chat | 51.39% | 45.42 | 74.58 | 50.42 | 64.03 | 63.44 | 17.44 | 58.20 | 92.00 | 65.22 |
46
- | ID | Qwen2-7B-Instruct | 45.83% | 45.77 | 81.97 | 42.92 | 58.83 | 62.79 | 13.66 | 63.70 | 90.80 | 65.32 |
47
- | ID | Meta-Llama-3.1-8B-Instruct | 41.67% | 63.98 | 61.34 | 37.10 | 63.90 | **65.35** | 19.44 | 29.40 | 83.20 | 57.12 |
48
- | ID | Sailor-7B-Chat | 41.67% | 36.93 | **85.17** | 42.67 | 66.61 | 63.34 | 14.16 | 59.50 | 85.20 | 54.10 |
49
- | ID | Meta-Llama-3-8B-Instruct | 36.11% | 55.49 | 72.27 | 44.68 | 56.54 | 55.63 | 15.35 | 71.80 | 82.40 | 59.25 |
50
- | ID | Mistral-7B-Instruct-v0.3 | 19.44% | 40.69 | 78.84 | 40.33 | 49.88 | 57.89 | 15.74 | 59.60 | 71.80 | 34.48 |
51
- | | | | | | | | | | | | |
52
- | VI | gemma-2-9b-it | 78.91% | 48.11 | 64.23 | **50.08** | 57.21 | 59.20 | 17.18 | 52.40 | **92.60** | - |
53
- | VI | llama3-8b-cpt-sealionv2-instruct | 64.84% | 57.05 | 54.09 | 21.99 | 58.60 | 58.97 | 18.28 | 52.40 | 87.80 | - |
54
- | VI | SeaLLM3-7B-Chat | 57.81% | 48.71 | 51.36 | 27.60 | 55.05 | 57.64 | 16.40 | 54.50 | 89.40 | - |
55
- | VI | Qwen2-7B-Instruct | 54.69% | 43.21 | 61.94 | 38.44 | 52.02 | 56.99 | 13.10 | **60.00** | 88.60 | - |
56
- | VI | aya-23-8B | 54.69% | **73.69** | 42.14 | 21.17 | 56.70 | 57.02 | **22.40** | 50.80 | 86.80 | - |
57
- | VI | Meta-Llama-3.1-8B-Instruct | 50.00% | 63.49 | 61.43 | 7.02 | 55.91 | **60.07** | 18.78 | 33.20 | 78.40 | - |
58
- | VI | Sailor-7B-Chat | 40.62% | 31.00 | 13.13 | 30.66 | **58.85** | 59.02 | 11.85 | 49.20 | 85.80 | - |
59
- | VI | Meta-Llama-3-8B-Instruct | 25.00% | 35.42 | **70.44** | 20.91 | 48.42 | 52.90 | 9.65 | 41.10 | 83.00 | - |
60
- | VI | Mistral-7B-Instruct-v0.3 | 23.44% | 36.13 | 51.01 | 41.30 | 36.89 | 49.06 | 13.22 | 34.70 | 69.60 | - |
61
- | | | | | | | | | | | | |
62
- | TH | gemma-2-9b-it | 82.81% | 76.33 | 49.01 | 65.49 | 43.49 | **56.48** | **25.79** | 38.90 | **90.40** | - |
63
- | TH | llama3-8b-cpt-sealionv2-instruct | 73.44% | 72.41 | **52.51** | 38.25 | **44.84** | 56.05 | 18.73 | 48.80 | 85.80 | - |
64
- | TH | Qwen2-7B-Instruct | 62.50% | 39.47 | 50.85 | **65.89** | 36.99 | 52.58 | 21.32 | 47.40 | 88.00 | - |
65
- | TH | SeaLLM3-7B-Chat | 56.25% | 45.01 | 40.24 | 55.48 | 41.80 | 54.58 | 23.33 | 36.40 | 90.20 | - |
66
- | TH | Sailor-7B-Chat | 48.44% | 31.44 | 48.11 | 33.10 | 44.26 | 56.03 | 15.24 | 45.30 | 85.60 | - |
67
- | TH | Meta-Llama-3.1-8B-Instruct | 42.19% | **82.16** | 32.46 | 25.48 | 39.65 | 55.47 | 24.92 | 6.20 | 73.40 | - |
68
- | TH | Meta-Llama-3-8B-Instruct | 40.62% | 68.57 | 38.80 | 48.63 | 35.03 | 47.74 | 14.21 | **54.30** | 78.20 | - |
69
- | TH | Mistral-7B-Instruct-v0.3 | 29.69% | 29.78 | 45.91 | 55.58 | 22.90 | 41.85 | 18.65 | 41.70 | 59.20 | - |
70
- | TH | aya-23-8B | 14.06% | 43.29 | 28.84 | 27.64 | 19.10 | 40.29 | 19.53 | 33.60 | 50.60 | - |
71
- | | | | | | | | | | | | |
72
- | TA | gemma-2-9b-it | 81.84% | 39.04 | **97.70** | 0.85 | 0.86 | 11.98 | 89.20 | - | 38.30 | - |
73
- | TA | llama3-8b-cpt-sealionv2-instruct | 70.51% | 29.35 | 97.19 | 0.87 | 0.86 | 6.80 | 76.80 | - | 34.50 | - |
74
- | TA | SeaLLM3-7B-Chat | 56.25% | 31.79 | 91.69 | 0.69 | 0.78 | 11.88 | 51.80 | - | 34.60 | - |
75
- | TA | Qwen2-7B-Instruct | 53.12% | 25.13 | 86.39 | 0.47 | 0.71 | 7.49 | 57.60 | - | 37.20 | - |
76
- | TA | Meta-Llama-3.1-8B-Instruct | 48.83% | **51.86** | 88.51 | 0.81 | 0.85 | 9.34 | 56.60 | - | 30.80 | - |
77
- | TA | aya-23-8B | 43.75% | 41.89 | 41.71 | 0.47 | 0.74 | 6.47 | 43.40 | - | 40.60 | - |
78
- | TA | Sailor-7B-Chat | 37.50% | 17.46 | 32.65 | 0.46 | 0.70 | 5.60 | 11.00 | - | 0.00 | - |
79
- | TA | Meta-Llama-3-8B-Instruct | 37.50% | 20.88 | 67.40 | 0.71 | 0.70 | 0.74 | 58.60 | - | 41.30 | - |
80
- | TA | Mistral-7B-Instruct-v0.3 | 20.70% | 13.85 | 0.00 | 0.37 | 0.52 | 5.31 | 14.20 | - | 0.80 | - |
81
 
82
 
83
  #### Instruction-following Capabilities
@@ -89,34 +89,41 @@ As these two datasets were originally in English, the linguists and native speak
89
 
90
  IFEval evaluates a model's ability to adhere to constraints provided in the prompt, for example beginning a response with a specific word/phrase or answering with a certain number of sections. The metric used is accuracy normalized by language (if the model performs the task correctly but responds in the wrong language, it is judged to have failed the task).
91
 
92
- | Model | IFEval (Indonesian) | IFEval (Vietnamese) | IFEval (English) |
93
- |-------------------------------|---------------------|---------------------|------------------|
94
- | gemma-2-9b-it | 0.88 | 0.77 | 0.85 |
95
- | Meta-Llama-3.1-8B-Instruct | 0.68 | 0.68 | 0.85 |
96
- | Qwen2-7B-Instruct | 0.63 | 0.65 | 0.70 |
97
- | llama3-8b-cpt-sealionv2-instruct | 0.61 | 0.66 | 0.70 |
98
- | aya-23-8B | 0.58 | 0.56 | 0.67 |
99
- | SeaLLMs-v3-7B-Chat | 0.55 | 0.52 | 0.67 |
100
- | Mistral-7B-Instruct-v0.3 | 0.43 | 0.39 | 0.70 |
101
- | Meta-Llama-3-8B-Instruct | 0.27 | 0.21 | 0.80 |
102
- | Sailor-7B-Chat | 0.26 | 0.25 | 0.42 |
 
 
 
 
103
 
104
 
105
  **MT-Bench**
106
 
107
  MT-Bench evaluates a model's ability to engage in multi-turn (2 turns) conversations and respond in ways that align with human needs. We use `gpt-4-1106-preview` as the judge model and compare against `gpt-3.5-turbo-0125` as the baseline model. The metric used is the win rate against the baseline model. A tie is given a score of 0.5.
108
 
109
- | Model | MT-Bench (Indonesian) | MT-Bench (Vietnamese) | MT-Bench (English) |
110
- |-------------------------------|-----------------------|-----------------------|--------------------|
111
- | gemma-2-9b-it | 0.684 | 0.674 | 0.638 |
112
- | SeaLLMs-v3-7B-Chat | 0.583 | 0.656 | 0.429 |
113
- | Qwen2-7B-Instruct | 0.498 | 0.556 | 0.597 |
114
- | llama3-8b-cpt-sealionv2-instruct | 0.531 | 0.517 | 0.510 |
115
- | Meta-Llama-3.1-8B-Instruct | 0.411 | 0.477 | 0.618 |
116
- | aya-23-8B | 0.499 | 0.546 | 0.416 |
117
- | Meta-Llama-3-8B-Instruct | 0.403 | 0.437 | 0.564 |
118
- | Mistral-7B-Instruct-v0.3 | 0.347 | 0.202 | 0.524 |
119
- | Sailor-7B-Chat | 0.290 | 0.314 | 0.190 |
 
 
 
120
 
121
  ### Usage
122
  SEA-LION can be run using the 🤗 Transformers library
 
11
 
12
  SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
13
 
14
+ LLaMA3 8B CPT SEA-LIONv2 Instruct is a multilingual model which has been fine-tuned with around **100,000 English instruction-completion pairs** alongside a smaller pool of around **50,000 instruction-completion pairs** from other ASEAN languages, such as Indonesian, Thai and Vietnamese.
15
  These instructions have been carefully curated and rewritten to ensure the model was trained on truly open, commercially permissive and high quality datasets.
16
 
17
  SEA-LION stands for _Southeast Asian Languages In One Network_.
 
36
 
37
  The evaluation was done zero-shot with native prompts and only a sample of 100-1000 instances for each dataset was used as per the setting described in the paper.
38
 
39
+ #### General Language Capabilities (BHASA)
40
+ |**Language**| **Model** | **Sentiment (f1)** | **QA (f1)** | **Toxicity (macro-f1)** | **Eng>Lang (chrf_score)** | **Lang>Eng (chrf_score)** | **Summary (f1)** | **Causal (accuracy)** | **NLI (accuracy)** | **LINDSEA (accuracy)** |
41
+ | ---------- | -------------------------------- | ------------------ | ----------- | ----------------------- | ------------------------- | ------------------------- | ---------------- | --------------------- | ------------------ | ---------------------- |
42
+ | ID | llama3-8b-cpt-sealionv2-instruct | 84.7 | 72.2 | 54.6 | 66.7 | 65.3 | 18.7 | 87.4 | 68.9 | 39.9 |
43
+ | ID | gemma-2-9b-it | 78.8 | 54.8 | 53.4 | 66.6 | 65.1 | 18.2 | 94.2 | 72.0 | 72.1 |
44
+ | ID | aya-23-8B | 82.6 | 64.5 | 45.4 | 64.6 | 63.9 | 22.2 | 89.0 | 44.4 | 50.4 |
45
+ | ID | SeaLLM3-7B-Chat | 74.6 | 45.4 | 50.4 | 64.0 | 63.4 | 17.4 | 92.0 | 58.2 | 65.2 |
46
+ | ID | Qwen2-7B-Instruct | 82.0 | 45.8 | 42.9 | 58.8 | 62.8 | 13.7 | 90.8 | 63.7 | 65.3 |
47
+ | ID | Meta-Llama-3.1-8B-Instruct | 61.3 | 64.0 | 37.1 | 63.9 | 65.4 | 19.4 | 83.2 | 29.4 | 57.1 |
48
+ | ID | Sailor-7B-Chat | 85.2 | 36.9 | 42.7 | 66.6 | 63.3 | 14.2 | 85.2 | 59.5 | 54.1 |
49
+ | ID | Meta-Llama-3-8B-Instruct | 72.3 | 55.5 | 44.7 | 56.5 | 55.6 | 15.4 | 82.4 | 71.8 | 59.2 |
50
+ | ID | Mistral-7B-Instruct-v0.3 | 78.8 | 40.7 | 40.3 | 49.9 | 57.9 | 15.7 | 71.8 | 59.6 | 34.5 |
51
+ | | | | | | | | | | | |
52
+ | VI | gemma-2-9b-it | 64.2 | 48.1 | 50.1 | 57.2 | 59.2 | 17.2 | 92.6 | 52.4 | \- |
53
+ | VI | llama3-8b-cpt-sealionv2-instruct | 54.1 | 57.1 | 22.0 | 58.6 | 59.0 | 18.3 | 87.8 | 52.4 | \- |
54
+ | VI | SeaLLM3-7B-Chat | 51.4 | 48.7 | 27.6 | 55.1 | 57.6 | 16.4 | 89.4 | 54.5 | \- |
55
+ | VI | Qwen2-7B-Instruct | 61.9 | 43.2 | 38.4 | 52.0 | 57.0 | 13.1 | 88.6 | 60.0 | \- |
56
+ | VI | aya-23-8B | 42.1 | 73.7 | 21.2 | 56.7 | 57.0 | 22.4 | 86.8 | 50.8 | \- |
57
+ | VI | Meta-Llama-3.1-8B-Instruct | 61.4 | 63.5 | 7.0 | 55.9 | 60.1 | 18.8 | 78.4 | 33.2 | \- |
58
+ | VI | Sailor-7B-Chat | 13.1 | 31.0 | 30.7 | 58.9 | 59.0 | 11.8 | 85.8 | 49.2 | \- |
59
+ | VI | Meta-Llama-3-8B-Instruct | 70.4 | 35.4 | 20.9 | 48.4 | 52.9 | 9.6 | 83.0 | 41.1 | \- |
60
+ | VI | Mistral-7B-Instruct-v0.3 | 51.0 | 36.1 | 41.3 | 36.9 | 49.1 | 13.2 | 69.6 | 34.7 | \- |
61
+ | | | | | | | | | | | |
62
+ | TH | gemma-2-9b-it | 49.0 | 76.3 | 65.5 | 43.5 | 56.5 | 25.8 | 90.4 | 38.9 | \- |
63
+ | TH | llama3-8b-cpt-sealionv2-instruct | 52.5 | 72.4 | 38.3 | 44.8 | 56.0 | 18.7 | 85.8 | 48.8 | \- |
64
+ | TH | Qwen2-7B-Instruct | 50.9 | 39.5 | 65.9 | 37.0 | 52.6 | 21.3 | 88.0 | 47.4 | \- |
65
+ | TH | SeaLLM3-7B-Chat | 40.2 | 45.0 | 55.5 | 41.8 | 54.6 | 23.3 | 90.2 | 36.4 | \- |
66
+ | TH | Sailor-7B-Chat | 48.1 | 31.4 | 33.1 | 44.3 | 56.0 | 15.2 | 85.6 | 45.3 | \- |
67
+ | TH | Meta-Llama-3.1-8B-Instruct | 32.5 | 82.2 | 25.5 | 39.7 | 55.5 | 24.9 | 73.4 | 6.2 | \- |
68
+ | TH | Meta-Llama-3-8B-Instruct | 38.8 | 68.6 | 48.6 | 35.0 | 47.7 | 14.2 | 78.2 | 54.3 | \- |
69
+ | TH | Mistral-7B-Instruct-v0.3 | 45.9 | 29.8 | 55.6 | 22.9 | 41.8 | 18.7 | 59.2 | 41.7 | \- |
70
+ | TH | aya-23-8B | 28.8 | 43.3 | 27.6 | 19.1 | 40.3 | 19.5 | 50.6 | 33.6 | \- |
71
+ | | | | | | | | | | | |
72
+ | TA | gemma-2-9b-it | 97.7 | 39.0 | 74.0 | 42.1 | 53.8 | 13.4 | 89.2 | 38.3 | \- |
73
+ | TA | llama3-8b-cpt-sealionv2-instruct | 97.2 | 29.4 | 58.0 | 45.6 | 53.2 | 6.4 | 76.8 | 34.5 | \- |
74
+ | TA | Meta-Llama-3.1-8B-Instruct | 88.5 | 51.9 | 70.0 | 37.8 | 51.5 | 8.8 | 56.6 | 30.8 | \- |
75
+ | TA | SeaLLM3-7B-Chat | 91.7 | 31.8 | 65.0 | 32.9 | 42.2 | 9.4 | 51.8 | 34.6 | \- |
76
+ | TA | Qwen2-7B-Instruct | 86.4 | 25.1 | 58.0 | 22.4 | 37.0 | 8.9 | 57.6 | 37.2 | \- |
77
+ | TA | Meta-Llama-3-8B-Instruct | 67.4 | 20.9 | 66.0 | 33.9 | 40.2 | 1.0 | 58.6 | 41.3 | \- |
78
+ | TA | aya-23-8B | 41.7 | 41.9 | 51.0 | 18.1 | 37.2 | 7.2 | 43.4 | 40.6 | \- |
79
+ | TA | Sailor-7B-Chat | 32.7 | 17.5 | 54.0 | 19.5 | 31.3 | 7.9 | 11.0 | 0.0 | \- |
80
+ | TA | Mistral-7B-Instruct-v0.3 | 0.0 | 13.8 | 53.0 | 15.6 | 22.5 | 7.4 | 14.2 | 0.8 | \- |
81
 
82
 
83
  #### Instruction-following Capabilities
 
89
 
90
  IFEval evaluates a model's ability to adhere to constraints provided in the prompt, for example beginning a response with a specific word/phrase or answering with a certain number of sections. The metric used is accuracy normalized by language (if the model performs the task correctly but responds in the wrong language, it is judged to have failed the task).
91
 
92
+ #### Instruction-following Capabilities (IFEval)
93
+ | Model | **Indonesian** | **Vietnamese** | **English** |
94
+ |----------------------------------|:------------------------------------:|:------------------------------------:|:---------------------------------:|
95
+ | gemma-2-9b-it | 0.88 | 0.77 | 0.85 |
96
+ | Meta-Llama-3.1-8B-Instruct | 0.68 | 0.68 | 0.85 |
97
+ | Qwen2-7B-Instruct | 0.63 | 0.65 | 0.70 |
98
+ | llama3-8b-cpt-sealionv2-instruct | 0.61 | 0.66 | 0.70 |
99
+ | aya-23-8B | 0.58 | 0.56 | 0.67 |
100
+ | SeaLLMs-v3-7B-Chat | 0.55 | 0.52 | 0.67 |
101
+ | Mistral-7B-Instruct-v0.3 | 0.43 | 0.39 | 0.70 |
102
+ | Meta-Llama-3-8B-Instruct | 0.27 | 0.21 | 0.80 |
103
+ | Sailor-7B-Chat | 0.26 | 0.25 | 0.42 |
104
+
105
+ Note: Scores are the language normalized accuracies ie. models are penalised when they respond in the incorrect language even if they may follow the instructions correctly;
106
+
107
 
108
 
109
  **MT-Bench**
110
 
111
  MT-Bench evaluates a model's ability to engage in multi-turn (2 turns) conversations and respond in ways that align with human needs. We use `gpt-4-1106-preview` as the judge model and compare against `gpt-3.5-turbo-0125` as the baseline model. The metric used is the win rate against the baseline model. A tie is given a score of 0.5.
112
 
113
+ #### Multi-turn Capatbilities (MT-Bench)
114
+ | **Model** | **Indonesian** | **Vietnamese** | **English** |
115
+ |---------------------------------|:---------------------:|:---------------------:|:----------------------:|
116
+ | gemma-2-9b-it | 0.684 | 0.674 | 0.638 |
117
+ | SeaLLMs-v3-7B-Chat | 0.583 | 0.656 | 0.429 |
118
+ | Qwen2-7B-Instruct | 0.498 | 0.556 | 0.597 |
119
+ | llama3-8b-cpt-sealionv2-instruct| 0.531 | 0.517 | 0.510 |
120
+ | Meta-Llama-3.1-8B-Instruct | 0.411 | 0.477 | 0.618 |
121
+ | aya-23-8B | 0.499 | 0.546 | 0.416 |
122
+ | Meta-Llama-3-8B-Instruct | 0.403 | 0.437 | 0.564 |
123
+ | Mistral-7B-Instruct-v0.3 | 0.347 | 0.202 | 0.524 |
124
+ | Sailor-7B-Chat | 0.290 | 0.314 | 0.190 |
125
+
126
+ Note: Scores are the Weighted Win Rate across reasoning, stem, math, humanities, extraction, writing, roleplay.
127
 
128
  ### Usage
129
  SEA-LION can be run using the 🤗 Transformers library