Update README.md
Browse files
README.md
CHANGED
@@ -25,7 +25,7 @@ SEA-LION stands for _Southeast Asian Languages In One Network_.
|
|
25 |
## Model Details
|
26 |
|
27 |
### Base model
|
28 |
-
We performed instruction tuning in English and also in ASEAN languages such as Indonesian, Thai and Vietnamese on our [continued pre-trained Llama3 CPT 8B SEA-LIONv2](https://huggingface.co/aisingapore/llama3-8b-cpt-
|
29 |
|
30 |
### Benchmark Performance
|
31 |
We evaluated Llama3 8B SEA-LIONv2 Instruct on both general language capabilities and instruction-following capabilities.
|
@@ -51,33 +51,33 @@ As these two datasets were originally in English, the linguists and native speak
|
|
51 |
|
52 |
IFEval evaluates a model's ability to adhere to constraints provided in the prompt, for example beginning a response with a specific word/phrase or answering with a certain number of sections. The metric used is accuracy normalized by language (if the model performs the task correctly but responds in the wrong language, it is judged to have failed the task).
|
53 |
|
54 |
-
| **Model**
|
55 |
-
|
56 |
-
|
|
57 |
-
|
|
58 |
-
|
|
59 |
-
| llama3-8b-cpt-
|
60 |
-
|
|
61 |
-
|
|
62 |
-
|
|
63 |
-
|
|
64 |
-
|
|
65 |
|
66 |
**MT-Bench**
|
67 |
|
68 |
MT-Bench evaluates a model's ability to engage in multi-turn (2 turns) conversations and respond in ways that align with human needs. We use `gpt-4-1106-preview` as the judge model and compare against `gpt-3.5-turbo-0125` as the baseline model. The metric used is the weighted win rate against the baseline model (i.e. average win rate across each category (Math, Reasoning, STEM, Humanities, Roleplay, Writing, Extraction)). A tie is given a score of 0.5.
|
69 |
|
70 |
-
| **Model**
|
71 |
-
|
72 |
-
|
|
73 |
-
|
|
74 |
-
|
|
75 |
-
| llama3-8b-cpt-
|
76 |
-
|
|
77 |
-
|
|
78 |
-
|
|
79 |
-
|
|
80 |
-
|
|
81 |
|
82 |
|
83 |
### Usage
|
@@ -88,7 +88,7 @@ SEA-LION can be run using the 🤗 Transformers library
|
|
88 |
import transformers
|
89 |
import torch
|
90 |
|
91 |
-
model_id = "aisingapore/llama3-8b-cpt-
|
92 |
|
93 |
pipeline = transformers.pipeline(
|
94 |
"text-generation",
|
|
|
25 |
## Model Details
|
26 |
|
27 |
### Base model
|
28 |
+
We performed instruction tuning in English and also in ASEAN languages such as Indonesian, Thai and Vietnamese on our [continued pre-trained Llama3 CPT 8B SEA-LIONv2](https://huggingface.co/aisingapore/llama3-8b-cpt-sea-lionv2-base), a decoder model using the Llama3 architecture, to create Llama3 8B SEA-LIONv2 Instruct.
|
29 |
|
30 |
### Benchmark Performance
|
31 |
We evaluated Llama3 8B SEA-LIONv2 Instruct on both general language capabilities and instruction-following capabilities.
|
|
|
51 |
|
52 |
IFEval evaluates a model's ability to adhere to constraints provided in the prompt, for example beginning a response with a specific word/phrase or answering with a certain number of sections. The metric used is accuracy normalized by language (if the model performs the task correctly but responds in the wrong language, it is judged to have failed the task).
|
53 |
|
54 |
+
| **Model** | **Indonesian(%)** | **Vietnamese(%)** | **English(%)** |
|
55 |
+
|:---------------------------------:|:------------------:|:------------------:|:---------------:|
|
56 |
+
| gemma-2-9b-it | 87.62 | 77.14 | 84.76 |
|
57 |
+
| Meta-Llama-3.1-8B-Instruct | 67.62 | 67.62 | 84.76 |
|
58 |
+
| Qwen2-7B-Instruct | 62.86 | 64.76 | 70.48 |
|
59 |
+
| llama3-8b-cpt-sea-lionv2-instruct | 60.95 | 65.71 | 69.52 |
|
60 |
+
| aya-23-8B | 58.10 | 56.19 | 66.67 |
|
61 |
+
| SeaLLMs-v3-7B-Chat | 55.24 | 52.38 | 66.67 |
|
62 |
+
| Mistral-7B-Instruct-v0.3 | 42.86 | 39.05 | 69.52 |
|
63 |
+
| Meta-Llama-3-8B-Instruct | 26.67 | 20.95 | 80.00 |
|
64 |
+
| Sailor-7B-Chat | 25.71 | 24.76 | 41.90 |
|
65 |
|
66 |
**MT-Bench**
|
67 |
|
68 |
MT-Bench evaluates a model's ability to engage in multi-turn (2 turns) conversations and respond in ways that align with human needs. We use `gpt-4-1106-preview` as the judge model and compare against `gpt-3.5-turbo-0125` as the baseline model. The metric used is the weighted win rate against the baseline model (i.e. average win rate across each category (Math, Reasoning, STEM, Humanities, Roleplay, Writing, Extraction)). A tie is given a score of 0.5.
|
69 |
|
70 |
+
| **Model** | **Indonesian(%)** | **Vietnamese(%)** | **English(%)** |
|
71 |
+
|:------------------=--------------:|:-----------------:|:-----------------:|:--------------:|
|
72 |
+
| gemma-2-9b-it | 68.35 | 67.36 | 63.84 |
|
73 |
+
| SeaLLMs-v3-7B-Chat | 58.33 | 65.56 | 42.94 |
|
74 |
+
| Qwen2-7B-Instruct | 49.78 | 55.65 | 59.68 |
|
75 |
+
| llama3-8b-cpt-sea-lionv2-instruct | 53.13 | 51.68 | 51.00 |
|
76 |
+
| Meta-Llama-3.1-8B-Instruct | 41.09 | 47.69 | 61.79 |
|
77 |
+
| aya-23-8B | 49.90 | 54.61 | 41.63 |
|
78 |
+
| Meta-Llama-3-8B-Instruct | 40.29 | 43.69 | 56.38 |
|
79 |
+
| Mistral-7B-Instruct-v0.3 | 34.74 | 20.24 | 52.40 |
|
80 |
+
| Sailor-7B-Chat | 29.05 | 31.39 | 18.98 |
|
81 |
|
82 |
|
83 |
### Usage
|
|
|
88 |
import transformers
|
89 |
import torch
|
90 |
|
91 |
+
model_id = "aisingapore/llama3-8b-cpt-sea-lionv2-instruct"
|
92 |
|
93 |
pipeline = transformers.pipeline(
|
94 |
"text-generation",
|