xianbin commited on
Commit
4bb76ef
·
verified ·
1 Parent(s): be1f2ae

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -10
README.md CHANGED
@@ -1,15 +1,25 @@
1
  ---
2
  language:
3
  - en
 
 
4
  - id
5
- - ta
6
  - th
7
- - vi
 
 
 
 
 
 
 
8
  license: gemma
 
 
9
  ---
10
  # Gemma2 9B CPT SEA-LIONv3 Instruct
11
 
12
- SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
13
 
14
  Gemma2 9B CPT SEA-LIONv3 Instruct is a multilingual model which has been fine-tuned with around **500,000 English instruction-completion pairs** alongside a larger pool of around **1,000,000 instruction-completion pairs** from other ASEAN languages, such as Indonesian, Thai and Vietnamese.
15
 
@@ -18,7 +28,7 @@ SEA-LION stands for _Southeast Asian Languages In One Network_.
18
  - **Developed by:** Products Pillar, AI Singapore
19
  - **Funded by:** Singapore NRF
20
  - **Model type:** Decoder
21
- - **Languages:** English, Indonesian, Thai, Vietnamese, Tamil
22
  - **License:** [Gemma Community License](https://ai.google.dev/gemma/terms)
23
 
24
  ## Model Details
@@ -26,7 +36,7 @@ SEA-LION stands for _Southeast Asian Languages In One Network_.
26
  ### Model Description
27
  We performed instruction tuning in English and also in ASEAN languages such as Indonesian, Thai and Vietnamese on our [continued pre-trained Gemma2 9B CPT SEA-LIONv3](https://huggingface.co/aisingapore/gemma2-9b-cpt-sea-lionv3-base), a decoder model using the Gemma2 architecture, to create Gemma2 9B CPT SEA-LIONv3 Instruct.
28
 
29
- The model has a context length of 8192.
30
 
31
  ### Benchmark Performance
32
  We evaluated Gemma2 9B CPT SEA-LIONv3 Instruct on both general language capabilities and instruction-following capabilities.
@@ -35,10 +45,9 @@ We evaluated Gemma2 9B CPT SEA-LIONv3 Instruct on both general language capabili
35
  For the evaluation of general language capabilities, we employed the [SEA HELM (also known as BHASA) evaluation benchmark](https://arxiv.org/abs/2309.06085v2) across a variety of tasks.
36
  These tasks include Question Answering (QA), Sentiment Analysis (Sentiment), Toxicity Detection (Toxicity), Translation in both directions (Eng>Lang & Lang>Eng), Abstractive Summarization (Summ), Causal Reasoning (Causal) and Natural Language Inference (NLI).
37
 
38
- Note: SEA HELM is implemented using prompts which expect answers in a strict format. For all tasks, the model is expected to provide an answer tag from which the answer would be extracted. For tasks where options are provided, the answer should only include one of the pre-defined options. The weighted accuracy of the answers is calculated and normalisation is performed to account for baseline performance due to random chance.
39
-
40
- The evaluation was done zero-shot with native prompts and only a sample of 100-1000 instances for each dataset was used as per the setting described in the paper.
41
 
 
42
 
43
  #### Instruction-following Capabilities
44
  Since Gemma2 9B CPT SEA-LIONv3 Instruct is an instruction-following model, we also evaluated it on instruction-following capabilities with two datasets, [IFEval](https://arxiv.org/abs/2311.07911) and [MT-Bench](https://arxiv.org/abs/2306.05685).
@@ -47,12 +56,12 @@ As these two datasets were originally in English, the linguists and native speak
47
 
48
  **IFEval**
49
 
50
- IFEval evaluates a model's ability to adhere to constraints provided in the prompt, for example beginning a response with a specific word/phrase or answering with a certain number of sections. The metric used is accuracy normalized by language (if the model performs the task correctly but responds in the wrong language, it is judged to have failed the task).
51
 
52
 
53
  **MT-Bench**
54
 
55
- MT-Bench evaluates a model's ability to engage in multi-turn (2 turns) conversations and respond in ways that align with human needs. We use `gpt-4-1106-preview` as the judge model and compare against `gpt-3.5-turbo-0125` as the baseline model. The metric used is the weighted win rate against the baseline model (i.e. average win rate across each category (Math, Reasoning, STEM, Humanities, Roleplay, Writing, Extraction)). A tie is given a score of 0.5.
56
 
57
 
58
  For more details on Gemma2 9B CPT SEA-LIONv3 Instruct benchmark performance, please refer to the SEA HELM leaderboard, https://leaderboard.sea-lion.ai/
 
1
  ---
2
  language:
3
  - en
4
+ - zh
5
+ - vi
6
  - id
 
7
  - th
8
+ - fil
9
+ - ta
10
+ - ms
11
+ - km
12
+ - lo
13
+ - my
14
+ - jv
15
+ - su
16
  license: gemma
17
+ base_model:
18
+ - aisingapore/gemma2-9b-cpt-sea-lionv3-base
19
  ---
20
  # Gemma2 9B CPT SEA-LIONv3 Instruct
21
 
22
+ SEA-LION is a collection of Large Language Models (LLMs) which have been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
23
 
24
  Gemma2 9B CPT SEA-LIONv3 Instruct is a multilingual model which has been fine-tuned with around **500,000 English instruction-completion pairs** alongside a larger pool of around **1,000,000 instruction-completion pairs** from other ASEAN languages, such as Indonesian, Thai and Vietnamese.
25
 
 
28
  - **Developed by:** Products Pillar, AI Singapore
29
  - **Funded by:** Singapore NRF
30
  - **Model type:** Decoder
31
+ - **Languages:** English, Chinese, Vietnamese, Indonesian, Thai, Filipino, Tamil, Malay, Khmer, Lao, Burmese, Javanese, Sundanese
32
  - **License:** [Gemma Community License](https://ai.google.dev/gemma/terms)
33
 
34
  ## Model Details
 
36
  ### Model Description
37
  We performed instruction tuning in English and also in ASEAN languages such as Indonesian, Thai and Vietnamese on our [continued pre-trained Gemma2 9B CPT SEA-LIONv3](https://huggingface.co/aisingapore/gemma2-9b-cpt-sea-lionv3-base), a decoder model using the Gemma2 architecture, to create Gemma2 9B CPT SEA-LIONv3 Instruct.
38
 
39
+ For tokenisation, the model employs the default tokenizer used in Gemma-2-9B. The model has a context length of 8192.
40
 
41
  ### Benchmark Performance
42
  We evaluated Gemma2 9B CPT SEA-LIONv3 Instruct on both general language capabilities and instruction-following capabilities.
 
45
  For the evaluation of general language capabilities, we employed the [SEA HELM (also known as BHASA) evaluation benchmark](https://arxiv.org/abs/2309.06085v2) across a variety of tasks.
46
  These tasks include Question Answering (QA), Sentiment Analysis (Sentiment), Toxicity Detection (Toxicity), Translation in both directions (Eng>Lang & Lang>Eng), Abstractive Summarization (Summ), Causal Reasoning (Causal) and Natural Language Inference (NLI).
47
 
48
+ Note: SEA HELM is implemented using prompts to elicit answers in a strict format. For all tasks, the model is expected to provide an answer tag from which the answer is automatically extracted. For tasks where options are provided, the answer should comprise one of the pre-defined options. The scores for each task is normalised to account for baseline performance due to random chance.
 
 
49
 
50
+ The evaluation was done **zero-shot** with native prompts on a sample of 100-1000 instances for each dataset.
51
 
52
  #### Instruction-following Capabilities
53
  Since Gemma2 9B CPT SEA-LIONv3 Instruct is an instruction-following model, we also evaluated it on instruction-following capabilities with two datasets, [IFEval](https://arxiv.org/abs/2311.07911) and [MT-Bench](https://arxiv.org/abs/2306.05685).
 
56
 
57
  **IFEval**
58
 
59
+ IFEval evaluates a model's ability to adhere to constraints provided in the prompt, for example beginning a response with a specific word/phrase or answering with a certain number of sections. Additionally, accuracy is normalized by the proportion of responses in the correct language (if the model performs the task correctly but responds in the wrong language, it is judged to have failed the task).
60
 
61
 
62
  **MT-Bench**
63
 
64
+ MT-Bench evaluates a model's ability to engage in multi-turn (2 turns) conversations and respond in ways that align with human needs. We use `gpt-4-1106-preview` as the judge model and compare against `gpt-3.5-turbo-0125` as the baseline model. The metric used is the weighted win rate against the baseline model (i.e. average win rate across each category: Math, Reasoning, STEM, Humanities, Roleplay, Writing, Extraction). A tie is given a score of 0.5.
65
 
66
 
67
  For more details on Gemma2 9B CPT SEA-LIONv3 Instruct benchmark performance, please refer to the SEA HELM leaderboard, https://leaderboard.sea-lion.ai/