aisingapore
/

llama3-8b-cpt-sea-lionv2.1-instruct

@@ -40,9 +40,6 @@ Note: BHASA is implemented following a strict answer format, and only spaces and
 The evaluation was done zero-shot with native prompts and only a sample of 100-1000 instances for each dataset was used as per the setting described in the paper.
-**BHASA**
-To be released.
 #### Instruction-following Capabilities
 Since LLaMa3 8B SEA-LIONv2 is an instruction-following model, we also evaluated it on instruction-following capabilities with two datasets, [IFEval](https://arxiv.org/abs/2311.07911) and [MT-Bench](https://arxiv.org/abs/2306.05685).
@@ -53,31 +50,14 @@ As these two datasets were originally in English, the linguists and native speak
 IFEval evaluates a model's ability to adhere to constraints provided in the prompt, for example beginning a response with a specific word/phrase or answering with a certain number of sections. The metric used is accuracy normalized by language (if the model performs the task correctly but responds in the wrong language, it is judged to have failed the task).
-|             **Model**             | **Indonesian(%)**  | **Vietnamese(%)**  | **English(%)**  |
-|:---------------------------------:|:------------------:|:------------------:|:---------------:|
-|        Meta-Llama-3.1-8B-Instruct |        67.62       |        67.62       |      84.76      |
-|                 Qwen2-7B-Instruct |        62.86       |        64.76       |      70.48      |
-| llama3-8b-cpt-sea-lionv2-instruct |        60.95       |        65.71       |      69.52      |
-|                         aya-23-8B |        58.10       |        56.19       |      66.67      |
-|                SeaLLMs-v3-7B-Chat |        55.24       |        52.38       |      66.67      |
-|          Mistral-7B-Instruct-v0.3 |        42.86       |        39.05       |      69.52      |
-|          Meta-Llama-3-8B-Instruct |        26.67       |        20.95       |      80.00      |
-|                    Sailor-7B-Chat |        25.71       |        24.76       |      41.90      |
 **MT-Bench**
 MT-Bench evaluates a model's ability to engage in multi-turn (2 turns) conversations and respond in ways that align with human needs. We use `gpt-4-1106-preview` as the judge model and compare against `gpt-3.5-turbo-0125` as the baseline model. The metric used is the weighted win rate against the baseline model (i.e. average win rate across each category (Math, Reasoning, STEM, Humanities, Roleplay, Writing, Extraction)). A tie is given a score of 0.5.
-|             **Model**             | **Indonesian(%)** | **Vietnamese(%)** | **English(%)** |
-|:---------------------------------:|:-----------------:|:-----------------:|:--------------:|
-|                SeaLLMs-v3-7B-Chat |     58.33         |     65.56         |    42.94       |
-|                 Qwen2-7B-Instruct |     49.78         |     55.65         |    59.68       |
-| llama3-8b-cpt-sea-lionv2-instruct |     53.13         |     51.68         |    51.00       |
-|        Meta-Llama-3.1-8B-Instruct |     41.09         |     47.69         |    61.79       |
-|                         aya-23-8B |     49.90         |     54.61         |    41.63       |
-|          Meta-Llama-3-8B-Instruct |     40.29         |     43.69         |    56.38       |
-|          Mistral-7B-Instruct-v0.3 |     34.74         |     20.24         |    52.40       |
-|                    Sailor-7B-Chat |     29.05         |     31.39         |    18.98       |
 ### Usage

 The evaluation was done zero-shot with native prompts and only a sample of 100-1000 instances for each dataset was used as per the setting described in the paper.
 #### Instruction-following Capabilities
 Since LLaMa3 8B SEA-LIONv2 is an instruction-following model, we also evaluated it on instruction-following capabilities with two datasets, [IFEval](https://arxiv.org/abs/2311.07911) and [MT-Bench](https://arxiv.org/abs/2306.05685).
 IFEval evaluates a model's ability to adhere to constraints provided in the prompt, for example beginning a response with a specific word/phrase or answering with a certain number of sections. The metric used is accuracy normalized by language (if the model performs the task correctly but responds in the wrong language, it is judged to have failed the task).
 **MT-Bench**
 MT-Bench evaluates a model's ability to engage in multi-turn (2 turns) conversations and respond in ways that align with human needs. We use `gpt-4-1106-preview` as the judge model and compare against `gpt-3.5-turbo-0125` as the baseline model. The metric used is the weighted win rate against the baseline model (i.e. average win rate across each category (Math, Reasoning, STEM, Humanities, Roleplay, Writing, Extraction)). A tie is given a score of 0.5.
+For more details on Llama3 8B CPT SEA-LIONv2 Instruct benchmark performance, please refer to the [SEA HELM leaderboard](https://leaderboard.sea-lion.ai/),
+https://leaderboard.sea-lion.ai/
 ### Usage