RaymondAISG commited on
Commit
70a8b9f
·
verified ·
1 Parent(s): a3714d3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -23
README.md CHANGED
@@ -40,9 +40,6 @@ Note: BHASA is implemented following a strict answer format, and only spaces and
40
 
41
  The evaluation was done zero-shot with native prompts and only a sample of 100-1000 instances for each dataset was used as per the setting described in the paper.
42
 
43
- **BHASA**
44
-
45
- To be released.
46
 
47
  #### Instruction-following Capabilities
48
  Since LLaMa3 8B SEA-LIONv2 is an instruction-following model, we also evaluated it on instruction-following capabilities with two datasets, [IFEval](https://arxiv.org/abs/2311.07911) and [MT-Bench](https://arxiv.org/abs/2306.05685).
@@ -53,31 +50,14 @@ As these two datasets were originally in English, the linguists and native speak
53
 
54
  IFEval evaluates a model's ability to adhere to constraints provided in the prompt, for example beginning a response with a specific word/phrase or answering with a certain number of sections. The metric used is accuracy normalized by language (if the model performs the task correctly but responds in the wrong language, it is judged to have failed the task).
55
 
56
- | **Model** | **Indonesian(%)** | **Vietnamese(%)** | **English(%)** |
57
- |:---------------------------------:|:------------------:|:------------------:|:---------------:|
58
- | Meta-Llama-3.1-8B-Instruct | 67.62 | 67.62 | 84.76 |
59
- | Qwen2-7B-Instruct | 62.86 | 64.76 | 70.48 |
60
- | llama3-8b-cpt-sea-lionv2-instruct | 60.95 | 65.71 | 69.52 |
61
- | aya-23-8B | 58.10 | 56.19 | 66.67 |
62
- | SeaLLMs-v3-7B-Chat | 55.24 | 52.38 | 66.67 |
63
- | Mistral-7B-Instruct-v0.3 | 42.86 | 39.05 | 69.52 |
64
- | Meta-Llama-3-8B-Instruct | 26.67 | 20.95 | 80.00 |
65
- | Sailor-7B-Chat | 25.71 | 24.76 | 41.90 |
66
 
67
  **MT-Bench**
68
 
69
  MT-Bench evaluates a model's ability to engage in multi-turn (2 turns) conversations and respond in ways that align with human needs. We use `gpt-4-1106-preview` as the judge model and compare against `gpt-3.5-turbo-0125` as the baseline model. The metric used is the weighted win rate against the baseline model (i.e. average win rate across each category (Math, Reasoning, STEM, Humanities, Roleplay, Writing, Extraction)). A tie is given a score of 0.5.
70
 
71
- | **Model** | **Indonesian(%)** | **Vietnamese(%)** | **English(%)** |
72
- |:---------------------------------:|:-----------------:|:-----------------:|:--------------:|
73
- | SeaLLMs-v3-7B-Chat | 58.33 | 65.56 | 42.94 |
74
- | Qwen2-7B-Instruct | 49.78 | 55.65 | 59.68 |
75
- | llama3-8b-cpt-sea-lionv2-instruct | 53.13 | 51.68 | 51.00 |
76
- | Meta-Llama-3.1-8B-Instruct | 41.09 | 47.69 | 61.79 |
77
- | aya-23-8B | 49.90 | 54.61 | 41.63 |
78
- | Meta-Llama-3-8B-Instruct | 40.29 | 43.69 | 56.38 |
79
- | Mistral-7B-Instruct-v0.3 | 34.74 | 20.24 | 52.40 |
80
- | Sailor-7B-Chat | 29.05 | 31.39 | 18.98 |
81
 
82
 
83
  ### Usage
 
40
 
41
  The evaluation was done zero-shot with native prompts and only a sample of 100-1000 instances for each dataset was used as per the setting described in the paper.
42
 
 
 
 
43
 
44
  #### Instruction-following Capabilities
45
  Since LLaMa3 8B SEA-LIONv2 is an instruction-following model, we also evaluated it on instruction-following capabilities with two datasets, [IFEval](https://arxiv.org/abs/2311.07911) and [MT-Bench](https://arxiv.org/abs/2306.05685).
 
50
 
51
  IFEval evaluates a model's ability to adhere to constraints provided in the prompt, for example beginning a response with a specific word/phrase or answering with a certain number of sections. The metric used is accuracy normalized by language (if the model performs the task correctly but responds in the wrong language, it is judged to have failed the task).
52
 
 
 
 
 
 
 
 
 
 
 
53
 
54
  **MT-Bench**
55
 
56
  MT-Bench evaluates a model's ability to engage in multi-turn (2 turns) conversations and respond in ways that align with human needs. We use `gpt-4-1106-preview` as the judge model and compare against `gpt-3.5-turbo-0125` as the baseline model. The metric used is the weighted win rate against the baseline model (i.e. average win rate across each category (Math, Reasoning, STEM, Humanities, Roleplay, Writing, Extraction)). A tie is given a score of 0.5.
57
 
58
+
59
+ For more details on Llama3 8B CPT SEA-LIONv2 Instruct benchmark performance, please refer to the [SEA HELM leaderboard](https://leaderboard.sea-lion.ai/),
60
+ https://leaderboard.sea-lion.ai/
 
 
 
 
 
 
 
61
 
62
 
63
  ### Usage