lukecq commited on
Commit
8c8300c
Β·
1 Parent(s): 60867e4

update about page

Browse files
Files changed (1) hide show
  1. src/display/about.py +8 -3
src/display/about.py CHANGED
@@ -28,7 +28,6 @@ This leaderboard is specifically designed to evaluate large language models (LLM
28
  For additional details such as datasets, evaluation criteria, and reproducibility, please refer to the "πŸ“ About" tab.
29
 
30
  Stay tuned for the *SeaBench leaderboard* - focusing on evaluating the model's ability to respond to general human instructions in real-world multi-turn settings.
31
-
32
  """
33
 
34
  # Which evaluations are you running? how can people reproduce what you have?
@@ -42,7 +41,7 @@ The benchmark data can be found in the [SeaExam dataset](https://huggingface.co/
42
  - [**MMLU**](https://arxiv.org/abs/2009.03300): a test to measure a text model's multitask accuracy in English. The test covers 57 tasks. We sample 50 questions from each task and translate the data into the other 4 languages with google translate.
43
 
44
  ## Evalation Criteria
45
- We evaluate the models with accuracy score. The leaderboard is sorted by the average score across SEA languages (id, th, and vi).
46
 
47
  We have the following settings for evaluation:
48
  - **few-shot**: the default setting is few-shot (3-shot). All open-source models are evaluated with 3-shot.
@@ -50,7 +49,11 @@ We have the following settings for evaluation:
50
 
51
 
52
  ## Reults
53
- You can find the detailed numerical results in the results Hugging Face dataset: https://huggingface.co/datasets/SeaLLMs/SeaExam-results
 
 
 
 
54
 
55
  ## Reproducibility
56
  To reproduce our results, use the script in [this repo](https://github.com/DAMO-NLP-SG/SeaExam/tree/main). The script will download the model and tokenizer, and evaluate the model on the benchmark data.
@@ -60,6 +63,8 @@ python scripts/main.py --model $model_name_or_path
60
 
61
  """
62
 
 
 
63
  EVALUATION_QUEUE_TEXT = """
64
  ## Some good practices before submitting a model
65
 
 
28
  For additional details such as datasets, evaluation criteria, and reproducibility, please refer to the "πŸ“ About" tab.
29
 
30
  Stay tuned for the *SeaBench leaderboard* - focusing on evaluating the model's ability to respond to general human instructions in real-world multi-turn settings.
 
31
  """
32
 
33
  # Which evaluations are you running? how can people reproduce what you have?
 
41
  - [**MMLU**](https://arxiv.org/abs/2009.03300): a test to measure a text model's multitask accuracy in English. The test covers 57 tasks. We sample 50 questions from each task and translate the data into the other 4 languages with google translate.
42
 
43
  ## Evalation Criteria
44
+ We evaluate the models with accuracy score.
45
 
46
  We have the following settings for evaluation:
47
  - **few-shot**: the default setting is few-shot (3-shot). All open-source models are evaluated with 3-shot.
 
49
 
50
 
51
  ## Reults
52
+ How to interpret the leaderboard?
53
+ * Each numerical value represet the accuracy (%).
54
+ * The "M3Exam" and "MMLU" pages show the performance of each model for that dataset.
55
+ * The "πŸ… Overall" shows the average results of "M3Exam" and "MMLU".
56
+ * The leaderboard is sorted by avg_sea, the average score across SEA languages (id, th, and vi).
57
 
58
  ## Reproducibility
59
  To reproduce our results, use the script in [this repo](https://github.com/DAMO-NLP-SG/SeaExam/tree/main). The script will download the model and tokenizer, and evaluate the model on the benchmark data.
 
63
 
64
  """
65
 
66
+ # You can find the detailed numerical results in the results Hugging Face dataset: https://huggingface.co/datasets/SeaLLMs/SeaExam-results
67
+
68
  EVALUATION_QUEUE_TEXT = """
69
  ## Some good practices before submitting a model
70