Spaces:
Running
Running
update about page
Browse files- src/display/about.py +8 -3
src/display/about.py
CHANGED
@@ -28,7 +28,6 @@ This leaderboard is specifically designed to evaluate large language models (LLM
|
|
28 |
For additional details such as datasets, evaluation criteria, and reproducibility, please refer to the "π About" tab.
|
29 |
|
30 |
Stay tuned for the *SeaBench leaderboard* - focusing on evaluating the model's ability to respond to general human instructions in real-world multi-turn settings.
|
31 |
-
|
32 |
"""
|
33 |
|
34 |
# Which evaluations are you running? how can people reproduce what you have?
|
@@ -42,7 +41,7 @@ The benchmark data can be found in the [SeaExam dataset](https://huggingface.co/
|
|
42 |
- [**MMLU**](https://arxiv.org/abs/2009.03300): a test to measure a text model's multitask accuracy in English. The test covers 57 tasks. We sample 50 questions from each task and translate the data into the other 4 languages with google translate.
|
43 |
|
44 |
## Evalation Criteria
|
45 |
-
We evaluate the models with accuracy score.
|
46 |
|
47 |
We have the following settings for evaluation:
|
48 |
- **few-shot**: the default setting is few-shot (3-shot). All open-source models are evaluated with 3-shot.
|
@@ -50,7 +49,11 @@ We have the following settings for evaluation:
|
|
50 |
|
51 |
|
52 |
## Reults
|
53 |
-
|
|
|
|
|
|
|
|
|
54 |
|
55 |
## Reproducibility
|
56 |
To reproduce our results, use the script in [this repo](https://github.com/DAMO-NLP-SG/SeaExam/tree/main). The script will download the model and tokenizer, and evaluate the model on the benchmark data.
|
@@ -60,6 +63,8 @@ python scripts/main.py --model $model_name_or_path
|
|
60 |
|
61 |
"""
|
62 |
|
|
|
|
|
63 |
EVALUATION_QUEUE_TEXT = """
|
64 |
## Some good practices before submitting a model
|
65 |
|
|
|
28 |
For additional details such as datasets, evaluation criteria, and reproducibility, please refer to the "π About" tab.
|
29 |
|
30 |
Stay tuned for the *SeaBench leaderboard* - focusing on evaluating the model's ability to respond to general human instructions in real-world multi-turn settings.
|
|
|
31 |
"""
|
32 |
|
33 |
# Which evaluations are you running? how can people reproduce what you have?
|
|
|
41 |
- [**MMLU**](https://arxiv.org/abs/2009.03300): a test to measure a text model's multitask accuracy in English. The test covers 57 tasks. We sample 50 questions from each task and translate the data into the other 4 languages with google translate.
|
42 |
|
43 |
## Evalation Criteria
|
44 |
+
We evaluate the models with accuracy score.
|
45 |
|
46 |
We have the following settings for evaluation:
|
47 |
- **few-shot**: the default setting is few-shot (3-shot). All open-source models are evaluated with 3-shot.
|
|
|
49 |
|
50 |
|
51 |
## Reults
|
52 |
+
How to interpret the leaderboard?
|
53 |
+
* Each numerical value represet the accuracy (%).
|
54 |
+
* The "M3Exam" and "MMLU" pages show the performance of each model for that dataset.
|
55 |
+
* The "π
Overall" shows the average results of "M3Exam" and "MMLU".
|
56 |
+
* The leaderboard is sorted by avg_sea, the average score across SEA languages (id, th, and vi).
|
57 |
|
58 |
## Reproducibility
|
59 |
To reproduce our results, use the script in [this repo](https://github.com/DAMO-NLP-SG/SeaExam/tree/main). The script will download the model and tokenizer, and evaluate the model on the benchmark data.
|
|
|
63 |
|
64 |
"""
|
65 |
|
66 |
+
# You can find the detailed numerical results in the results Hugging Face dataset: https://huggingface.co/datasets/SeaLLMs/SeaExam-results
|
67 |
+
|
68 |
EVALUATION_QUEUE_TEXT = """
|
69 |
## Some good practices before submitting a model
|
70 |
|