Spaces:
Runtime error
Runtime error
Commit
·
701776a
1
Parent(s):
346cd10
Change description
Browse files- src/about.py +5 -12
src/about.py
CHANGED
@@ -30,31 +30,24 @@ TITLE = """<h1 align="center" id="space-title">HunEval leaderboard</h1>"""
|
|
30 |
|
31 |
# What does your leaderboard evaluate?
|
32 |
INTRODUCTION_TEXT = """
|
33 |
-
The HunEval leaderboard
|
34 |
-
primary components: (1) linguistic comprehension tasks, which aim to gauge a model's ability to interpret and process Hungarian text; and (2) knowledge-based tasks that examine a model's familiarity
|
35 |
-
with Hungarian cultural and linguistic phenomena. The benchmark is comprised of multiple sub-tasks, each targeting a distinct aspect of the model's performance.
|
36 |
|
37 |
-
|
38 |
-
demanding for models without prior training on Hungarian data. As such, we anticipate that models trained on Hungarian datasets will perform well on the benchmark, whereas those lacking this experience
|
39 |
-
may encounter difficulties. Notwithstanding, a model's strong performance on the benchmark does not imply expertise in a specific task; rather, it indicates a proficiency in understanding Hungarian
|
40 |
-
language and its structures.
|
41 |
|
42 |
-
**
|
43 |
"""
|
44 |
|
45 |
# Which evaluations are you running? how can people reproduce what you have?
|
46 |
LLM_BENCHMARKS_TEXT = """
|
47 |
## How it works
|
48 |
The benhmark is devided into several tasks, including: history, logic (testing the knowledge of the models), grammar, sayings, spelling, and vocabulary (testing the language understanding capabilities of the models). Each task contains an instruction or question, and a set of four possible answers. The model is given a system
|
49 |
-
prompt, which aims to add CoT reasoning before providing an answer. This makes the improves the results for most of the models, while also making the benchmark more consistent. An answer is considered correct if it matches the correct answer in the set of possible answers. The task is given to the model three times. If it answers correctly at least once, it is considered correct. The final score is the number of correct answers divided by the number of tasks.
|
50 |
-
|
51 |
-
To run the evaluation, we gave the model 2048 tokens to generate the answer and 0.0 was used as the temperature.
|
52 |
|
53 |
## Reproducing the results
|
54 |
TODO
|
55 |
|
56 |
## Evaluation
|
57 |
-
In the current version of the benchmark, some models, (ones that were most likely trained on Hungarian data) perform very well, while others, (ones that were not trained on Hungarian data) perform poorly. This may indicate that in the future, more challenging tasks should be added to the benchmark to make it more
|
58 |
"""
|
59 |
|
60 |
EVALUATION_QUEUE_TEXT = """
|
|
|
30 |
|
31 |
# What does your leaderboard evaluate?
|
32 |
INTRODUCTION_TEXT = """
|
33 |
+
The HunEval leaderboard aims to evaluate models based on their proficiency in understanding and processing the Hungarian language. This benchmark focuses on two key areas: (1) linguistic comprehension, which measures a model's ability to interpret Hungarian text, and (2) knowledge-based tasks, which assess a model's familiarity with Hungarian history and cultural aspects. The benchmark includes multiple sub-tasks, each targeting a different facet of language understanding and knowledge.
|
|
|
|
|
34 |
|
35 |
+
My goal was to create tasks that are straightforward for native Hungarian speakers or individuals with deep familiarity with the language, but potentially challenging for models without specific training on Hungarian data. I expect models trained on Hungarian datasets to perform well, while those without such training may struggle. A strong performance on this benchmark indicates proficiency in Hungarian language structures and knowledge, but not necessarily expertise in specific tasks.
|
|
|
|
|
|
|
36 |
|
37 |
+
**Please note that this benchmark is a Proof of Concept and not a comprehensive evaluation of a model's capabilities.** I invite participants to engage with the benchmark and provide feedback for future improvements.
|
38 |
"""
|
39 |
|
40 |
# Which evaluations are you running? how can people reproduce what you have?
|
41 |
LLM_BENCHMARKS_TEXT = """
|
42 |
## How it works
|
43 |
The benhmark is devided into several tasks, including: history, logic (testing the knowledge of the models), grammar, sayings, spelling, and vocabulary (testing the language understanding capabilities of the models). Each task contains an instruction or question, and a set of four possible answers. The model is given a system
|
44 |
+
prompt, which aims to add CoT reasoning before providing an answer. This makes the improves the results for most of the models, while also making the benchmark more consistent. An answer is considered correct if it matches the correct answer in the set of possible answers. The task is given to the model three times. If it answers correctly at least once, it is considered correct. The final score is the number of correct answers divided by the number of tasks. To run the evaluation, I gave the model 2048 tokens to generate the answer and 0.0 was used as the temperature.
|
|
|
|
|
45 |
|
46 |
## Reproducing the results
|
47 |
TODO
|
48 |
|
49 |
## Evaluation
|
50 |
+
In the current version of the benchmark, some models, (ones that were most likely trained on Hungarian data) perform very well (maybe a bit too well?), while others, (ones that were not trained on Hungarian data) perform poorly. This may indicate that in the future, more challenging tasks should be added to the benchmark to make it a more accurate representation of the models' capabilities in Hungarian language understanding and knowledge.
|
51 |
"""
|
52 |
|
53 |
EVALUATION_QUEUE_TEXT = """
|