persian_llm_leaderboard

Running

Behnamm commited on Aug 28, 2024

Commit

ba78a38

verified ·

1 Parent(s): 04eb2c0

Update src/about.py

Files changed (1) hide show

src/about.py CHANGED Viewed

@@ -72,8 +72,8 @@ We use the given *test* subset (for those benchmarks that also have *train* and
 These benchmarks are picked for now, but several other benchmarks are going to be added later to help us perform a more thorough examination of models.
 The last two benchmarks, ParsiNLU NLI and ParsiNLU QQP are evaluated in different few-shot settings and then the maximum score is returned as the final evaluation.
-We argue that this is indeed a fair evaluation scheme since many light-weight models (around ~7B and less) can have a poor in-context learning and thus perform better
-in small shots (or have a small knowledge capacity and perform poorly in zero-shot). We wish to not hold this against the model by trying to measure performances in different settings and take the maximum score achieved .
 ## REPRODUCIBILITY
 The parameters used for evaluation along with instructions and prompts will be available once the framework is released. (TO BE COMPLETED)

 These benchmarks are picked for now, but several other benchmarks are going to be added later to help us perform a more thorough examination of models.
 The last two benchmarks, ParsiNLU NLI and ParsiNLU QQP are evaluated in different few-shot settings and then the maximum score is returned as the final evaluation.
+We argue that this is indeed a fair evaluation scheme since many light-weight models (around ~7B and less) can have a poor in-context learning in long-context prompts and thus perform better
+in smaller shots (or have a small knowledge capacity and perform poorly in zero-shot). We wish to not hold this against the model by trying to measure performances in different settings and take the maximum score achieved .
 ## REPRODUCIBILITY
 The parameters used for evaluation along with instructions and prompts will be available once the framework is released. (TO BE COMPLETED)