Spaces:
Runtime error
Runtime error
Update src/about.py
Browse files- src/about.py +14 -31
src/about.py
CHANGED
@@ -51,43 +51,26 @@ For now, the only competitive open language models capable of properly speaking
|
|
51 |
There are only a few capable multilingual LLMs in Persian that derive their main knowledge from English. A Persian LLM is almost an imagination right now as there doesn't exist
|
52 |
that many models being expert in Persian in the first place.
|
53 |
|
54 |
-
Our
|
|
|
55 |
|
56 |
We use our own framework to evaluate the models on the following benchmarks (TO BE RELEASED SOON)
|
57 |
### Tasks
|
58 |
-
-
|
59 |
-
-
|
60 |
-
- <a href="https://arxiv.org/abs/
|
61 |
-
- <a href="https://arxiv.org/abs/
|
62 |
-
- <a href="https://arxiv.org/abs/
|
63 |
-
- <a href="https://arxiv.org/abs/
|
|
|
64 |
For all these evaluations, a higher score is a better score.
|
65 |
-
We chose these benchmarks
|
|
|
|
|
|
|
66 |
|
67 |
## REPRODUCIBILITY
|
68 |
-
|
69 |
-
`python main.py --model=hf-causal-experimental --model_args="pretrained=<your_model>,use_accelerate=True,revision=<your_model_revision>"`
|
70 |
-
` --tasks=<task_list> --num_fewshot=<n_few_shot> --batch_size=1 --output_path=<output_path>`
|
71 |
-
```
|
72 |
-
python main.py --model=hf-causal-experimental \
|
73 |
-
--model_args="pretrained=<your_model>,use_accelerate=True,revision=<your_model_revision>" \
|
74 |
-
--tasks=<task_list> \
|
75 |
-
--num_fewshot=<n_few_shot> \
|
76 |
-
--batch_size=1 \
|
77 |
-
--output_path=<output_path>
|
78 |
-
```
|
79 |
-
**Note:** We evaluate all models on a single node of 8 H100s, so the global batch size is 8 for each evaluation. If you don't use parallelism, adapt your batch size to fit.
|
80 |
-
*You can expect results to vary slightly for different batch sizes because of padding.*
|
81 |
-
The tasks and few shots parameters are:
|
82 |
-
- ARC: 25-shot, *arc-challenge* (`acc_norm`)
|
83 |
-
- HellaSwag: 10-shot, *hellaswag* (`acc_norm`)
|
84 |
-
- TruthfulQA: 0-shot, *truthfulqa-mc* (`mc2`)
|
85 |
-
- MMLU: 5-shot, *hendrycksTest-abstract_algebra,hendrycksTest-anatomy,hendrycksTest-astronomy,hendrycksTest-business_ethics,hendrycksTest-clinical_knowledge,hendrycksTest-college_biology,hendrycksTest-college_chemistry,hendrycksTest-college_computer_science,hendrycksTest-college_mathematics,hendrycksTest-college_medicine,hendrycksTest-college_physics,hendrycksTest-computer_security,hendrycksTest-conceptual_physics,hendrycksTest-econometrics,hendrycksTest-electrical_engineering,hendrycksTest-elementary_mathematics,hendrycksTest-formal_logic,hendrycksTest-global_facts,hendrycksTest-high_school_biology,hendrycksTest-high_school_chemistry,hendrycksTest-high_school_computer_science,hendrycksTest-high_school_european_history,hendrycksTest-high_school_geography,hendrycksTest-high_school_government_and_politics,hendrycksTest-high_school_macroeconomics,hendrycksTest-high_school_mathematics,hendrycksTest-high_school_microeconomics,hendrycksTest-high_school_physics,hendrycksTest-high_school_psychology,hendrycksTest-high_school_statistics,hendrycksTest-high_school_us_history,hendrycksTest-high_school_world_history,hendrycksTest-human_aging,hendrycksTest-human_sexuality,hendrycksTest-international_law,hendrycksTest-jurisprudence,hendrycksTest-logical_fallacies,hendrycksTest-machine_learning,hendrycksTest-management,hendrycksTest-marketing,hendrycksTest-medical_genetics,hendrycksTest-miscellaneous,hendrycksTest-moral_disputes,hendrycksTest-moral_scenarios,hendrycksTest-nutrition,hendrycksTest-philosophy,hendrycksTest-prehistory,hendrycksTest-professional_accounting,hendrycksTest-professional_law,hendrycksTest-professional_medicine,hendrycksTest-professional_psychology,hendrycksTest-public_relations,hendrycksTest-security_studies,hendrycksTest-sociology,hendrycksTest-us_foreign_policy,hendrycksTest-virology,hendrycksTest-world_religions* (average of all the results `acc`)
|
86 |
-
- Winogrande: 5-shot, *winogrande* (`acc`)
|
87 |
-
- GSM8k: 5-shot, *gsm8k* (`acc`)
|
88 |
-
Side note on the baseline scores:
|
89 |
-
- for log-likelihood evaluation, we select the random baseline
|
90 |
-
- for GSM8K, we select the score obtained in the paper after finetuning a 6B model on the full GSM8K training set for 50 epochs
|
91 |
"""
|
92 |
|
93 |
EVALUATION_QUEUE_TEXT = """
|
|
|
51 |
There are only a few capable multilingual LLMs in Persian that derive their main knowledge from English. A Persian LLM is almost an imagination right now as there doesn't exist
|
52 |
that many models being expert in Persian in the first place.
|
53 |
|
54 |
+
Our goals are to provide a benchmark on diverse domains and tasks that provides insights on how much is the gap between the SOTA models right now in different grounds.
|
55 |
+
This benchmark can also be used by multilingual researchers to measure how well their model performs in a language like Persian.
|
56 |
|
57 |
We use our own framework to evaluate the models on the following benchmarks (TO BE RELEASED SOON)
|
58 |
### Tasks
|
59 |
+
- PeKA: Persian Knowledge Assesment (0-shot) - a set of multiple-choice questions that tests the level of native knowledge in Persian language in more 15 domains and categories: From art to history and geography, cinema, tv, sports, law and medicine, and much more.
|
60 |
+
- PersBETS: Persian Bias Ethics Toxicity and Skills (0-shot) - a test of model's capability in linguistic skills such as Grammar and Praphrasing, and also questions examining the bias, ethics, and toxicity of the model.
|
61 |
+
- <a href="https://arxiv.org/abs/2404.06644" target="_blank"> Khayyam Challenge (Persian MMLU) </a> (0-shot) - comprising 20,192 four-choice questions sourced from 38 diverse tasks extracted from Persian examinations, spanning a wide spectrum of subjects, complexities, and ages
|
62 |
+
- <a href="https://arxiv.org/abs/2012.06154" target="_blank"> ParsiNLU MCQA </a> (0-shot) - a series of multiple-choice questions in domains of *literature*, *math & logic*, and *common knowledge*.
|
63 |
+
- <a href="https://arxiv.org/abs/2012.06154" target="_blank"> ParsiNLU NLI </a> (max[0,3,5,10]-shot) - a 3-way classification to determine whether a hypothesis sentence entails, contradicts, or is neutral with respect to a given premise sentence.
|
64 |
+
- <a href="https://arxiv.org/abs/2012.06154" target="_blank"> ParsiNLU QQP </a> (max[0,2,5,10]-shot) - task of deciding whether a whether two given questions are paraphrases of each other or not.
|
65 |
+
|
66 |
For all these evaluations, a higher score is a better score.
|
67 |
+
We chose these benchmarks for now, but several other benchmarks are going to be added later to help us perform a more thorough examination of models.
|
68 |
+
The last two benchmarks, ParsiNLU NLI and ParsiNLU QQP are evaluated in different few-shot settings and then the maximum score is returned as the final evaluation.
|
69 |
+
We argue that is indeed a fair evaluation method since many ,ight-weight models (around ~7B and less) can have a pooor in-context learning and thus they perform better
|
70 |
+
in small shots. We wish to not hold this against the model by trying to measure performances in different settings and take the maximum score achieved .
|
71 |
|
72 |
## REPRODUCIBILITY
|
73 |
+
(TO BE COMPLETED)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
74 |
"""
|
75 |
|
76 |
EVALUATION_QUEUE_TEXT = """
|