tabedini commited on
Commit
1cac001
·
verified ·
1 Parent(s): 7552e97

Update utils.py

Browse files
Files changed (1) hide show
  1. utils.py +6 -6
utils.py CHANGED
@@ -112,7 +112,7 @@ body, .gradio-container, .gr-button, .gr-input, .gr-slider, .gr-dropdown, .gr-ma
112
  LLM_BENCHMARKS_ABOUT_TEXT = f"""
113
  # Persian LLM Evaluation Leaderboard (v1)
114
 
115
- > The Persian LLM Evaluation Leaderboard, developed by **Part DP AI** in collaboration with **AUT (Amirkabir University of Technology) NLP Lab**, provides a comprehensive benchmarking system specifically designed for Persian language models. This leaderboard, based on the open-source [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness), offers a unique platform for evaluating the performance of large language models (LLMs) on tasks that demand linguistic proficiency and technical skill in Persian.
116
 
117
  ## 1. Key Features
118
 
@@ -123,7 +123,7 @@ LLM_BENCHMARKS_ABOUT_TEXT = f"""
123
  > Six specialized tasks have been curated for this leaderboard, each tailored to challenge different aspects of a model’s capabilities. These tasks include:
124
  > - **Part Multiple Choice**
125
  > - **ARC Easy**
126
- > - **ARC Challenging**
127
  > - **MMLU Pro**
128
  > - **GSM8k Persian**
129
  > - **Multiple Choice Persian**
@@ -134,22 +134,22 @@ LLM_BENCHMARKS_ABOUT_TEXT = f"""
134
  > A sample of the evaluation dataset is hosted on [Hugging Face Datasets](https://huggingface.co/datasets/PartAI/llm-leaderboard-datasets-sample), offering the AI community a glimpse of the benchmark content and format. This sample allows developers to pre-assess their models against representative data before a full leaderboard evaluation.
135
  >
136
  > 4. **Collaborative Development**
137
- > This leaderboard represents a significant collaboration between Part AI and Professor Saeedeh Momtazi of Amirkabir University of Technology, leveraging academic research and industrial expertise to create a high-quality, open benchmarking tool. The partnership underscores a shared commitment to advancing Persian-language AI technologies.
138
  >
139
  > 5. **Comprehensive Evaluation Pipeline**
140
  > By integrating a standardized evaluation pipeline, models are assessed across a variety of data types, including text, mathematical formulas, and numerical data. This multi-faceted approach enhances the evaluation’s reliability and allows for precise, nuanced assessment of model performance across multiple dimensions.
141
 
142
  ## 2. Background and Goals
143
 
144
- > Recent months have seen a notable increase in the development of Persian language models by research centers and AI companies in Iran. However, the lack of reliable, standardized benchmarks for Persian models has made it challenging to evaluate model quality comprehensively. Global benchmarks typically do not support Persian, resulting in skewed or unreliable results for Persian-based AI.
145
  >
146
- > This leaderboard addresses this gap by providing a locally-focused, transparent system that enables consistent, fair comparisons of Persian models. It is expected to be a valuable tool for Persian-speaking businesses and developers, allowing them to select models best suited to their needs. Researchers and model developers also benefit from the competitive environment, with opportunities to showcase and improve their models based on benchmark rankings.
147
 
148
  ## 3. Data Privacy and Integrity
149
 
150
  > To maintain evaluation integrity and prevent overfitting or data leakage, only part of the benchmark dataset is openly available. This limited access approach upholds model evaluation reliability, ensuring that results are genuinely representative of each model’s capabilities across unseen data.
151
  >
152
- > The leaderboard represents a significant milestone in Persian language AI and is positioned to become the leading standard for LLM evaluation in the Persian-speaking world.
153
 
154
  """
155
 
 
112
  LLM_BENCHMARKS_ABOUT_TEXT = f"""
113
  # Persian LLM Evaluation Leaderboard (v1)
114
 
115
+ > The Persian LLM Evaluation Leaderboard, developed by **Part DP AI** in collaboration with **AUT (Amirkabir University of Technology) NLP Lab**, provides a comprehensive benchmarking system specifically designed for Persian LLMs. This leaderboard, based on the open-source [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness), offers a unique platform for evaluating the performance of large language models (LLMs) on tasks that demand linguistic proficiency and technical skill in Persian.
116
 
117
  ## 1. Key Features
118
 
 
123
  > Six specialized tasks have been curated for this leaderboard, each tailored to challenge different aspects of a model’s capabilities. These tasks include:
124
  > - **Part Multiple Choice**
125
  > - **ARC Easy**
126
+ > - **ARC Challenge**
127
  > - **MMLU Pro**
128
  > - **GSM8k Persian**
129
  > - **Multiple Choice Persian**
 
134
  > A sample of the evaluation dataset is hosted on [Hugging Face Datasets](https://huggingface.co/datasets/PartAI/llm-leaderboard-datasets-sample), offering the AI community a glimpse of the benchmark content and format. This sample allows developers to pre-assess their models against representative data before a full leaderboard evaluation.
135
  >
136
  > 4. **Collaborative Development**
137
+ > This leaderboard represents a significant collaboration between Part AI and Professor Saeedeh Momtazi of Amirkabir University of Technology, leveraging industrial expertise and academic research to create a high-quality, open benchmarking tool. The partnership underscores a shared commitment to advancing Persian LLMs.
138
  >
139
  > 5. **Comprehensive Evaluation Pipeline**
140
  > By integrating a standardized evaluation pipeline, models are assessed across a variety of data types, including text, mathematical formulas, and numerical data. This multi-faceted approach enhances the evaluation’s reliability and allows for precise, nuanced assessment of model performance across multiple dimensions.
141
 
142
  ## 2. Background and Goals
143
 
144
+ > Recent months have seen a notable increase in the development of Persian LLMs by research centers and AI companies in Iran. However, the lack of reliable, standardized benchmarks for Persian LLMs has made it challenging to evaluate model quality comprehensively. Global benchmarks typically do not support Persian, resulting in skewed or unreliable results for Persian LLMs.
145
  >
146
+ > This leaderboard addresses this gap by providing a locally-focused, transparent system that enables consistent, fair comparisons of Persian LLMs. It is expected to be a valuable tool for Persian-speaking businesses and developers, allowing them to select models best suited to their needs. Researchers and model developers also benefit from the competitive environment, with opportunities to showcase and improve their models based on benchmark rankings.
147
 
148
  ## 3. Data Privacy and Integrity
149
 
150
  > To maintain evaluation integrity and prevent overfitting or data leakage, only part of the benchmark dataset is openly available. This limited access approach upholds model evaluation reliability, ensuring that results are genuinely representative of each model’s capabilities across unseen data.
151
  >
152
+ > The leaderboard represents a significant milestone in Persian LLMs and is positioned to become the leading standard for LLM evaluation in the Persian-speaking world.
153
 
154
  """
155