Bram Vanroy commited on
Commit
b268b1d
·
1 Parent(s): 3ad3dea

add disclaimer

Browse files
Files changed (2) hide show
  1. app.py +2 -0
  2. content.py +10 -8
app.py CHANGED
@@ -261,6 +261,8 @@ with gr.Blocks() as demo:
261
  gr.Markdown("## LaTeX")
262
  gr.Code(results.latex_df.to_latex(convert_css=True))
263
 
 
 
264
  gr.Markdown(CREDIT, elem_classes="markdown-text")
265
  gr.Markdown(CITATION, elem_classes="markdown-text")
266
 
 
261
  gr.Markdown("## LaTeX")
262
  gr.Code(results.latex_df.to_latex(convert_css=True))
263
 
264
+
265
+ gr.Markdown(DISCLAIMER, elem_classes="markdown-text")
266
  gr.Markdown(CREDIT, elem_classes="markdown-text")
267
  gr.Markdown(CITATION, elem_classes="markdown-text")
268
 
content.py CHANGED
@@ -1,27 +1,30 @@
1
  TITLE = '<h1 align="center" id="space-title">Open Dutch LLM Evaluation Leaderboard</h1>'
2
 
3
- INTRO_TEXT = f"""
4
- ## About
5
 
6
  This is a leaderboard for Dutch benchmarks for large language models.
7
 
8
  This is a fork of the [Open Multilingual LLM Evaluation Leaderboard](https://huggingface.co/spaces/uonlp/open_multilingual_llm_leaderboard), but restricted to only Dutch models and augmented with additional model results.
9
  We test the models on the following benchmarks **for the Dutch version only!!**, which have been translated into Dutch automatically by the original authors of the Open Multilingual LLM Evaluation Leaderboard with `gpt-35-turbo`.
 
10
 
11
  - <a href="https://arxiv.org/abs/1803.05457" target="_blank"> AI2 Reasoning Challenge </a> (25-shot)
12
  - <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot)
13
  - <a href="https://arxiv.org/abs/2009.03300" target="_blank"> MMLU </a> (5-shot)
14
  - <a href="https://arxiv.org/abs/2109.07958" target="_blank"> TruthfulQA </a> (0-shot)
15
 
16
- I do not maintain those datasets, I only run benchmarks and add the results to this space. For questions regarding the test sets or running them yourself, see [the original Github repository](https://github.com/laiviet/lm-evaluation-harness).
 
 
 
 
17
 
18
- **Disclaimer**: I am aware that benchmarking models on *translated* data is not ideal. However, for Dutch there are no other options for generative models at the moment. Because the benchmarks were automatically translated, some translationese effects may occur: the translations may not be fluent Dutch or still contain artifacts of the source text (like word order, literal translation, certain vocabulary items). Because of that, an unfair advantage may be given to the non-Dutch models: Dutch is closely related to English, so if the benchmarks are in automatically translated Dutch that still has English properties, those English models may not have too many issues with that. If the benchmarks were to have been manually translated or, even better, created from scratch in Dutch, those non-Dutch models may have a harder time. Maybe not. We cannot know for sure until we have high-quality, manually crafted benchmarks for Dutch.
19
 
20
  If you have any suggestions for other Dutch benchmarks, please let me know so I can add them!
21
  """
22
 
23
- CREDIT = f"""
24
- ## Credit
25
 
26
  This leaderboard has borrowed heavily from the following sources:
27
 
@@ -33,8 +36,7 @@ This leaderboard has borrowed heavily from the following sources:
33
  """
34
 
35
 
36
- CITATION = f"""
37
- ## Citation
38
 
39
 
40
  If you use or cite the Dutch benchmark results or this specific leaderboard page, please cite the following paper:
 
1
  TITLE = '<h1 align="center" id="space-title">Open Dutch LLM Evaluation Leaderboard</h1>'
2
 
3
+ INTRO_TEXT = f"""## About
 
4
 
5
  This is a leaderboard for Dutch benchmarks for large language models.
6
 
7
  This is a fork of the [Open Multilingual LLM Evaluation Leaderboard](https://huggingface.co/spaces/uonlp/open_multilingual_llm_leaderboard), but restricted to only Dutch models and augmented with additional model results.
8
  We test the models on the following benchmarks **for the Dutch version only!!**, which have been translated into Dutch automatically by the original authors of the Open Multilingual LLM Evaluation Leaderboard with `gpt-35-turbo`.
9
+ I did not verify their translations and I do not maintain the datasets, I only run the benchmarks and add the results to this space. For questions regarding the test sets or running them yourself, see [the original Github repository](https://github.com/laiviet/lm-evaluation-harness).
10
 
11
  - <a href="https://arxiv.org/abs/1803.05457" target="_blank"> AI2 Reasoning Challenge </a> (25-shot)
12
  - <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot)
13
  - <a href="https://arxiv.org/abs/2009.03300" target="_blank"> MMLU </a> (5-shot)
14
  - <a href="https://arxiv.org/abs/2109.07958" target="_blank"> TruthfulQA </a> (0-shot)
15
 
16
+ """
17
+
18
+ DISCLAIMER = """## Disclaimer
19
+
20
+ I did not verify the (translation) quality of the benchmarks. If you encounter issues with the benchmark contents, please contact the original authors.
21
 
22
+ I am aware that benchmarking models on *translated* data is not ideal. However, for Dutch there are no other options for generative models at the moment. Because the benchmarks were automatically translated, some translationese effects may occur: the translations may not be fluent Dutch or still contain artifacts of the source text (like word order, literal translation, certain vocabulary items). Because of that, an unfair advantage may be given to the non-Dutch models: Dutch is closely related to English, so if the benchmarks are in automatically translated Dutch that still has English properties, those English models may not have too many issues with that. If the benchmarks were to have been manually translated or, even better, created from scratch in Dutch, those non-Dutch models may have a harder time. Maybe not. We cannot know for sure until we have high-quality, manually crafted benchmarks for Dutch.
23
 
24
  If you have any suggestions for other Dutch benchmarks, please let me know so I can add them!
25
  """
26
 
27
+ CREDIT = f"""## Credit
 
28
 
29
  This leaderboard has borrowed heavily from the following sources:
30
 
 
36
  """
37
 
38
 
39
+ CITATION = f"""## Citation
 
40
 
41
 
42
  If you use or cite the Dutch benchmark results or this specific leaderboard page, please cite the following paper: