Spaces:

per
/

benchbench

Sleeping

Yotam Perlitz commited on Jul 18, 2024

Commit

1138892

1 Parent(s): acd921a

update images location

Files changed (2) hide show

.gitignore CHANGED Viewed

@@ -1,6 +1,4 @@
 .vscode/launch.json
 .vscode/settings.json
 .DS_Store
-# assets/ablations.png
-# assets/motivation.png
 images/*

 .vscode/launch.json
 .vscode/settings.json
 .DS_Store
 images/*

app.py CHANGED Viewed

@@ -280,7 +280,7 @@ st.markdown(
 )
 st.image(
-    "assets/motivation.png",
     caption="Conclusions depend on the models considered. Kendall-tau correlations between the LMSys Arena benchmark and three other benchmarks: BBH, MMLU, and Alpaca v2. Each group of bars represents the correlation for different sets of top models, specifically the top 5, top 10, and top 15 (overlapping) models (according to the Arena). The results indicate that the degree of agreement between benchmarks varies with the number of top models considered, highlighting that different selections of models can lead to varying conclusions about benchmark agreement.",
     use_column_width=True,
 )
@@ -297,7 +297,7 @@ st.markdown(
 )
 st.image(
-    "assets/pointplot_granularity_matters.png",
     caption="Correlations increase with number of models. Mean correlation (y) between each benchmark (lines) and the rest, given different numbers of models. The Blue and Orange lines are the average of all benchmark pair correlations with models sampled randomly (orange) or in contiguous sets (blue). The shaded lines represents adjacent sampling for the different benchmarks.",
     use_column_width=True,
 )
@@ -316,7 +316,7 @@ st.markdown(
 st.image(
-    "assets/ablations.png",
     caption="Our recommendations substantially reduce the variance of BAT. Ablation analysis for each BAT recommendation separately and their combinations.",
     use_column_width=True,
 )

 )
 st.image(
+    "images/motivation.png",
     caption="Conclusions depend on the models considered. Kendall-tau correlations between the LMSys Arena benchmark and three other benchmarks: BBH, MMLU, and Alpaca v2. Each group of bars represents the correlation for different sets of top models, specifically the top 5, top 10, and top 15 (overlapping) models (according to the Arena). The results indicate that the degree of agreement between benchmarks varies with the number of top models considered, highlighting that different selections of models can lead to varying conclusions about benchmark agreement.",
     use_column_width=True,
 )
 )
 st.image(
+    "images/pointplot_granularity_matters.png",
     caption="Correlations increase with number of models. Mean correlation (y) between each benchmark (lines) and the rest, given different numbers of models. The Blue and Orange lines are the average of all benchmark pair correlations with models sampled randomly (orange) or in contiguous sets (blue). The shaded lines represents adjacent sampling for the different benchmarks.",
     use_column_width=True,
 )
 st.image(
+    "images/ablations.png",
     caption="Our recommendations substantially reduce the variance of BAT. Ablation analysis for each BAT recommendation separately and their combinations.",
     use_column_width=True,
 )