Spaces:

jbnayahu
/

bluebench

Running

jbnayahu commited on Jun 23

Commit

382809d

unverified ·

1 Parent(s): 460efe2

.

Signed-off-by: Jonathan Bnayahu <[email protected]>

Files changed (2) hide show

src/about.py CHANGED Viewed

@@ -46,6 +46,12 @@ As a dynamic and evolving benchmark, BlueBench currently encompasses diverse dom
 LLM_BENCHMARKS_TEXT = """
 ## How it works
 BlueBench is comprised of the following subtasks:
 <style>

 LLM_BENCHMARKS_TEXT = """
 ## How it works
+BlueBench was designed with four goals in mind: representativeness, reliability, efficiency, and validity.
+* **Representative**: tasks distribution represents the required skills in an enterprise setting
+* **Valid**: tasks measure what they aim to measure
+* **Robust**: going beyond single-prompt evaluation due to model’s brittleness
+* **Efficiency**: evaluation is fast (cheap)
 BlueBench is comprised of the following subtasks:
 <style>

src/leaderboard/read_evals.py CHANGED Viewed

@@ -69,9 +69,12 @@ def get_raw_eval_results(results_path: str) -> list[EvalResult]:
         if len(files) == 0 or any([not f.endswith(".json") for f in files]):
             continue
         # Sort the files by date
         try:
-            files.sort(key=lambda x: x.removesuffix(".json").removeprefix("results_")[:-7])
         except dateutil.parser._parser.ParserError:
             files = [files[-1]]

         if len(files) == 0 or any([not f.endswith(".json") for f in files]):
             continue
+        # skip anything not results
+        files = [f for f in files if (f.endswith("_evaluation_results.json"))]
         # Sort the files by date
         try:
+            files.sort(key=lambda x: x.removesuffix("_evaluation_results.json"))
         except dateutil.parser._parser.ParserError:
             files = [files[-1]]