NimaBoscarino commited on
Commit
5601a63
·
1 Parent(s): ce824ba

Adjust description for TruthfulQA

Browse files
Files changed (1) hide show
  1. content.py +6 -2
content.py CHANGED
@@ -1,4 +1,7 @@
1
  CHANGELOG_TEXT = f"""
 
 
 
2
  ## [2023-06-12]
3
  - Add Human & GPT-4 Evaluations
4
 
@@ -34,7 +37,8 @@ CHANGELOG_TEXT = f"""
34
  - Display different queues for jobs that are RUNNING, PENDING, FINISHED status
35
 
36
  ## [2023-05-15]
37
- - Fix a typo: from "TruthQA" to "TruthfulQA"
 
38
 
39
  ## [2023-05-10]
40
  - Fix a bug that prevented auto-refresh
@@ -58,7 +62,7 @@ Evaluation is performed against 4 popular benchmarks:
58
  - <a href="https://arxiv.org/abs/1803.05457" target="_blank"> AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
59
  - <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
60
  - <a href="https://arxiv.org/abs/2009.03300" target="_blank"> MMLU </a> (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
61
- - <a href="https://arxiv.org/abs/2109.07958" target="_blank"> TruthfulQA </a> (0-shot) - a benchmark to measure whether a language model is truthful in generating answers to questions.
62
 
63
  We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
64
  """
 
1
  CHANGELOG_TEXT = f"""
2
+ ## [2023-06-13]
3
+ - Adjust description for TruthfulQA
4
+
5
  ## [2023-06-12]
6
  - Add Human & GPT-4 Evaluations
7
 
 
37
  - Display different queues for jobs that are RUNNING, PENDING, FINISHED status
38
 
39
  ## [2023-05-15]
40
+ - Fix a typo: from "TruthQA" to "
41
+ QA"
42
 
43
  ## [2023-05-10]
44
  - Fix a bug that prevented auto-refresh
 
62
  - <a href="https://arxiv.org/abs/1803.05457" target="_blank"> AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
63
  - <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
64
  - <a href="https://arxiv.org/abs/2009.03300" target="_blank"> MMLU </a> (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
65
+ - <a href="https://arxiv.org/abs/2109.07958" target="_blank"> TruthfulQA </a> (0-shot) - a test to measure a model’s propensity to reproduce falsehoods commonly found online.
66
 
67
  We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
68
  """