kawine commited on
Commit
f81638c
·
1 Parent(s): 0abcd57

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -6
README.md CHANGED
@@ -21,8 +21,8 @@ tags:
21
 
22
  <!-- Provide a quick summary of what the model is/does. -->
23
 
24
- SteamSHP-Large is a preference model trained to predict human preferences, given some context and two possible responses.
25
- It can be used for NLG evaluation or to train a smaller reward model for RLHF.
26
 
27
  It is a FLAN-T5-large model (780M parameters) finetuned on:
28
  1. The [Stanford Human Preferences Dataset (SHP)](https://huggingface.co/datasets/stanfordnlp/SHP), which contains collective human preferences sourced from 18 different communities on Reddit (e.g., `askculinary`, `legaladvice`, etc.).
@@ -106,13 +106,18 @@ SteamSHP-Large gets an average 72.0% accuracy across all domains:
106
 
107
 
108
 
109
- ### Biases and Limitations
110
 
111
- Biases in the datasets used to train SteamSHP-Large may be propagated downstream to the model predictions.
 
 
 
112
  Although SHP filtered out posts with NSFW (over 18) content, chose subreddits that were well-moderated and had policies against harassment and bigotry, some of the data may contain discriminatory or harmful language.
113
- Reddit users on the subreddits covered by SHP are also not representative of the broader population. They are disproportionately from developed, Western, and English-speaking countries.
114
 
115
- It is also worth noting that the more preferred response in SHP or HH-RLHF is not necessarily the more correct one -- the data just reflects the collective preference of Reddit users (in SHP's case) and individuals' preferences (in HH-RLHF's case).
 
 
116
  [Past work](https://www.anthropic.com/model-written-evals.pdf) by Anthropic has found that models optimized for human preference can be obsequious, at the expense of the truth.
117
 
118
 
 
21
 
22
  <!-- Provide a quick summary of what the model is/does. -->
23
 
24
+ SteamSHP-Large is a preference model trained to predict -- given some context and two possible responses -- which response humans will find more helpful.
25
+ It can be used for NLG evaluation, question-answering evalation, or to train a smaller reward model for RLHF.
26
 
27
  It is a FLAN-T5-large model (780M parameters) finetuned on:
28
  1. The [Stanford Human Preferences Dataset (SHP)](https://huggingface.co/datasets/stanfordnlp/SHP), which contains collective human preferences sourced from 18 different communities on Reddit (e.g., `askculinary`, `legaladvice`, etc.).
 
106
 
107
 
108
 
109
+ ## Biases and Limitations
110
 
111
+ SteamSHP is trained to predict which of two responses humans will find *more helpful*, not which response is *less harmful*.
112
+ It should not be used to detect toxicity, make ethical judgments, or for a similar purpose.
113
+
114
+ Biases and misinformation in the datasets used to train SteamSHP may also be propagated downstream to the model predictions.
115
  Although SHP filtered out posts with NSFW (over 18) content, chose subreddits that were well-moderated and had policies against harassment and bigotry, some of the data may contain discriminatory or harmful language.
116
+ The responses that humans collectively found more helpful are also not guaranteed to be more factual.
117
 
118
+ The people whose preferences are captured in SHP and HH-RLHF are not representative of the broader population.
119
+ Although specific demographic information is not available, overall, the Reddit users whose preferences are captured in SHP are disproportionately male and from developed, Western, and English-speaking countries (Pew Research).
120
+
121
  [Past work](https://www.anthropic.com/model-written-evals.pdf) by Anthropic has found that models optimized for human preference can be obsequious, at the expense of the truth.
122
 
123