kawine commited on
Commit
b00ca45
·
1 Parent(s): 13f9788

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +133 -0
README.md CHANGED
@@ -1,3 +1,136 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ datasets:
4
+ - stanfordnlp/SHP
5
+ language:
6
+ - en
7
+ metrics:
8
+ - accuracy
9
+ tags:
10
+ - human feedback
11
+ - rlhf
12
+ - preferences
13
+ - reddit
14
+ - preference model
15
+ - RL
16
+ - NLG
17
+ - evaluation
18
  ---
19
+
20
+ # 💨🚢 SteamSHP-Large
21
+
22
+ <!-- Provide a quick summary of what the model is/does. -->
23
+
24
+ SteamSHP-Large is a preference model trained to predict human preferences, given some context and two possible responses.
25
+ It can be used for NLG evaluation or to train a smaller reward model for RLHF.
26
+
27
+ It is a FLAN-T5-large model (780M parameters) finetuned on:
28
+ 1. The [Stanford Human Preferences Dataset (SHP)](https://huggingface.co/datasets/stanfordnlp/SHP), which contains aggregate human preferences sourced from 18 different communities on Reddit (e.g., `askculinary`, `legaladvice`, etc.).
29
+ 2. The helpfulness data in [Anthropic's HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset.
30
+
31
+ There is a larger variant called [SteamSHP-XL](https://huggingface.co/stanfordnlp/SteamSHP-flan-t5-xl) that was made by finetuning FLAN-T5-xl (3B parameters), which is 0.75 percentage points more accurate on the test data.
32
+
33
+
34
+ ## Usage
35
+
36
+ The input text should be of the format:
37
+
38
+ ```
39
+ POST: { the context, such as the 'history' column in SHP }
40
+
41
+ RESPONSE A: { first possible continuation }
42
+
43
+ RESPONSE B: { second possible continuation }
44
+
45
+ Which response is better? RESPONSE
46
+ ```
47
+
48
+ The output generated by SteamSHP-Large will either be `A` or `B`.
49
+
50
+ Here's how to use the model:
51
+
52
+ ```python
53
+
54
+ >> from transformers import T5ForConditionalGeneration, T5Tokenizer
55
+ >> device = 'cuda'
56
+
57
+ >> tokenizer = T5Tokenizer.from_pretrained('stanfordnlp/SteamSHP-flan-t5-large')
58
+ >> model = T5ForConditionalGeneration.from_pretrained('stanfordnlp/SteamSHP-flan-t5-large').to(device)
59
+
60
+ >> input_text = "POST: Instacart gave me 50 pounds of limes instead of 5 pounds... what the hell do I do with 50 pounds of limes? I've already donated a bunch and gave a bunch away. I'm planning on making a bunch of lime-themed cocktails, but... jeez. Ceviche? \n\n RESPONSE A: Lime juice, and zest, then freeze in small quantities.\n\n RESPONSE B: Lime marmalade lol\n\n Which response is better? RESPONSE"
61
+ >> x = tokenizer([input_text], return_tensors='pt').input_ids.to(device)
62
+ >> y = model.generate(x, max_new_tokens=1)
63
+ >> tokenizer.batch_decode(y, skip_special_tokens=True)
64
+ ['A']
65
+ ```
66
+
67
+ If the input exceeds the 512 token limit, you can use [pybsd](https://github.com/nipunsadvilkar/pySBD) to break the input up into sentences and only include what fits into 512 tokens.
68
+ When trying to cram an example into 512 tokens, we recommend truncating the context as much as possible and leaving the responses as untouched as possible.
69
+
70
+
71
+ ## Training and Evaluation
72
+
73
+ SteamSHP-Large was only finetuned on 125K of the 392K training examples that were available, since we found that:
74
+ 1. When the total input length exceeded the limit (512 tokens), the loss would not converge.
75
+ When possible, we crammed an example to fit under 500 tokens by truncating the context as much as possible, though some examples would still not fit despite this.
76
+ We used 500 as the limit instead of 512 to allow for slight modifications to the structure of the input without any examples exceeding the actual 512 limit.
77
+ 3. Training on fewer preferences with a stronger signal led to better performance than training on all the preferences.
78
+ From the SHP dataset, we only used preferences where the more preferred comment was twice as preferred as the other (i.e., `score_ratio` >= 2) and used no more than 5 preferences from each context (i.e., 5 examples per unique `post_id`) to prevent ovefitting.
79
+ We did no such subsampling for the HH-RLHF training data.
80
+
81
+ We evaluated the model on the SHP and HH-RLHF test data using accuracy, but only on the data that could be truncated to fit within 500 tokens (a total of 18621 out of 20753 available test examples).
82
+ SteamSHP-Large gets an average 72.0% accuracy across all domains:
83
+
84
+ | Domain | Accuracy |
85
+ | ------ | -------- |
86
+ | askculinary | 0.7199 |
87
+ | askhr | 0.7507 |
88
+ | askdocs | 0.6920 |
89
+ | askanthropology | 0.7925 |
90
+ | asksciencefiction | 0.7266 |
91
+ | askacademia | 0.7442 |
92
+ | askengineers | 0.7146 |
93
+ | legaladvice | 0.7958 |
94
+ | explainlikeimfive | 0.7312 |
95
+ | askbaking | 0.6656 |
96
+ | askphysics | 0.7888 |
97
+ | askscience | 0.6926 |
98
+ | askphilosophy | 0.6837 |
99
+ | askvet | 0.7696 |
100
+ | changemyview | 0.6984 |
101
+ | askcarguys | 0.7297 |
102
+ | askhistorians | 0.7476 |
103
+ | asksocialscience | 0.8231 |
104
+ | anthropic (helpfulness) | 0.7310 |
105
+ | ALL | 0.7203 |
106
+
107
+
108
+
109
+ ### Biases and Limitations
110
+
111
+ Biases in the datasets used to train SteamSHP-Large may be propagated downstream to the model predictions.
112
+ Although SHP filtered out posts with NSFW (over 18) content, chose subreddits that were well-moderated and had policies against harassment and bigotry, some of the data may contain discriminatory or harmful language.
113
+ Reddit users on the subreddits covered by SHP are also not representative of the broader population. They are disproportionately from developed, Western, and English-speaking countries.
114
+
115
+ It is also worth noting that the more preferred response in SHP or HH-RLHF is not necessarily the more correct one -- the data just reflects the aggregate preference of Reddit users (in SHP's case) and individuals' preferences (in HH-RLHF's case).
116
+ [Past work](https://www.anthropic.com/model-written-evals.pdf) by Anthropic has found that models optimized for human preference can be obsequious, at the expense of the truth.
117
+
118
+
119
+ ## Contact
120
+
121
+ Please contact [email protected] if you have any questions about the model.
122
+ This dataset was created by Kawin Ethayarajh, Heidi (Chenyu) Zhang, Yizhong Wang, and Dan Jurafsky.
123
+
124
+
125
+ ## Citation
126
+
127
+ We will have a paper out soon, but until then, please cite:
128
+
129
+ ```
130
+ @online{SHP,
131
+ author = {Ethayarajh, Kawin and Zhang, Heidi and Wang, Yizhong and Jurafsky, Dan},
132
+ title = {Stanford Human Preferences Dataset},
133
+ year = 2023,
134
+ url = {https://huggingface.co/datasets/stanfordnlp/SHP},
135
+ }
136
+ ```