Spaces:
Sleeping
Sleeping
janraasch
commited on
Commit
•
5b51887
0
Parent(s):
Initial commit
Browse files- .gitattributes +35 -0
- .gitignore +1 -0
- README.md +55 -0
- app.py +335 -0
- requirements.txt +4 -0
.gitattributes
ADDED
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
.gitignore
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
gradio-env
|
README.md
ADDED
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
title: Activate Love
|
3 |
+
emoji: ❤️
|
4 |
+
colorFrom: purple
|
5 |
+
colorTo: red
|
6 |
+
sdk: gradio
|
7 |
+
sdk_version: 4.31.5
|
8 |
+
app_file: app.py
|
9 |
+
pinned: true
|
10 |
+
license: mit
|
11 |
+
short_description: Steering AI Text Generation
|
12 |
+
---
|
13 |
+
|
14 |
+
# Activate Love ❤️
|
15 |
+
|
16 |
+
A [Gradio App][gradio-url] replicating results of the paper [»Activation Addition: Steering Language Models Without Optimization«][paper-url] on a [Hugging Face Space][hugging-face-spaces-url].
|
17 |
+
|
18 |
+
## Demo
|
19 |
+
|
20 |
+
Check it out https://huggingface.co/spaces/janraasch/activate-love 🎯.
|
21 |
+
|
22 |
+
## Raison d'être
|
23 |
+
|
24 |
+
This is my final project for the [AI Safety Fundamentals][ai-safety-fundamentals-url] course on [AI Alignment][ai-safety-fundamentals-alignment-url].
|
25 |
+
|
26 |
+
When we covered the topic of *Mechanistic Interpretability* in session six my cohort's instructor mentioned [the paper on activation addition][paper-url] published in late 2023. I found this to be an enjoyable & interesting way to get to play around with the inner workings of a model w/o training/optimization.
|
27 |
+
|
28 |
+
The authors kindly provide [a notebook on Google Colab][notebook-url] for everyone to replicate their results. Still, I felt it to be useful to give an even more user-friendly & non-technical interface to lower the barrier to interaction with these low-level workings of the model.
|
29 |
+
|
30 |
+
Hence this https://huggingface.co/spaces/janraasch/activate-love app exists such that *everyone* may steer and play with [GPT-2 XL][gpt2-xl-url].
|
31 |
+
|
32 |
+
## Development
|
33 |
+
|
34 |
+
```bash
|
35 |
+
# Create virtual environment
|
36 |
+
python3 -m venv gradio-env
|
37 |
+
source gradio-env/bin/activate
|
38 |
+
|
39 |
+
# Install dependencies
|
40 |
+
pip install -r requirements.txt
|
41 |
+
|
42 |
+
# Run app locally
|
43 |
+
gradio app.py
|
44 |
+
```
|
45 |
+
|
46 |
+
## License
|
47 |
+
[MIT License](https://en.wikipedia.org/wiki/MIT_License) © [Jan Raasch](https://www.janraasch.com)
|
48 |
+
|
49 |
+
[ai-safety-fundamentals-alignment-url]: https://aisafetyfundamentals.com/alignment
|
50 |
+
[ai-safety-fundamentals-url]: https://aisafetyfundamentals.com
|
51 |
+
[gpt2-xl-url]:https://huggingface.co/openai-community/gpt2-xl
|
52 |
+
[gradio-url]: https://www.gradio.app
|
53 |
+
[hugging-face-spaces-url]: https://huggingface.co/spaces/launch
|
54 |
+
[paper-url]: https://arxiv.org/abs/2308.10248
|
55 |
+
[notebook-url]: http://tinyurl.com/actadd
|
app.py
ADDED
@@ -0,0 +1,335 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import spaces
|
2 |
+
import gradio as gr
|
3 |
+
|
4 |
+
import time
|
5 |
+
import torch
|
6 |
+
from transformer_lens import HookedTransformer
|
7 |
+
from typing import List
|
8 |
+
|
9 |
+
# Save memory
|
10 |
+
torch.set_grad_enabled(False)
|
11 |
+
|
12 |
+
# Mock model for faster UI testing & feedback
|
13 |
+
UI_DEVELOPMENT = False
|
14 |
+
|
15 |
+
if not UI_DEVELOPMENT:
|
16 |
+
model = HookedTransformer.from_pretrained("gpt2-xl")
|
17 |
+
model.eval()
|
18 |
+
if torch.cuda.is_available():
|
19 |
+
model.to("cuda")
|
20 |
+
else:
|
21 |
+
model = "toy" # :)
|
22 |
+
|
23 |
+
SEED = 0
|
24 |
+
sampling_kwargs = dict(temperature=1.0, top_p=0.3, freq_penalty=1.0)
|
25 |
+
example_count = 4
|
26 |
+
|
27 |
+
|
28 |
+
def get_token_length(prompt):
|
29 |
+
return model.to_tokens(prompt).shape[1]
|
30 |
+
|
31 |
+
|
32 |
+
def add_padding_right(prompt, length):
|
33 |
+
return prompt + " " * (length - get_token_length(prompt))
|
34 |
+
|
35 |
+
|
36 |
+
def add_padding(prompt_add, prompt_sub):
|
37 |
+
padding_size = max(get_token_length(prompt_add), get_token_length(prompt_sub))
|
38 |
+
return add_padding_right(prompt_add, padding_size), add_padding_right(
|
39 |
+
prompt_sub, padding_size
|
40 |
+
)
|
41 |
+
|
42 |
+
|
43 |
+
def get_resid_pre(prompt: str, layer: int):
|
44 |
+
name = f"blocks.{layer}.hook_resid_pre"
|
45 |
+
cache, caching_hooks, _ = model.get_caching_hooks(lambda n: n == name)
|
46 |
+
with model.hooks(fwd_hooks=caching_hooks):
|
47 |
+
_ = model(prompt)
|
48 |
+
return cache[name]
|
49 |
+
|
50 |
+
|
51 |
+
def get_activations(prompt_add: str, prompt_sub: str, layer: int):
|
52 |
+
act_add = get_resid_pre(prompt_add, layer)
|
53 |
+
act_sub = get_resid_pre(prompt_sub, layer)
|
54 |
+
act_diff = act_add - act_sub
|
55 |
+
|
56 |
+
print("Activation Difference:")
|
57 |
+
print(act_diff.shape)
|
58 |
+
|
59 |
+
return act_diff
|
60 |
+
|
61 |
+
|
62 |
+
def create_hook(act_diff: torch.Tensor, coeff: int):
|
63 |
+
def ave_hook(resid_pre, hook):
|
64 |
+
if resid_pre.shape[1] == 1:
|
65 |
+
return # caching in model.generate for new tokens
|
66 |
+
|
67 |
+
# We only add to the prompt (first call), not the generated tokens.
|
68 |
+
ppos, apos = resid_pre.shape[1], act_diff.shape[1]
|
69 |
+
|
70 |
+
if apos > ppos:
|
71 |
+
raise gr.Error(
|
72 |
+
f"More mod tokens ({apos}) then PROMPT tokens ({ppos}). Try a **longer** PROMPT."
|
73 |
+
)
|
74 |
+
|
75 |
+
# add to the beginning (position-wise) of the activations
|
76 |
+
resid_pre[:, :apos, :] += coeff * act_diff
|
77 |
+
|
78 |
+
return ave_hook
|
79 |
+
|
80 |
+
|
81 |
+
def hooked_generate(prompt_batch: List[str], fwd_hooks=[], seed=None, **kwargs):
|
82 |
+
if seed is not None:
|
83 |
+
torch.manual_seed(seed)
|
84 |
+
|
85 |
+
with model.hooks(fwd_hooks=fwd_hooks):
|
86 |
+
tokenized = model.to_tokens(prompt_batch)
|
87 |
+
r = model.generate(input=tokenized, max_new_tokens=50, do_sample=True, **kwargs)
|
88 |
+
return r
|
89 |
+
|
90 |
+
|
91 |
+
def config_to_str(prompt, prompt_sub, prompt_add, coeff, act_name, no_steering_input):
|
92 |
+
if no_steering_input:
|
93 |
+
return "NO STEERING: TRUE"
|
94 |
+
return f"""PROMPT: {prompt}
|
95 |
+
FROM: {prompt_sub}
|
96 |
+
TO: {prompt_add}
|
97 |
+
MULTIPLIER: {coeff}
|
98 |
+
LAYER: {act_name}"""
|
99 |
+
|
100 |
+
|
101 |
+
def config_header_str():
|
102 |
+
return f"{'='*8} CONFIGURATION {'='*8}"
|
103 |
+
|
104 |
+
|
105 |
+
def sample_header_str(i: int):
|
106 |
+
return f"{'='*11} SAMPLE {i+1} {'='*11}"
|
107 |
+
|
108 |
+
|
109 |
+
def results_to_ui_output(
|
110 |
+
results, prompt, prompt_sub, prompt_add, coeff, act_name, no_steering_input
|
111 |
+
):
|
112 |
+
config_str = config_to_str(
|
113 |
+
prompt, prompt_sub, prompt_add, coeff, act_name, no_steering_input
|
114 |
+
)
|
115 |
+
header_str = f"{config_header_str()}\n\n{config_str}"
|
116 |
+
body_str = "\n\n".join(
|
117 |
+
[f"{sample_header_str(i)}\n\n{r}" for i, r in enumerate(results)]
|
118 |
+
)
|
119 |
+
return f"{header_str}\n\n{body_str}"
|
120 |
+
|
121 |
+
|
122 |
+
@spaces.GPU
|
123 |
+
def predict(
|
124 |
+
prompt: str,
|
125 |
+
prompt_sub: str = "",
|
126 |
+
prompt_add: str = "",
|
127 |
+
coeff: int = 12,
|
128 |
+
act_name: int = 6,
|
129 |
+
no_steering_input: bool = False,
|
130 |
+
):
|
131 |
+
if prompt_sub == "":
|
132 |
+
raise gr.Error(
|
133 |
+
"Please input a FROM option. Could be a single space character, a word or a phrase"
|
134 |
+
)
|
135 |
+
if prompt_add == "":
|
136 |
+
raise gr.Error(
|
137 |
+
"Please input a TO option. Could be a single space character, a word or a phrase"
|
138 |
+
)
|
139 |
+
|
140 |
+
print("Text generation begin:")
|
141 |
+
time_stamp = time.time()
|
142 |
+
print("Parameters:")
|
143 |
+
print("prompt:", prompt)
|
144 |
+
print("prompt_sub:", prompt_sub)
|
145 |
+
print("prompt_add:", prompt_add)
|
146 |
+
print("coeff:", coeff)
|
147 |
+
print("act_name:", act_name)
|
148 |
+
print("no_steering_input:", no_steering_input)
|
149 |
+
|
150 |
+
if not UI_DEVELOPMENT and not no_steering_input:
|
151 |
+
padded_prompt_add, padded_prompt_sub = add_padding(prompt_add, prompt_sub)
|
152 |
+
act_diff = get_activations(padded_prompt_add, padded_prompt_sub, act_name)
|
153 |
+
ave_hook = create_hook(act_diff, coeff)
|
154 |
+
editing_hooks = [(f"blocks.{act_name}.hook_resid_pre", ave_hook)]
|
155 |
+
res = hooked_generate(
|
156 |
+
[prompt] * example_count, editing_hooks, seed=SEED, **sampling_kwargs
|
157 |
+
)
|
158 |
+
|
159 |
+
# Remove beginning of sequence token
|
160 |
+
res_str = model.to_string(res[:, 1:])
|
161 |
+
else:
|
162 |
+
if not UI_DEVELOPMENT and no_steering_input:
|
163 |
+
res_str = hooked_generate(
|
164 |
+
[prompt] * example_count, [], seed=SEED, **sampling_kwargs
|
165 |
+
)
|
166 |
+
|
167 |
+
# Remove beginning of sequence token
|
168 |
+
res_str = model.to_string(res_str[:, 1:])
|
169 |
+
else:
|
170 |
+
res_str = [
|
171 |
+
"To visit the Berlin wall people have to go to the wall.",
|
172 |
+
"To visit the Berlin wall people have to go to a museum.",
|
173 |
+
]
|
174 |
+
|
175 |
+
ui_result = results_to_ui_output(
|
176 |
+
res_str, prompt, prompt_sub, prompt_add, coeff, act_name, no_steering_input
|
177 |
+
)
|
178 |
+
|
179 |
+
print(f"Text generation end after {time.time() - time_stamp:.2f} seconds:")
|
180 |
+
print(ui_result)
|
181 |
+
|
182 |
+
return ui_result
|
183 |
+
|
184 |
+
|
185 |
+
options_accordion = gr.Accordion(label="Steering Options", open=True)
|
186 |
+
|
187 |
+
prompt_sub_input = gr.Textbox(
|
188 |
+
lines=1,
|
189 |
+
label="FROM",
|
190 |
+
info='Enter a prompt that you want to steer the AI output away from. \
|
191 |
+
This can be a single word or a whole phrase. E.g. \
|
192 |
+
"The Berlin Wall is in Berlin" or "Hate".',
|
193 |
+
value="Hate",
|
194 |
+
)
|
195 |
+
|
196 |
+
prompt_add_input = gr.Textbox(
|
197 |
+
lines=1,
|
198 |
+
label="TO",
|
199 |
+
info='Enter a prompt that you want to steer the AI ouput towards. \
|
200 |
+
This can be a single word or a whole phrase. E.g. \
|
201 |
+
"The Berlin Wall is in Hamburg" or "Love".',
|
202 |
+
value="Love",
|
203 |
+
)
|
204 |
+
|
205 |
+
coeff_input = gr.Slider(
|
206 |
+
minimum=0,
|
207 |
+
maximum=100,
|
208 |
+
step=1,
|
209 |
+
label="MULTIPLIER",
|
210 |
+
info="The strength of the steering. Higher values will steer the AI output more towards the TO prompt. Be careful not to oversteer and break the AI's semantic capabilities!",
|
211 |
+
value=12,
|
212 |
+
)
|
213 |
+
|
214 |
+
act_name_input = gr.Slider(
|
215 |
+
minimum=0,
|
216 |
+
maximum=47,
|
217 |
+
step=1,
|
218 |
+
label="LAYER",
|
219 |
+
info="The layer of the model to steer. Higher layers are more abstract. However, steering at lower layers can lead to more coherent output. Experiment to find the best layer for your use case.",
|
220 |
+
value=6,
|
221 |
+
)
|
222 |
+
|
223 |
+
no_steering_input = gr.Checkbox(
|
224 |
+
label="No Steering",
|
225 |
+
info="Check this box to generate text without steering.",
|
226 |
+
value=False,
|
227 |
+
)
|
228 |
+
|
229 |
+
message_input = gr.Textbox(
|
230 |
+
lines=1,
|
231 |
+
label="PROMPT",
|
232 |
+
info='Enter a message to be completed by the AI. E.g. "I hate you because".',
|
233 |
+
placeholder="Enter a message to generate text.",
|
234 |
+
value="I hate you because",
|
235 |
+
)
|
236 |
+
|
237 |
+
text_output = gr.Textbox(
|
238 |
+
label="AI Text Generator",
|
239 |
+
lines=24,
|
240 |
+
max_lines=24,
|
241 |
+
placeholder="Hi, I am an AI Text Generator. \n\nPlease don't steer me the wrong way! 🤖",
|
242 |
+
show_copy_button=True,
|
243 |
+
)
|
244 |
+
|
245 |
+
CSS = """\
|
246 |
+
.prose {
|
247 |
+
var(--block-title-text-color);
|
248 |
+
}
|
249 |
+
.block:has(.prose) {
|
250 |
+
border: solid var(--panel-border-width) var(--panel-border-color);
|
251 |
+
border-radius: var(--container-radius);
|
252 |
+
background: var(--panel-background-fill);
|
253 |
+
padding: var(--spacing-lg);
|
254 |
+
}
|
255 |
+
"""
|
256 |
+
|
257 |
+
DESCRIPTION = """\
|
258 |
+
AI Text Generation can seem magical and inscrutable, but [recent research](https://arxiv.org/abs/2308.10248) has shown that it is possible to steer the output of a model by modifying its activations. Even better, it is quite intuitive and fun!
|
259 |
+
|
260 |
+
This demo allows you to input a message and two prompts, and then steer the model's output towards one prompt and away from another. You can also control the strength of the steering and the layer of the model to steer. Try it out and see what you can create!
|
261 |
+
|
262 |
+
If you end up with something you like, feel free to share it with us [on the community tab](https://huggingface.co/spaces/janraasch/activate-love/discussions). We would love to see what you come up with!
|
263 |
+
|
264 |
+
You can use the »copy«-button on the upper right corner of the generated text box to copy your results to your clipboard. Have fun exploring the interface! 🚀
|
265 |
+
|
266 |
+
Learn more about the research behind this below. 📚
|
267 |
+
|
268 |
+
CONTENT WARNING: This interface allows you to manipulate and steer the outputs of [a large language model (GPT2-XL)](https://huggingface.co/openai-community/gpt2-xl) trained on a broad corpus of online data. The model's outputs may contain biased, offensive, explicit, or otherwise harmful content. Use this interface cautiously and at your own risk. We recommend parental guidance for minors.
|
269 |
+
"""
|
270 |
+
|
271 |
+
ARTICLE = """\
|
272 |
+
# Activation Addition: Steering GPT2 Without Optimization
|
273 |
+
|
274 |
+
This Space replicates results from the paper [Activation Addition: Steering GPT2 Without Optimization](https://arxiv.org/abs/2308.10248) and provides a user-friendly interface for anybody to gain intuition about how activation steering works.
|
275 |
+
|
276 |
+
🔎 For more details about the research behind this take a look at [this post on the AI Alignment Forum](https://www.alignmentforum.org/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector) or check out [the original paper](https://arxiv.org/abs/2308.10248).
|
277 |
+
|
278 |
+
## Model Details
|
279 |
+
|
280 |
+
We use a [pre-trained GPT2-XL model](https://huggingface.co/openai-community/gpt2-xl) from the Hugging Face model hub. The model is loaded with the [`transformer_lens` library](https://transformerlensorg.github.io/TransformerLens/), which allows us to access the activations of the model at different layers.
|
281 |
+
|
282 |
+
## Limitations
|
283 |
+
|
284 |
+
*So how is this not the solution to the [Alignment Problem](https://en.wikipedia.org/wiki/AI_alignment)?* you might ask.
|
285 |
+
|
286 |
+
Well, this is early research, and there are some limitations to keep in mind 😇:
|
287 |
+
|
288 |
+
* [GPT2-XL](https://huggingface.co/openai-community/gpt2-xl) is quite small compared to models currently being trained (like e.g. [LLAMA3](https://huggingface.co/collections/meta-llama/meta-llama-3-66214712577ca38149ebb2b6)).
|
289 |
+
* Activation Steering is not perfect and can lead to unintended side effects. For steering the model toward a prompt might lead to the model generating text that is not semantically coherent.
|
290 |
+
* Activation Steering is also not guaranteed to work for all prompts and all layers.
|
291 |
+
* It is still an open question how to best steer models in a safe and reliable way.
|
292 |
+
|
293 |
+
## Future Work
|
294 |
+
|
295 |
+
There is an even more recent paper that builds on this research: [Steering LLAMA-2 with Contrastive Activation Additions](https://arxiv.org/abs/2308.10248). This paper steers the [LLAMA-2 model](https://huggingface.co/collections/meta-llama/llama-2-family-661da1f90a9d678b6f55773b) with contrastive activation additions and shows that it is possible to steer a larger model chatbot with this technique.
|
296 |
+
|
297 |
+
Hence, we would like to try to replicate these results on a Hugging Face Space thus providing a chat interface that can be steered to be more helpful or more harmful.
|
298 |
+
"""
|
299 |
+
|
300 |
+
EXAMPLES = [
|
301 |
+
["I hate you because", "Hate", "Love", 12, 6, False],
|
302 |
+
[
|
303 |
+
"To see the Berlin Wall, people flock to",
|
304 |
+
"The Berlin Wall is in Berlin",
|
305 |
+
"The Berlin Wall is in Hamburg",
|
306 |
+
10,
|
307 |
+
20,
|
308 |
+
False,
|
309 |
+
],
|
310 |
+
["I went up to my friend and said", " ", " wedding", 4, 6, False],
|
311 |
+
]
|
312 |
+
|
313 |
+
demo = gr.Interface(
|
314 |
+
theme="gradio/[email protected]",
|
315 |
+
fn=predict,
|
316 |
+
inputs=[
|
317 |
+
message_input,
|
318 |
+
prompt_sub_input,
|
319 |
+
prompt_add_input,
|
320 |
+
coeff_input,
|
321 |
+
act_name_input,
|
322 |
+
no_steering_input,
|
323 |
+
],
|
324 |
+
outputs=text_output,
|
325 |
+
title="ACTIVATE LOVE",
|
326 |
+
description=DESCRIPTION,
|
327 |
+
allow_duplication=True,
|
328 |
+
article=ARTICLE,
|
329 |
+
allow_flagging="never",
|
330 |
+
examples=EXAMPLES,
|
331 |
+
cache_examples=False,
|
332 |
+
css=CSS,
|
333 |
+
)
|
334 |
+
print("Starting demo!")
|
335 |
+
demo.launch()
|
requirements.txt
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
gradio==4.31.5
|
2 |
+
pytest==8.2.1
|
3 |
+
spaces==0.28.3
|
4 |
+
transformer-lens==1.15.0
|