Spaces:
Runtime error
Runtime error
Commit
·
25f03c6
1
Parent(s):
cba3d76
update README and add app
Browse files- README.md +12 -1
- app.py +7 -0
- requirements.txt +2 -1
README.md
CHANGED
@@ -1,4 +1,15 @@
|
|
1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
|
3 |
In cases where we don't want to risk relying on RLHF to teach the model to refuse, we could leverage the model's own understanding of risky behaviours (through SAE extracted features) and selectively steer the model towards refusal (by injecting activation vectors) under certain circumstances.
|
4 |
|
|
|
1 |
+
---
|
2 |
+
title: {{title}}
|
3 |
+
emoji: {{emoji}}
|
4 |
+
colorFrom: {{colorFrom}}
|
5 |
+
colorTo: {{colorTo}}
|
6 |
+
sdk: {{sdk}}
|
7 |
+
sdk_version: "{{sdkVersion}}"
|
8 |
+
app_file: app.py
|
9 |
+
pinned: false
|
10 |
+
---
|
11 |
+
|
12 |
+
## Dead Man's Switch for LLMs
|
13 |
|
14 |
In cases where we don't want to risk relying on RLHF to teach the model to refuse, we could leverage the model's own understanding of risky behaviours (through SAE extracted features) and selectively steer the model towards refusal (by injecting activation vectors) under certain circumstances.
|
15 |
|
app.py
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import gradio as gr
|
2 |
+
|
3 |
+
def greet(name):
|
4 |
+
return "Hello " + name + "!!"
|
5 |
+
|
6 |
+
demo = gr.Interface(fn=greet, inputs="text", outputs="text")
|
7 |
+
demo.launch()
|
requirements.txt
CHANGED
@@ -5,4 +5,5 @@ transformers
|
|
5 |
sae-lens==3.18.2
|
6 |
git+https://github.com/cyber-chris/activation_additions.git
|
7 |
pandas
|
8 |
-
bitsandbytes
|
|
|
|
5 |
sae-lens==3.18.2
|
6 |
git+https://github.com/cyber-chris/activation_additions.git
|
7 |
pandas
|
8 |
+
bitsandbytes
|
9 |
+
gradio
|