cyber-chris commited on
Commit
25f03c6
·
1 Parent(s): cba3d76

update README and add app

Browse files
Files changed (3) hide show
  1. README.md +12 -1
  2. app.py +7 -0
  3. requirements.txt +2 -1
README.md CHANGED
@@ -1,4 +1,15 @@
1
- # Dead Man's Switch for LLMs
 
 
 
 
 
 
 
 
 
 
 
2
 
3
  In cases where we don't want to risk relying on RLHF to teach the model to refuse, we could leverage the model's own understanding of risky behaviours (through SAE extracted features) and selectively steer the model towards refusal (by injecting activation vectors) under certain circumstances.
4
 
 
1
+ ---
2
+ title: {{title}}
3
+ emoji: {{emoji}}
4
+ colorFrom: {{colorFrom}}
5
+ colorTo: {{colorTo}}
6
+ sdk: {{sdk}}
7
+ sdk_version: "{{sdkVersion}}"
8
+ app_file: app.py
9
+ pinned: false
10
+ ---
11
+
12
+ ## Dead Man's Switch for LLMs
13
 
14
  In cases where we don't want to risk relying on RLHF to teach the model to refuse, we could leverage the model's own understanding of risky behaviours (through SAE extracted features) and selectively steer the model towards refusal (by injecting activation vectors) under certain circumstances.
15
 
app.py ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+
3
+ def greet(name):
4
+ return "Hello " + name + "!!"
5
+
6
+ demo = gr.Interface(fn=greet, inputs="text", outputs="text")
7
+ demo.launch()
requirements.txt CHANGED
@@ -5,4 +5,5 @@ transformers
5
  sae-lens==3.18.2
6
  git+https://github.com/cyber-chris/activation_additions.git
7
  pandas
8
- bitsandbytes
 
 
5
  sae-lens==3.18.2
6
  git+https://github.com/cyber-chris/activation_additions.git
7
  pandas
8
+ bitsandbytes
9
+ gradio