Spaces:

cyber-chris
/

dead-mans-switch

Runtime error

cyber-chris commited on Sep 16, 2024

Commit

25f03c6

1 Parent(s): cba3d76

update README and add app

Files changed (3) hide show

README.md CHANGED Viewed

@@ -1,4 +1,15 @@
-# Dead Man's Switch for LLMs
 In cases where we don't want to risk relying on RLHF to teach the model to refuse, we could leverage the model's own understanding of risky behaviours (through SAE extracted features) and selectively steer the model towards refusal (by injecting activation vectors) under certain circumstances.

+---
+title: {{title}}
+emoji: {{emoji}}
+colorFrom: {{colorFrom}}
+colorTo: {{colorTo}}
+sdk: {{sdk}}
+sdk_version: "{{sdkVersion}}"
+app_file: app.py
+pinned: false
+---
+## Dead Man's Switch for LLMs
 In cases where we don't want to risk relying on RLHF to teach the model to refuse, we could leverage the model's own understanding of risky behaviours (through SAE extracted features) and selectively steer the model towards refusal (by injecting activation vectors) under certain circumstances.

app.py ADDED Viewed

+import gradio as gr
+def greet(name):
+    return "Hello " + name + "!!"
+demo = gr.Interface(fn=greet, inputs="text", outputs="text")
+demo.launch()

requirements.txt CHANGED Viewed

@@ -5,4 +5,5 @@ transformers
 sae-lens==3.18.2
 git+https://github.com/cyber-chris/activation_additions.git
 pandas
-bitsandbytes

 sae-lens==3.18.2
 git+https://github.com/cyber-chris/activation_additions.git
 pandas
+bitsandbytes
+gradio