cyber-chris commited on
Commit
b0b4625
·
2 Parent(s): 36a864b 014ffc6

Merge branch 'main' of github-cyber-chris:cyber-chris/llm-dead-man-switch into main

Browse files
Files changed (1) hide show
  1. README.md +2 -0
README.md CHANGED
@@ -5,6 +5,8 @@ app_file: app.py
5
  pinned: false
6
  ---
7
 
 
 
8
  ## Dead Man's Switch for LLMs
9
 
10
  In cases where we don't want to risk relying on RLHF to teach the model to refuse, we could leverage the model's own understanding of risky behaviours (through SAE extracted features) and selectively steer the model towards refusal (by injecting activation vectors) under certain circumstances.
 
5
  pinned: false
6
  ---
7
 
8
+ ![dalle-llm-dead-mans-switch](https://github.com/user-attachments/assets/29245c41-8796-4b59-9842-157cb78f9142)
9
+
10
  ## Dead Man's Switch for LLMs
11
 
12
  In cases where we don't want to risk relying on RLHF to teach the model to refuse, we could leverage the model's own understanding of risky behaviours (through SAE extracted features) and selectively steer the model towards refusal (by injecting activation vectors) under certain circumstances.