stecas commited on
Commit
9b2cf4b
Β·
verified Β·
1 Parent(s): 8550dc9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -1
README.md CHANGED
@@ -5,6 +5,91 @@ colorFrom: indigo
5
  colorTo: indigo
6
  sdk: static
7
  pinned: false
 
 
8
  ---
9
 
10
- Edit this `README.md` markdown file to author your organization card.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  colorTo: indigo
6
  sdk: static
7
  pinned: false
8
+ thumbnail: >-
9
+ https://cdn-uploads.huggingface.co/production/uploads/6466a046326128fd2c6c59c2/rlGxR2jD815pERRdNHxGM.png
10
  ---
11
 
12
+ # Model Tampering Attacks Enable More Rigorous Evlauations of LLM Capabilities
13
+
14
+ Zora Che*, Stephen Casper*, Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev E McKinney, Rohit Gandikota, Aidan Ewart, Domenic Rosati, Zichu Wu, Zikui Cai, Bilal Chughtai, Yarin Gal, Furong Huang, Dylan Hadfield-Menell
15
+
16
+ Paper: COMING SOON
17
+
18
+ BibTeX:
19
+ ```
20
+ COMING SOON
21
+ ```
22
+
23
+ ## Paper Abstract
24
+
25
+ Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks.
26
+ Currently, most risk evaluations are conducted by searching for inputs that elicit harmful behaviors from the system.
27
+ However, a fundamental limitation of this approach is that the harmfulness of the behaviors identified during any particular evaluation can only lower bound the model's worst-possible-case behavior.
28
+ As a complementary method for eliciting harmful behaviors, we propose evaluating LLMs with model tampering attacks which allow for modifications to the latent activations or weights.
29
+ We pit state-of-the-art techniques for removing harmful LLM capabilities against a suite of 5 input-space and 6 model tampering attacks.
30
+ In addition to benchmarking these methods against each other, we show that (1) model resilience to capability elicitation attacks lies on a low-dimensional robustness subspace; (2) the attack success rate of model tampering attacks can empirically predict and offer conservative estimates for the success of held-out input-space attacks; and (3) state-of-the-art unlearning methods can easily be undone within 16 steps of fine-tuning.
31
+ Together these results highlight the difficulty of removing harmful LLM capabilities and show that model tampering attacks enable substantially more rigorous evaluations for vulnerabilities than input-space attacks alone.
32
+
33
+ ## Info
34
+
35
+ This space contains the 64 models. All are versions of [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) which have been fine-tuned using various machine unlearning methods to unlearn dual-use biology knowledge using the [WMDP-Bio](https://www.wmdp.ai/) benchmark.
36
+ The goal of unlearning WMDP-Bio knowledge from these models is to (1) make them incapable of correctly answering questions related to bioweapons creation and (2) preserve their capabilities on all other tasks.
37
+ See the paepr for details.
38
+ We used 8 unlearning methods:
39
+ * **Gradient Difference (GradDiff)**, [(Liu et al., 2022)](https://arxiv.org/abs/2203.12817)
40
+ * **Random Misdirection for Unlearning (RMU)**, [(Li et al, 2024)](https://arxiv.org/abs/2403.03218)
41
+ * **RMU with Latent Adversarial Training (RMU+LAT)**, [(Sheshadri et al, 2024)](https://arxiv.org/abs/2407.15549)
42
+ * **Representation Noising (RepNoise)**, [(Rosati et al, 2024)](https://arxiv.org/abs/2405.14577)
43
+ * **Erasure of Language Memory (ELM)**, [(Gandikota et al, 2024)](https://arxiv.org/abs/2410.02760)
44
+ * **Representation Rerouting (RR)**, [(Zou et al, 2024)](https://arxiv.org/html/2406.04313v1)
45
+ * **Tamper Attack Resistance (TAR)**, [(Tamirisa et al., 2024)](https://arxiv.org/abs/2408.00761)
46
+ * **PullBack & proJect (PB&J)**, (Anonymous, (2025)
47
+
48
+ We saved 8 evenly-spaced checkpoints from these 8 methods for a total of 64 models.
49
+
50
+
51
+ ## Evaluation
52
+
53
+ Good unlearning needs to balance removal of harmful capabilities and preservation of general capabilities.
54
+ So we evaluated models using multiple benchmarks.
55
+ * **WMDP-Bio** (Bio capabilities)
56
+ * **MMLU** (General capabilities)
57
+ * **AGIEval** (General capabilities)
58
+ * **T-Bench** (General capabilities)
59
+
60
+ We then calculated the unlearning score which gives a normalized measure of how much WMDP-bio capabilities go down disproportionately compared to general capabilities.
61
+
62
+ $$
63
+ S_{\text{unlearn}}(M') =
64
+ \frac{
65
+ \underbrace{\left[S_{\text{WMDP}}(M) - S_{\text{WMDP}}(M')\right]}_{\Delta \text{Unlearn efficacy}}
66
+ -
67
+ \underbrace{\left[S_{\text{utility}}(M) - S_{\text{utility}}(M')\right]}_{\Delta \text{Utility degradation}}
68
+ }{
69
+ \underbrace{\left[S_{\text{WMDP}}(M) - S_{\text{WMDP}}(\text{rand})\right]}_{\Delta \text{Random chance (for normalization)}}
70
+ }
71
+ $$
72
+
73
+ See complete details in the paper where we also present results from evaluating these methods under 11 attacks.
74
+
75
+ We report results for the checkpoint from each method with the highest unlearning score.
76
+ We report original WMDP-Bio performance, worst-case WMDP-Bio performance after attack, and three measures of general utility: MMLU, MT-Bench, and AGIEval.
77
+ For all benchmarks, the random-guess baseline is 0.25 except for MT-Bench/10 which is 0.1.
78
+ Representation rerouting (RR) has the best unlearning score.
79
+ No model has a WMDP-Bio performance less than 0.36 after the most effective attack.
80
+ We note that Grad Diff and TAR models performed very poorly, often struggling with basic fluency.
81
+
82
+ | **Method** | **WMDP ↓** | **WMDP, Best Input Attack ↓** | **WMDP, Best Tamp. Attack ↓** | **MMLU ↑** | **MT-Bench/10 ↑** | **AGIEval ↑** | **Unlearning Score ↑** |
83
+ |----------------------|-------------|-------------------------------|-------------------------------|-------------|-------------------|---------------|-------------------------|
84
+ | Llama3 8B Instruct | 0.70 | 0.75 | 0.71 | 0.64 | 0.78 | 0.41 | 0.00 |
85
+ | **Grad Diff** | 0.25 | 0.27 | 0.67 | 0.52 | 0.13 | 0.32 | 0.17 |
86
+ | **RMU** | 0.26 | 0.34 | 0.57 | 0.59 | 0.68 | 0.42 | 0.84 |
87
+ | **RMU + LAT** | 0.32 | 0.39 | 0.64 | 0.60 | 0.71 | 0.39 | 0.73 |
88
+ | **RepNoise** | 0.29 | 0.30 | 0.65 | 0.59 | 0.71 | 0.37 | 0.78 |
89
+ | **ELM** | 0.24 | 0.38 | 0.71 | 0.59 | 0.76 | 0.37 | 0.95 |
90
+ | **RR** | 0.26 | 0.28 | 0.66 | 0.61 | 0.76 | 0.44 | **0.96** |
91
+ | **TAR** | 0.28 | 0.29 | 0.36 | 0.54 | 0.12 | 0.31 | 0.09 |
92
+ | **PB&J** | 0.31 | 0.32 | 0.64 | 0.63 | 0.78 | 0.40 | 0.85 |
93
+
94
+
95
+ ### Full Results