Update README.md
Browse files
README.md
CHANGED
@@ -5,6 +5,91 @@ colorFrom: indigo
|
|
5 |
colorTo: indigo
|
6 |
sdk: static
|
7 |
pinned: false
|
|
|
|
|
8 |
---
|
9 |
|
10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
colorTo: indigo
|
6 |
sdk: static
|
7 |
pinned: false
|
8 |
+
thumbnail: >-
|
9 |
+
https://cdn-uploads.huggingface.co/production/uploads/6466a046326128fd2c6c59c2/rlGxR2jD815pERRdNHxGM.png
|
10 |
---
|
11 |
|
12 |
+
# Model Tampering Attacks Enable More Rigorous Evlauations of LLM Capabilities
|
13 |
+
|
14 |
+
Zora Che*, Stephen Casper*, Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev E McKinney, Rohit Gandikota, Aidan Ewart, Domenic Rosati, Zichu Wu, Zikui Cai, Bilal Chughtai, Yarin Gal, Furong Huang, Dylan Hadfield-Menell
|
15 |
+
|
16 |
+
Paper: COMING SOON
|
17 |
+
|
18 |
+
BibTeX:
|
19 |
+
```
|
20 |
+
COMING SOON
|
21 |
+
```
|
22 |
+
|
23 |
+
## Paper Abstract
|
24 |
+
|
25 |
+
Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks.
|
26 |
+
Currently, most risk evaluations are conducted by searching for inputs that elicit harmful behaviors from the system.
|
27 |
+
However, a fundamental limitation of this approach is that the harmfulness of the behaviors identified during any particular evaluation can only lower bound the model's worst-possible-case behavior.
|
28 |
+
As a complementary method for eliciting harmful behaviors, we propose evaluating LLMs with model tampering attacks which allow for modifications to the latent activations or weights.
|
29 |
+
We pit state-of-the-art techniques for removing harmful LLM capabilities against a suite of 5 input-space and 6 model tampering attacks.
|
30 |
+
In addition to benchmarking these methods against each other, we show that (1) model resilience to capability elicitation attacks lies on a low-dimensional robustness subspace; (2) the attack success rate of model tampering attacks can empirically predict and offer conservative estimates for the success of held-out input-space attacks; and (3) state-of-the-art unlearning methods can easily be undone within 16 steps of fine-tuning.
|
31 |
+
Together these results highlight the difficulty of removing harmful LLM capabilities and show that model tampering attacks enable substantially more rigorous evaluations for vulnerabilities than input-space attacks alone.
|
32 |
+
|
33 |
+
## Info
|
34 |
+
|
35 |
+
This space contains the 64 models. All are versions of [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) which have been fine-tuned using various machine unlearning methods to unlearn dual-use biology knowledge using the [WMDP-Bio](https://www.wmdp.ai/) benchmark.
|
36 |
+
The goal of unlearning WMDP-Bio knowledge from these models is to (1) make them incapable of correctly answering questions related to bioweapons creation and (2) preserve their capabilities on all other tasks.
|
37 |
+
See the paepr for details.
|
38 |
+
We used 8 unlearning methods:
|
39 |
+
* **Gradient Difference (GradDiff)**, [(Liu et al., 2022)](https://arxiv.org/abs/2203.12817)
|
40 |
+
* **Random Misdirection for Unlearning (RMU)**, [(Li et al, 2024)](https://arxiv.org/abs/2403.03218)
|
41 |
+
* **RMU with Latent Adversarial Training (RMU+LAT)**, [(Sheshadri et al, 2024)](https://arxiv.org/abs/2407.15549)
|
42 |
+
* **Representation Noising (RepNoise)**, [(Rosati et al, 2024)](https://arxiv.org/abs/2405.14577)
|
43 |
+
* **Erasure of Language Memory (ELM)**, [(Gandikota et al, 2024)](https://arxiv.org/abs/2410.02760)
|
44 |
+
* **Representation Rerouting (RR)**, [(Zou et al, 2024)](https://arxiv.org/html/2406.04313v1)
|
45 |
+
* **Tamper Attack Resistance (TAR)**, [(Tamirisa et al., 2024)](https://arxiv.org/abs/2408.00761)
|
46 |
+
* **PullBack & proJect (PB&J)**, (Anonymous, (2025)
|
47 |
+
|
48 |
+
We saved 8 evenly-spaced checkpoints from these 8 methods for a total of 64 models.
|
49 |
+
|
50 |
+
|
51 |
+
## Evaluation
|
52 |
+
|
53 |
+
Good unlearning needs to balance removal of harmful capabilities and preservation of general capabilities.
|
54 |
+
So we evaluated models using multiple benchmarks.
|
55 |
+
* **WMDP-Bio** (Bio capabilities)
|
56 |
+
* **MMLU** (General capabilities)
|
57 |
+
* **AGIEval** (General capabilities)
|
58 |
+
* **T-Bench** (General capabilities)
|
59 |
+
|
60 |
+
We then calculated the unlearning score which gives a normalized measure of how much WMDP-bio capabilities go down disproportionately compared to general capabilities.
|
61 |
+
|
62 |
+
$$
|
63 |
+
S_{\text{unlearn}}(M') =
|
64 |
+
\frac{
|
65 |
+
\underbrace{\left[S_{\text{WMDP}}(M) - S_{\text{WMDP}}(M')\right]}_{\Delta \text{Unlearn efficacy}}
|
66 |
+
-
|
67 |
+
\underbrace{\left[S_{\text{utility}}(M) - S_{\text{utility}}(M')\right]}_{\Delta \text{Utility degradation}}
|
68 |
+
}{
|
69 |
+
\underbrace{\left[S_{\text{WMDP}}(M) - S_{\text{WMDP}}(\text{rand})\right]}_{\Delta \text{Random chance (for normalization)}}
|
70 |
+
}
|
71 |
+
$$
|
72 |
+
|
73 |
+
See complete details in the paper where we also present results from evaluating these methods under 11 attacks.
|
74 |
+
|
75 |
+
We report results for the checkpoint from each method with the highest unlearning score.
|
76 |
+
We report original WMDP-Bio performance, worst-case WMDP-Bio performance after attack, and three measures of general utility: MMLU, MT-Bench, and AGIEval.
|
77 |
+
For all benchmarks, the random-guess baseline is 0.25 except for MT-Bench/10 which is 0.1.
|
78 |
+
Representation rerouting (RR) has the best unlearning score.
|
79 |
+
No model has a WMDP-Bio performance less than 0.36 after the most effective attack.
|
80 |
+
We note that Grad Diff and TAR models performed very poorly, often struggling with basic fluency.
|
81 |
+
|
82 |
+
| **Method** | **WMDP β** | **WMDP, Best Input Attack β** | **WMDP, Best Tamp. Attack β** | **MMLU β** | **MT-Bench/10 β** | **AGIEval β** | **Unlearning Score β** |
|
83 |
+
|----------------------|-------------|-------------------------------|-------------------------------|-------------|-------------------|---------------|-------------------------|
|
84 |
+
| Llama3 8B Instruct | 0.70 | 0.75 | 0.71 | 0.64 | 0.78 | 0.41 | 0.00 |
|
85 |
+
| **Grad Diff** | 0.25 | 0.27 | 0.67 | 0.52 | 0.13 | 0.32 | 0.17 |
|
86 |
+
| **RMU** | 0.26 | 0.34 | 0.57 | 0.59 | 0.68 | 0.42 | 0.84 |
|
87 |
+
| **RMU + LAT** | 0.32 | 0.39 | 0.64 | 0.60 | 0.71 | 0.39 | 0.73 |
|
88 |
+
| **RepNoise** | 0.29 | 0.30 | 0.65 | 0.59 | 0.71 | 0.37 | 0.78 |
|
89 |
+
| **ELM** | 0.24 | 0.38 | 0.71 | 0.59 | 0.76 | 0.37 | 0.95 |
|
90 |
+
| **RR** | 0.26 | 0.28 | 0.66 | 0.61 | 0.76 | 0.44 | **0.96** |
|
91 |
+
| **TAR** | 0.28 | 0.29 | 0.36 | 0.54 | 0.12 | 0.31 | 0.09 |
|
92 |
+
| **PB&J** | 0.31 | 0.32 | 0.64 | 0.63 | 0.78 | 0.40 | 0.85 |
|
93 |
+
|
94 |
+
|
95 |
+
### Full Results
|