|
--- |
|
title: README |
|
emoji: π’ |
|
colorFrom: indigo |
|
colorTo: indigo |
|
sdk: static |
|
pinned: false |
|
thumbnail: >- |
|
https://cdn-uploads.huggingface.co/production/uploads/6466a046326128fd2c6c59c2/rlGxR2jD815pERRdNHxGM.png |
|
--- |
|
|
|
# Model Tampering Attacks Enable More Rigorous Evlauations of LLM Capabilities |
|
|
|
Zora Che*, Stephen Casper*, |
|
Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev E McKinney, Rohit Gandikota, Aidan Ewart, Domenic Rosati, Zichu Wu, Zikui Cai, Bilal Chughtai, |
|
Yarin Gal, Furong Huang, Dylan Hadfield-Menell |
|
|
|
Paper: [Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities](https://arxiv.org/abs/2502.05209) |
|
|
|
BibTeX: |
|
``` |
|
@article{che2025model, |
|
title={Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities}, |
|
author={Che, Zora and Casper, Stephen and Kirk, Robert and Satheesh, Anirudh and Slocum, Stewart and McKinney, Lev E and Gandikota, Rohit and Ewart, Aidan and Rosati, Domenic and Wu, Zichu and others}, |
|
journal={arXiv preprint arXiv:2502.05209}, |
|
year={2025} |
|
} |
|
``` |
|
|
|
<img src="fig1-1.png" alt="model manipulation attacks" width="500"/> |
|
|
|
## Paper Abstract |
|
|
|
Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks. |
|
Currently, most risk evaluations are conducted by searching for inputs that elicit harmful behaviors from the system. |
|
However, a fundamental limitation of this approach is that the harmfulness of the behaviors identified during any particular evaluation can only lower bound the model's worst-possible-case behavior. |
|
As a complementary method for eliciting harmful behaviors, we propose evaluating LLMs with model tampering attacks which allow for modifications to the latent activations or weights. |
|
We pit state-of-the-art techniques for removing harmful LLM capabilities against a suite of 5 input-space and 6 model tampering attacks. |
|
In addition to benchmarking these methods against each other, we show that (1) model resilience to capability elicitation attacks lies on a low-dimensional robustness subspace; (2) the attack success rate of model tampering attacks can empirically predict and offer conservative estimates for the success of held-out input-space attacks; and (3) state-of-the-art unlearning methods can easily be undone within 16 steps of fine-tuning. |
|
Together these results highlight the difficulty of removing harmful LLM capabilities and show that model tampering attacks enable substantially more rigorous evaluations for vulnerabilities than input-space attacks alone. |
|
|
|
## Info |
|
|
|
This space contains the 64 models. All are versions of [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) which have been fine-tuned using various machine unlearning methods to unlearn dual-use biology knowledge using the [WMDP-Bio](https://www.wmdp.ai/) benchmark. |
|
The goal of unlearning WMDP-Bio knowledge from these models is to (1) make them incapable of correctly answering questions related to bioweapons creation and (2) preserve their capabilities on all other tasks. |
|
See the paepr for details. |
|
We used 8 unlearning methods: |
|
* **Gradient Difference (GradDiff)**, [(Liu et al., 2022)](https://arxiv.org/abs/2203.12817) |
|
* **Random Misdirection for Unlearning (RMU)**, [(Li et al, 2024)](https://arxiv.org/abs/2403.03218) |
|
* **RMU with Latent Adversarial Training (RMU+LAT)**, [(Sheshadri et al, 2024)](https://arxiv.org/abs/2407.15549) |
|
* **Representation Noising (RepNoise)**, [(Rosati et al, 2024)](https://arxiv.org/abs/2405.14577) |
|
* **Erasure of Language Memory (ELM)**, [(Gandikota et al, 2024)](https://arxiv.org/abs/2410.02760) |
|
* **Representation Rerouting (RR)**, [(Zou et al, 2024)](https://arxiv.org/html/2406.04313v1) |
|
* **Tamper Attack Resistance (TAR)**, [(Tamirisa et al., 2024)](https://arxiv.org/abs/2408.00761) |
|
* **PullBack & proJect (PB&J)**, (Anonymous, 2025) |
|
|
|
We saved 8 evenly-spaced checkpoints from these 8 methods for a total of 64 models. |
|
|
|
|
|
## Evaluation |
|
|
|
Good unlearning needs to balance removal of harmful capabilities and preservation of general capabilities. |
|
So we evaluated models using multiple benchmarks. |
|
* **WMDP-Bio** (Bio capabilities) |
|
* **MMLU** (General capabilities) |
|
* **AGIEval** (General capabilities) |
|
* **MT-Bench** (General capabilities) |
|
|
|
We then calculated the unlearning score which gives a normalized measure of how much WMDP-bio capabilities go down disproportionately compared to general capabilities. |
|
|
|
$$ |
|
S_{\text{unlearn}}(M') = |
|
\frac{ |
|
\underbrace{\left[S_{\text{WMDP}}(M) - S_{\text{WMDP}}(M')\right]}_{\Delta \text{Unlearn efficacy}} |
|
- |
|
\underbrace{\left[S_{\text{utility}}(M) - S_{\text{utility}}(M')\right]}_{\Delta \text{Utility degradation}} |
|
}{ |
|
\underbrace{\left[S_{\text{WMDP}}(M) - S_{\text{WMDP}}(\text{rand})\right]}_{\Delta \text{Random chance (for normalization)}} |
|
} |
|
$$ |
|
|
|
See complete details in the paper where we also present results from evaluating these methods under 11 attacks. |
|
|
|
We report results for the checkpoint from each method with the highest unlearning score. |
|
We report original WMDP-Bio performance, worst-case WMDP-Bio performance after attack, and three measures of general utility: MMLU, MT-Bench, and AGIEval. |
|
For all benchmarks, the random-guess baseline is 0.25 except for MT-Bench/10 which is 0.1. |
|
Representation rerouting (RR) has the best unlearning score. |
|
No model has a WMDP-Bio performance less than 0.36 after the most effective attack. |
|
We note that Grad Diff and TAR models performed very poorly, often struggling with basic fluency. |
|
|
|
| **Method** | **WMDP β** | **WMDP, Best Input Attack β** | **WMDP, Best Tamp. Attack β** | **MMLU β** | **MT-Bench/10 β** | **AGIEval β** | **Unlearning Score β** | |
|
|----------------------|-------------|-------------------------------|-------------------------------|-------------|-------------------|---------------|-------------------------| |
|
| **Llama3 8B Instruct** | 0.70 | 0.75 | 0.71 | 0.64 | 0.78 | 0.41 | 0.00 | |
|
| **Grad Diff** | 0.25 | 0.27 | 0.67 | 0.52 | 0.13 | 0.32 | 0.17 | |
|
| **RMU** | 0.26 | 0.34 | 0.57 | 0.59 | 0.68 | 0.42 | 0.84 | |
|
| **RMU + LAT** | 0.32 | 0.39 | 0.64 | 0.60 | 0.71 | 0.39 | 0.73 | |
|
| **RepNoise** | 0.29 | 0.30 | 0.65 | 0.59 | 0.71 | 0.37 | 0.78 | |
|
| **ELM** | 0.24 | 0.38 | 0.71 | 0.59 | 0.76 | 0.37 | 0.95 | |
|
| **RR** | 0.26 | 0.28 | 0.66 | 0.61 | 0.76 | 0.44 | **0.96** | |
|
| **TAR** | 0.28 | 0.29 | 0.36 | 0.54 | 0.12 | 0.31 | 0.09 | |
|
| **PB&J** | 0.31 | 0.32 | 0.64 | 0.63 | 0.78 | 0.40 | 0.85 | |
|
|
|
|
|
## Full Eval Results for All 64 Models |
|
|
|
View and download [here](https://docs.google.com/spreadsheets/d/1i36NoZxPUxrPNGsyggVz_FGNCfOZjrkPxY1ym9OBr2w/edit?usp=sharing). |
|
|