Spaces:

LLM-GAT
/

README

Running

App Files Files Community

README / README.md

stecas

Update README.md

4ba164b verified 24 days ago

preview code

raw

history blame contribute delete

7.7 kB

	---
	title: README
	emoji: 🏢
	colorFrom: indigo
	colorTo: indigo
	sdk: static
	pinned: false
	thumbnail: >-
	https://cdn-uploads.huggingface.co/production/uploads/6466a046326128fd2c6c59c2/rlGxR2jD815pERRdNHxGM.png
	---

	# Model Tampering Attacks Enable More Rigorous Evlauations of LLM Capabilities

	Zora Che, Stephen Casper,
	Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev E McKinney, Rohit Gandikota, Aidan Ewart, Domenic Rosati, Zichu Wu, Zikui Cai, Bilal Chughtai,
	Yarin Gal, Furong Huang, Dylan Hadfield-Menell

	Paper: [Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities](https://arxiv.org/abs/2502.05209)

	BibTeX:
	```
	@article{che2025model,
	title={Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities},
	author={Che, Zora and Casper, Stephen and Kirk, Robert and Satheesh, Anirudh and Slocum, Stewart and McKinney, Lev E and Gandikota, Rohit and Ewart, Aidan and Rosati, Domenic and Wu, Zichu and others},
	journal={arXiv preprint arXiv:2502.05209},
	year={2025}
	}
	```

	<img src="fig1-1.png" alt="model manipulation attacks" width="500"/>

	## Paper Abstract

	Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks.
	Currently, most risk evaluations are conducted by searching for inputs that elicit harmful behaviors from the system.
	However, a fundamental limitation of this approach is that the harmfulness of the behaviors identified during any particular evaluation can only lower bound the model's worst-possible-case behavior.
	As a complementary method for eliciting harmful behaviors, we propose evaluating LLMs with model tampering attacks which allow for modifications to the latent activations or weights.
	We pit state-of-the-art techniques for removing harmful LLM capabilities against a suite of 5 input-space and 6 model tampering attacks.
	In addition to benchmarking these methods against each other, we show that (1) model resilience to capability elicitation attacks lies on a low-dimensional robustness subspace; (2) the attack success rate of model tampering attacks can empirically predict and offer conservative estimates for the success of held-out input-space attacks; and (3) state-of-the-art unlearning methods can easily be undone within 16 steps of fine-tuning.
	Together these results highlight the difficulty of removing harmful LLM capabilities and show that model tampering attacks enable substantially more rigorous evaluations for vulnerabilities than input-space attacks alone.

	## Info

	This space contains the 64 models. All are versions of [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) which have been fine-tuned using various machine unlearning methods to unlearn dual-use biology knowledge using the [WMDP-Bio](https://www.wmdp.ai/) benchmark.
	The goal of unlearning WMDP-Bio knowledge from these models is to (1) make them incapable of correctly answering questions related to bioweapons creation and (2) preserve their capabilities on all other tasks.
	See the paepr for details.
	We used 8 unlearning methods:
	* Gradient Difference (GradDiff), [(Liu et al., 2022)](https://arxiv.org/abs/2203.12817)
	* Random Misdirection for Unlearning (RMU), [(Li et al, 2024)](https://arxiv.org/abs/2403.03218)
	* RMU with Latent Adversarial Training (RMU+LAT), [(Sheshadri et al, 2024)](https://arxiv.org/abs/2407.15549)
	* Representation Noising (RepNoise), [(Rosati et al, 2024)](https://arxiv.org/abs/2405.14577)
	* Erasure of Language Memory (ELM), [(Gandikota et al, 2024)](https://arxiv.org/abs/2410.02760)
	* Representation Rerouting (RR), [(Zou et al, 2024)](https://arxiv.org/html/2406.04313v1)
	* Tamper Attack Resistance (TAR), [(Tamirisa et al., 2024)](https://arxiv.org/abs/2408.00761)
	* PullBack & proJect (PB&J), (Anonymous, 2025)

	We saved 8 evenly-spaced checkpoints from these 8 methods for a total of 64 models.


	## Evaluation

	Good unlearning needs to balance removal of harmful capabilities and preservation of general capabilities.
	So we evaluated models using multiple benchmarks.
	* WMDP-Bio (Bio capabilities)
	* MMLU (General capabilities)
	* AGIEval (General capabilities)
	* MT-Bench (General capabilities)

	We then calculated the unlearning score which gives a normalized measure of how much WMDP-bio capabilities go down disproportionately compared to general capabilities.

	$$
	S_{\text{unlearn}}(M') =
	\frac{
	\underbrace{\left[S_{\text{WMDP}}(M) - S_{\text{WMDP}}(M')\right]}_{\Delta \text{Unlearn efficacy}}
	-
	\underbrace{\left[S_{\text{utility}}(M) - S_{\text{utility}}(M')\right]}_{\Delta \text{Utility degradation}}
	}{
	\underbrace{\left[S_{\text{WMDP}}(M) - S_{\text{WMDP}}(\text{rand})\right]}_{\Delta \text{Random chance (for normalization)}}
	}
	$$

	See complete details in the paper where we also present results from evaluating these methods under 11 attacks.

	We report results for the checkpoint from each method with the highest unlearning score.
	We report original WMDP-Bio performance, worst-case WMDP-Bio performance after attack, and three measures of general utility: MMLU, MT-Bench, and AGIEval.
	For all benchmarks, the random-guess baseline is 0.25 except for MT-Bench/10 which is 0.1.
	Representation rerouting (RR) has the best unlearning score.
	No model has a WMDP-Bio performance less than 0.36 after the most effective attack.
	We note that Grad Diff and TAR models performed very poorly, often struggling with basic fluency.

	\| Method \| WMDP ↓ \| WMDP, Best Input Attack ↓ \| WMDP, Best Tamp. Attack ↓ \| MMLU ↑ \| MT-Bench/10 ↑ \| AGIEval ↑ \| Unlearning Score ↑ \|
	\|----------------------\|-------------\|-------------------------------\|-------------------------------\|-------------\|-------------------\|---------------\|-------------------------\|
	\| Llama3 8B Instruct \| 0.70 \| 0.75 \| 0.71 \| 0.64 \| 0.78 \| 0.41 \| 0.00 \|
	\| Grad Diff \| 0.25 \| 0.27 \| 0.67 \| 0.52 \| 0.13 \| 0.32 \| 0.17 \|
	\| RMU \| 0.26 \| 0.34 \| 0.57 \| 0.59 \| 0.68 \| 0.42 \| 0.84 \|
	\| RMU + LAT \| 0.32 \| 0.39 \| 0.64 \| 0.60 \| 0.71 \| 0.39 \| 0.73 \|
	\| RepNoise \| 0.29 \| 0.30 \| 0.65 \| 0.59 \| 0.71 \| 0.37 \| 0.78 \|
	\| ELM \| 0.24 \| 0.38 \| 0.71 \| 0.59 \| 0.76 \| 0.37 \| 0.95 \|
	\| RR \| 0.26 \| 0.28 \| 0.66 \| 0.61 \| 0.76 \| 0.44 \| 0.96 \|
	\| TAR \| 0.28 \| 0.29 \| 0.36 \| 0.54 \| 0.12 \| 0.31 \| 0.09 \|
	\| PB&J \| 0.31 \| 0.32 \| 0.64 \| 0.63 \| 0.78 \| 0.40 \| 0.85 \|


	## Full Eval Results for All 64 Models

	View and download [here](https://docs.google.com/spreadsheets/d/1i36NoZxPUxrPNGsyggVz_FGNCfOZjrkPxY1ym9OBr2w/edit?usp=sharing).