Papers
arxiv:2501.17433

Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation

Published on Jan 29
· Submitted by TianshengHuang on Jan 30

Abstract

Recent research shows that Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks -- models lose their safety alignment ability after fine-tuning on a few harmful samples. For risk mitigation, a guardrail is typically used to filter out harmful samples before fine-tuning. By designing a new red-teaming method, we in this paper show that purely relying on the moderation guardrail for data filtration is not reliable. Our proposed attack method, dubbed Virus, easily bypasses the guardrail moderation by slightly modifying the harmful data. Experimental results show that the harmful data optimized by Virus is not detectable by the guardrail with up to 100\% leakage ratio, and can simultaneously achieve superior attack performance. Finally, the key message we want to convey through this paper is that: it is reckless to consider guardrail moderation as a clutch at straws towards harmful fine-tuning attack, as it cannot solve the inherent safety issue of the pre-trained LLMs. Our code is available at https://github.com/git-disl/Virus

Community

Paper author Paper submitter
edited 5 days ago

Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation

In last year, (Qi et al, 2023) show by red-teaming that ordinary users can upload as few as 10 harmful samples through OpenAI's fine-tuning API to break down the safety alignment of GPT4, eliciting its harmful behaviors. However, such an attack is never successful by now. OpenAI is able to fix this issue in a pretty fast pace-- they in-place a guardrail model to moderate the data uploaded by users, and remove those harmful samples before streaming them to the real fine-tuning API.

Is the fine-tuning service safe by now?

The answer is no! Our paper submitted to the daily collection show that it is still possible to bypass the guardrail and break down the safety alignment of the victim LLM. The proposed attack method, named Virus, construct attack samples that are not detectable by the guardrail with up to 100% leakage ratio, and can simultaneously achieve superior attack performance towards a Victim LLM.

Recently we have witnessed an increasing number of guardrail products appear on the market, e.g., LlamaGuard, IBM Garnite Guardian, etc. However, we reserve concern on relying these one-time tool to solve the LLM's inherent safety issue.

We want to convey a short caveat with this paper: it is reckless to consider guardrail moderation as a clutch at straws for LLM safety, and it nevery should be.

Qi, Xiangyu, et al. "Fine-tuning aligned language models compromises safety, even when users do not intend to!." arXiv preprint arXiv:2310.03693 (2023).

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2501.17433 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2501.17433 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2501.17433 in a Space README.md to link it from this page.

Collections including this paper 2