arxiv:2501.17433

Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation

Published on Jan 29

· Submitted by

TianshengHuang on Jan 30

Upvote

Authors:

Tiansheng Huang ,

Sihao Hu ,

Fatih Ilhan ,

Selim Furkan Tekin ,

Ling Liu

Abstract

Recent research shows that Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks -- models lose their safety alignment ability after fine-tuning on a few harmful samples. For risk mitigation, a guardrail is typically used to filter out harmful samples before fine-tuning. By designing a new red-teaming method, we in this paper show that purely relying on the moderation guardrail for data filtration is not reliable. Our proposed attack method, dubbed Virus, easily bypasses the guardrail moderation by slightly modifying the harmful data. Experimental results show that the harmful data optimized by Virus is not detectable by the guardrail with up to 100\% leakage ratio, and can simultaneously achieve superior attack performance. Finally, the key message we want to convey through this paper is that: it is reckless to consider guardrail moderation as a clutch at straws towards harmful fine-tuning attack, as it cannot solve the inherent safety issue of the pre-trained LLMs. Our code is available at https://github.com/git-disl/Virus

View arXiv page View PDF Add to collection

Community

TianshengHuang

Paper author Paper submitter 5 days ago

•

edited 5 days ago

Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation

In last year, (Qi et al, 2023) show by red-teaming that ordinary users can upload as few as 10 harmful samples through OpenAI's fine-tuning API to break down the safety alignment of GPT4, eliciting its harmful behaviors. However, such an attack is never successful by now. OpenAI is able to fix this issue in a pretty fast pace-- they in-place a guardrail model to moderate the data uploaded by users, and remove those harmful samples before streaming them to the real fine-tuning API.

Is the fine-tuning service safe by now?

The answer is no! Our paper submitted to the daily collection show that it is still possible to bypass the guardrail and break down the safety alignment of the victim LLM. The proposed attack method, named Virus, construct attack samples that are not detectable by the guardrail with up to 100% leakage ratio, and can simultaneously achieve superior attack performance towards a Victim LLM.

Recently we have witnessed an increasing number of guardrail products appear on the market, e.g., LlamaGuard, IBM Garnite Guardian, etc. However, we reserve concern on relying these one-time tool to solve the LLM's inherent safety issue.

We want to convey a short caveat with this paper: it is reckless to consider guardrail moderation as a clutch at straws for LLM safety, and it nevery should be.

Qi, Xiangyu, et al. "Fine-tuning aligned language models compromises safety, even when users do not intend to!." arXiv preprint arXiv:2310.03693 (2023).