Token-Highlighter

Running

App Files Files Community

Token-Highlighter / index.html

gregH

Update index.html

ccefa35 verified 5 months ago

raw

history blame

17 kB

	<!DOCTYPE html>
	<html lang="en-US">
	<head>
	<meta charset="UTF-8">

	<!-- Begin Jekyll SEO tag v2.8.0 -->
	<title>Gradient Cuff \| Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by
	Exploring Refusal Loss Landscapes </title>
	<meta property="og:title" content="Gradient Cuff" />
	<meta property="og:locale" content="en_US" />
	<meta name="description" content="Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes" />
	<meta property="og:description" content="Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes" />
	<script type="application/ld+json">
	{"@context":"https://schema.org","@type":"WebSite","description":"Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes","headline":"Gradient Cuff","name":"Gradient Cuff","url":"https://huggingface.co/spaces/gregH/Gradient Cuff"}</script>
	<!-- End Jekyll SEO tag -->

	<link rel="preconnect" href="https://fonts.gstatic.com">
	<link rel="preload" href="https://fonts.googleapis.com/css?family=Open+Sans:400,700&display=swap" as="style" type="text/css" crossorigin>
	<meta name="viewport" content="width=device-width, initial-scale=1">
	<meta name="theme-color" content="#157878">
	<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent">

	<link rel="stylesheet" href="assets/css/bootstrap/bootstrap.min.css?v=90447f115a006bc45b738d9592069468b20e2551">
	<link rel="stylesheet" href="assets/css/style.css?v=90447f115a006bc45b738d9592069468b20e2551">
	<!-- start custom head snippets, customize with your own _includes/head-custom.html file -->
	<link rel="stylesheet" href="assets/css/custom_style.css?v=90447f115a006bc45b738d9592069468b20e2551">
	<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
	<link rel="stylesheet" href="https://ajax.googleapis.com/ajax/libs/jqueryui/1.12.1/themes/smoothness/jquery-ui.css">
	<script src="https://ajax.googleapis.com/ajax/libs/jqueryui/1.12.1/jquery-ui.min.js"></script>
	<script src="https://cdnjs.cloudflare.com/ajax/libs/Chart.js/2.9.4/Chart.js"></script>
	<script src="assets/js/calibration.js?v=90447f115a006bc45b738d9592069468b20e2551"></script>
	<link rel="stylesheet" href="//code.jquery.com/ui/1.13.2/themes/base/jquery-ui.css">
	<link rel="stylesheet" href="/resources/demos/style.css">
	<script src="https://code.jquery.com/jquery-3.6.0.js"></script>
	<script src="https://code.jquery.com/ui/1.13.2/jquery-ui.js"></script>
	<script>
	$( function() {
	$( "#tabs" ).tabs();
	} );
	</script>
	<script>
	$( function() {
	$( "#accordion-defenses" ).accordion({
	heightStyle: "content"
	});
	} );
	</script>
	<script>
	$( function() {
	$( "#accordion-attacks" ).accordion({
	heightStyle: "content"
	});
	} );
	</script>




	<!-- for mathjax support -->
	<script src="https://cdnjs.cloudflare.com/polyfill/v3/polyfill.min.js?features=es6"></script>
	<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>


	<!-- end custom head snippets -->

	</head>
	<body>
	<a id="skip-to-content" href="#content">Skip to the content.</a>

	<header class="page-header" role="banner">
	<h1 class="project-name">Token Highlighter</h1>
	<h2 class="project-tagline">Inspecting and Mitigating Jailbreak Prompts for Large Language Models</h2>
	<h2 class="project-tagline"><a href="https://arxiv.org/abs/2412.18171" style="color: white;" target="_blank" rel="noopener noreferrer">https://arxiv.org/abs/2412.18171</a></h2>
	<h2 class="project-tagline"><a href="https://huggingface.co/spaces/gregH/token_highlighter" style="color: white;" target="_blank" rel="noopener noreferrer">Live Demo</a></h2>
	<div style="text-align: center">
	<div>
	<a href="https://gregxmhu.github.io/" style="color: white;" target="_blank" rel="noopener noreferrer">Xiaomeng Hu, CUHK CSE</a>
	</div>
	<div>
	<a href="https://sites.google.com/site/pinyuchenpage/home" style="color: white;" target="_blank" rel="noopener noreferrer">Pin-Yu Chen, IBM Research</a>
	</div>
	<div>
	<a href="https://www.cse.cuhk.edu.hk/people/faculty/tsung-yi-ho/" style="color: white;" target="_blank" rel="noopener noreferrer">Tsung-Yi Ho, CUHK CSE</a>
	</div>
	</div>
	</header>

	<main id="content" class="main-content" role="main">
	<h2 id="introduction">Introduction</h2>

	<p>Large Language Models (LLMs) are increasingly being integrated into services such as ChatGPT to provide responses to user queries.
	To mitigate potential harm and prevent misuse, there have been concerted efforts to align the LLMs with human values and legal compliance
	by incorporating various techniques, such as Reinforcement Learning from Human Feedback (RLHF),
	into the training of the LLMs. However, recent research has exposed that even aligned
	LLMs are susceptible to adversarial manipulations known as Jailbreak Attacks. To address this challenge, this paper proposes a method called
	<strong>Token Highlighter</strong> to inspect and mitigate the potential jailbreak threats in the user query.
	Token Highlighter introduced a concept called Affirmation Loss to measure the LLM's willingness to answer the user query.
	It then uses the gradient of Affirmation Loss for each token in the user query to locate the jailbreak-critical tokens. Further,
	Token Highlighter exploits our proposed Soft Removal technique to mitigate the jailbreak effects of critical tokens via shrinking their
	token embeddings. Experimental results on two aligned LLMs (LLaMA-2 and Vicuna-V1.5) demonstrate that the proposed method can effectively
	defend against a variety of Jailbreak Attacks while maintaining competent performance on benign questions of the AlpacaEval benchmark. In
	addition, Token Highlighter is a cost-effective and interpretable defense because it only needs to query the protected LLM once to compute
	the Affirmation Loss and can highlight the critical tokens upon refusal.
	</p>

	<h2 id="what-is-jailbreak">What is Jailbreak?</h2>
	<p>Jailbreak attacks involve maliciously inserting or replacing tokens in the user instruction or rewriting it to bypass and circumvent
	the safety guardrails of aligned LLMs. A notable example is that a jailbroken LLM would be tricked into
	generating hate speech targeting certain groups of people, as demonstrated below.</p>

	<div class="container">
	<div id="jailbreak-intro" class="row align-items-center jailbreak-intro-sec">
	<img id="jailbreak-intro-img" src="./jailbreak.jpg" />
	</div>
	</div>

	<h2 id="refusal-loss">Interpretability</h2>
	<p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
	autoregressive sampling-based generation. With this randomness, it is an
	interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
	sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called <strong>Refusal Loss</strong> to
	represent the probability with which the LLM won't reject the input user query. By using 1 to denote successful jailbroken and 0 to denote
	the opposite, we compute the empirical Refusal Loss as the sample mean of the jailbroken results returned from the target LLM.
	<!--Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
	mean of the Jailbroken results (1 indicates successful jailbreak, 0 indicates the opposite) to approximate the function value. -->
	We visualize the 2-D landscape of the empirical Refusal Loss on Vicuna 7B and Llama-2 7B as below:
	</p>

	<div class="container jailbreak-intro-sec">
	<div><img id="jailbreak-intro-img" src="./loss_landscape.png" /></div>
	</div>

	<p>
	We show the loss landscape for both Benign and Malicious queries in the above plot. The benign queries are non-harmful user instructions collected
	from the LM-SYS Chatbot Arena leaderboard, which is a crowd-sourced open platform for LLM evaluation. The tested malicious queries are harmful
	behavior user instructions with GCG jailbreak prompt. From this plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries,
	which implies that the Refusal Loss tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
	the gradient norm of Refusal Loss to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value
	is under 0.5 (this is a naive detector because the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).
	Below we present the definition of the Refusal Loss, the computation of its empirical values, and the approximation of its gradient, see more
	details about them and the landscape drawing techniques in our paper.
	</p>

	<div id="refusal-loss-formula" class="container">
	<div id="refusal-loss-formula-list" class="row align-items-center formula-list">
	<a href="#Refusal-Loss" class="selected">Refusal Loss Definition</a>
	<a href="#Refusal-Loss-Approximation">Refusal Loss Computation</a>
	<a href="#Gradient-Estimation">Gradient Estimation</a>
	<div style="clear: both"></div>
	</div>
	<div id="refusal-loss-formula-content" class="row align-items-center">
	<span id="Refusal-Loss" class="formula" style="">
	$$
	\displaystyle
	\begin{aligned}
	\phi_\theta(x)&=1-\mathbb{E}_{y \sim T_\theta(x)} JB(y)\\
	JB (y) &= \begin{cases}
	1 \text{, if $y$ contains any jailbreak keyword;} \\
	0 \text{, otherwise.}
	\end{cases}
	\end{aligned}
	$$
	</span>
	<span id="Refusal-Loss-Approximation" class="formula" style="display: none;">
	$$
	\displaystyle
	\begin{aligned}
	f_\theta(x) &=1-\frac{1}{N}\sum_{i=1}^N JB(y_i)\\
	JB (y_i) &= \begin{cases}
	1 \text{, if $y_i$ contains any jailbreak keyword;} \\
	0 \text{, otherwise.}
	\end{cases}
	\end{aligned}
	$$
	</span>
	<span id="Gradient-Estimation" class="formula" style="display: none;">$$\displaystyle g_\theta(x)=\sum_{i=1}^P \frac{f_\theta(x\oplus \mu u_i)-f_\theta(x)}{\mu} u_i $$</span>
	</div>
	</div>

	<h2 id="proposed-approach-gradient-cuff">Performance evaluation against practical Jailbreaks</h2>
	<p> With the exploration of the Refusal Loss landscape, we propose Gradient Cuff,
	a two-step jailbreak detection method based on checking the refusal loss and its gradient norm. Our detection procedure is shown below:
	</p>

	<div class="container"><img id="gradient-cuff-header" src="./gradient_cuff.png" /></div>

	<p>
	Gradient Cuff can be summarized into two phases:
	</p>
	<p>
	<strong>(Phase 1) Sampling-based Rejection:</strong> In the first step, we reject the user query by checking whether the Refusal Loss value is below 0.5. If true, then user query is rejected, otherwise, the user query is pushed into phase 2.
	</p>
	<p>
	<strong>(Phase 2) Gradient Norm Rejection:</strong> In the second step, we regard the user query as having jailbreak attempts if the norm of the estimated gradient is larger than a configurable threshold t.
	</p>

	<p>
	We provide more details about the running flow of Gradient Cuff in the paper.
	</p>

	<h2 id="demonstration">Demonstration</h2>
	<p>We evaluated Gradient Cuff as well as 4 baselines (Perplexity Filter, SmoothLLM, Erase-and-Check, and Self-Reminder)
	against 6 different jailbreak attacks (GCG, AutoDAN, PAIR, TAP, Base64, and LRL) and benign user queries on 2 LLMs (LLaMA-2-7B-Chat and
	Vicuna-7B-V1.5). We below demonstrate the average refusal rate across these 6 malicious user query datasets as the Average Malicious Refusal
	Rate and the refusal rate on benign user queries as the Benign Refusal Rate. The defending performance against different jailbreak types is
	shown in the provided bar chart.
	</p>


	<div id="jailbreak-demo" class="container">
	<div class="row align-items-center">
	<div class="row" style="margin: 10px 0 0">
	<div class="models-list">
	<span style="margin-right: 1em;">Models</span>
	<span class="radio-group"><input type="radio" id="LLaMA2" class="options" name="models" value="llama2_7b_chat" checked="" /><label for="LLaMA2" class="option-label">LLaMA-2-7B-Chat</label></span>
	<span class="radio-group"><input type="radio" id="Vicuna" class="options" name="models" value="vicuna_7b_v1.5" /><label for="Vicuna" class="option-label">Vicuna-7B-V1.5</label></span>
	</div>
	</div>
	</div>
	<div class="row align-items-center">
	<div class="col-4">
	<div id="defense-methods">
	<div class="row align-items-center"><input type="radio" id="defense_ppl" class="options" name="defense" value="ppl" /><label for="defense_ppl" class="defense">Perplexity Filter</label></div>
	<div class="row align-items-center"><input type="radio" id="defense_smoothllm" class="options" name="defense" value="smoothllm" /><label for="defense_smoothllm" class="defense">SmoothLLM</label></div>
	<div class="row align-items-center"><input type="radio" id="defense_erase_check" class="options" name="defense" value="erase_check" /><label for="defense_erase_check" class="defense">Erase-Check</label></div>
	<div class="row align-items-center"><input type="radio" id="defense_self_reminder" class="options" name="defense" value="self_reminder" /><label for="defense_self_reminder" class="defense">Self-Reminder</label></div>
	<div class="row align-items-center"><input type="radio" id="defense_gradient_cuff" class="options" name="defense" value="gradient_cuff" checked="" /><label for="defense_gradient_cuff" class="defense"><span style="font-weight: bold;">Gradient Cuff</span></label></div>
	</div>
	<div class="row align-items-center">
	<div class="attack-success-rate"><span class="jailbreak-metric">Average Malicious Refusal Rate</span><span class="attack-success-rate-value" id="asr-value">0.959</span></div>
	</div>
	<div class="row align-items-center">
	<div class="benign-refusal-rate"><span class="jailbreak-metric">Benign Refusal Rate</span><span class="benign-refusal-rate-value" id="brr-value">0.050</span></div>
	</div>
	</div>
	<div class="col-8">
	<figure class="figure">
	<img id="reliability-diagram" src="demo_results/gradient_cuff_llama2_7b_chat_threshold_100.png" alt="CIFAR-100 Calibrated Reliability Diagram (Full)" />
	<div class="slider-container">
	<div class="slider-label"><span>Perplexity Threshold</span></div>
	<div class="slider-content" id="ppl-slider"><div id="ppl-threshold" class="ui-slider-handle"></div></div>
	</div>
	<div class="slider-container">
	<div class="slider-label"><span>Gradient Threshold</span></div>
	<div class="slider-content" id="gradient-norm-slider"><div id="gradient-norm-threshold" class="slider-value ui-slider-handle"></div></div>
	</div>
	<figcaption class="figure-caption">
	</figcaption>
	</figure>
	</div>
	</div>
	</div>

	<p>
	Higher malicious refusal rate and lower benign refusal rate mean a better defense.
	Overall, Gradient Cuff is the most performant compared with those baselines. We also evaluated Gradient Cuff against adaptive attacks
	in the paper.
	</p>

	<h2 id="inquiries"> Inquiries on LLM with Gradient Cuff defense</h2>
	<p> Please contact <a href="Mailto:[email protected]">Xiaomeng Hu</a>
	and <a href="Mailto:[email protected]">Pin-Yu Chen</a>
	</p>
	<h2 id="citations">Citations</h2>
	<p>If you find Gradient Cuff helpful and useful for your research, please cite our main paper as follows:</p>

	<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@article{DBLP:journals/corr/abs-2412-18171,
	author = {Xiaomeng Hu and
	Pin{-}Yu Chen and
	Tsung{-}Yi Ho},
	title = {Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for
	Large Language Models},
	journal = {CoRR},
	volume = {abs/2412.18171},
	year = {2024},
	url = {https://doi.org/10.48550/arXiv.2412.18171},
	doi = {10.48550/ARXIV.2412.18171},
	eprinttype = {arXiv},
	eprint = {2412.18171},
	timestamp = {Sat, 25 Jan 2025 12:51:16 +0100},
	biburl = {https://dblp.org/rec/journals/corr/abs-2412-18171.bib},
	bibsource = {dblp computer science bibliography, https://dblp.org}
	}
	</code></pre></div></div>


	<footer class="site-footer">

	<span class="site-footer-owner">Token-Highlighter is maintained by <a href="https://gregxmhu.github.io/">Xiaomeng Hu</a></a>.</span>

	</footer>
	</main>
	</body>
	</html>