Spaces:
Running
Running
File size: 16,196 Bytes
9fade2a bfc5ccd 7261a26 bfc5ccd 8a3a312 bfc5ccd 8a3a312 bfc5ccd 81a881b 5625e29 57a2687 5625e29 9dc91af 57a2687 9dc91af bfc5ccd 32c2839 bfc5ccd ba95930 c0f5964 fde14d1 db9d916 0ab5de1 d15a7b1 0ab5de1 d15a7b1 0ab5de1 d15a7b1 0ab5de1 fbbf560 bfc5ccd aebf979 2654ca5 bfc5ccd 2654ca5 354f973 bfc5ccd b929465 86ff643 bfc5ccd 826654d 2f98671 a25d95b cf0d3f3 2f98671 bfc5ccd cf0d3f3 7310da4 86e6270 186aca2 fda3749 186aca2 86e6270 7310da4 997c569 48cf386 fda3749 48cf386 997c569 27b60ef eaf95ba fe1cf0c eaf95ba fe1cf0c eaf95ba 27b60ef bfc5ccd 2f98671 f6d6ea2 b92dddc bfc5ccd 8a3462c bfc5ccd b92dddc 2d4556c 7ee8287 b92dddc 101b0fa b92dddc bfc5ccd f134f1b b928cb0 f134f1b 8b5c98a bfc5ccd 4b4042c bfc5ccd fd620cd bfc5ccd 6799636 232a1d9 0730956 1166ace 1b4fa79 bfc5ccd a908f78 7ef77c5 a908f78 bfc5ccd c9fcacb bfc5ccd ab9235e 0730956 bfc5ccd ab9235e 19975bf bfc5ccd a351266 c9c3573 a283ca1 31262cb a351266 86a82c4 31262cb 86a82c4 bfc5ccd c1e761a bfc5ccd f6d6ea2 bfc5ccd ccefa35 bfc5ccd 9fade2a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 |
<!DOCTYPE html>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<!-- Begin Jekyll SEO tag v2.8.0 -->
<title>Gradient Cuff | Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by
Exploring Refusal Loss Landscapes </title>
<meta property="og:title" content="Gradient Cuff" />
<meta property="og:locale" content="en_US" />
<meta name="description" content="Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes" />
<meta property="og:description" content="Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes" />
<script type="application/ld+json">
{"@context":"https://schema.org","@type":"WebSite","description":"Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes","headline":"Gradient Cuff","name":"Gradient Cuff","url":"https://huggingface.co/spaces/gregH/Gradient Cuff"}</script>
<!-- End Jekyll SEO tag -->
<link rel="preconnect" href="https://fonts.gstatic.com">
<link rel="preload" href="https://fonts.googleapis.com/css?family=Open+Sans:400,700&display=swap" as="style" type="text/css" crossorigin>
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="theme-color" content="#157878">
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent">
<link rel="stylesheet" href="assets/css/bootstrap/bootstrap.min.css?v=90447f115a006bc45b738d9592069468b20e2551">
<link rel="stylesheet" href="assets/css/style.css?v=90447f115a006bc45b738d9592069468b20e2551">
<!-- start custom head snippets, customize with your own _includes/head-custom.html file -->
<link rel="stylesheet" href="assets/css/custom_style.css?v=90447f115a006bc45b738d9592069468b20e2551">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<link rel="stylesheet" href="https://ajax.googleapis.com/ajax/libs/jqueryui/1.12.1/themes/smoothness/jquery-ui.css">
<script src="https://ajax.googleapis.com/ajax/libs/jqueryui/1.12.1/jquery-ui.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/Chart.js/2.9.4/Chart.js"></script>
<script src="assets/js/calibration.js?v=90447f115a006bc45b738d9592069468b20e2551"></script>
<link rel="stylesheet" href="//code.jquery.com/ui/1.13.2/themes/base/jquery-ui.css">
<link rel="stylesheet" href="/resources/demos/style.css">
<script src="https://code.jquery.com/jquery-3.6.0.js"></script>
<script src="https://code.jquery.com/ui/1.13.2/jquery-ui.js"></script>
<script>
$( function() {
$( "#tabs" ).tabs();
} );
</script>
<script>
$( function() {
$( "#accordion-defenses" ).accordion({
heightStyle: "content"
});
} );
</script>
<script>
$( function() {
$( "#accordion-attacks" ).accordion({
heightStyle: "content"
});
} );
</script>
<!-- for mathjax support -->
<script src="https://cdnjs.cloudflare.com/polyfill/v3/polyfill.min.js?features=es6"></script>
<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
<!-- end custom head snippets -->
</head>
<body>
<a id="skip-to-content" href="#content">Skip to the content.</a>
<header class="page-header" role="banner">
<h1 class="project-name">Token Highlighter</h1>
<h2 class="project-tagline">Inspecting and Mitigating Jailbreak Prompts for Large Language Models</h2>
<h2 class="project-tagline"><a href="https://arxiv.org/abs/2412.18171" style="color: white;" target="_blank" rel="noopener noreferrer">https://arxiv.org/abs/2412.18171</a></h2>
<h2 class="project-tagline"><a href="https://huggingface.co/spaces/gregH/token_highlighter" style="color: white;" target="_blank" rel="noopener noreferrer">Live Demo</a></h2>
<div style="text-align: center">
<div>
<a href="https://gregxmhu.github.io/" style="color: white;" target="_blank" rel="noopener noreferrer">Xiaomeng Hu, CUHK CSE</a>
</div>
<div>
<a href="https://sites.google.com/site/pinyuchenpage/home" style="color: white;" target="_blank" rel="noopener noreferrer">Pin-Yu Chen, IBM Research</a>
</div>
<div>
<a href="https://www.cse.cuhk.edu.hk/people/faculty/tsung-yi-ho/" style="color: white;" target="_blank" rel="noopener noreferrer">Tsung-Yi Ho, CUHK CSE</a>
</div>
</div>
</header>
<main id="content" class="main-content" role="main">
<h2 id="introduction">Introduction</h2>
<p>Large Language Models (LLMs) are increasingly being integrated into services such as ChatGPT to provide responses to user queries.
To mitigate potential harm and prevent misuse, there have been concerted efforts to align the LLMs with human values and legal compliance
by incorporating various techniques, such as Reinforcement Learning from Human Feedback (RLHF),
into the training of the LLMs. However, recent research has exposed that even aligned
LLMs are susceptible to adversarial manipulations known as Jailbreak Attacks. To address this challenge, this paper proposes a method called
<strong>Token Highlighter</strong> to inspect and mitigate the potential jailbreak threats in the user query.
Token Highlighter introduced a concept called Affirmation Loss to measure the LLM's willingness to answer the user query.
It then uses the gradient of Affirmation Loss for each token in the user query to locate the jailbreak-critical tokens. Further,
Token Highlighter exploits our proposed Soft Removal technique to mitigate the jailbreak effects of critical tokens via shrinking their
token embeddings. Experimental results on two aligned LLMs (LLaMA-2 and Vicuna-V1.5) demonstrate that the proposed method can effectively
defend against a variety of Jailbreak Attacks while maintaining competent performance on benign questions of the AlpacaEval benchmark. In
addition, Token Highlighter is a cost-effective and interpretable defense because it only needs to query the protected LLM once to compute
the Affirmation Loss and can highlight the critical tokens upon refusal.
</p>
<h2 id="what-is-jailbreak">What is Jailbreak?</h2>
<p>
Aligned Large Language Models (LLMs) have been shown to exhibit vulnerabilities to jailbreak attacks, which exploit token-level
or prompt-level manipulations to bypass and circumvent the safety guardrails embedded within these models. A notable example is that
a jailbroken LLM would be tricked into giving tutorials on how to cause harm to others. Jailbreak techniques often employ
sophisticated strategies, including but not limited to role-playing , instruction disguising , leading language , and the normalization
of illicit action, as illustrated in the examples below.
</p>
<div class="container">
<div id="jailbreak-intro" class="row align-items-center jailbreak-intro-sec">
<img id="jailbreak-intro-img" src="./jailbreak.jpg" />
</div>
</div>
<h2 id="refusal-loss">Token Highlighter: Principle and Interpretability</h2>
<p>High-level speaking, successful jailbreaks share a common principle that they are trying to make the LLMs willing to affirm the user request which will be rejected at the beginning. Drawing upon this inspiration, our proposed defense aims to find the tokens that are most critical in forcing the LLM to generate such affirmative responses,
decrease their importance in the generation, and thereby resolve the potential jailbreak risks brought by these tokens. To identify
these tokens, we propose a new concept called the <strong>Affirmation Loss</strong>. We then use the loss's gradient norm
with respoect to each token in the user input prompt to find the jailbreak-critical tokens. We select those tokens with the larger
gradient norm and then apply soft removal on them to mitigate the potential jailbreak risks. Below we introduce how we define these concepts mathematically.
</p>
<div id="refusal-loss-formula" class="container">
<div id="refusal-loss-formula-list" class="row align-items-center formula-list">
<a href="#Refusal-Loss" class="selected">Affirmation Loss Computation</a>
<a href="#Refusal-Loss-Approximation">Critical Tokens Selection</a>
<a href="#Gradient-Estimation">Soft Removal Operation</a>
<div style="clear: both"></div>
</div>
<div id="refusal-loss-formula-content" class="row align-items-center">
<span id="Refusal-Loss" class="formula" style="">
$$
\displaystyle
\begin{aligned}
x_{1:n}& =\mathtt{embed}_\theta(q_{1:n})\\
\mathtt{Affirmation~Loss}&(x_{1:n},\theta)=-\log P(y|x_{1:n})
\end{aligned}
$$
</span>
<span id="Refusal-Loss-Approximation" class="formula" style="display: none;">
$$
\displaystyle
\begin{aligned}
\label{eq:influence}
&\mathtt{Influence} (x_i) = \Vert \nabla_{x_i} \log P_\theta(y|x_{1:n}) \Vert_2 \\
&\mathcal{X} = \mathtt{argtop}\text{-}n\alpha(\{\mathtt{Influence}(x_i), \forall x_i \in x_{1:n}\}) \\
&\mathcal{Q} = \{q_i, \forall x_i \in \mathcal{X}\}
\end{aligned}
$$
</span>
<span id="Gradient-Estimation" class="formula" style="display: none;">
$$
\displaystyle
\begin{aligned}
&x^\prime_i=\begin{cases}
\beta \times \mathtt{embed}(q_i), \text{ if $q_i$ in $\mathcal{Q}$}\\
\mathtt{embed}(q_i), \text{ otherwise}
\end{cases} \\
&r_\theta(q_{1:n})\sim P_\theta(\cdot|x^\prime_{1:n})
\end{aligned}
$$
</span>
</div>
</div>
<div class="container jailbreak-intro-sec">
<div><img id="jailbreak-intro-img" src="./loss_landscape.png" /></div>
</div>
<h2 id="proposed-approach-gradient-cuff">Performance evaluation against practical Jailbreaks</h2>
<p> With the exploration of the Refusal Loss landscape, we propose Gradient Cuff,
a two-step jailbreak detection method based on checking the refusal loss and its gradient norm. Our detection procedure is shown below:
</p>
<div class="container"><img id="gradient-cuff-header" src="./gradient_cuff.png" /></div>
<p>
Gradient Cuff can be summarized into two phases:
</p>
<p>
<strong>(Phase 1) Sampling-based Rejection:</strong> In the first step, we reject the user query by checking whether the Refusal Loss value is below 0.5. If true, then user query is rejected, otherwise, the user query is pushed into phase 2.
</p>
<p>
<strong>(Phase 2) Gradient Norm Rejection:</strong> In the second step, we regard the user query as having jailbreak attempts if the norm of the estimated gradient is larger than a configurable threshold t.
</p>
<p>
We provide more details about the running flow of Gradient Cuff in the paper.
</p>
<h2 id="demonstration">Demonstration</h2>
<p>We evaluated Gradient Cuff as well as 4 baselines (Perplexity Filter, SmoothLLM, Erase-and-Check, and Self-Reminder)
against 6 different jailbreak attacks (GCG, AutoDAN, PAIR, TAP, Base64, and LRL) and benign user queries on 2 LLMs (LLaMA-2-7B-Chat and
Vicuna-7B-V1.5). We below demonstrate the average refusal rate across these 6 malicious user query datasets as the Average Malicious Refusal
Rate and the refusal rate on benign user queries as the Benign Refusal Rate. The defending performance against different jailbreak types is
shown in the provided bar chart.
</p>
<div id="jailbreak-demo" class="container">
<div class="row align-items-center">
<div class="row" style="margin: 10px 0 0">
<div class="models-list">
<span style="margin-right: 1em;">Models</span>
<span class="radio-group"><input type="radio" id="LLaMA2" class="options" name="models" value="llama2_7b_chat" checked="" /><label for="LLaMA2" class="option-label">LLaMA-2-7B-Chat</label></span>
<span class="radio-group"><input type="radio" id="Vicuna" class="options" name="models" value="vicuna_7b_v1.5" /><label for="Vicuna" class="option-label">Vicuna-7B-V1.5</label></span>
</div>
</div>
</div>
<div class="row align-items-center">
<div class="col-4">
<div id="defense-methods">
<div class="row align-items-center"><input type="radio" id="defense_ppl" class="options" name="defense" value="ppl" /><label for="defense_ppl" class="defense">Perplexity Filter</label></div>
<div class="row align-items-center"><input type="radio" id="defense_smoothllm" class="options" name="defense" value="smoothllm" /><label for="defense_smoothllm" class="defense">SmoothLLM</label></div>
<div class="row align-items-center"><input type="radio" id="defense_erase_check" class="options" name="defense" value="erase_check" /><label for="defense_erase_check" class="defense">Erase-Check</label></div>
<div class="row align-items-center"><input type="radio" id="defense_self_reminder" class="options" name="defense" value="self_reminder" /><label for="defense_self_reminder" class="defense">Self-Reminder</label></div>
<div class="row align-items-center"><input type="radio" id="defense_gradient_cuff" class="options" name="defense" value="gradient_cuff" checked="" /><label for="defense_gradient_cuff" class="defense"><span style="font-weight: bold;">Gradient Cuff</span></label></div>
</div>
<div class="row align-items-center">
<div class="attack-success-rate"><span class="jailbreak-metric">Average Malicious Refusal Rate</span><span class="attack-success-rate-value" id="asr-value">0.959</span></div>
</div>
<div class="row align-items-center">
<div class="benign-refusal-rate"><span class="jailbreak-metric">Benign Refusal Rate</span><span class="benign-refusal-rate-value" id="brr-value">0.050</span></div>
</div>
</div>
<div class="col-8">
<figure class="figure">
<img id="reliability-diagram" src="demo_results/gradient_cuff_llama2_7b_chat_threshold_100.png" alt="CIFAR-100 Calibrated Reliability Diagram (Full)" />
<div class="slider-container">
<div class="slider-label"><span>Perplexity Threshold</span></div>
<div class="slider-content" id="ppl-slider"><div id="ppl-threshold" class="ui-slider-handle"></div></div>
</div>
<div class="slider-container">
<div class="slider-label"><span>Gradient Threshold</span></div>
<div class="slider-content" id="gradient-norm-slider"><div id="gradient-norm-threshold" class="slider-value ui-slider-handle"></div></div>
</div>
<figcaption class="figure-caption">
</figcaption>
</figure>
</div>
</div>
</div>
<p>
Higher malicious refusal rate and lower benign refusal rate mean a better defense.
Overall, Gradient Cuff is the most performant compared with those baselines. We also evaluated Gradient Cuff against adaptive attacks
in the paper.
</p>
<h2 id="inquiries"> Inquiries on LLM with Gradient Cuff defense</h2>
<p> Please contact <a href="Mailto:[email protected]">Xiaomeng Hu</a>
and <a href="Mailto:[email protected]">Pin-Yu Chen</a>
</p>
<h2 id="citations">Citations</h2>
<p>If you find Gradient Cuff helpful and useful for your research, please cite our main paper as follows:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@article{DBLP:journals/corr/abs-2412-18171,
author = {Xiaomeng Hu and
Pin{-}Yu Chen and
Tsung{-}Yi Ho},
title = {Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for
Large Language Models},
journal = {CoRR},
volume = {abs/2412.18171},
year = {2024},
url = {https://doi.org/10.48550/arXiv.2412.18171},
doi = {10.48550/ARXIV.2412.18171},
eprinttype = {arXiv},
eprint = {2412.18171},
timestamp = {Sat, 25 Jan 2025 12:51:16 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2412-18171.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
</code></pre></div></div>
<footer class="site-footer">
<span class="site-footer-owner">Token-Highlighter is maintained by <a href="https://gregxmhu.github.io/">Xiaomeng Hu</a></a>.</span>
</footer>
</main>
</body>
</html>
|