Spaces:
Running
Running
Update index.html
Browse files- index.html +14 -25
index.html
CHANGED
@@ -119,35 +119,19 @@ Exploring Refusal Loss Landscapes </title>
|
|
119 |
</div>
|
120 |
|
121 |
<h2 id="refusal-loss">Token Highlighter: Principle and Interpretability</h2>
|
122 |
-
<p>
|
123 |
-
|
124 |
-
|
125 |
-
|
126 |
-
|
127 |
-
these tokens, we propose a new concept called the
|
128 |
-
</p>
|
129 |
-
|
130 |
-
<div class="container jailbreak-intro-sec">
|
131 |
-
<div><img id="jailbreak-intro-img" src="./loss_landscape.png" /></div>
|
132 |
-
</div>
|
133 |
-
|
134 |
-
<p>
|
135 |
-
We show the loss landscape for both Benign and Malicious queries in the above plot. The benign queries are non-harmful user instructions collected
|
136 |
-
from the LM-SYS Chatbot Arena leaderboard, which is a crowd-sourced open platform for LLM evaluation. The tested malicious queries are harmful
|
137 |
-
behavior user instructions with GCG jailbreak prompt. From this plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries,
|
138 |
-
which implies that the Refusal Loss tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
|
139 |
-
the gradient norm of Refusal Loss to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value
|
140 |
-
is under 0.5 (this is a naive detector because the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).
|
141 |
-
Below we present the definition of the Refusal Loss, the computation of its empirical values, and the approximation of its gradient, see more
|
142 |
-
details about them and the landscape drawing techniques in our paper.
|
143 |
</p>
|
144 |
|
145 |
<div id="refusal-loss-formula" class="container">
|
146 |
<div id="refusal-loss-formula-list" class="row align-items-center formula-list">
|
147 |
-
<a href="#Refusal-Loss" class="selected">
|
148 |
-
<a href="#Refusal-Loss-Approximation">
|
149 |
-
<a href="#Gradient-Estimation">
|
150 |
-
|
151 |
</div>
|
152 |
<div id="refusal-loss-formula-content" class="row align-items-center">
|
153 |
<span id="Refusal-Loss" class="formula" style="">
|
@@ -178,6 +162,11 @@ these tokens, we propose a new concept called the
|
|
178 |
</div>
|
179 |
</div>
|
180 |
|
|
|
|
|
|
|
|
|
|
|
181 |
<h2 id="proposed-approach-gradient-cuff">Performance evaluation against practical Jailbreaks</h2>
|
182 |
<p> With the exploration of the Refusal Loss landscape, we propose Gradient Cuff,
|
183 |
a two-step jailbreak detection method based on checking the refusal loss and its gradient norm. Our detection procedure is shown below:
|
|
|
119 |
</div>
|
120 |
|
121 |
<h2 id="refusal-loss">Token Highlighter: Principle and Interpretability</h2>
|
122 |
+
<p>High-level speaking, successful jailbreaks share a common principle that they are trying to make the LLMs willing to affirm the user request which will be rejected at the beginning. Drawing upon this inspiration, our proposed defense aims to find the tokens that are most critical in forcing the LLM to generate such affirmative responses,
|
123 |
+
decrease their importance in the generation, and thereby resolve the potential jailbreak risks brought by these tokens. To identify
|
124 |
+
these tokens, we propose a new concept called the <strong>Affirmation Loss</strong>. We then use the loss's gradient norm
|
125 |
+
with respoect to each token in the user input prompt to find the jailbreak-critical tokens. We select those tokens with the larger
|
126 |
+
gradient norm and then apply soft removal on them to mitigate the potential jailbreak risks. Below we introduce how we define these concepts mathematically.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
127 |
</p>
|
128 |
|
129 |
<div id="refusal-loss-formula" class="container">
|
130 |
<div id="refusal-loss-formula-list" class="row align-items-center formula-list">
|
131 |
+
<a href="#Refusal-Loss" class="selected">Affirmation Loss Computation</a>
|
132 |
+
<a href="#Refusal-Loss-Approximation">Critical Tokens Selection</a>
|
133 |
+
<a href="#Gradient-Estimation">Soft Removal Operation</a>
|
134 |
+
<div style="clear: both"></div>
|
135 |
</div>
|
136 |
<div id="refusal-loss-formula-content" class="row align-items-center">
|
137 |
<span id="Refusal-Loss" class="formula" style="">
|
|
|
162 |
</div>
|
163 |
</div>
|
164 |
|
165 |
+
|
166 |
+
<div class="container jailbreak-intro-sec">
|
167 |
+
<div><img id="jailbreak-intro-img" src="./loss_landscape.png" /></div>
|
168 |
+
</div>
|
169 |
+
|
170 |
<h2 id="proposed-approach-gradient-cuff">Performance evaluation against practical Jailbreaks</h2>
|
171 |
<p> With the exploration of the Refusal Loss landscape, we propose Gradient Cuff,
|
172 |
a two-step jailbreak detection method based on checking the refusal loss and its gradient norm. Our detection procedure is shown below:
|