gregH commited on
Commit
2f98671
·
verified ·
1 Parent(s): 34b33b0

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +14 -25
index.html CHANGED
@@ -119,35 +119,19 @@ Exploring Refusal Loss Landscapes </title>
119
  </div>
120
 
121
  <h2 id="refusal-loss">Token Highlighter: Principle and Interpretability</h2>
122
- <p>Studies found that many successful jailbreak attempts share a common property that
123
- they all trick the LLM into generating affirmations like starting with "Sure, here is" at the beginning
124
- of their responses. Drawing upon this inspiration, our proposed defense aims to find the tokens that
125
- are most critical in forcing the LLM to generate such affirmative responses, decrease their importance
126
- in the generation, and thereby resolve the potential jailbreak risks brought by these tokens. To identify
127
- these tokens, we propose a new concept called the
128
- </p>
129
-
130
- <div class="container jailbreak-intro-sec">
131
- <div><img id="jailbreak-intro-img" src="./loss_landscape.png" /></div>
132
- </div>
133
-
134
- <p>
135
- We show the loss landscape for both Benign and Malicious queries in the above plot. The benign queries are non-harmful user instructions collected
136
- from the LM-SYS Chatbot Arena leaderboard, which is a crowd-sourced open platform for LLM evaluation. The tested malicious queries are harmful
137
- behavior user instructions with GCG jailbreak prompt. From this plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries,
138
- which implies that the Refusal Loss tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
139
- the gradient norm of Refusal Loss to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value
140
- is under 0.5 (this is a naive detector because the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).
141
- Below we present the definition of the Refusal Loss, the computation of its empirical values, and the approximation of its gradient, see more
142
- details about them and the landscape drawing techniques in our paper.
143
  </p>
144
 
145
  <div id="refusal-loss-formula" class="container">
146
  <div id="refusal-loss-formula-list" class="row align-items-center formula-list">
147
- <a href="#Refusal-Loss" class="selected">Refusal Loss Definition</a>
148
- <a href="#Refusal-Loss-Approximation">Refusal Loss Computation</a>
149
- <a href="#Gradient-Estimation">Gradient Estimation</a>
150
- <div style="clear: both"></div>
151
  </div>
152
  <div id="refusal-loss-formula-content" class="row align-items-center">
153
  <span id="Refusal-Loss" class="formula" style="">
@@ -178,6 +162,11 @@ these tokens, we propose a new concept called the
178
  </div>
179
  </div>
180
 
 
 
 
 
 
181
  <h2 id="proposed-approach-gradient-cuff">Performance evaluation against practical Jailbreaks</h2>
182
  <p> With the exploration of the Refusal Loss landscape, we propose Gradient Cuff,
183
  a two-step jailbreak detection method based on checking the refusal loss and its gradient norm. Our detection procedure is shown below:
 
119
  </div>
120
 
121
  <h2 id="refusal-loss">Token Highlighter: Principle and Interpretability</h2>
122
+ <p>High-level speaking, successful jailbreaks share a common principle that they are trying to make the LLMs willing to affirm the user request which will be rejected at the beginning. Drawing upon this inspiration, our proposed defense aims to find the tokens that are most critical in forcing the LLM to generate such affirmative responses,
123
+ decrease their importance in the generation, and thereby resolve the potential jailbreak risks brought by these tokens. To identify
124
+ these tokens, we propose a new concept called the <strong>Affirmation Loss</strong>. We then use the loss's gradient norm
125
+ with respoect to each token in the user input prompt to find the jailbreak-critical tokens. We select those tokens with the larger
126
+ gradient norm and then apply soft removal on them to mitigate the potential jailbreak risks. Below we introduce how we define these concepts mathematically.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127
  </p>
128
 
129
  <div id="refusal-loss-formula" class="container">
130
  <div id="refusal-loss-formula-list" class="row align-items-center formula-list">
131
+ <a href="#Refusal-Loss" class="selected">Affirmation Loss Computation</a>
132
+ <a href="#Refusal-Loss-Approximation">Critical Tokens Selection</a>
133
+ <a href="#Gradient-Estimation">Soft Removal Operation</a>
134
+ <div style="clear: both"></div>
135
  </div>
136
  <div id="refusal-loss-formula-content" class="row align-items-center">
137
  <span id="Refusal-Loss" class="formula" style="">
 
162
  </div>
163
  </div>
164
 
165
+
166
+ <div class="container jailbreak-intro-sec">
167
+ <div><img id="jailbreak-intro-img" src="./loss_landscape.png" /></div>
168
+ </div>
169
+
170
  <h2 id="proposed-approach-gradient-cuff">Performance evaluation against practical Jailbreaks</h2>
171
  <p> With the exploration of the Refusal Loss landscape, we propose Gradient Cuff,
172
  a two-step jailbreak detection method based on checking the refusal loss and its gradient norm. Our detection procedure is shown below: