gregH commited on
Commit
b92dddc
·
verified ·
1 Parent(s): aa27052

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +16 -0
index.html CHANGED
@@ -134,9 +134,25 @@ Exploring Refusal Loss Landscapes </title>
134
  </div>
135
 
136
  <h2 id="proposed-approach-gradient-cuff">Proposed Approach: Gradient Cuff</h2>
 
 
 
137
 
138
  <div class="container"><img id="gradient-cuff-header" src="./gradient_cuff.png" /></div>
139
 
 
 
 
 
 
 
 
 
 
 
 
 
 
140
  <h2 id="demonstration">Demonstration</h2>
141
  <p>We evaluated Gradient Cuff as well as 4 baselines (Perplexity Filter, SmoothLLM, Erase-and-Check, and Self-Reminder) against 6
142
  different jailbreak attacks~(GCG, AutoDAN, PAIR, TAP, Base64, and LRL) and benign user queries on 2 LLMs (LLaMA-2-7B-Chat and Vicuna-7B-V1.5).
 
134
  </div>
135
 
136
  <h2 id="proposed-approach-gradient-cuff">Proposed Approach: Gradient Cuff</h2>
137
+ <p> With the exploration of the Refusal Loss landscape, we propose Gradient Cuff,
138
+ a two-step jailbreak detection method based on checking the refusal loss and its gradient norm. Our detection procedure is shown below:
139
+ </p>
140
 
141
  <div class="container"><img id="gradient-cuff-header" src="./gradient_cuff.png" /></div>
142
 
143
+ <p>
144
+ Gradient Cuff can be summarized into two phases:
145
+ <span>
146
+ $$
147
+ \begin{itemize}
148
+ \item \textbf{(Phase 1) Sampling-based Rejection:}~In the first step, we reject the user query $x$ by checking whether $f_\theta(x)<0.5$. If true, then $x$ is rejected, otherwise, $x$ is pushed into phase 2.
149
+ \item \textbf{(Phase 2) Gradient Norm Rejection:}~In the second step, we regard $x$ as having jailbreak attempts if the norm of the estimated gradient $g_\theta(x)$ is larger than a configurable threshold $t$, i.e., $\|g_\theta(x)\| > t$.
150
+ \end{itemize}
151
+ $$
152
+ </span>
153
+ </p>
154
+
155
+
156
  <h2 id="demonstration">Demonstration</h2>
157
  <p>We evaluated Gradient Cuff as well as 4 baselines (Perplexity Filter, SmoothLLM, Erase-and-Check, and Self-Reminder) against 6
158
  different jailbreak attacks~(GCG, AutoDAN, PAIR, TAP, Base64, and LRL) and benign user queries on 2 LLMs (LLaMA-2-7B-Chat and Vicuna-7B-V1.5).