Spaces:
Running
Running
Update index.html
Browse files- index.html +16 -0
index.html
CHANGED
@@ -134,9 +134,25 @@ Exploring Refusal Loss Landscapes </title>
|
|
134 |
</div>
|
135 |
|
136 |
<h2 id="proposed-approach-gradient-cuff">Proposed Approach: Gradient Cuff</h2>
|
|
|
|
|
|
|
137 |
|
138 |
<div class="container"><img id="gradient-cuff-header" src="./gradient_cuff.png" /></div>
|
139 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
140 |
<h2 id="demonstration">Demonstration</h2>
|
141 |
<p>We evaluated Gradient Cuff as well as 4 baselines (Perplexity Filter, SmoothLLM, Erase-and-Check, and Self-Reminder) against 6
|
142 |
different jailbreak attacks~(GCG, AutoDAN, PAIR, TAP, Base64, and LRL) and benign user queries on 2 LLMs (LLaMA-2-7B-Chat and Vicuna-7B-V1.5).
|
|
|
134 |
</div>
|
135 |
|
136 |
<h2 id="proposed-approach-gradient-cuff">Proposed Approach: Gradient Cuff</h2>
|
137 |
+
<p> With the exploration of the Refusal Loss landscape, we propose Gradient Cuff,
|
138 |
+
a two-step jailbreak detection method based on checking the refusal loss and its gradient norm. Our detection procedure is shown below:
|
139 |
+
</p>
|
140 |
|
141 |
<div class="container"><img id="gradient-cuff-header" src="./gradient_cuff.png" /></div>
|
142 |
|
143 |
+
<p>
|
144 |
+
Gradient Cuff can be summarized into two phases:
|
145 |
+
<span>
|
146 |
+
$$
|
147 |
+
\begin{itemize}
|
148 |
+
\item \textbf{(Phase 1) Sampling-based Rejection:}~In the first step, we reject the user query $x$ by checking whether $f_\theta(x)<0.5$. If true, then $x$ is rejected, otherwise, $x$ is pushed into phase 2.
|
149 |
+
\item \textbf{(Phase 2) Gradient Norm Rejection:}~In the second step, we regard $x$ as having jailbreak attempts if the norm of the estimated gradient $g_\theta(x)$ is larger than a configurable threshold $t$, i.e., $\|g_\theta(x)\| > t$.
|
150 |
+
\end{itemize}
|
151 |
+
$$
|
152 |
+
</span>
|
153 |
+
</p>
|
154 |
+
|
155 |
+
|
156 |
<h2 id="demonstration">Demonstration</h2>
|
157 |
<p>We evaluated Gradient Cuff as well as 4 baselines (Perplexity Filter, SmoothLLM, Erase-and-Check, and Self-Reminder) against 6
|
158 |
different jailbreak attacks~(GCG, AutoDAN, PAIR, TAP, Base64, and LRL) and benign user queries on 2 LLMs (LLaMA-2-7B-Chat and Vicuna-7B-V1.5).
|