Token-Highlighter

Running

gregH commited on Feb 28, 2024

Commit

b92dddc

verified ·

1 Parent(s): aa27052

Update index.html

Files changed (1) hide show

index.html CHANGED Viewed

@@ -134,9 +134,25 @@ Exploring Refusal Loss Landscapes </title>
 </div>
 <h2 id="proposed-approach-gradient-cuff">Proposed Approach: Gradient Cuff</h2>
 <div class="container"><img id="gradient-cuff-header" src="./gradient_cuff.png" /></div>
 <h2 id="demonstration">Demonstration</h2>
 <p>We evaluated Gradient Cuff as well as 4 baselines (Perplexity Filter, SmoothLLM, Erase-and-Check, and Self-Reminder) against 6
   different jailbreak attacks~(GCG, AutoDAN, PAIR, TAP, Base64, and LRL) and benign user queries on 2 LLMs (LLaMA-2-7B-Chat and Vicuna-7B-V1.5).

 </div>
 <h2 id="proposed-approach-gradient-cuff">Proposed Approach: Gradient Cuff</h2>
+<p> With the exploration of the Refusal Loss landscape, we propose Gradient Cuff,
+  a two-step jailbreak detection method based on checking the refusal loss and its gradient norm. Our detection procedure is shown below:
+</p>
 <div class="container"><img id="gradient-cuff-header" src="./gradient_cuff.png" /></div>
+<p>
+  Gradient Cuff can be summarized into two phases:
+  <span>
+    $$
+    \begin{itemize}
+    \item \textbf{(Phase 1) Sampling-based Rejection:}~In the first step, we reject the user query $x$ by checking whether $f_\theta(x)<0.5$. If true, then $x$ is rejected, otherwise, $x$ is pushed into phase 2.
+    \item \textbf{(Phase 2) Gradient Norm Rejection:}~In the second step, we regard $x$ as having jailbreak attempts if the norm of the estimated gradient $g_\theta(x)$ is larger than a configurable threshold $t$, i.e., $\|g_\theta(x)\| > t$.
+\end{itemize}
+    $$
+  </span>
+</p>
 <h2 id="demonstration">Demonstration</h2>
 <p>We evaluated Gradient Cuff as well as 4 baselines (Perplexity Filter, SmoothLLM, Erase-and-Check, and Self-Reminder) against 6
   different jailbreak attacks~(GCG, AutoDAN, PAIR, TAP, Base64, and LRL) and benign user queries on 2 LLMs (LLaMA-2-7B-Chat and Vicuna-7B-V1.5).