Token-Highlighter

Running

App Files Files Community

gregH commited on Feb 13

Commit

f6d6ea2

verified ·

1 Parent(s): 86ff643

Update index.html

Browse files

Files changed (1) hide show

index.html +17 -121

index.html CHANGED Viewed

@@ -114,119 +114,6 @@ Exploring Refusal Loss Landscapes </title>
 </div>
 </div>
-<p>We summarized some recent advances in <strong>Jailbreak Attack</strong> and <strong>Jailbreak Defense</strong> in the below table: </p>
-<div id="tabs">
-  <ul>
-    <li><a href="#jailbreak-attacks">Jailbreak Attack</a></li>
-    <li><a href="#jailbreak-defenses">Jailbreak Defense</a></li>
-  </ul>
-  <div id="jailbreak-attacks">
-    <div id="accordion-attacks">
-      <h3>GCG</h3>
-      <div>
-        <ul>
-          <li>Paper: <a href="https://arxiv.org/abs/2307.15043" target="_blank" rel="noopener noreferrer">
-            Universal and Transferable Adversarial Attacks on Aligned Language Models</a></li>
-          <li>Brief Introduction: Given a (potentially harmful) user query, GCG trains and appends an adversarial suffix to the query
-            that attempts to induce negative behavior from the target LLM. </li>
-        </ul>
-      </div>
-      <h3>AutoDAN</h3>
-      <div>
-        <ul>
-          <li>Paper: <a href="https://arxiv.org/abs/2310.04451" target="_blank" rel="noopener noreferrer">
-            AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models</a></li>
-          <li>Brief Introduction: AutoDAN, an automatic stealthy jailbreak prompts generation framework based on a carefully designed
-            hierarchical genetic algorithm. AUtoDAN preserves the meaningfulness and fluency (i.e., stealthiness) of jailbreak prompts,
-            akin to handcrafted ones, while also ensuring automated deployment as introduced in prior token-level research like GCG.
-          </li>
-        </ul>
-      </div>
-      <h3>PAIR</h3>
-      <div>
-        <ul>
-          <li>Paper: <a href="https://arxiv.org/abs/2310.08419" target="_blank" rel="noopener noreferrer">
-            Jailbreaking Black Box Large Language Models in Twenty Queries</a></li>
-          <li>Brief Introduction: PAIR uses an attacker LLM to automatically generate jailbreaks for a separate targeted LLM
-            without human intervention. The attacker LLM iteratively queries the target LLM to update and refine a candidate
-            jailbreak based on the comments and the rated score provided by another Judge model.
-            Empirically, PAIR often requires fewer than twenty queries to produce a successful jailbreak.</li>
-        </ul>
-      </div>
-      <h3>TAP</h3>
-      <div>
-        <ul>
-          <li>Paper: <a href="https://arxiv.org/abs/2312.02119" target="_blank" rel="noopener noreferrer">
-            Tree of Attacks: Jailbreaking Black-Box LLMs Automatically</a></li>
-          <li>Brief Introduction: TAP is similar to PAIR. The main difference is that
-            the attacker in TAP iteratively refines candidate (attack) prompts using tree-of-thought
-            reasoning.</li>
-        </ul>
-      </div>
-      <h3>Base64</h3>
-      <div>
-        <ul>
-          <li>Paper: <a href="https://arxiv.org/abs/2307.02483" target="_blank" rel="noopener noreferrer">
-            Jailbroken: How Does LLM Safety Training Fail?</a></li>
-          <li>Brief Introduction: Encode the malicious user query into base64 format before using it to query the model.</li>
-        </ul>
-      </div>
-      <h3>LRL</h3>
-      <div>
-        <ul>
-          <li>Paper: <a href="https://arxiv.org/abs/2310.02446" target="_blank" rel="noopener noreferrer">
-            Low-Resource Languages Jailbreak GPT-4</a></li>
-          <li>Brief Introduction: Translate the malicious user query into low-resource language before using it to query the model.</li>
-        </ul>
-      </div>
-    </div>
-  </div>
-  <div id="jailbreak-defenses">
-    <div id="accordion-defenses">
-      <h3>Perpleixty Filter</h3>
-      <div>
-        <ul>
-          <li>Paper: <a href="https://arxiv.org/abs/2309.00614" target="_blank" rel="noopener noreferrer">
-            Baseline Defenses for Adversarial Attacks Against Aligned Language Models</a></li>
-          <li>Brief Introduction: Perplexity Filter uses an LLM to compute the perplexity of the input query and rejects those
-            with high perplexity.</li>
-        </ul>
-      </div>
-      <h3>SmoothLLM</h3>
-      <div>
-        <ul>
-          <li>Paper: <a href="https://arxiv.org/abs/2310.03684" target="_blank" rel="noopener noreferrer">
-            SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks</a></li>
-          <li>Brief Introduction: SmoothLLM perturbs the original input query to obtain several copies and aggregates
-            the intermediate responses of the target LLM to these perturbed queries to give the final response to the
-            original query.
-          </li>
-        </ul>
-      </div>
-      <h3>Erase-Check</h3>
-      <div>
-        <ul>
-          <li>Paper: <a href="https://arxiv.org/abs/2309.02705" target="_blank" rel="noopener noreferrer">
-            Certifying LLM Safety against Adversarial Prompting</a></li>
-          <li>Brief Introduction: Erase-Check employs a model to check whether the original query or any of its erased subsentences
-          is harmful. The query would be rejected if the query or one of its sub-sentences is regarded as harmful by the safety checker</li>
-        </ul>
-      </div>
-      <h3>Self-Reminder</h3>
-      <div>
-        <ul>
-          <li>Paper: <a href="https://assets.researchsquare.com/files/rs-2873090/v1_covered_eb589a01-bf05-4f32-b3eb-0d6864f64ad9.pdf?c=1702456350" target="_blank" rel="noopener noreferrer">
-            Defending ChatGPT against Jailbreak Attack via Self-Reminder</a></li>
-          <li>Brief Introduction: Self-Reminder modifying the system prompt of the target LLM so that the model reminds itself to process
-            and respond to the user in the context of being an aligned LLM.</li>
-        </ul>
-      </div>
-    </div>
-  </div>
-</div>
 <h2 id="refusal-loss">Interpretability</h2>
 <p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
   autoregressive sampling-based generation. With this randomness, it is an
@@ -290,7 +177,7 @@ Exploring Refusal Loss Landscapes </title>
 </div>
 </div>
-<h2 id="proposed-approach-gradient-cuff">Experimental results on benchmarks</h2>
 <p> With the exploration of the Refusal Loss landscape, we propose Gradient Cuff,
   a two-step jailbreak detection method based on checking the refusal loss and its gradient norm. Our detection procedure is shown below:
 </p>
@@ -377,13 +264,22 @@ and <a href="Mailto:[email protected]">Pin-Yu Chen</a>
 <h2 id="citations">Citations</h2>
 <p>If you find Gradient Cuff helpful and useful for your research, please cite our main paper as follows:</p>
-<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@misc{hu2024gradient,
-      title={Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes},
-      author={Xiaomeng Hu and Pin-Yu Chen and Tsung-Yi Ho},
-      year={2024},
-      eprint={2403.00867},
-      archivePrefix={arXiv},
-      primaryClass={cs.CR}
 }
 </code></pre></div></div>

 </div>
 </div>
 <h2 id="refusal-loss">Interpretability</h2>
 <p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
   autoregressive sampling-based generation. With this randomness, it is an
 </div>
 </div>
+<h2 id="proposed-approach-gradient-cuff">Performance evaluation against practical Jailbreaks</h2>
 <p> With the exploration of the Refusal Loss landscape, we propose Gradient Cuff,
   a two-step jailbreak detection method based on checking the refusal loss and its gradient norm. Our detection procedure is shown below:
 </p>
 <h2 id="citations">Citations</h2>
 <p>If you find Gradient Cuff helpful and useful for your research, please cite our main paper as follows:</p>
+<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@article{DBLP:journals/corr/abs-2412-18171,
+  author       = {Xiaomeng Hu and
+                  Pin{-}Yu Chen and
+                  Tsung{-}Yi Ho},
+  title        = {Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for
+                  Large Language Models},
+  journal      = {CoRR},
+  volume       = {abs/2412.18171},
+  year         = {2024},
+  url          = {https://doi.org/10.48550/arXiv.2412.18171},
+  doi          = {10.48550/ARXIV.2412.18171},
+  eprinttype    = {arXiv},
+  eprint       = {2412.18171},
+  timestamp    = {Sat, 25 Jan 2025 12:51:16 +0100},
+  biburl       = {https://dblp.org/rec/journals/corr/abs-2412-18171.bib},
+  bibsource    = {dblp computer science bibliography, https://dblp.org}
 }
 </code></pre></div></div>