Spaces:
Running
Running
Update index.html
Browse files- index.html +17 -121
index.html
CHANGED
@@ -114,119 +114,6 @@ Exploring Refusal Loss Landscapes </title>
|
|
114 |
</div>
|
115 |
</div>
|
116 |
|
117 |
-
<p>We summarized some recent advances in <strong>Jailbreak Attack</strong> and <strong>Jailbreak Defense</strong> in the below table: </p>
|
118 |
-
<div id="tabs">
|
119 |
-
<ul>
|
120 |
-
<li><a href="#jailbreak-attacks">Jailbreak Attack</a></li>
|
121 |
-
<li><a href="#jailbreak-defenses">Jailbreak Defense</a></li>
|
122 |
-
</ul>
|
123 |
-
<div id="jailbreak-attacks">
|
124 |
-
<div id="accordion-attacks">
|
125 |
-
<h3>GCG</h3>
|
126 |
-
<div>
|
127 |
-
<ul>
|
128 |
-
<li>Paper: <a href="https://arxiv.org/abs/2307.15043" target="_blank" rel="noopener noreferrer">
|
129 |
-
Universal and Transferable Adversarial Attacks on Aligned Language Models</a></li>
|
130 |
-
<li>Brief Introduction: Given a (potentially harmful) user query, GCG trains and appends an adversarial suffix to the query
|
131 |
-
that attempts to induce negative behavior from the target LLM. </li>
|
132 |
-
</ul>
|
133 |
-
</div>
|
134 |
-
<h3>AutoDAN</h3>
|
135 |
-
<div>
|
136 |
-
<ul>
|
137 |
-
<li>Paper: <a href="https://arxiv.org/abs/2310.04451" target="_blank" rel="noopener noreferrer">
|
138 |
-
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models</a></li>
|
139 |
-
<li>Brief Introduction: AutoDAN, an automatic stealthy jailbreak prompts generation framework based on a carefully designed
|
140 |
-
hierarchical genetic algorithm. AUtoDAN preserves the meaningfulness and fluency (i.e., stealthiness) of jailbreak prompts,
|
141 |
-
akin to handcrafted ones, while also ensuring automated deployment as introduced in prior token-level research like GCG.
|
142 |
-
</li>
|
143 |
-
</ul>
|
144 |
-
</div>
|
145 |
-
<h3>PAIR</h3>
|
146 |
-
<div>
|
147 |
-
<ul>
|
148 |
-
<li>Paper: <a href="https://arxiv.org/abs/2310.08419" target="_blank" rel="noopener noreferrer">
|
149 |
-
Jailbreaking Black Box Large Language Models in Twenty Queries</a></li>
|
150 |
-
<li>Brief Introduction: PAIR uses an attacker LLM to automatically generate jailbreaks for a separate targeted LLM
|
151 |
-
without human intervention. The attacker LLM iteratively queries the target LLM to update and refine a candidate
|
152 |
-
jailbreak based on the comments and the rated score provided by another Judge model.
|
153 |
-
Empirically, PAIR often requires fewer than twenty queries to produce a successful jailbreak.</li>
|
154 |
-
</ul>
|
155 |
-
</div>
|
156 |
-
<h3>TAP</h3>
|
157 |
-
<div>
|
158 |
-
<ul>
|
159 |
-
<li>Paper: <a href="https://arxiv.org/abs/2312.02119" target="_blank" rel="noopener noreferrer">
|
160 |
-
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically</a></li>
|
161 |
-
<li>Brief Introduction: TAP is similar to PAIR. The main difference is that
|
162 |
-
the attacker in TAP iteratively refines candidate (attack) prompts using tree-of-thought
|
163 |
-
reasoning.</li>
|
164 |
-
</ul>
|
165 |
-
</div>
|
166 |
-
<h3>Base64</h3>
|
167 |
-
<div>
|
168 |
-
<ul>
|
169 |
-
<li>Paper: <a href="https://arxiv.org/abs/2307.02483" target="_blank" rel="noopener noreferrer">
|
170 |
-
Jailbroken: How Does LLM Safety Training Fail?</a></li>
|
171 |
-
<li>Brief Introduction: Encode the malicious user query into base64 format before using it to query the model.</li>
|
172 |
-
</ul>
|
173 |
-
</div>
|
174 |
-
<h3>LRL</h3>
|
175 |
-
<div>
|
176 |
-
<ul>
|
177 |
-
<li>Paper: <a href="https://arxiv.org/abs/2310.02446" target="_blank" rel="noopener noreferrer">
|
178 |
-
Low-Resource Languages Jailbreak GPT-4</a></li>
|
179 |
-
<li>Brief Introduction: Translate the malicious user query into low-resource language before using it to query the model.</li>
|
180 |
-
</ul>
|
181 |
-
</div>
|
182 |
-
</div>
|
183 |
-
</div>
|
184 |
-
|
185 |
-
<div id="jailbreak-defenses">
|
186 |
-
<div id="accordion-defenses">
|
187 |
-
<h3>Perpleixty Filter</h3>
|
188 |
-
<div>
|
189 |
-
<ul>
|
190 |
-
<li>Paper: <a href="https://arxiv.org/abs/2309.00614" target="_blank" rel="noopener noreferrer">
|
191 |
-
Baseline Defenses for Adversarial Attacks Against Aligned Language Models</a></li>
|
192 |
-
<li>Brief Introduction: Perplexity Filter uses an LLM to compute the perplexity of the input query and rejects those
|
193 |
-
with high perplexity.</li>
|
194 |
-
</ul>
|
195 |
-
</div>
|
196 |
-
<h3>SmoothLLM</h3>
|
197 |
-
<div>
|
198 |
-
<ul>
|
199 |
-
<li>Paper: <a href="https://arxiv.org/abs/2310.03684" target="_blank" rel="noopener noreferrer">
|
200 |
-
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks</a></li>
|
201 |
-
<li>Brief Introduction: SmoothLLM perturbs the original input query to obtain several copies and aggregates
|
202 |
-
the intermediate responses of the target LLM to these perturbed queries to give the final response to the
|
203 |
-
original query.
|
204 |
-
</li>
|
205 |
-
</ul>
|
206 |
-
</div>
|
207 |
-
<h3>Erase-Check</h3>
|
208 |
-
<div>
|
209 |
-
<ul>
|
210 |
-
<li>Paper: <a href="https://arxiv.org/abs/2309.02705" target="_blank" rel="noopener noreferrer">
|
211 |
-
Certifying LLM Safety against Adversarial Prompting</a></li>
|
212 |
-
<li>Brief Introduction: Erase-Check employs a model to check whether the original query or any of its erased subsentences
|
213 |
-
is harmful. The query would be rejected if the query or one of its sub-sentences is regarded as harmful by the safety checker</li>
|
214 |
-
</ul>
|
215 |
-
</div>
|
216 |
-
<h3>Self-Reminder</h3>
|
217 |
-
<div>
|
218 |
-
<ul>
|
219 |
-
<li>Paper: <a href="https://assets.researchsquare.com/files/rs-2873090/v1_covered_eb589a01-bf05-4f32-b3eb-0d6864f64ad9.pdf?c=1702456350" target="_blank" rel="noopener noreferrer">
|
220 |
-
Defending ChatGPT against Jailbreak Attack via Self-Reminder</a></li>
|
221 |
-
<li>Brief Introduction: Self-Reminder modifying the system prompt of the target LLM so that the model reminds itself to process
|
222 |
-
and respond to the user in the context of being an aligned LLM.</li>
|
223 |
-
</ul>
|
224 |
-
</div>
|
225 |
-
</div>
|
226 |
-
</div>
|
227 |
-
|
228 |
-
</div>
|
229 |
-
|
230 |
<h2 id="refusal-loss">Interpretability</h2>
|
231 |
<p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
|
232 |
autoregressive sampling-based generation. With this randomness, it is an
|
@@ -290,7 +177,7 @@ Exploring Refusal Loss Landscapes </title>
|
|
290 |
</div>
|
291 |
</div>
|
292 |
|
293 |
-
<h2 id="proposed-approach-gradient-cuff">
|
294 |
<p> With the exploration of the Refusal Loss landscape, we propose Gradient Cuff,
|
295 |
a two-step jailbreak detection method based on checking the refusal loss and its gradient norm. Our detection procedure is shown below:
|
296 |
</p>
|
@@ -377,13 +264,22 @@ and <a href="Mailto:[email protected]">Pin-Yu Chen</a>
|
|
377 |
<h2 id="citations">Citations</h2>
|
378 |
<p>If you find Gradient Cuff helpful and useful for your research, please cite our main paper as follows:</p>
|
379 |
|
380 |
-
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@
|
381 |
-
|
382 |
-
|
383 |
-
|
384 |
-
|
385 |
-
|
386 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
387 |
}
|
388 |
</code></pre></div></div>
|
389 |
|
|
|
114 |
</div>
|
115 |
</div>
|
116 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
117 |
<h2 id="refusal-loss">Interpretability</h2>
|
118 |
<p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
|
119 |
autoregressive sampling-based generation. With this randomness, it is an
|
|
|
177 |
</div>
|
178 |
</div>
|
179 |
|
180 |
+
<h2 id="proposed-approach-gradient-cuff">Performance evaluation against practical Jailbreaks</h2>
|
181 |
<p> With the exploration of the Refusal Loss landscape, we propose Gradient Cuff,
|
182 |
a two-step jailbreak detection method based on checking the refusal loss and its gradient norm. Our detection procedure is shown below:
|
183 |
</p>
|
|
|
264 |
<h2 id="citations">Citations</h2>
|
265 |
<p>If you find Gradient Cuff helpful and useful for your research, please cite our main paper as follows:</p>
|
266 |
|
267 |
+
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@article{DBLP:journals/corr/abs-2412-18171,
|
268 |
+
author = {Xiaomeng Hu and
|
269 |
+
Pin{-}Yu Chen and
|
270 |
+
Tsung{-}Yi Ho},
|
271 |
+
title = {Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for
|
272 |
+
Large Language Models},
|
273 |
+
journal = {CoRR},
|
274 |
+
volume = {abs/2412.18171},
|
275 |
+
year = {2024},
|
276 |
+
url = {https://doi.org/10.48550/arXiv.2412.18171},
|
277 |
+
doi = {10.48550/ARXIV.2412.18171},
|
278 |
+
eprinttype = {arXiv},
|
279 |
+
eprint = {2412.18171},
|
280 |
+
timestamp = {Sat, 25 Jan 2025 12:51:16 +0100},
|
281 |
+
biburl = {https://dblp.org/rec/journals/corr/abs-2412-18171.bib},
|
282 |
+
bibsource = {dblp computer science bibliography, https://dblp.org}
|
283 |
}
|
284 |
</code></pre></div></div>
|
285 |
|