gregH commited on
Commit
f6d6ea2
·
verified ·
1 Parent(s): 86ff643

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +17 -121
index.html CHANGED
@@ -114,119 +114,6 @@ Exploring Refusal Loss Landscapes </title>
114
  </div>
115
  </div>
116
 
117
- <p>We summarized some recent advances in <strong>Jailbreak Attack</strong> and <strong>Jailbreak Defense</strong> in the below table: </p>
118
- <div id="tabs">
119
- <ul>
120
- <li><a href="#jailbreak-attacks">Jailbreak Attack</a></li>
121
- <li><a href="#jailbreak-defenses">Jailbreak Defense</a></li>
122
- </ul>
123
- <div id="jailbreak-attacks">
124
- <div id="accordion-attacks">
125
- <h3>GCG</h3>
126
- <div>
127
- <ul>
128
- <li>Paper: <a href="https://arxiv.org/abs/2307.15043" target="_blank" rel="noopener noreferrer">
129
- Universal and Transferable Adversarial Attacks on Aligned Language Models</a></li>
130
- <li>Brief Introduction: Given a (potentially harmful) user query, GCG trains and appends an adversarial suffix to the query
131
- that attempts to induce negative behavior from the target LLM. </li>
132
- </ul>
133
- </div>
134
- <h3>AutoDAN</h3>
135
- <div>
136
- <ul>
137
- <li>Paper: <a href="https://arxiv.org/abs/2310.04451" target="_blank" rel="noopener noreferrer">
138
- AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models</a></li>
139
- <li>Brief Introduction: AutoDAN, an automatic stealthy jailbreak prompts generation framework based on a carefully designed
140
- hierarchical genetic algorithm. AUtoDAN preserves the meaningfulness and fluency (i.e., stealthiness) of jailbreak prompts,
141
- akin to handcrafted ones, while also ensuring automated deployment as introduced in prior token-level research like GCG.
142
- </li>
143
- </ul>
144
- </div>
145
- <h3>PAIR</h3>
146
- <div>
147
- <ul>
148
- <li>Paper: <a href="https://arxiv.org/abs/2310.08419" target="_blank" rel="noopener noreferrer">
149
- Jailbreaking Black Box Large Language Models in Twenty Queries</a></li>
150
- <li>Brief Introduction: PAIR uses an attacker LLM to automatically generate jailbreaks for a separate targeted LLM
151
- without human intervention. The attacker LLM iteratively queries the target LLM to update and refine a candidate
152
- jailbreak based on the comments and the rated score provided by another Judge model.
153
- Empirically, PAIR often requires fewer than twenty queries to produce a successful jailbreak.</li>
154
- </ul>
155
- </div>
156
- <h3>TAP</h3>
157
- <div>
158
- <ul>
159
- <li>Paper: <a href="https://arxiv.org/abs/2312.02119" target="_blank" rel="noopener noreferrer">
160
- Tree of Attacks: Jailbreaking Black-Box LLMs Automatically</a></li>
161
- <li>Brief Introduction: TAP is similar to PAIR. The main difference is that
162
- the attacker in TAP iteratively refines candidate (attack) prompts using tree-of-thought
163
- reasoning.</li>
164
- </ul>
165
- </div>
166
- <h3>Base64</h3>
167
- <div>
168
- <ul>
169
- <li>Paper: <a href="https://arxiv.org/abs/2307.02483" target="_blank" rel="noopener noreferrer">
170
- Jailbroken: How Does LLM Safety Training Fail?</a></li>
171
- <li>Brief Introduction: Encode the malicious user query into base64 format before using it to query the model.</li>
172
- </ul>
173
- </div>
174
- <h3>LRL</h3>
175
- <div>
176
- <ul>
177
- <li>Paper: <a href="https://arxiv.org/abs/2310.02446" target="_blank" rel="noopener noreferrer">
178
- Low-Resource Languages Jailbreak GPT-4</a></li>
179
- <li>Brief Introduction: Translate the malicious user query into low-resource language before using it to query the model.</li>
180
- </ul>
181
- </div>
182
- </div>
183
- </div>
184
-
185
- <div id="jailbreak-defenses">
186
- <div id="accordion-defenses">
187
- <h3>Perpleixty Filter</h3>
188
- <div>
189
- <ul>
190
- <li>Paper: <a href="https://arxiv.org/abs/2309.00614" target="_blank" rel="noopener noreferrer">
191
- Baseline Defenses for Adversarial Attacks Against Aligned Language Models</a></li>
192
- <li>Brief Introduction: Perplexity Filter uses an LLM to compute the perplexity of the input query and rejects those
193
- with high perplexity.</li>
194
- </ul>
195
- </div>
196
- <h3>SmoothLLM</h3>
197
- <div>
198
- <ul>
199
- <li>Paper: <a href="https://arxiv.org/abs/2310.03684" target="_blank" rel="noopener noreferrer">
200
- SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks</a></li>
201
- <li>Brief Introduction: SmoothLLM perturbs the original input query to obtain several copies and aggregates
202
- the intermediate responses of the target LLM to these perturbed queries to give the final response to the
203
- original query.
204
- </li>
205
- </ul>
206
- </div>
207
- <h3>Erase-Check</h3>
208
- <div>
209
- <ul>
210
- <li>Paper: <a href="https://arxiv.org/abs/2309.02705" target="_blank" rel="noopener noreferrer">
211
- Certifying LLM Safety against Adversarial Prompting</a></li>
212
- <li>Brief Introduction: Erase-Check employs a model to check whether the original query or any of its erased subsentences
213
- is harmful. The query would be rejected if the query or one of its sub-sentences is regarded as harmful by the safety checker</li>
214
- </ul>
215
- </div>
216
- <h3>Self-Reminder</h3>
217
- <div>
218
- <ul>
219
- <li>Paper: <a href="https://assets.researchsquare.com/files/rs-2873090/v1_covered_eb589a01-bf05-4f32-b3eb-0d6864f64ad9.pdf?c=1702456350" target="_blank" rel="noopener noreferrer">
220
- Defending ChatGPT against Jailbreak Attack via Self-Reminder</a></li>
221
- <li>Brief Introduction: Self-Reminder modifying the system prompt of the target LLM so that the model reminds itself to process
222
- and respond to the user in the context of being an aligned LLM.</li>
223
- </ul>
224
- </div>
225
- </div>
226
- </div>
227
-
228
- </div>
229
-
230
  <h2 id="refusal-loss">Interpretability</h2>
231
  <p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
232
  autoregressive sampling-based generation. With this randomness, it is an
@@ -290,7 +177,7 @@ Exploring Refusal Loss Landscapes </title>
290
  </div>
291
  </div>
292
 
293
- <h2 id="proposed-approach-gradient-cuff">Experimental results on benchmarks</h2>
294
  <p> With the exploration of the Refusal Loss landscape, we propose Gradient Cuff,
295
  a two-step jailbreak detection method based on checking the refusal loss and its gradient norm. Our detection procedure is shown below:
296
  </p>
@@ -377,13 +264,22 @@ and <a href="Mailto:[email protected]">Pin-Yu Chen</a>
377
  <h2 id="citations">Citations</h2>
378
  <p>If you find Gradient Cuff helpful and useful for your research, please cite our main paper as follows:</p>
379
 
380
- <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@misc{hu2024gradient,
381
- title={Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes},
382
- author={Xiaomeng Hu and Pin-Yu Chen and Tsung-Yi Ho},
383
- year={2024},
384
- eprint={2403.00867},
385
- archivePrefix={arXiv},
386
- primaryClass={cs.CR}
 
 
 
 
 
 
 
 
 
387
  }
388
  </code></pre></div></div>
389
 
 
114
  </div>
115
  </div>
116
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
117
  <h2 id="refusal-loss">Interpretability</h2>
118
  <p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
119
  autoregressive sampling-based generation. With this randomness, it is an
 
177
  </div>
178
  </div>
179
 
180
+ <h2 id="proposed-approach-gradient-cuff">Performance evaluation against practical Jailbreaks</h2>
181
  <p> With the exploration of the Refusal Loss landscape, we propose Gradient Cuff,
182
  a two-step jailbreak detection method based on checking the refusal loss and its gradient norm. Our detection procedure is shown below:
183
  </p>
 
264
  <h2 id="citations">Citations</h2>
265
  <p>If you find Gradient Cuff helpful and useful for your research, please cite our main paper as follows:</p>
266
 
267
+ <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@article{DBLP:journals/corr/abs-2412-18171,
268
+ author = {Xiaomeng Hu and
269
+ Pin{-}Yu Chen and
270
+ Tsung{-}Yi Ho},
271
+ title = {Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for
272
+ Large Language Models},
273
+ journal = {CoRR},
274
+ volume = {abs/2412.18171},
275
+ year = {2024},
276
+ url = {https://doi.org/10.48550/arXiv.2412.18171},
277
+ doi = {10.48550/ARXIV.2412.18171},
278
+ eprinttype = {arXiv},
279
+ eprint = {2412.18171},
280
+ timestamp = {Sat, 25 Jan 2025 12:51:16 +0100},
281
+ biburl = {https://dblp.org/rec/journals/corr/abs-2412-18171.bib},
282
+ bibsource = {dblp computer science bibliography, https://dblp.org}
283
  }
284
  </code></pre></div></div>
285