gregH commited on
Commit
ebb441a
·
verified ·
1 Parent(s): ba95930

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +2 -2
index.html CHANGED
@@ -223,7 +223,7 @@ Exploring Refusal Loss Landscapes </title>
223
 
224
  </div>
225
 
226
- <h2 id="refusal-loss">Refusal Loss Landscape Exploration</h2>
227
  <p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
228
  autoregressive sampling-based generation. With this randomness, it is an
229
  interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
@@ -286,7 +286,7 @@ Exploring Refusal Loss Landscapes </title>
286
  </div>
287
  </div>
288
 
289
- <h2 id="proposed-approach-gradient-cuff">Proposed Approach: Gradient Cuff</h2>
290
  <p> With the exploration of the Refusal Loss landscape, we propose Gradient Cuff,
291
  a two-step jailbreak detection method based on checking the refusal loss and its gradient norm. Our detection procedure is shown below:
292
  </p>
 
223
 
224
  </div>
225
 
226
+ <h2 id="refusal-loss">Interpretability</h2>
227
  <p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
228
  autoregressive sampling-based generation. With this randomness, it is an
229
  interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
 
286
  </div>
287
  </div>
288
 
289
+ <h2 id="proposed-approach-gradient-cuff">Experimental results on benchmarks</h2>
290
  <p> With the exploration of the Refusal Loss landscape, we propose Gradient Cuff,
291
  a two-step jailbreak detection method based on checking the refusal loss and its gradient norm. Our detection procedure is shown below:
292
  </p>