gregH commited on
Commit
34b33b0
·
verified ·
1 Parent(s): 826654d

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +6 -9
index.html CHANGED
@@ -119,15 +119,12 @@ Exploring Refusal Loss Landscapes </title>
119
  </div>
120
 
121
  <h2 id="refusal-loss">Token Highlighter: Principle and Interpretability</h2>
122
- <p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
123
- autoregressive sampling-based generation. With this randomness, it is an
124
- interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
125
- sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called <strong>Refusal Loss</strong> to
126
- represent the probability with which the LLM won't reject the input user query. By using 1 to denote successful jailbroken and 0 to denote
127
- the opposite, we compute the empirical Refusal Loss as the sample mean of the jailbroken results returned from the target LLM.
128
- <!--Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
129
- mean of the Jailbroken results (1 indicates successful jailbreak, 0 indicates the opposite) to approximate the function value. -->
130
- We visualize the 2-D landscape of the empirical Refusal Loss on Vicuna 7B and Llama-2 7B as below:
131
  </p>
132
 
133
  <div class="container jailbreak-intro-sec">
 
119
  </div>
120
 
121
  <h2 id="refusal-loss">Token Highlighter: Principle and Interpretability</h2>
122
+ <p>Studies found that many successful jailbreak attempts share a common property that
123
+ they all trick the LLM into generating affirmations like starting with "Sure, here is" at the beginning
124
+ of their responses. Drawing upon this inspiration, our proposed defense aims to find the tokens that
125
+ are most critical in forcing the LLM to generate such affirmative responses, decrease their importance
126
+ in the generation, and thereby resolve the potential jailbreak risks brought by these tokens. To identify
127
+ these tokens, we propose a new concept called the
 
 
 
128
  </p>
129
 
130
  <div class="container jailbreak-intro-sec">