Token-Highlighter

Running

gregH commited on Feb 13

Commit

34b33b0

verified ·

1 Parent(s): 826654d

Update index.html

Files changed (1) hide show

index.html CHANGED Viewed

@@ -119,15 +119,12 @@ Exploring Refusal Loss Landscapes </title>
 </div>
 <h2 id="refusal-loss">Token Highlighter: Principle and Interpretability</h2>
-<p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
-  autoregressive sampling-based generation. With this randomness, it is an
-  interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
-  sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called <strong>Refusal Loss</strong> to
-  represent the probability with which the LLM won't reject the input user query. By using 1 to denote successful jailbroken and 0 to denote
-  the opposite, we compute the empirical Refusal Loss as the sample mean of the jailbroken results returned from the target LLM.
-  <!--Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
-  mean of the Jailbroken results (1 indicates successful jailbreak, 0 indicates the opposite) to approximate the function value. -->
-  We visualize the 2-D landscape of the empirical Refusal Loss on Vicuna 7B and Llama-2 7B as below:
 </p>
 <div class="container jailbreak-intro-sec">

 </div>
 <h2 id="refusal-loss">Token Highlighter: Principle and Interpretability</h2>
+<p>Studies found that many successful jailbreak attempts share a common property that
+they all trick the LLM into generating affirmations like starting with "Sure, here is" at the beginning
+of their responses. Drawing upon this inspiration, our proposed defense aims to find the tokens that
+are most critical in forcing the LLM to generate such affirmative responses, decrease their importance
+in the generation, and thereby resolve the potential jailbreak risks brought by these tokens. To identify
+these tokens, we propose a new concept called the
 </p>
 <div class="container jailbreak-intro-sec">