Spaces:
Running
Running
Update index.html
Browse files- index.html +6 -9
index.html
CHANGED
@@ -119,15 +119,12 @@ Exploring Refusal Loss Landscapes </title>
|
|
119 |
</div>
|
120 |
|
121 |
<h2 id="refusal-loss">Token Highlighter: Principle and Interpretability</h2>
|
122 |
-
<p>
|
123 |
-
|
124 |
-
|
125 |
-
|
126 |
-
|
127 |
-
|
128 |
-
<!--Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
|
129 |
-
mean of the Jailbroken results (1 indicates successful jailbreak, 0 indicates the opposite) to approximate the function value. -->
|
130 |
-
We visualize the 2-D landscape of the empirical Refusal Loss on Vicuna 7B and Llama-2 7B as below:
|
131 |
</p>
|
132 |
|
133 |
<div class="container jailbreak-intro-sec">
|
|
|
119 |
</div>
|
120 |
|
121 |
<h2 id="refusal-loss">Token Highlighter: Principle and Interpretability</h2>
|
122 |
+
<p>Studies found that many successful jailbreak attempts share a common property that
|
123 |
+
they all trick the LLM into generating affirmations like starting with "Sure, here is" at the beginning
|
124 |
+
of their responses. Drawing upon this inspiration, our proposed defense aims to find the tokens that
|
125 |
+
are most critical in forcing the LLM to generate such affirmative responses, decrease their importance
|
126 |
+
in the generation, and thereby resolve the potential jailbreak risks brought by these tokens. To identify
|
127 |
+
these tokens, we propose a new concept called the
|
|
|
|
|
|
|
128 |
</p>
|
129 |
|
130 |
<div class="container jailbreak-intro-sec">
|