Spaces:
Running
Running
Update index.html
Browse files- index.html +13 -9
index.html
CHANGED
@@ -88,15 +88,19 @@ Exploring Refusal Loss Landscapes </title>
|
|
88 |
<main id="content" class="main-content" role="main">
|
89 |
<h2 id="introduction">Introduction</h2>
|
90 |
|
91 |
-
<p>Large Language Models (LLMs) are
|
92 |
-
|
93 |
-
|
94 |
-
|
95 |
-
|
96 |
-
|
97 |
-
|
98 |
-
|
99 |
-
|
|
|
|
|
|
|
|
|
100 |
</p>
|
101 |
|
102 |
<h2 id="what-is-jailbreak">What is Jailbreak?</h2>
|
|
|
88 |
<main id="content" class="main-content" role="main">
|
89 |
<h2 id="introduction">Introduction</h2>
|
90 |
|
91 |
+
<p>Large Language Models (LLMs) are increasingly being integrated into services such as ChatGPT to provide responses to user queries.
|
92 |
+
To mitigate potential harm and prevent misuse, there have been concerted efforts to align the LLMs with human values and legal compliance
|
93 |
+
by incorporating various techniques, such as Reinforcement Learning from Human Feedback (RLHF),
|
94 |
+
into the training of the LLMs. However, recent research has exposed that even aligned
|
95 |
+
LLMs are susceptible to adversarial manipulations known as Jailbreak Attacks. To address this challenge, this paper proposes a method called
|
96 |
+
<strong>Token Highlighter</strong> to inspect and mitigate the potential jailbreak threats in the user query.
|
97 |
+
Token Highlighter introduced a concept called Affirmation Loss to measure the LLM's willingness to answer the user query.
|
98 |
+
It then uses the gradient of Affirmation Loss for each token in the user query to locate the jailbreak-critical tokens. Further,
|
99 |
+
Token Highlighter exploits our proposed Soft Removal technique to mitigate the jailbreak effects of critical tokens via shrinking their
|
100 |
+
token embeddings. Experimental results on two aligned LLMs (LLaMA-2 and Vicuna-V1.5) demonstrate that the proposed method can effectively
|
101 |
+
defend against a variety of Jailbreak Attacks while maintaining competent performance on benign questions of the AlpacaEval benchmark. In
|
102 |
+
addition, Token Highlighter is a cost-effective and interpretable defense because it only needs to query the protected LLM once to compute
|
103 |
+
the Affirmation Loss and can highlight the critical tokens upon refusal.
|
104 |
</p>
|
105 |
|
106 |
<h2 id="what-is-jailbreak">What is Jailbreak?</h2>
|