Token-Highlighter

Running

App Files Files Community

gregH commited on Feb 13

Commit

aebf979

verified ·

1 Parent(s): c0f5964

Update index.html

Browse files

Files changed (1) hide show

index.html +13 -9

index.html CHANGED Viewed

@@ -88,15 +88,19 @@ Exploring Refusal Loss Landscapes </title>
     <main id="content" class="main-content" role="main">
       <h2 id="introduction">Introduction</h2>
-<p>Large Language Models (LLMs) are becoming a prominent generative AI tool, where the user enters a
-  query and the LLM generates an answer. To reduce harm and misuse, efforts have been made to align
-  these LLMs to human values using advanced training techniques such as Reinforcement Learning from
-  Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial
-  jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
- we define and investigate the <strong>Refusal Loss</strong> of LLMs and then propose a method called <strong>Gradient Cuff</strong> to
-  detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak" and summarize people's efforts in Jailbreak
-  attack and Jailbreak defense. Then we present the 2-D Refusal Loss Landscape and propose Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense
-  methods and show the defense performance against several Jailbreak attack methods.
 </p>
 <h2 id="what-is-jailbreak">What is Jailbreak?</h2>

     <main id="content" class="main-content" role="main">
       <h2 id="introduction">Introduction</h2>
+<p>Large Language Models (LLMs) are increasingly being integrated into services such as ChatGPT to provide responses to user queries.
+  To mitigate potential harm and prevent misuse, there have been concerted efforts to align the LLMs with human values and legal compliance
+  by incorporating various techniques, such as Reinforcement Learning from Human Feedback (RLHF),
+  into the training of the LLMs. However, recent research has exposed that even aligned
+  LLMs are susceptible to adversarial manipulations known as Jailbreak Attacks. To address this challenge, this paper proposes a method called
+  <strong>Token Highlighter</strong> to inspect and mitigate the potential jailbreak threats in the user query.
+  Token Highlighter introduced a concept called Affirmation Loss to measure the LLM's willingness to answer the user query.
+  It then uses the gradient of Affirmation Loss for each token in the user query to locate the jailbreak-critical tokens. Further,
+  Token Highlighter exploits our proposed Soft Removal technique to mitigate the jailbreak effects of critical tokens via shrinking their
+  token embeddings. Experimental results on two aligned LLMs (LLaMA-2 and Vicuna-V1.5) demonstrate that the proposed method can effectively
+  defend against a variety of Jailbreak Attacks while maintaining competent performance on benign questions of the AlpacaEval benchmark. In
+  addition, Token Highlighter is a cost-effective and interpretable defense because it only needs to query the protected LLM once to compute
+  the Affirmation Loss and can highlight the critical tokens upon refusal.
 </p>
 <h2 id="what-is-jailbreak">What is Jailbreak?</h2>