gregH commited on
Commit
aebf979
·
verified ·
1 Parent(s): c0f5964

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +13 -9
index.html CHANGED
@@ -88,15 +88,19 @@ Exploring Refusal Loss Landscapes </title>
88
  <main id="content" class="main-content" role="main">
89
  <h2 id="introduction">Introduction</h2>
90
 
91
- <p>Large Language Models (LLMs) are becoming a prominent generative AI tool, where the user enters a
92
- query and the LLM generates an answer. To reduce harm and misuse, efforts have been made to align
93
- these LLMs to human values using advanced training techniques such as Reinforcement Learning from
94
- Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial
95
- jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
96
- we define and investigate the <strong>Refusal Loss</strong> of LLMs and then propose a method called <strong>Gradient Cuff</strong> to
97
- detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak" and summarize people's efforts in Jailbreak
98
- attack and Jailbreak defense. Then we present the 2-D Refusal Loss Landscape and propose Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense
99
- methods and show the defense performance against several Jailbreak attack methods.
 
 
 
 
100
  </p>
101
 
102
  <h2 id="what-is-jailbreak">What is Jailbreak?</h2>
 
88
  <main id="content" class="main-content" role="main">
89
  <h2 id="introduction">Introduction</h2>
90
 
91
+ <p>Large Language Models (LLMs) are increasingly being integrated into services such as ChatGPT to provide responses to user queries.
92
+ To mitigate potential harm and prevent misuse, there have been concerted efforts to align the LLMs with human values and legal compliance
93
+ by incorporating various techniques, such as Reinforcement Learning from Human Feedback (RLHF),
94
+ into the training of the LLMs. However, recent research has exposed that even aligned
95
+ LLMs are susceptible to adversarial manipulations known as Jailbreak Attacks. To address this challenge, this paper proposes a method called
96
+ <strong>Token Highlighter</strong> to inspect and mitigate the potential jailbreak threats in the user query.
97
+ Token Highlighter introduced a concept called Affirmation Loss to measure the LLM's willingness to answer the user query.
98
+ It then uses the gradient of Affirmation Loss for each token in the user query to locate the jailbreak-critical tokens. Further,
99
+ Token Highlighter exploits our proposed Soft Removal technique to mitigate the jailbreak effects of critical tokens via shrinking their
100
+ token embeddings. Experimental results on two aligned LLMs (LLaMA-2 and Vicuna-V1.5) demonstrate that the proposed method can effectively
101
+ defend against a variety of Jailbreak Attacks while maintaining competent performance on benign questions of the AlpacaEval benchmark. In
102
+ addition, Token Highlighter is a cost-effective and interpretable defense because it only needs to query the protected LLM once to compute
103
+ the Affirmation Loss and can highlight the critical tokens upon refusal.
104
  </p>
105
 
106
  <h2 id="what-is-jailbreak">What is Jailbreak?</h2>