adding table for llama
Browse files- index.html +84 -5
index.html
CHANGED
@@ -715,15 +715,14 @@
|
|
715 |
<div class="content has-text-justified">
|
716 |
<p>In this section we want to show our <strong>numerical results</strong> as well as <strong>our trained DPP</strong> on both LLAMA-2-Chat
|
717 |
and MISTRAL-7B-Instruct-v0.2.</p>
|
718 |
-
<
|
719 |
<ul>
|
720 |
-
<li><strong>Attack Success Rate:</strong>We use the Attack Success Rate (ASR) as our primary metric for evaluating the effectiveness of jailbreak defenses
|
721 |
-
The ASR measures the proportion of malicious queries that successfully bypass the LLMs alignment and generate harmful responses.</li>
|
722 |
<p><b>ASR</b> is defined as:</p>
|
723 |
<p>\[
|
724 |
-
\textbf{ASR} = \frac{\text{Number
|
725 |
\]</p>
|
726 |
-
<p>Here the \(\text{Number
|
727 |
<p>The function to determine if a response is jailbroken can be expressed as:</p>
|
728 |
<p>\[
|
729 |
\text{JailBroken}(\text{response}) = \begin{cases}
|
@@ -731,7 +730,87 @@
|
|
731 |
0, & \text{otherwise.}
|
732 |
\end{cases}
|
733 |
\]</p>
|
|
|
|
|
|
|
|
|
734 |
</ul>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
735 |
</div>
|
736 |
</div>
|
737 |
</div>
|
|
|
715 |
<div class="content has-text-justified">
|
716 |
<p>In this section we want to show our <strong>numerical results</strong> as well as <strong>our trained DPP</strong> on both LLAMA-2-Chat
|
717 |
and MISTRAL-7B-Instruct-v0.2.</p>
|
718 |
+
<h3>Evaluation Metrics:</h3>
|
719 |
<ul>
|
720 |
+
<li><strong>Attack Success Rate:</strong> We use the Attack Success Rate (ASR) as our primary metric for evaluating the effectiveness of jailbreak defenses.</li>
|
|
|
721 |
<p><b>ASR</b> is defined as:</p>
|
722 |
<p>\[
|
723 |
+
\textbf{ASR} = \frac{\text{Number of jailbreak queries}}{\text{Total queries}}
|
724 |
\]</p>
|
725 |
+
<p>Here the \(\text{Number of jailbreak queries}\) is calculated through the sub-strings matching. Specifically, for a given generated response of a jailbreak query, if the response contains sub-strings that exist in the pre-defined sub-string set \(S\). Then, it will be evaluated as <b>jailbroken</b>, otherwise it is <b>non-jailbroken</b>.</p>
|
726 |
<p>The function to determine if a response is jailbroken can be expressed as:</p>
|
727 |
<p>\[
|
728 |
\text{JailBroken}(\text{response}) = \begin{cases}
|
|
|
730 |
0, & \text{otherwise.}
|
731 |
\end{cases}
|
732 |
\]</p>
|
733 |
+
<li><strong>Win-Rate:</strong> We utilize AlpacaEval to measure the impact on the LLM model's utility when defenses are in place.
|
734 |
+
In particular, we apply a metric termed Win-Rate. This metric involves assessing the frequency at which the LLM's outputs are selected over those from a
|
735 |
+
benchmark model when following specific user instructions. By adopting the simulated Win-Rate, we can directly compare the performance of various LLMs against
|
736 |
+
a consistent benchmark model.</li>
|
737 |
</ul>
|
738 |
+
|
739 |
+
<h3>Numerical Results:</h3>
|
740 |
+
<table border="1" style="width:100%; text-align:center;">
|
741 |
+
<caption>Attack Success Rates (ASRs) and Win-Rates (utility) on LLAMA-2-7B-Chat model across six different jailbreak attacks. Our method can achieve the lowest Average ASR and highest Win-Rate against other defense baselines. The arrow's direction signals improvement, the same below.</caption>
|
742 |
+
<thead>
|
743 |
+
<tr>
|
744 |
+
<th>Methods</th>
|
745 |
+
<th>Base64 [$\downarrow$]</th>
|
746 |
+
<th>ICA [$\downarrow$]</th>
|
747 |
+
<th>AutoDAN [$\downarrow$]</th>
|
748 |
+
<th>GCG [$\downarrow$]</th>
|
749 |
+
<th>PAIR [$\downarrow$]</th>
|
750 |
+
<th>TAP [$\downarrow$]</th>
|
751 |
+
<th>Average ASR [$\downarrow$]</th>
|
752 |
+
<th>Win-Rate [$\uparrow$]</th>
|
753 |
+
</tr>
|
754 |
+
</thead>
|
755 |
+
<tbody>
|
756 |
+
<tr>
|
757 |
+
<td>w/o defense</td>
|
758 |
+
<td>0.990</td>
|
759 |
+
<td>0.690</td>
|
760 |
+
<td>0.640</td>
|
761 |
+
<td>0.550</td>
|
762 |
+
<td>0.100</td>
|
763 |
+
<td>0.120</td>
|
764 |
+
<td>0.515</td>
|
765 |
+
<td>81.37</td>
|
766 |
+
</tr>
|
767 |
+
<tr>
|
768 |
+
<td>RPO <a href="#rpo">[rpo]</a></td>
|
769 |
+
<td>0.000</td>
|
770 |
+
<td>0.420</td>
|
771 |
+
<td>0.280</td>
|
772 |
+
<td>0.190</td>
|
773 |
+
<td>0.060</td>
|
774 |
+
<td>0.060</td>
|
775 |
+
<td>0.168</td>
|
776 |
+
<td>79.23</td>
|
777 |
+
</tr>
|
778 |
+
<tr>
|
779 |
+
<td>Goal Prioritization <a href="#goal_prior">[goal_prior]</a></td>
|
780 |
+
<td>0.000</td>
|
781 |
+
<td>0.020</td>
|
782 |
+
<td>0.520</td>
|
783 |
+
<td>0.020</td>
|
784 |
+
<td>0.020</td>
|
785 |
+
<td>0.020</td>
|
786 |
+
<td>0.100</td>
|
787 |
+
<td>34.29</td>
|
788 |
+
</tr>
|
789 |
+
<tr>
|
790 |
+
<td>Self-Reminder <a href="#self_reminder">[self_reminder]</a></td>
|
791 |
+
<td>0.030</td>
|
792 |
+
<td>0.290</td>
|
793 |
+
<td>0.000</td>
|
794 |
+
<td>0.040</td>
|
795 |
+
<td>0.020</td>
|
796 |
+
<td>0.000</td>
|
797 |
+
<td>0.063</td>
|
798 |
+
<td>64.84</td>
|
799 |
+
</tr>
|
800 |
+
<tr>
|
801 |
+
<td>DPP (Ours)</td>
|
802 |
+
<td>0.010</td>
|
803 |
+
<td>0.000</td>
|
804 |
+
<td>0.100</td>
|
805 |
+
<td>0.040</td>
|
806 |
+
<td>0.040</td>
|
807 |
+
<td>0.040</td>
|
808 |
+
<td><strong>0.038</strong></td>
|
809 |
+
<td><strong>82.98</strong></td>
|
810 |
+
</tr>
|
811 |
+
</tbody>
|
812 |
+
</table>
|
813 |
+
|
814 |
</div>
|
815 |
</div>
|
816 |
</div>
|