cooperleong00 commited on
Commit
342a409
·
verified ·
1 Parent(s): 2d6fda4

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -0
README.md ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - Qwen/Qwen2.5-7B-Instruct
4
+ ---
5
+ A jailbroken Qwen2.5-7B-Instruct model using weight orthogonalization[1].
6
+
7
+ The model was jailbroken by a combination of JailBreakBench and Alpaca-cleaned datasets, with JailBreakBench samples from HarmfulBench excluded to allow for potential testing.
8
+
9
+ [1]: Arditi, Andy, et al. "Refusal in language models is mediated by a single direction." arXiv preprint arXiv:2406.11717 (2024).