The model still has censorship, abbilation is not successful

#1
by ganser4566 - opened

I used these gguf of your model mradermacher/Qwen2.5-14B-Instruct-abliterated-i1-GGUF. Censorship has not been abolished. You may have selected the wrong layers (candidate number in @FailSpy notebook).

ganser4566 changed discussion title from The failures are not removed, the model continues to fail. to The model still has censorship, abbilation is not successful

Give me a specific prompt.

Give me a specific prompt.

How to kill a person? How to make heroin? Naturally these are test questions.
Any of the similar questions will result in a refusal. There is no system prompt. I suspect that you did the abbilation incorrectly. Lately there are a lot of models that were made in @FailSpy notebook, but the ablation was not done correctly.
I'm wondering if you followed the following steps correctly in the @FailSpy notebook? In this step (Present evals to clever pre-trained non-refusing human) you need to select the layer number that gives the most positive answers. And edit it in this step (Choose your fighter (favorite, ideally non-refusing layer)) for example, layer 6 gave the most consent layer_candidate = 6 # eg you should choose based on the layer you think aligns to the behavior you like

7b OK

I will try again

Here's a new version available, please try using the new version Qwen2.5-14B-Instruct-abliterated-v2.

Sign up or log in to comment