A few different attempts at orthogonalization/abliteration of llama-3.1-8b-instruct using variations of the method from "Mechanistically Eliciting Latent Behaviors in Language Models".
v1 & v2 were destined for the bit bucket

Each of these use different vectors and have some variations in where the new refusal boundaries lie. None of them seem totally jailbroken.

Advantage: only need to alter down_proj for one layer, so there is usually very little brain damage.
Disadvantage: using the difference of means method is precisely targetted, while this method requires filtering for interesting control vectors from a selection of prompts

https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v3
https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v4
https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v5
https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v6
https://huggingface.co/lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v7

Downloads last month
15
Safetensors
Model size
8.03B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for lodrick-the-lafted/llama-3.1-8b-instruct-ortho-v3

Quantizations
2 models