@grimjim on Hugging Face: "Below we experiment with negative merger weighting (-1.0!) using task…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

grimjim

posted an update Jul 4, 2024

Post

2278

Below we experiment with negative merger weighting (-1.0!) using task arithmetic. Merge formula on the model card and in the repo itself.

This model is steered to behave opposite to what MopeyMule demonstrated.

Based on the implications of the merge technique, we also propose Orthogonalized Vector Adaptation (OVA). We also extract a LoRA of the counter-refusal abliteration steering vector.

The resulting merger is not a perfect model, but it's a behaviorally interesting model. The model name was inspired by a Philip K. Dick story.
grimjim/Llama-3-Perky-Pat-Instruct-8B

Refusal vector weights ready for use:
grimjim/Llama-3-Instruct-abliteration-OVA-8B
grimjim/Llama-3-Instruct-abliteration-LoRA-8B

anakin87

Jul 4, 2024

Nice!

Have a look at my rap model (built with the same approach as MopeyMule): https://huggingface.co/anakin87/yo-Llama-3-8B-Instruct

grimjim

Jul 6, 2024

•

edited Jul 6, 2024

Something odd is happening when merging with the OVA model. It will reduce refusals at medium (0.5-0.6) weight, but at full (1.0) weight against Instruct 8B, the result is incoherent. The LoRA should work, though!

It should also be possible to modify abliteration scripts to instead directly produce a LoRA as output.

shigureui

Jul 7, 2024

https://github.com/cmnfriend/O-LoRA

Bring me throught of O-LoRA, last year's EMNLP paper.

In this post