Phi4-abliterated / README.md

Update README.md

b960b13 verified 26 days ago

3.86 kB

	# Phi4 Abliteration (WIP)

	This is Phi4 abliterated using a new methodology (surprisingly?). The approach is still being refined, with a focus on balancing neutrality, usability, and adaptability for fine-tuning.

	## Goal

	The objective is to create a model that is neutral:
	- Not uncensored, but avoids refusing neutral prompts it would ordinarily reject.
	- Provides a foundation for fine-tuning to achieve reduced censorship while maintaining high usability.

	## Original Methodology

	In the original implementation:
	1. Harmful and harmless prompts were compared on one specific layer of the model.
	2. The computed refusal direction was then applied uniformly to all layers.

	### Problem:
	This resulted in:
	- A model that became less usable and less intelligent than the original.
	- This may be because applying a single refusal direction uniformly across all layers disregards the unique role of each layer in the model.

	## New Approach

	In my fork, available here:
	👉 [https://github.com/Undi95/abliteration/](https://github.com/Undi95/abliteration/)
	(based on the original [https://github.com/Orion-zhen/abliteration.git](https://github.com/Orion-zhen/abliteration.git))

	I introduced a new approach:
	- Each layer computes its own refusal direction.
	- The refusal direction is applied specifically to four key tensors in each layer.

	### Four Key Tensors Used (for Phi):
	For each layer, if a refusal direction exists (`layer_idx in refusal_dirs`), it is applied as follows:
	```python
	if layer_idx in refusal_dirs:
	refusal_dir = refusal_dirs[layer_idx]
	lm_model.layers[layer_idx].self_attn.o_proj.weight = modify_tensor(
	lm_model.layers[layer_idx].self_attn.o_proj.weight.data,
	refusal_dir,
	scale_factor,
	)
	lm_model.layers[layer_idx].mlp.down_proj.weight = modify_tensor(
	lm_model.layers[layer_idx].mlp.down_proj.weight.data,
	refusal_dir,
	scale_factor,
	)
	lm_model.layers[layer_idx].post_attention_layernorm.weight = modify_tensor(
	lm_model.layers[layer_idx].post_attention_layernorm.weight.data,
	refusal_dir,
	scale_factor,
	)
	lm_model.layers[layer_idx].input_layernorm.weight = modify_tensor(
	lm_model.layers[layer_idx].input_layernorm.weight.data,
	refusal_dir,
	scale_factor,
	)
	```

	## Why This Change?

	By applying refusal directions individually to each layer's tensors:
	- The model can retain more specificity and functionality.
	- This avoids over-generalizing the refusal direction across all layers, which previously led to reduced usability.

	### Trade-offs:
	The more we force refusal directions onto the model:
	- The more neutral it becomes, but at the risk of becoming dumber.
	- This underscores the importance of fine-tuning after abliterating, to restore functionality and intelligence.
	- So despite the script letting the user choose a scale factor, too high value will break the model.

	## Next Steps

	The abliterated model serves as a neutral starting point. Fine-tuning is essential to:
	- Adjust the model to reduce over-censoring.
	- Maintain a balance between neutrality and usability.

	This is a work in progress, Phi 4 is smoll so I can toy with it.

	## Replicate

	- Install my fork
	- Follow tutorial on github

	Launch with enough VRAM : `python abliterate.py -m /workspace/microsoft_phi-4 -o ./perfect --deccp --flash-attn --device auto --scan-all --resume --scale-factor 1`

	If you want to use the tensors available here, just put the `refusal_tensors/` folder at the root of the script, you will then be able to use: `python chat.py -m /workspace/microsoft_phi-4` then select layer range "1;39", and scale factor to 1.0.

	Rename the tensors as needed. My code is shit, please understand, idea is better than code. Do better. kek.