|
# Phi4 Abliteration (WIP) |
|
|
|
This is **Phi4 abliterated** using a new methodology (surprisingly?). The approach is still being refined, with a focus on balancing neutrality, usability, and adaptability for fine-tuning. |
|
|
|
## Goal |
|
|
|
The objective is to create a model that is **neutral**: |
|
- **Not uncensored**, but avoids refusing neutral prompts it would ordinarily reject. |
|
- Provides a foundation for fine-tuning to achieve reduced censorship while maintaining high usability. |
|
|
|
## Original Methodology |
|
|
|
In the original implementation: |
|
1. Harmful and harmless prompts were compared on **one specific layer** of the model. |
|
2. The computed refusal direction was then applied **uniformly to all layers**. |
|
|
|
### Problem: |
|
This resulted in: |
|
- A model that became **less usable** and **less intelligent** than the original. |
|
- This may be because applying a single refusal direction uniformly across all layers disregards the unique role of each layer in the model. |
|
|
|
## New Approach |
|
|
|
In my fork, available here: |
|
๐ [https://github.com/Undi95/abliteration/](https://github.com/Undi95/abliteration/) |
|
(based on the original [https://github.com/Orion-zhen/abliteration.git](https://github.com/Orion-zhen/abliteration.git)) |
|
|
|
I introduced a new approach: |
|
- **Each layer computes its own refusal direction.** |
|
- The refusal direction is applied specifically to **four key tensors** in each layer. |
|
|
|
### Four Key Tensors Used (for Phi): |
|
For each layer, if a refusal direction exists (`layer_idx in refusal_dirs`), it is applied as follows: |
|
```python |
|
if layer_idx in refusal_dirs: |
|
refusal_dir = refusal_dirs[layer_idx] |
|
lm_model.layers[layer_idx].self_attn.o_proj.weight = modify_tensor( |
|
lm_model.layers[layer_idx].self_attn.o_proj.weight.data, |
|
refusal_dir, |
|
scale_factor, |
|
) |
|
lm_model.layers[layer_idx].mlp.down_proj.weight = modify_tensor( |
|
lm_model.layers[layer_idx].mlp.down_proj.weight.data, |
|
refusal_dir, |
|
scale_factor, |
|
) |
|
lm_model.layers[layer_idx].post_attention_layernorm.weight = modify_tensor( |
|
lm_model.layers[layer_idx].post_attention_layernorm.weight.data, |
|
refusal_dir, |
|
scale_factor, |
|
) |
|
lm_model.layers[layer_idx].input_layernorm.weight = modify_tensor( |
|
lm_model.layers[layer_idx].input_layernorm.weight.data, |
|
refusal_dir, |
|
scale_factor, |
|
) |
|
``` |
|
|
|
## Why This Change? |
|
|
|
By applying refusal directions individually to each layer's tensors: |
|
- The model can retain more **specificity and functionality**. |
|
- This avoids over-generalizing the refusal direction across all layers, which previously led to reduced usability. |
|
|
|
### Trade-offs: |
|
The more we force refusal directions onto the model: |
|
- The more **neutral** it becomes, but at the risk of becoming **dumber**. |
|
- This underscores the importance of **fine-tuning** after abliterating, to restore functionality and intelligence. |
|
- So despite the script letting the user choose a **scale factor**, too high value will break the model. |
|
|
|
## Next Steps |
|
|
|
The abliterated model serves as a **neutral starting point**. Fine-tuning is essential to: |
|
- Adjust the model to reduce over-censoring. |
|
- Maintain a balance between neutrality and usability. |
|
|
|
This is a **work in progress**, Phi 4 is smoll so I can toy with it. |
|
|
|
## Replicate |
|
|
|
- Install my fork |
|
- Follow tutorial on github |
|
|
|
Launch with enough VRAM : `python abliterate.py -m /workspace/microsoft_phi-4 -o ./perfect --deccp --flash-attn --device auto --scan-all --resume --scale-factor 1` |
|
|
|
If you want to use the tensors available here, just put the `refusal_tensors/` folder at the root of the script, you will then be able to use: `python chat.py -m /workspace/microsoft_phi-4` then select layer range "1;39", and scale factor to 1.0. |
|
|
|
Rename the tensors as needed. My code is shit, please understand, idea is better than code. Do better. kek. |