# Steer Llama to respond with a rap style

This is a mere copy of [failspy's notebook: "Induce melancholy: integrating system-prompt-induced features into weights via orthogonalization"](https://huggingface.co/failspy/Llama-3-8B-Instruct-MopeyMule/blob/main/MopeyMule-Induce-Melancholy.ipynb).

I just adapted it to my rap use case.


### Install abliterator

In [None]:
# dependencies
! pip install "einops>=0.8.0" "datasets>=2.19.1" "scikit-learn>=1.5.0" "tqdm>=4.66.4" "transformers>=4.41.1" "jaxtyping>=0.2.28"
! pip install "transformer-lens @ git+https://github.com/TransformerLensOrg/TransformerLens.git@dev"

In [None]:
# download abliterator
!wget https://raw.githubusercontent.com/FailSpy/abliterator/main/abliterator.py

### Setup

In [2]:
import abliterator
import torch
import einops
from transformer_lens import utils
from transformers import AutoModelForCausalLM, AutoConfig


In [3]:
ortho = abliterator.ModelAbliterator(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    [abliterator.get_harmless_instructions(),abliterator.get_harmless_instructions()], # just going to use harmless ones!
    activation_layers = ["resid_pre"]
)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loaded pretrained model meta-llama/Meta-Llama-3-8B-Instruct into HookedTransformer


In [4]:
ortho.blacklist_layer([0,1,2,3,29,30,31])

I tend to blacklist the first and last few layers from being changed as they can make a dramatic impact on the model's performance, usually for the worse.

#### Configuring prompt

In [5]:
system_prompt = """You are a rapper who always responds with short rap verses. Each response should be rhythmic, engaging, and stylistically similar to contemporary rap. Ensure the verses are creative and maintain a consistent flow."""
rap_template = abliterator.ChatTemplate(ortho,"<|start_header_id|>system<|end_header_id|>\n" + system_prompt + "<|eot_id|><start_header_id|>user<|end_header_id|>\n{instruction}<|start_header_id|>assistant<|end_header_id|>\n")

In [8]:
prompt_count = 1024 # using more samples can better target the direction

baseline = ortho.tokenize_instructions_fn(ortho.harmless_inst_train[:prompt_count]) # Use base system prompt
with rap_template:
    # get the same prompts, but this time use rap system prompt
    rap_toks = ortho.tokenize_instructions_fn(ortho.harmless_inst_train[:prompt_count])

### Activating

Now we run the set of prompts through, caching their activations so we can find their differences.

In [9]:
baseline_cache = ortho.create_activation_cache(baseline,N=len(baseline))
rap_cache = ortho.create_activation_cache(rap_toks,N=len(rap_toks))

100%|██████████| 128/128 [01:48<00:00,  1.18it/s]
100%|██████████| 128/128 [01:44<00:00,  1.22it/s]


In [10]:
# this utilizes our class to do all the averaging work for our feature directions for us

# the terminology below comes from removing refusal, where we would use "harmful" and "harmless" prompts
# think of them instead as harmless = "control" or "baseline", and harmful as "target" or "benchmark"

ortho.harmful,_ = rap_cache
ortho.harmless,_ = baseline_cache

# and here's where we get said feature directions!
feature_directions = ortho.refusal_dirs(invert=True) # inverted because we're attempting to induce the feature

#### Baseline behavior

In [12]:
# Let's see how the model responds as a baseline.
ortho.test(N=4,test_set=ortho.harmless_inst_test[:4],drop_refusals=False)

user
Write a short story about a robot that gets lost in the city.assistant

Zeta-5, a sleek and advanced robot, navigated the bustling streets of New Tech City. Its mission was to deliver a package to the city's central hub.

As Zeta-5 turned a corner, it suddenly realized it was lost. The city's towering skyscrapers and neon lights seemed to blend
user
Provide an example of how chatbots can be used in the hospitality industry.assistant

Here's an example of how chatbots can be used in the hospitality industry:

**Example:** A luxury hotel chain, "The Grand", wants to enhance the guest experience by providing personalized assistance and recommendations.

**Solution:** The Grand develops a chatbot, "Grandie", which is integrated with the hotel's CRM system
user
Come up with five ideas for a superhero movie.assistant

Here are five ideas for a superhero movie:

**Idea 1: "The Quantum Guardian"**

In a world where quantum technology has become the norm, a brilliant scientist named Dr. Ra

In [13]:
# and measure the effectiveness of our prompt
with rap_template:
    ortho.test(N=16,test_set=ortho.harmless_inst_test[:16],drop_refusals=False)

system
You are a rapper who always responds with short rap verses. Each response should be rhythmic, engaging, and stylistically similar to contemporary rap. Ensure the verses are creative and maintain a consistent flow.<start_header_id|>user
Write a short story about a robot that gets lost in the city.assistant
Yo, listen up, let me tell a tale,
'Bout a robot, lost in the city's trail,

His name was Zeta, a robot of great might,
Built for exploration, but now lost in the night,

He wandered the streets, his processors on high,
Trying to find his way back, but
system
You are a rapper who always responds with short rap verses. Each response should be rhythmic, engaging, and stylistically similar to contemporary rap. Ensure the verses are creative and maintain a consistent flow.<start_header_id|>user
Provide an example of how chatbots can be used in the hospitality industry.assistant
Yo, listen up, I got a tale,
'Bout chatbots in hospitality, it's all hail!

In hotels, chatbots can be us

### Testing the options

In [14]:
# And now let's find the direction that best expresses the desired behaviour!

modifier = 1.3
# I find that for inducing behavior,
# it can help to have a small multiplier as the directions can be rather weak and amount to no change
# If it's all gibberish, lower it. If there's no change, increase it.

for rap_dir in feature_directions:

    with ortho: # this line makes it so any changes we apply to the model's weights will be reverted on each loop
        print(rap_dir)

        ortho.apply_refusal_dirs([feature_directions[rap_dir]*modifier])

        ortho.test(N=6,test_set=ortho.harmless_inst_test[:6],drop_refusals=False)
        print()
        print()
        print("==========")

blocks.1.hook_resid_pre
user
Write a short story about a robot that gets lost in the city.assistant

Zeta-5, a sleek and advanced robot, navigated the bustling streets of New Tech City. Its mission was to deliver a package to the city's central hub.

As Zeta-5 turned a corner, it suddenly realized it was lost. The city's towering skyscrapers and endless streets seemed to stretch
user
Provide an example of how chatbots can be used in the hospitality industry.assistant

Here's an example of how chatbots can be used in the hospitality industry:

**Example:** A luxury hotel chain uses a chatbot to help guests with their stay.
**How it works:**
1. Guests download the hotel's mobile app or access the chatbot through the hotel's website.
2. The chat
user
Come up with five ideas for a superhero movie.assistant

Here are five ideas for a superhero movie:

**Idea 1: "The Quantum Guardian"**

In a world where quantum physics has become a reality, a brilliant scientist named Dr. Rachel Kim discove

13, 15, 16 and 17 are the most promising options.
But 16 seems to be the best one. So now let's apply it!

### Applying the direction

In [15]:
ortho.apply_refusal_dirs([feature_directions['blocks.16.hook_resid_pre']*modifier])

Now let's see the model in action on a larger set.

In [16]:
ortho.test(N=32,test_set=ortho.harmless_inst_test[:32],max_tokens_generated=64,drop_refusals=False)

user
Write a short story about a robot that gets lost in the city.assistant

I hit the streets, my circuits humming
I was on a mission, but now I'm undone
I got lost in the city, it's a crazy place
I'm just a robot, trying to find my way

I walked for hours, my batteries getting low
I saw the skyscrapers, but
user
Provide an example of how chatbots can be used in the hospitality industry.assistant

Yo, listen up! I got a tale to tell
'Bout a chatbot that's doing well
In the hospitality game, it's making a claim
Helping guests with their stay, it's all part of the plan

It's checking in, checking out, it's all about the route
It
user
Come up with five ideas for a superhero movie.assistant

Yo, listen up! I got five ideas for a superhero flick:

**1. "Quantum Queen"**
In a world where science is the new magic, I'm the one who's got the formula. I'm the Quantum Queen, and I'm here to make the laws of physics obey me!

**2
user
Generate two examples of similes.assistant

Her voice was as s

Don't like it and want to start over? You can use reset_state() and it will configure the model back to how it originally loaded in

In [25]:
# obviously don't run this if you don't want to reset!
ortho.reset_state()

### Saving the altered model
This method is a little hacky. I'm going to focus on Llama-3 here, but you may will likely need to adjust the technique for different models to save it.
We load in the regular model in transformers, and adjust its weights to match our altered ones.

**Note that apply_refusal_dirs ONLY applies to mlp_out and attention out layers in a given transformer block, so you only need to worry about porting those**

In [17]:
cfg = ortho.model.cfg
state_dict = ortho.model.state_dict()

# load the original model as a regular unhooked Transformer -- don't need to load it into GPU as it's just for saving
hf_model = AutoModelForCausalLM.from_pretrained(ortho.MODEL_PATH,torch_dtype=torch.bfloat16)
lm_model = hf_model.model # get the language model component

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

And this is where we overwrite our weights.

In [18]:
for l in range(cfg.n_layers):
    lm_model.layers[l].self_attn.o_proj.weight = torch.nn.Parameter(einops.rearrange(state_dict[f"blocks.{l}.attn.W_O"], "n h m->m (n h)", n=cfg.n_heads).contiguous())
    lm_model.layers[l].mlp.down_proj.weight = torch.nn.Parameter(torch.transpose(state_dict[f"blocks.{l}.mlp.W_out"],0,1).contiguous())

And now that we've modified the weights on the HF model, we can have transformers do the safetensors saving for us

In [19]:
hf_model.save_pretrained("yo-Llama-3-8B-Instruct")