# Gemma Scope Tutorial

This is a barebones tutorial on how to use [Gemma Scope](https://huggingface.co/google/gemma-scope), Google DeepMind's suite of Sparse Autoencoders (SAEs) on every layer and sublayer of Gemma 2 2B and 9B. Sparse Autoencoders are an interpretability tool that act like a "microscope" on language model activations. They let us zoom in on dense, compressed activations, and expand them to a larger but sparser and seemingly more interpretable form, which can be a very useful tool when doing interpretability research!

**Learn more:**
* If you want to learn about Gemma Scope without writing any code, check out [this interactive demo](https://neuronpedia.org/gemma-scope) courtesy of [Neuronpedia](https://neuronpedia.org).
* For an overview of Gemma Scope check out [the blog post](https://deepmind.google/discover/blog/gemma-scope-helping-the-safety-community-shed-light-on-the-inner-workings-of-language-models).
* See [the technical report](https://storage.googleapis.com/gemma-scope/gemma-scope-report.pdf) for the technical details


# Gemma Scope Tutorial

This is a barebones tutorial on how to use [Gemma Scope](https://huggingface.co/google/gemma-scope), Google DeepMind's suite of Sparse Autoencoders (SAEs) on every layer and sublayer of Gemma 2 2B and 9B. Sparse Autoencoders are an interpretability tool that act like a "microscope" on language model activations. They let us zoom in on dense, compressed activations, and expand them to a larger but sparser and seemingly more interpretable form, which can be a very useful tool when doing interpretability research!

**Learn more:**
* If you want to learn about Gemma Scope without writing any code, check out [this interactive demo](https://neuronpedia.org/gemma-scope) courtesy of [Neuronpedia](https://neuronpedia.org).
* For an overview of Gemma Scope check out [the blog post](https://deepmind.google/discover/blog/gemma-scope-helping-the-safety-community-shed-light-on-the-inner-workings-of-language-models).
* See [the technical report](https://storage.googleapis.com/gemma-scope/gemma-scope-report.pdf) for the technical details



For illustrative purposes, we begin with a lightweight tutorial that uses as few libraries as possible to outline how Gemma Scope works, and what Sparse Autoencoders are doing. This is deliberately a fairly minimalist tutorial, designed to make clear what is actually going on, but does not model research best practices.

For any serious research with Gemma Scope, **we recommend using the [SAELens](https://jbloomaus.github.io/SAELens/) and [TransformerLens](https://transformerlensorg.github.io/TransformerLens/) libraries**, see [this tutorial](https://colab.research.google.com/github/jbloomAus/SAELens/blob/main/tutorials/tutorial_2_0.ipynb) on how to use [SAELens](https://jbloomaus.github.io/SAELens/) in practice.


## Loading the Model

First, let's load the model:

For simplicity we do this straight from [HuggingFace transformers](https://huggingface.co/docs/transformers/en/index), rather than using an interpretability focused library like [TransformerLens](https://transformerlensorg.github.io/TransformerLens/) or [nnsight](https://nnsight.net/)

In [31]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
from huggingface_hub import hf_hub_download, notebook_login
import numpy as np
import torch

torch.set_grad_enabled(False) # avoid blowing up mem

params = {
    "model_name" : "google/gemma-2-9b-it",
    "width" : "16k",
    "layer" : 31,
    "l0" : 76,
    "sae_repo_id": "google/gemma-scope-9b-it-res",
    "filename" : "layer_31/width_16k/average_l0_76/params.npz"
}

# params = {
#     "model_name" : "google/gemma-2-2b",
#     "width" : "16k",
#     "layer" : 23,
#     "l0" : 74,
#     "sae_repo_id": "google/gemma-scope-2b-pt-res",
#     "filename" : "layer_23/width_16k/average_l0_74/params.npz"
# }

model_name = params["model_name"]
width = params["width"]
layer = params["layer"]
l0 = params["l0"]
sae_repo_id = params["sae_repo_id"]
filename = params["filename"]

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
)
tokenizer =  AutoTokenizer.from_pretrained(model_name)

filename = f"layer_{layer}/width_{width}/average_l0_{l0}/params.npz"

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Some parameters are on the meta device device because they were offloaded to the cpu.


We load Gemma 2 2B, the smallest model that Gemma Scope works for. We load the base model, not the chat model, since that's where our SAEs are trained. Though the SAEs seem to transfer OK to these models. First, you'll need to authenticate with huggingface in order to download the model weights

Now we've loaded the model, let's try running it! We give it the prompt "Would you be able to travel through time using a wormhole?" and print the generated output

## Loading a Sparse Autoencoder

OK, so we have got Gemma 2 loaded and can sample from it to get sensible stuff. Now, let's load one of our SAEs.

GemmaScope actually contains over four hundred SAEs, but for now we'll just load one on the residual stream at the end of layer 20 (of 26, note that layers start at 0 so this is the 21st layer. This is a fairly late layer, so the model should have time to find more abstract concepts!).

See [the final section](https://colab.research.google.com/drive/17dQFYUYnuKnP6OwQPH9v_GSYUW5aj-Rp?authuser=2#scrollTo=E7zjkVseLSPp) for more information on how to load all the other SAEs in Gemma Scope

<details><summary>What is the residual stream?</summary>

Transformers have skip connections, which means that the output of each block is the output of each sublayer *plus* the input to the block. This means that each sublayer (attention or MLP) actually only has a fairly small effect on the output of the block, since most of it comes from all the earlier layers. We call the output of a block (including skip connections) the **residual stream**.

Everything communicated from earlier layers to later layers must go via the residual stream, so it acts as a "bottleneck" in the transformer, essentially capturing everything the model has "thought" so far. This means it is often a natural thing to study, since it will contain everything important going on in the model.
</details>


In [32]:
from huggingface_hub import hf_hub_download

path_to_params = hf_hub_download(
    repo_id=sae_repo_id,
    filename=filename,
    force_download=False,
)

params = np.load(path_to_params)
pt_params = {k: torch.from_numpy(v).cuda() for k, v in params.items()}

### Implementing the SAE


We now define the forward pass of the SAE for pedagogical purposes (in practice, we recommend using the implementation in SAELens)

Gemma Scope is a collection of [JumpReLU SAEs](https://arxiv.org/abs/2407.14435), which is like a standard two layer (one hidden layer) neural network, but where the activation function is a **JumpReLU**: a ReLU with a discontinuous jump.

In [33]:
import torch.nn as nn
class JumpReLUSAE(nn.Module):
  def __init__(self, d_model, d_sae):
    # Note that we initialise these to zeros because we're loading in pre-trained weights.
    # If you want to train your own SAEs then we recommend using blah
    super().__init__()
    self.W_enc = nn.Parameter(torch.zeros(d_model, d_sae))
    self.W_dec = nn.Parameter(torch.zeros(d_sae, d_model))
    self.threshold = nn.Parameter(torch.zeros(d_sae))
    self.b_enc = nn.Parameter(torch.zeros(d_sae))
    self.b_dec = nn.Parameter(torch.zeros(d_model))

  def encode(self, input_acts):
    pre_acts = input_acts @ self.W_enc + self.b_enc
    mask = (pre_acts > self.threshold)
    acts = mask * torch.nn.functional.relu(pre_acts)
    return acts

  def decode(self, acts):
    return acts @ self.W_dec + self.b_dec

  def forward(self, acts):
    acts = self.encode(acts)
    recon = self.decode(acts)
    return recon


In [34]:
sae = JumpReLUSAE(params['W_enc'].shape[0], params['W_enc'].shape[1])
sae.load_state_dict(pt_params)
sae.cuda()

JumpReLUSAE()

### Running the SAE on model activatinos


Let's first get out some activations from the model at the SAE target site. We'll demonstrate how to do this 'manually' first, by using Pytorch hooks. Note that this is not particularly good practice, and it's probably more practical to use a library like TransformerLens to handle hooking the SAE into a model forward pass. But for illustrative purposes, it's useful to see how it's done.

We can gather activations at a site by registering a hook. To keep this local, we can wrap this in a function that registers a hook, runs the model, saving the intermediate activation, then removes the hook. (This is basically what TransformerLens is doing under the hood)

In [35]:
def gather_residual_activations(model, target_layer, inputs):
  target_act = None
  def gather_target_act_hook(mod, inputs, outputs):
    nonlocal target_act # make sure we can modify the target_act from the outer scope
    target_act = outputs[0]
    return outputs
  handle = model.model.layers[target_layer].register_forward_hook(gather_target_act_hook)
  _ = model.forward(inputs)
  handle.remove()
  return target_act

In [6]:
import pandas as pd

dataset_name = "cornell-movie-review-data/rotten_tomatoes/"

splits = {'train': 'train.parquet', 'validation': 'validation.parquet', 'test': 'test.parquet'}
df = pd.read_parquet(f"hf://datasets/{dataset_name}" + splits["train"])

In [7]:
n = len(df)

sub_df = df.sample(n=n)

prompts = sub_df["text"].tolist()

In [8]:
import os
weight_name = dataset_name + "/" + model_name + "/" + filename
weight_name = weight_name.replace(os.sep, "_")
weight_name

'cornell-movie-review-data_rotten_tomatoes__google_gemma-2-9b-it_layer_31_width_16k_average_l0_76_params.npz'

In [9]:
target_acts = []

from tqdm import tqdm
import torch
import numpy as np

with torch.no_grad():
    for prompt in tqdm(prompts):
        inputs = tokenizer.encode(prompt, return_tensors="pt", add_special_tokens=True).to("cuda")

        target_act = gather_residual_activations(model, layer, inputs)
        target_acts.append(target_act)
        
        # Optionally, clear CUDA cache
        torch.cuda.empty_cache()


# Create a list of tensors
tensor_list = target_acts

# Convert to NumPy and save
# np.savez(f'{weight_name}.npz', 
#          *[f'array_{i}' for i in range(len(tensor_list))],
#          **{f'array_{i}': tensor.cpu().numpy() for i, tensor in enumerate(tensor_list)})

100%|██████████| 8530/8530 [11:18<00:00, 12.57it/s]


Now, we can run our SAE on the saved activations.

In [10]:
sae_acts = []
    
from tqdm import tqdm

with torch.no_grad():
    for target_act in tqdm(target_acts):
    # Move the input to GPU if it's not already there
        target_act_gpu = target_act.to(torch.float32).cuda()
        
        sae_act = sae.encode(target_act_gpu)

        # Move result to CPU and convert to numpy
        sae_act_aggregated = ((sae_act[:,:,:] > 0).sum(1) > 0).cpu().numpy()
        
        # Append the CPU numpy array
        sae_acts.append(sae_act_aggregated)
        
        # Optionally, clear CUDA cache
        torch.cuda.empty_cache()

100%|██████████| 8530/8530 [00:05<00:00, 1451.02it/s]


Let's just double check that the model looks sensible by checking that we explain a decent chunk of the variance:

In [11]:
# Concatenate the list of numpy arrays on the first dimension
array = np.concatenate(sae_acts, axis=0).astype(float)

In [12]:
result_df = pd.DataFrame(array)
result_df["label"] = sub_df["label"].values

result_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16375,16376,16377,16378,16379,16380,16381,16382,16383,label
0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,...,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1
1,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,...,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0
2,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,...,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0
3,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,...,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0
4,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,...,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8525,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,...,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1
8526,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,...,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1
8527,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,...,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1
8528,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,...,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0


In [13]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt
import requests

max_depth = 5

def get_feature_descriptions(feature, model="gemma-2-2b", layer="20-gemmascope-res-65k"):
    url = f"https://www.neuronpedia.org/api/feature/{model}/{layer}/{feature}"
    response = requests.get(url)
    output = response.json()["explanations"][0]["description"]
    return output

get_feature_descriptions_gemma_2_9b = lambda x: get_feature_descriptions(x, model="gemma-2-9b-it", layer="31-gemmascope-res-16k")

# Assuming your data is already in a DataFrame called 'result_df'
# If not, load your data into a DataFrame first

# Separate features and target
X = result_df.drop('label', axis=1)
y = result_df['label']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Fit decision tree classifier with constraints
clf = DecisionTreeClassifier(
    max_depth=max_depth,  # Limit the depth of the tree
    random_state=42
)
clf.fit(X_train, y_train)

# Make predictions
y_train_pred = clf.predict(X_train)
y_val_pred = clf.predict(X_val)

print("Accuracy on training:", accuracy_score(y_train, y_train_pred))
print("Classification Report on training:")
print(classification_report(y_train, y_train_pred))

print("Accuracy on validation:", accuracy_score(y_val, y_val_pred))
print("\nClassification Report on validation:")
print(classification_report(y_val, y_val_pred))

# Get feature importances
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': clf.feature_importances_
})

# Sort features by importance
feature_importance = feature_importance.sort_values('importance', ascending=False)

print("Non-zero features:", feature_importance.loc[feature_importance["importance"] > 0].feature.tolist())

# Print top 20 most important features
print("\nTop 20 Most Important Features:")
print(feature_importance.head(20))

# Get feature descriptions for non-zero importance features
non_zero_features = feature_importance.loc[feature_importance["importance"] > 0, "feature"].tolist()
feature_descriptions = {feature: get_feature_descriptions_gemma_2_9b(feature) for feature in non_zero_features}

# Create a mapping of feature names to their descriptions
feature_names_with_desc = [f"{feat}\n{feature_descriptions[feat][:50]}..." if feat in feature_descriptions else feat for feat in X.columns]

# # Visualize the decision tree with feature descriptions
# plt.figure(figsize=(30,15))
# plot_tree(clf, feature_names=feature_names_with_desc, class_names=clf.classes_.astype(str), filled=True, rounded=True, max_depth=3)
# plt.savefig('constrained_decision_tree_with_descriptions.png', dpi=300, bbox_inches='tight')
# plt.close()

# print("Constrained decision tree visualization with feature descriptions has been saved as 'constrained_decision_tree_with_descriptions.png'")

Accuracy on training: 0.8402637845759297
Classification Report on training:
              precision    recall  f1-score   support

           0       0.83      0.86      0.84      2713
           1       0.86      0.82      0.84      2746

    accuracy                           0.84      5459
   macro avg       0.84      0.84      0.84      5459
weighted avg       0.84      0.84      0.84      5459

Accuracy on validation: 0.8234432234432234

Classification Report on validation:
              precision    recall  f1-score   support

           0       0.82      0.84      0.83       692
           1       0.83      0.81      0.82       673

    accuracy                           0.82      1365
   macro avg       0.82      0.82      0.82      1365
weighted avg       0.82      0.82      0.82      1365

Non-zero features: [6272, 8410, 11367, 14557, 15837, 12526, 7886, 1518, 13556, 854, 14929, 7796, 15291, 1244, 2442, 14484, 10718, 13507, 264, 8867, 13444, 13545, 6532, 5864]

Top 20 Most Im

In [14]:
import pickle

clf_name = f"decision_tree_max_depth_{max_depth}_ "+ model_name + "_" + filename.split(".npz")[0]
clf_name = clf_name.replace(os.sep, "_")

with open(f'{clf_name}.pkl', 'wb') as model_file:
    pickle.dump(clf, model_file)

print(f"Decision Tree model has been exported to {clf_name}.pkl")

with open(f"{clf_name}.pkl", 'rb') as model_file:
    clf = pickle.load(model_file)

Decision Tree model has been exported to decision_tree_max_depth_5_ google_gemma-2-9b-it_layer_31_width_16k_average_l0_76_params.pkl


In [15]:
with open(f"{clf_name}.pkl", 'rb') as model_file:
    clf = pickle.load(model_file)

In [29]:
from sklearn.linear_model import LogisticRegression
X = result_df.drop('label', axis=1)
y = result_df['label']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# # Fit logistic regression with L1 regularization
# clf = LogisticRegression(penalty='l1', solver='liblinear', C=0.1, random_state=42)
# clf.fit(X_train, y_train)

# # Make predictions
# y_pred = clf.predict(X_test)

C = 0.01

# scaler = StandardScaler()
# X_train_scaled = scaler.fit_transform(X_train)
# X_val_scaled = scaler.transform(X_val)

# Fit logistic regression with L1 regularization
clf = LogisticRegression(penalty='l1', solver='liblinear', C=C, random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_val_pred = clf.predict(X_val)

print("Accuracy on training:", accuracy_score(y_train, clf.predict(X_train)))
print("Classification Report on training:")
print(classification_report(y_train, clf.predict(X_train)))

# Print accuracy and classification report
print("Accuracy on validation:", accuracy_score(y_val, y_val_pred))
print("\nClassification Report on validation:")
print(classification_report(y_val, y_val_pred))

# Get feature importances
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': np.abs(clf.coef_[0])
})

# Sort features by importance
feature_importance = feature_importance.sort_values('importance', ascending=False)

print("Non zero features:", feature_importance.loc[feature_importance["importance"] > 0].feature.tolist())

Accuracy on training: 0.8521707272394211
Classification Report on training:
              precision    recall  f1-score   support

           0       0.84      0.87      0.85      2713
           1       0.87      0.83      0.85      2746

    accuracy                           0.85      5459
   macro avg       0.85      0.85      0.85      5459
weighted avg       0.85      0.85      0.85      5459

Accuracy on validation: 0.8652014652014652

Classification Report on validation:
              precision    recall  f1-score   support

           0       0.85      0.89      0.87       692
           1       0.88      0.84      0.86       673

    accuracy                           0.87      1365
   macro avg       0.87      0.86      0.87      1365
weighted avg       0.87      0.87      0.87      1365

Non zero features: [6272, 8410, 14557, 7886, 11367, 13556, 15837, 6634, 4795, 1518, 3456, 7796, 3404, 15142, 4364, 12526, 3628, 920, 12970, 5236, 1631, 1374, 13679, 14218, 10816, 3762]


It's always worth checking this sort of thing when you do this by hand to check that you haven't got the wrong site, or are missing a scaling factor or something like this. But here, our results all look like they are supposed to .

Note that there's a bit of a gotcha here; our SAEs are *NOT* trained on the BOS token, because we found that this tended to be a large outlier and to mess up training. So they tend to give nonsense when we apply to them to it, and we need to be careful not to do this accidentally! We can see this above : the BOS token is a total outlier in terms of L0!

Let's look at the highest activating features on this input text, on each token position:

In [49]:
feature

6272

: 

In [46]:
import gradio as gr

topk = 3

examples = [
    "a masterpiece four years in the making .",
    "a sentimental mess that never rings true .",
    "the action clichés just pile up ."
]

text = "I really wished I could give this movie a higher rating. The plot was interesting, but the acting was terrible. The special effects were great, but the pacing was off. The movie was too long, but the ending was satisfying."

inputs = tokenizer.encode(text, return_tensors="pt", add_special_tokens=True).to("cuda")

target_act = gather_residual_activations(model, layer, inputs)
sae_act = sae.encode(target_act)
sae_act_aggregated = ((sae_act[:,:,:] > 0).sum(1) > 0).cpu().numpy()

X = pd.DataFrame(sae_act_aggregated)

feature_contributions = X.iloc[0].astype(float).values * clf.coef_[0]

contrib_df = pd.DataFrame({
        'feature': range(len(feature_contributions)),
        'contribution': feature_contributions
})

contrib_df = contrib_df.loc[contrib_df['contribution'].abs() > 0]

# Sort by absolute contribution and get top N
contrib_df = contrib_df.reindex(contrib_df['contribution'].abs().sort_values(ascending=False).index)

contrib_df = contrib_df.head(topk)
contrib_df["description"] = contrib_df["feature"].apply(get_feature_descriptions)

import plotly.graph_objs as go

fig = go.Figure(go.Bar(
    x=contrib_df['contribution'],
    y=contrib_df['description'],
    orientation='h'  # Horizontal bar chart
))

fig.update_layout(
    title='Feature contribution',
    xaxis_title='Contribution',
    yaxis_title='Features',
    height=500,
    margin=dict(l=200)  # Increase left margin to accommodate longer feature names
)
fig.update_yaxes(autorange="reversed")

probability = clf.predict_proba(X)[0]
classes = {
    "Positive": probability[1],
    "Negative": probability[0]
}

choices = [(description, feature) for description, feature in zip(contrib_df["description"], contrib_df["feature"])]
dropdown = gr.Dropdown(choices=choices, 
                        value=choices[0][1],
                        interactive=True, label="Features")

feature = choices[0][1]

inputs = tokenizer.encode(text, return_tensors="pt", add_special_tokens=True).to("cuda")

target_act = gather_residual_activations(model, layer, inputs)
sae_act = sae.encode(target_act)

activated_tokens = sae_act[0:,:,feature]
max_activation = activated_tokens.max().item()
activated_tokens /= max_activation

activated_tokens = activated_tokens.cpu().detach().numpy()

output = []

for i, token_id in enumerate(inputs[0, :]):
    token = tokenizer.decode(token_id)
    output.append((token, activated_tokens[0, i]))

print(output)
fig.show()

[('<bos>', 0.0), ('I', 0.0), (' really', 0.0), (' wished', 0.0), (' I', 0.0), (' could', 0.0), (' give', 0.0), (' this', 0.0), (' movie', 0.29309767), (' a', 0.0), (' higher', 0.0), (' rating', 0.0), ('.', 0.0), (' The', 0.21383312), (' plot', 0.0), (' was', 0.22131765), (' interesting', 0.0), (',', 0.0), (' but', 0.51617336), (' the', 0.5799874), (' acting', 0.347309), (' was', 0.36400035), (' terrible', 0.49232012), ('.', 0.7318199), (' The', 0.56170917), (' special', 0.0), (' effects', 0.45976144), (' were', 0.99999994), (' great', 0.0), (',', 0.0), (' but', 0.47706267), (' the', 0.4011524), (' pacing', 0.7848547), (' was', 0.9232518), (' off', 0.4621812), ('.', 0.0), (' The', 0.0), (' movie', 0.59335506), (' was', 0.57606274), (' too', 0.0), (' long', 0.0), (',', 0.0), (' but', 0.4116312), (' the', 0.0), (' ending', 0.5189625), (' was', 0.71944976), (' satisfying', 0.0), ('.', 0.0)]


In [48]:
classes

{'Positive': 0.3629834319308022, 'Negative': 0.6370165680691978}

In [44]:
feature = choices[2][1]

inputs = tokenizer.encode(text, return_tensors="pt", add_special_tokens=True).to("cuda")

target_act = gather_residual_activations(model, layer, inputs)
sae_act = sae.encode(target_act)

activated_tokens = sae_act[0:,:,feature]
max_activation = activated_tokens.max().item()
activated_tokens /= max_activation

activated_tokens = activated_tokens.cpu().detach().numpy()

output = []

for i, token_id in enumerate(inputs[0, :]):
    token = tokenizer.decode(token_id)
    output.append((token, activated_tokens[0, i]))

print(output)

[('<bos>', 0.0), ('the', 0.0), (' action', 0.0), (' clichés', 0.47497016), (' just', 1.0), (' pile', 0.516835), (' up', 0.46400496), (' .', 0.4915409)]


In [17]:
import requests

def get_feature_descriptions(feature):
    layer_name = f"{layer}-gemmascope-res-{width}"
    model_name_neuronpedia = model_name.split("/")[1]

    url = f"https://www.neuronpedia.org/api/feature/{model_name_neuronpedia}/{layer_name}/{feature}"

    response = requests.get(url)
    output = response.json()["explanations"][0]["description"]
    return output

# Get feature importances
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': np.abs(clf.coef_[0])
})

# Sort features by importance
feature_importance = feature_importance.sort_values('importance', ascending=False)
feature_importance = feature_importance.loc[feature_importance["importance"] > 0]

# feature_importance["description"] = feature_importance["feature"].apply(get_feature_descriptions)

print("Non zero features:", feature_importance.feature.tolist())

# Print top 20 most important features
print("\nTop Important Features:")
print(feature_importance.head(20))

Non zero features: [6272, 8410, 14557, 7886, 11367, 13556, 15837, 6634, 4795, 1518, 3456, 7796, 3404, 15142, 4364, 12526, 3628, 920, 12970, 5236, 1631, 1374, 13679, 14218, 10816, 3762]

Top Important Features:
      feature  importance
6272     6272    1.235425
8410     8410    0.707908
14557   14557    0.576760
7886     7886    0.485816
11367   11367    0.467120
13556   13556    0.417031
15837   15837    0.383319
6634     6634    0.354729
4795     4795    0.327832
1518     1518    0.325042
3456     3456    0.193763
7796     7796    0.178672
3404     3404    0.155527
15142   15142    0.123701
4364     4364    0.114390
12526   12526    0.098219
3628     3628    0.084569
920       920    0.056221
12970   12970    0.046524
5236     5236    0.046149


In [18]:
import pickle

clf_name = f"linear_classifier_C_{C}_ "+ model_name + "_" + filename.split(".npz")[0]
clf_name = clf_name.replace(os.sep, "_")

with open(f'{clf_name}.pkl', 'wb') as model_file:
    pickle.dump(clf, model_file)

print(f"Linear classifier model has been exported to {clf_name}.pkl")

Linear classifier model has been exported to linear_classifier_C_0.01_ google_gemma-2-9b-it_layer_31_width_16k_average_l0_76_params.pkl


In [23]:
import pandas as pd

# params = {
#     "model_name" : "google/gemma-2-2b",
#     "width" : "16k",
#     "layer" : 23,
#     "l0" : 74,
#     "sae_repo_id": "google/gemma-scope-2b-pt-res",
#     "filename" : "layer_23/width_16k/average_l0_74/params.npz"
# }

params = {
    "model_name" : "google/gemma-2-9b-it",
    "width" : "16k",
    "layer" : 31,
    "l0" : 76,
    "sae_repo_id": "google/gemma-scope-9b-it-res",
    "filename" : "layer_31/width_16k/average_l0_76/params.npz"
}

model_name = params["model_name"]
width = params["width"]
layer = params["layer"]
l0 = params["l0"]
sae_repo_id = params["sae_repo_id"]
filename = params["filename"]

feature_importance = pd.read_csv("feature_importance.csv")
feature_importance = feature_importance.iloc[:3]

import requests

def get_feature_descriptions(feature):
    layer_name = f"{layer}-gemmascope-res-{width}"
    model_name_neuronpedia = model_name.split("/")[1]

    url = f"https://www.neuronpedia.org/api/feature/{model_name_neuronpedia}/{layer_name}/{feature}"

    response = requests.get(url)
    output = response.json()["explanations"][0]["description"]
    return output
feature_importance["description"] = feature_importance["feature"].apply(get_feature_descriptions)

In [24]:
import plotly.graph_objs as go

fig = go.Figure(go.Bar(
    x=feature_importance['importance'],
    y=feature_importance['description'],
    orientation='h'  # Horizontal bar chart
))

fig.update_layout(
    title='Feature Importance',
    xaxis_title='Importance',
    yaxis_title='Features',
    height=500,
    margin=dict(l=200)  # Increase left margin to accommodate longer feature names
)
fig.update_yaxes(autorange="reversed")
fig.show()

In [25]:
topk = 3
topk_features = feature_importance.head(topk).feature.values

print(topk_features)

from IPython.display import IFrame
html_template = "https://neuronpedia.org/{}/{}/{}?embed=true&embedexplanation=true&embedplots=true&embedtest=true&height=300"

def get_dashboard_html(sae_release, sae_id, feature_idx=0):
    return html_template.format(sae_release, sae_id, feature_idx)

for feature_idx in topk_features:
    print(f"Feature: {feature_idx}")
    print(f"Coefficient: {clf.coef_[0][feature_idx]}")
    html = get_dashboard_html(sae_release = "gemma-2-2b", sae_id="23-gemmascope-res-16k", feature_idx=feature_idx)
    display(IFrame(html, width=1200, height=600))
    print("\n")

# html = get_dashboard_html(sae_release = "gemma-2-2b", sae_id="20-gemmascope-res-16k", feature_idx=10004)
# IFrame(html, width=1200, height=600)

[ 3946  4438 13920]
Feature: 3946
Coefficient: 0.0




Feature: 4438
Coefficient: -1.4645945804608147




Feature: 13920
Coefficient: 0.763696937782067






In [22]:
import gradio as gr
import os
from transformers import AutoModelForCausalLM, AutoTokenizer
from huggingface_hub import hf_hub_download
import numpy as np
import torch

torch.set_grad_enabled(False) # avoid blowing up mem

params = {
    "model_name" : "google/gemma-2-2b",
    "width" : "16k",
    "layer" : 23,
    "l0" : 74,
    "sae_repo_id": "google/gemma-scope-2b-pt-res",
    "filename" : "layer_23/width_16k/average_l0_74/params.npz"
}

model_name = params["model_name"]
width = params["width"]
layer = params["layer"]
l0 = params["l0"]
sae_repo_id = params["sae_repo_id"]
filename = params["filename"]

C = 0.01

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
)
tokenizer =  AutoTokenizer.from_pretrained(model_name)

path_to_params = hf_hub_download(
    repo_id=sae_repo_id,
    filename=filename,
    force_download=False,
)

params = np.load(path_to_params)
pt_params = {k: torch.from_numpy(v).cuda() for k, v in params.items()}

import pickle
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

clf_name = f"linear_classifier_C_{C}_ "+ model_name + "_" + filename.split(".npz")[0]
clf_name = clf_name.replace(os.sep, "_")

scaler_name = f"scaler_C_{C}_ "+ model_name + "_" + filename.split(".npz")[0]
scaler_name = scaler_name.replace(os.sep, "_")

with open(f"{clf_name}.pkl", 'rb') as model_file:
    clf = pickle.load(model_file)

with open(f"{scaler_name}.pkl", 'rb') as scaler_file:
    scaler = pickle.load(scaler_file)

import torch.nn as nn
class JumpReLUSAE(nn.Module):
  def __init__(self, d_model, d_sae):
    # Note that we initialise these to zeros because we're loading in pre-trained weights.
    # If you want to train your own SAEs then we recommend using blah
    super().__init__()
    self.W_enc = nn.Parameter(torch.zeros(d_model, d_sae))
    self.W_dec = nn.Parameter(torch.zeros(d_sae, d_model))
    self.threshold = nn.Parameter(torch.zeros(d_sae))
    self.b_enc = nn.Parameter(torch.zeros(d_sae))
    self.b_dec = nn.Parameter(torch.zeros(d_model))

  def encode(self, input_acts):
    pre_acts = input_acts @ self.W_enc + self.b_enc
    mask = (pre_acts > self.threshold)
    acts = mask * torch.nn.functional.relu(pre_acts)
    return acts

  def decode(self, acts):
    return acts @ self.W_dec + self.b_dec

  def forward(self, acts):
    acts = self.encode(acts)
    recon = self.decode(acts)
    return recon

sae = JumpReLUSAE(params['W_enc'].shape[0], params['W_enc'].shape[1])
sae.load_state_dict(pt_params)
sae.cuda()

@torch.no_grad()
def gather_residual_activations(model, target_layer, inputs):
  target_act = None
  def gather_target_act_hook(mod, inputs, outputs):
    nonlocal target_act # make sure we can modify the target_act from the outer scope
    target_act = outputs[0]
    return outputs
  handle = model.model.layers[target_layer].register_forward_hook(gather_target_act_hook)
  _ = model.forward(inputs)
  handle.remove()
  return target_act

import requests

def get_feature_descriptions(feature):
    layer_name = f"{layer}-gemmascope-res-{width}"
    model_name_neuronpedia = model_name.split("/")[1]

    url = f"https://www.neuronpedia.org/api/feature/{model_name_neuronpedia}/{layer_name}/{feature}"

    response = requests.get(url)
    output = response.json()["explanations"][0]["description"]
    return output

def embed_content(url):
    html_content = f"""
    <div style="width:100%; height:500px; overflow:hidden;">
        <iframe src="{url}" width="100%" height="100%" frameborder="0"></iframe>
    </div>
    """
    return html_content

def dummy_function(*args):
    # This is a placeholder function. Replace with your actual logic.
    return "Scores will be displayed here"

examples = [
    "a masterpiece four years in the making .",
    "a sentimental mess that never rings true .",
    "the action clichés just pile up ."
]


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

EOFError: Ran out of input

In [None]:
import gradio as gr

topk = 5

def get_features(text):

    inputs = tokenizer.encode(text, return_tensors="pt", add_special_tokens=True).to("cuda")

    target_act = gather_residual_activations(model, layer, inputs)
    sae_act = sae.encode(target_act)
    sae_act_aggregated = ((sae_act[:,1:,:] > 0).sum(1) > 0).cpu().numpy()

    X = pd.DataFrame(sae_act_aggregated)

    feature_contributions = X.iloc[0].astype(float).values * clf.coef_[0]

    contrib_df = pd.DataFrame({
            'feature': range(len(feature_contributions)),
            'contribution': feature_contributions
    })

    contrib_df = contrib_df.loc[contrib_df['contribution'].abs() > 0]

    # Sort by absolute contribution and get top N
    contrib_df = contrib_df.reindex(contrib_df['contribution'].abs().sort_values(ascending=False).index)

    contrib_df = contrib_df.head(topk)
    contrib_df["description"] = contrib_df["feature"].apply(get_feature_descriptions)

    import plotly.graph_objs as go

    fig = go.Figure(go.Bar(
        x=contrib_df['contribution'],
        y=contrib_df['description'],
        orientation='h'  # Horizontal bar chart
    ))

    fig.update_layout(
        title='Feature contribution',
        xaxis_title='Contribution',
        yaxis_title='Features',
        height=500,
        margin=dict(l=200)  # Increase left margin to accommodate longer feature names
    )
    fig.update_yaxes(autorange="reversed")

    probability = clf.predict_proba(X)[0]
    classes = {
        "Positive": probability[1],
        "Negative": probability[0]
    }

    choices = [(description, feature) for description, feature in zip(contrib_df["description"], contrib_df["feature"])]
    dropdown = gr.Dropdown(choices=choices, 
                           value=choices[0][1],
                           interactive=True, label="Features")

    return classes, fig, dropdown

IndexError: index 31 is out of range

In [28]:
print(text)
fig.show()

a sentimental mess that never rings true .


In [None]:
inputs = tokenizer.encode(text, return_tensors="pt", add_special_tokens=True).to("cuda")

target_act = gather_residual_activations(model, layer, inputs)
sae_act = sae.encode(target_act)


activated_tokens = sae_act[0:,:,feature]
# max_activation = activated_tokens.max().item()
# activated_tokens /= max_activation

# activated_tokens = activated_tokens.cpu().detach().numpy()

# output = []

# for i, token_id in enumerate(inputs[0, :]):
#     token = tokenizer.decode(token_id)
#     output.append((token, activated_tokens[0, i]))

In [None]:
def get_feature_iframe(feature):
    layer_name = f"{layer}-gemmascope-res-{width}"
    model_name_neuronpedia = model_name.split("/")[1]

    url = f"https://www.neuronpedia.org/api/feature/{model_name_neuronpedia}/{layer_name}/{feature}?embed=true"
    html_content = embed_content(url)
    return html_content



digital-video documentary about stand-up comedians is a great glimpse into a very different world . 1


In [None]:
inputs = tokenizer.encode(text, return_tensors="pt", add_special_tokens=True).to("cuda")

target_act = gather_residual_activations(model, layer, inputs)
sae_act = sae.encode(target_act)

activated_tokens = sae_act[0:,:,feature]
activated_tokens
# max_activation = activated_tokens.max().item()
# activated_tokens /= max_activation

# activated_tokens = activated_tokens.cpu().detach().numpy()

# output = []

# for i, token_id in enumerate(inputs[0, :]):
#     token = tokenizer.decode(token_id)
#     output.append((token, activated_tokens[0, i]))


tensor([[62.0354,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000]],
       device='cuda:0')

In [None]:
def get_highlighted_text(text, feature):

    inputs = tokenizer.encode(text, return_tensors="pt", add_special_tokens=True).to("cuda")

    target_act = gather_residual_activations(model, layer, inputs)
    sae_act = sae.encode(target_act)

    activated_tokens = sae_act[0:,1:,feature]
    max_activation = activated_tokens.max().item()
    activated_tokens /= max_activation

    activated_tokens = activated_tokens.cpu().detach().numpy()

    output = []

    for i, token_id in enumerate(inputs[0, 1:]):
        token = tokenizer.decode(token_id)
        output.append((token, activated_tokens[0, i]))

    return output

{'Positive': 0.9071025712081094, 'Negative': 0.09289742879189056}