This primer can serve as a comprehensive introduction to recent advances in interpretability for Transformer-based LMs for a technical audience, employing a unified notation to introduce network modules and present state-of-the-art interpretability methods.
Interpretability methods are presented with detailed formulations and categorized as either localizing the inputs or model components responsible for a particular prediction or decoding information stored in learned representations. Then, various insights on the role of specific model components are summarized alongside recent work using model internals to direct editing and mitigate hallucinations.
Finally, the paper provides a detailed picture of the open-source interpretability tools landscape, supporting the need for open-access models to advance interpretability research.
π Today's pick in Interpretability & Analysis of LMs: by @aadityasingh T. Moskovitz, F. Hill, S. C. Y. Chan, A. M. Saxe (@gatsbyunit)
This work proposes a new methodology inspired by optogenetics (dubbed "clamping") to perform targeted ablations during training to estimate the causal effect of specific interventions on mechanism formation.
Authors use this approach to study the formation of induction heads training a 2L attention-only transformer to label examples via context information.
Notable findings:
- The effects of induction heads are additive and redundant, with weaker heads compensating well for the ablation of a strong induction head in case the latter is ablated. - Competition between induction heads might emerge as a product of optimization pressure to converge faster, but it is not strictly necessary as all heads eventually learn to solve the task. - Previous token heads (PTH) influence induction heads in a many-to-many fashion, with any PTH eliciting above-chance prediction from a subsequent induction head - Three subcircuits for induction are identified, respectively mixing token-label information (1 + 2), matching the previous occurrence of the current class in the context (3qk + 4), and copying the label of the matched class (3v + 5). - The formation of induction heads is slowed down by a larger number of classes & labels, with more classes and more labels slowing down the formation of the matching and copying mechanisms, respectively. This may have implications when selecting a vocabulary size for LLMs: larger vocabularies lead to an increased compression ratio and longer contexts, but they might make copying more challenging by delaying the formation of induction heads.
π Today's pick in Interpretability & Analysis of LMs: Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms by @mwhanna@sandropezzelle@belinkov
Edge attribution patching (EAP) is a circuit discovery technique using gradients to approximate the effects of causal intervening on each model edge. In the literature, its effectiveness is validated by comparing the overlap of its resulting circuits with those found via causal interventions (much more expensive).
This work:
1. Proposes a new method for faithful and efficient circuit discovery named edge attribution patching with integrated gradients (EAP-IG) 2. Evaluates the faithfulness EAP, EAP-IG and activation patching, i.e. whether behavior of the model remains consistent after all non-circuit edges are ablated. 3. Highlights that, while the no-overlap and full-overlap of EAP-like methods with activation patching results are generally good indicators of unfaithful and faithful (respectively) circuit identification, circuits with moderate overlap cannot generally assumed to be faithful to model behavior.
An advantage of EAP-IG is enabling the usage of KL-Divergence as a target for gradient propagation, which is not possible in the case of raw gradient-based EAP.
EAP-IG runtime is approximately similar to the one of EAP, with a small number of steps to approximate the gradient integral.
Importantly, circuit faithfulness does not imply completeness, i.e. whether all components participating towards a specific task were accounted for. This aspect is identified as interesting for future work.
π Today's pick in Interpretability & Analysis of LMs: Information Flow Routes: Automatically Interpreting Language Models at Scale by @javifer@lena-voita
This work presents a novel method to identify salient components in Transformer-based language models by decomposing the contribution of various model components into the residual stream.
This method is more efficient and scalable than previous techniques such as activation patching, as it only requires a single forward pass through the model to identify critical information flow paths. Moreover, it can be applied without a contrastive template, which is observed to produce results dependent on the selected contrastive example for activation patching.
Information flow routes are applied to Llama 2, showing that:
1. Models show βtypicalβ information flow routes for non-content words, while content words donβt exhibit such patterns. 2. Feedforward networks are more active in the bottom layers of the network (where e.g. subject enrichment is performed) and in very last layer. 3. Positional and subword-merging attention heads are among the most active and important throughout the network. 4. Periods can be treated by the model as BOS tokens by leaving their residual representation mostly untouched during the forward pass.
Finally, the paper also demonstrates that some model components are specialized for specific domains, such as coding or multilingual texts, suggesting a high degree of modularity in the network. The contribution of domain-specific heads obtained by projecting right singular values of the OV circuit to the unembedding matrix show highly interpretable concepts being handled in granular model components.