Transformer Circuits Thread

Circuits Updates - March 2024



We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.

We'd ask you to treat these results like those of a colleague sharing some thoughts or preliminary experiments for a few minutes at a lab meeting, rather than a mature paper.

New Posts

Updates







Using Features For Easy Circuit Identification

Joshua Batson, Brian Chen, Andy Jones

Finding circuits in language models can be a labor-intensive process. You specify a behavior of interest, such as indirect-object identification, inequality identification, or factual association, then assemble a dataset with hundreds or thousands of examples of that behavior. To find model components which are important for a prediction, you ablate or patch them in turn on each example. This produces a list of components, and further investigations can reveal their interactions on the assembled dataset.

Features produced by sparse autoencoders (SAEs) offer an alternative to behavior-specific dataset curation. Key aspects of the full data distribution, as processed by the model, are encapsulated in the dictionary. This allows us to generate rich hypotheses from a single example text, and to study the more general function of the revealed components in terms of other features they interact with.

We illustrate this approach by investigating the attribute-extraction process following Nanda et al. (see also Geva et al., Meng et al.). The authors studied completions of the sentence "Fact: [ATHLETE NAME] plays the sport of" for a collection of ~1500 [ATHLETE NAME]s playing basketball, baseball, or tennis. The authors found three distinct phases:

We investigate an 18-layer model, with a sparse autoencoder with expansion factor of ~256 trained on the layer 9 residual stream (as in Cunningham et al.).

Detecting Important Features with Attribution

The first step of our investigation was determining which features are driving the behavior in our task. We automatically identify basketball, baseball, and tennis features in the residual stream, learned by our SAE, which seem to play a key role.

We begin with the example text "Fact: Michael Jordan plays the sport of", and compute an attribution score for each autoencoder feature in the layer 9 residual stream at the position [ Jordan] with respect to the logit difference between completions of [ basketball] and [ golf]:

\text{attr}_i := a_i \mathbf{d_i} \cdot \nabla_{x} \mathcal{L},

where a_i is the activation of feature i, \mathbf{d}_i is the dictionary (decoder) vector for feature i in the SAE, and \nabla_{x} \mathcal{L} is the gradient of the logit difference between [ basketball] and [ golf] on the prediction of the final token with respect to the residual stream on the token [ Jordan] in layer 9 where the SAE is defined. (We use the logit difference between two sports to identify the features that produce basketball specifically, and avoid those which might produce all sports in general.)

There is a clear top feature by attribution. The top-activating dataset examples (from a feature visualization precomputed on the training mix) include basketball-related words such as "layup", "half-court", and "point guard". Since each feature's activation is a thresholded affine function on the residual stream \text{relu}(E_i x+b_i) (where E_i is that feature’s vector in the encoder and b_i is that feature’s bias), we have from a single example isolated a basketball-related probe (E_ix + b_i > 0). This also indicates that the subject_enrichment phase has progressed by layer 9, such that information about the sport is present in the token.

We repeat this analysis for "Babe Ruth" for baseball and "Serena Williams" for tennis, finding in each case the top feature by attribution fired on baseball ("Dodgers", "Orioles", "yankees") and tennis ("WTA")-related texts.

We note a number of features that respond to ends of names or surnames ("Judas Priest" in one, "John Cena" in another, "Carl Jung" in a third) which have significant attribution for "Jordan" and "Ruth" but not "Williams". We hypothesized that this might be due to the model having learned to associate "tennis" with the first name "Serena", but with full names for "Michael Jordan" and "Babe Ruth". We took a handful of athlete names, and asked the model for completions of "Fact: [ATHLETE FIRSTNAME] Smith plays the sport of"; we found that the model's predictions matched that for [ATHLETE NAME] in exactly the cases where the "Judas Priest" feature had low attribution on the original text, and that the predictions were damaged in the cases where the "Judas Priest" feature had high attribution on the original text. The presence and effects of this feature provides additional evidence that the model “detokenizes” its input in early layers.

Detecting Attention Heads that Act on Features

We then looked for attention heads who produced a large direct logit effect from the basketball feature (i.e., the difference in the [basketball] and [golf] entries of U\operatorname{norm}(OV\mathbf{d_i}) is large). We found many such heads, across many layers, with the top five heads belonging to layers 11, 13, and 14. The top logit effects of each head are basketball-related ("basket", "NBA"). Moreover, the same set of heads produces baseball-related words on the baseball feature and tennis-related words on the top tennis feature. This parallels the discovery of a special head in Nanda et al. which acted as a probe for sport, as well as the more recent identification of many "Subject Heads" in Chughtai et al., which found that many heads were involved in extracting attributes from the subject in factual recall, producing the correct output via constructive interference.

We hope this exercise demonstrates the utility of using SAEs to kickstart circuit analysis.







Update on Dictionary Learning Improvements

Adly Templeton, Tom Conerly, Jonathan Marcus, and Tom Henighan

In previous updates, we’ve mentioned a number of interventions which we found to be meaningful improvements to SAE optimization at the time. Since then, we have made a number of other changes, including bug fixes. These changes have improved optimization and decreased our training losses. With these changes we no longer see meaningful improvements from the following interventions.

While these were improvements to the training process we used at the time, other changes have superseded them. We intend to publish these changes in a future update once we’ve vetted them further.







Research By Other Groups

Joshua Batson, Nicholas L Turner, Brian Chen, and Adam Pearce

Finally, we'd like to highlight a selection of recent work by a number of researchers at other groups which we believe will be of interest to you if you find our papers interesting.

Othello and Automata

In "Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT", He, Ge, Tang, Sun, Cheng, and Qiu use sparse autoencoders as a tool to investigate circuits in a model trained to play legal games of Othello (see Li et al.). They train separate SAEs on word embeddings, MLP outputs, and attention outputs, and find features corresponding to specific states for cells (current move, current state, legal to move in) as one moves through the layers. They identify these features using a visualization toolkit which may be useful for other people working on Othello. They then ask how, on specific dataset examples, the features in later layers are computed as functions of features in earlier layers by multiplying through attention OV and MLP weights (with a fixed gating mask). For example, an MLP output feature in layer 1 that generally corresponds to cell c-3 being "mine" is computed on a specific example where it gets flipped as an AND of three attentional features – d-2 "mine", c-2 "theirs", and b-4 "current" – which are in turn OV outputs of embedding features corresponding to each original move. They also decompose attention scores in terms of features, and find mechanisms supporting heads that exclusively attend to "mine" and "theirs" moves.

As the authors note, the L0-norm of feature activations is quite large (means of ~60–100 for the different SAEs with d_model of 128), with near-zero reconstruction error. In our experience, such dictionaries tend to be polysemantic in lower activations. We would also be curious to see the feature interpretations investigated using the activation specificity/sensitivity analysis in Bricken et al. It would also be interesting to see followup work with sparser dictionaries, which might make the circuit analyses crisper and more explanatory with respect to ablations.

Othello may also prove to be a useful laboratory for investigating the more general class of highly parallel algorithms implemented by transformers. The naive algorithm for finding legal moves of Othello takes N steps of sequential computation to determine the board state after N moves; simply play the game, flipping stones after each move. However, an L-layer transformer only allows for ~2L sequential calculations. For games with N > 2L, well-performing transformers must have learned a different algorithm. A similar phenomenon was explored in detail last year in Transformers Learn Shortcuts to Automata by Liu, Ash, Goel, Krishnamurthy, and Zhang, who found that "a low-depth Transformer can represent the computations of any finite-state automaton (thus, any bounded-memory algorithm), by hierarchically reparameterizing its recurrent dynamics." They construct transformers with width polynomial in the number of automaton states T and depth O(log(T)) in general. They find solutions with depth O(1) often exist, and show that these shortcut solutions can be learned in practice. In recent related work, Transformers, parallel computation, and logarithmic depth, Sanford, Hsu, and Telgarsky, showed that k-hop induction tasks, which naively require k sequential iterations of the induction mechanism, can be implemented by transformers of depth log(k). It would be quite interesting to see a feature-based analysis like He et al. applied to automata, and, conversely, to identify the shortcut strategies being learned by Othello transformers.

Computation in Superposition

"Towards a Mathematical Framework of Computation in Superposition" describes a set of asymptotically optimal constructions of transformer layers that implement many-to-many boolean functions of sparse inputs. In their setting, each input variable x_i is assigned an embedding vector v_i in the residual stream, the full boolean state is represented as a sum of the vectors corresponding to True variables: \sum_i x_i v_i, and the output of the transformer is transformed back into a boolean vector by multiplication by an unembedding U f(x) and then rounding to 0/1.

Their MLP constructions provide an explanation for the empirical result that arbitrary XORs of input features can be read off from residual stream representations. They also suggest an interpretation of attention circuits as counting “skip feature bigrams” which motivates one of their constructions. Some notable results are highlighted below. The engines of these proofs are (1) the fact that random unit vectors in high-dimensions have inner product of order 1/\sqrt{d} and (2) distortion bounds à la Johnson-Lindenstrauss.

In line with the setting above, the authors say a property is “linearly represented” in the residual stream at a certain point if it is linearly accessible up to an additive error. That is, some boolean function f(x)  is represented in some vector z(x) if there exists a covector v_f^T such that v_f^Tz(x) = f(x) \pm \epsilon on all (k-sparse) inputs. It is possible for two vectors z and z^\prime to share many properties without being close in an L^2 or MSE sense, and it is this gap that allows for many of the above constructions. Roughly speaking, L^2-closeness constrains expressivity on the order of the number of neurons or the dimension of the model, and this notion constrains expressivity on the order of the number of parameters over the logarithm of the number of properties. This is the information theoretic limit up to a multiplicative constant, as the number of bits in a model is proportional to its parameter count and a table storing properties for entities requires the logarithm of the number of properties bits per property per entity.

Because these constructions are information theoretically tight, they are the kind of thing transformers might actually learn. As the authors note, dictionary learning approaches to untangling such solutions on the residual stream might benefit from using a thresholding activation function (rounding interference to 0), which is a stronger constraint than a ReLU activation function (which allows for significant negative interference).

Improvements to Attribution Patching

Activation patching (see e.g. Zhang & Nanda, 2023) is a common method for finding the components of an LLM that contribute to a behavior of interest. Briefly, one runs a forward pass of the LLM on some prompt with an expected answer (e.g. “The Eiffel Tower is in the city of” should produce “Paris”), modifies the activation of some component, and then reruns the remainder of that forward pass. Typically, the modified activation is copied from a forward pass on a different prompt. If the new forward pass no longer gives the expected answer, then the component is partially responsible for the original behavior. Although this method is conceptually simple and persuasive, it is expensive; if we want to find one out of 1,000 neurons contributing to some behavior, we would need to run 1,000 forward passes.

Attribution patching (see also Syed et al., 2023) is an efficient approximation to activation patching using gradients. It can be thought of as activation patching where the remainder of the forward pass is replaced with a linear approximation from a backwards pass. With attribution patching, we can examine the approximate influence of any component after just one forward and one backward pass. Unfortunately, the approximation isn’t always accurate.

In “AtP*: An efficient and scalable method for localizing LLM behaviour to components”, Kramár, Lieberum, Shah, and Nanda investigate failure modes of attribution patching, and develop two fixes that improve its accuracy without losing too much efficiency.

They validate their improvements by comparing their approximations to nodes' influences with the "ground truth" of activation patching, on Pythia-12B. They also demonstrate that attribution patching is more efficient than some other clever ways of approximating activation patching.

This paper – and the attribution patching strategy in general – is relevant to anybody who wants to do circuit finding at scale.

Self-Repair and LayerNorm

Besides computation costs, measuring the effect of a component through ablation is tricky; a later component may "self-repair" the impact of an ablation and leave the model output mostly unchanged. In Explorations of Self-Repair in Language Models, Rushing and Nanda quantify self-repair by measuring the difference between ablating an attention head and the direct effect of what the head writes out to the residual stream on the logit of the next token in pretraining data.

Surprisingly, they find that the non-linearity in LayerNorm explains ~30% of the self-repair that occurs in the 2% of tokens with the highest direct effect. Multiple components of the model push to predict the same token so ablating just one of them reduces the norm of the residual stream while largely preserving its direction. This lowers the LayerNorm scaling factor, increasing the logits from the other components and counteracting some of the impact of ablation.

McGrath et al speculated that self-repair might make the model more robust during training while McDougall et al identified "copy suppression heads" that reduced model overconfidence in a way that shows up as self-repair when ablating earlier copying heads. These results suggest that at least some of the self-repair phenomenon is a byproduct of LayerNorm, which may be useful for other reasons, not a behavior which received direct optimization pressure.