Circuits Updates — May 2023

We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.

We'd ask you to treat these results like those of a colleague sharing some thoughts or preliminary experiments for a few minutes at a lab meeting, rather than a mature paper.

Attacking Superposition with Dictionary Learning

Trenton Bricken, Joshua Batson, Adly Templeton, Adam Jermyn, Tristan Hume, Tom Henighan, Chris Olah

We are often asked about where our work on superposition is going. Over the last few months, we've run several more ad-hoc experiments on superposition in real models (which sometimes produced interesting, but inconclusive results), as well as exploring a variety of questions related to the theory of superposition like our recent memorization paper.

Our ad-hoc experiments have persuaded us that solutions to superposition won't be extremely low-hanging fruit, and that more systematic, focused efforts will be necessary. To that end, we're presently focusing on what we've described as "Approach 2" in the Toy Models paper: extracting features from superposition by using dictionary learning on the activations of a trained, dense model. (This approach has also been investigated by Sharkey et al., who provided us with helpful comments.) We're building up infrastructure to do a systematic, large-scale investigation, with the hope of either finding superposition in real models or making a significant update against this decoding approach.

Informally, we've found cases where sparse factorizations of neural network activations seem to produce activations which suggest simple hypotheses on inspection. But we don't yet have anything that persuades us that these are the "fundamental truth" of the models that we're studying, rather than a convenient lens which might reveal some features while obscuring others.

Features as The Simplest Factorization

Trenton Bricken, Joshua Batson, Adly Templeton, Adam Jermyn, Tom Henighan, Chris Olah

As investigation of superposition has progressed, it's become clear that we don't really know what a "feature" is, despite them being central to our research agenda. Several previous definitions have been considered, but all seem unsatisfying. This isn't necessarily bad – sometimes uncertainty about definitions is a very fruitful avenue for science! – but it points at a major open question for us.

In parallel with this, attempts to use dictionary learning or sparse coding methods to automatically discover features in superposition have run into a major challenge. These methods require one to pick a number of features to attempt to decompose the activations into. But how can one know if they've picked the right number to "get all the features" without splitting a true feature into many parts? A recent report by Sharkey et al. proposes some heuristics, but the answer seems non-obvious.

We wonder if it might be possible to answer both of these questions at once by defining features as "the simplest factorization of the activations".

More formally, given a sparse factorization of the activations A=SD (S is the sparse code, D is the dictionary), we can ask how much information it takes to represent S and D. We define this “total information” by fitting a probability distribution to the entries of the matrices and computing its entropy. Larger dictionaries tend to require more information to represent, but sparser codes require less information to represent, which may counterbalance.

(The initial experiments below measure information by modeling the distribution of entries in each matrix with a 100 bin histogram, with the largest bin determined by the maximum matrix entry over all experiments. We then take the surprisal of each entry under this distribution, and sum to get the total information. However, we expect our exact formulation of this to change as our investigation continues.)

It turns out that measuring the information like this seems to be an effective tool for determining the "correct number of features", at least for synthetic data. We consider a variety of synthetic dictionary learning tasks where we take sparse vectors (of dimension n_sparse, with n_active non-zero entries) and randomly project them into 32 dimensional space. The goal is to recover the original sparse structure. We perform dictionary learning (essentially MOD with an LASSO inner loop), varying both the L1 coefficient and the dictionary size. We then do a parametric plot of Mean Squared Error vs this notion of "Total Information".

We observe that dictionary learning solutions "bounce" when the dictionary size matches the true number of factors. Put another way, the Pareto frontier of this rate-distortion plot is occupied by the solutions with the correct number of learned factors. (This is also where the best MMCS score, a metric introduced by Sharkey et al. for evaluating factorizations when the correct answer is known, is obtained.)

If such bounces could be found in real data, it would seem like significant evidence that there are "real features" to be found.

Dictionary Learning Worries

Dictionary learning is presently our top contender for how to extract features out of superposition (following "Approach 2" to solving superposition). If we believe that our activations are described by the factorization A=SD where S is a sparse matrix (the "true sparse features") and D is the "dictionary" of unit vectors projecting them to the observed activations, then dictionary learning is a well-established set of tools for solving this problem.

Unfortunately, there are at least two major ways in which we might wish to solve a subtly different problem:

All these concerns point towards using the kind of sparse autoencoder setup explored by Sharkey et al. over a full-blown dictionary learning setup. However, we've found that sparse autoencoders are more fragile and sensitive to hyperparameters, which is a significant countervailing consideration in using them. We are interested in finding an approach with the advantages of a sparse autoencoder (in terms of only finding true features) and the consistent trainability of the dictionary learning schemes.

We also have other worries – such as correlated features which may be more difficult to pull apart – which could be of significant concern to these efforts, but aren't specific to the dictionary learning setup.

Fractional Dimensionality and "Pressure"

In Toy Models of Superposition, perhaps the most surprising result was that toy model features often arranged themselves into uniform polyhedra in superposition, with the specific polyhedra varying by sparsity. However, in a recent comment, we found this is more sensitive to the amount of "feature pressure" (the ratio of the number of features that the model would ideally represent and the number of dimensions it has to represent them), and also how long the model is trained. In particular, there are regimes where having more features compete, and also having models train longer, causes clean geometry where it otherwise would not exist.

We're confused why having more features – which the model ultimately chooses to not represent – affects the geometry of the solution so much. One hypothesis is that even as tiny features, they inject noise. Another hypothesis is that the model is actually effectively using "epsilon features" in superposition somehow.

The "Two Circle" Phenomenon in Memorization

In a recent comment on Superposition, Memorization, and Double Descent, we observed that problems with m=2 hidden dimensions occasionally have data points that arrange themselves on two circles of different radii. While we believe the specific phenomenon is likely a quirk of optimization in 2D, it's an interesting case study in the geometry of superposition and memorization.

Weight Superposition

We typically think about superposition as a phenomenon where features are put in superposition. For example, we might have features X^* which are put into superposition X by a map U.

But this picture doesn't really help us reason about what kinds of computation a neural network can do while in superposition. We know that some kinds are possible – but what?

To answer this, it's helpful to reason about how weights are put in superposition. If we have two layers X and Y (both in superposition according to matrices U^X and U^Y), we can imagine there being "ideal weights" between these features for computing the second layer from the first. When the two layers are put in superposition, the ideal weights must also be mapped into superposition.

What are the properties of this map? Observe that we want e_i^T W^* e_j = {U^Y_I}^T W U^X_j. Therefore, if we decompose the ideal weights into their entries we get the following transformation:

W^* ~=~ \sum_{i,j} W^*_{i,j} e_i \otimes e_j ~~\to~~ \sum_{i,j} W^*_{i,j} U^X_i \otimes U^Y_j ~\simeq~ W

Equivalently, one can define the map as a tensor product, U^W = U^X \otimes U^Y.

In feature superposition, the interference between two features X^*_i and X^*_j is governed by \langle U_i, U_j\rangle. Weight superposition has something analogous. Two weights W^*_{i,j} and W^*_{k,l} have interference governed by \langle U^W_{i,j}, ~U^W_{k,l} \rangle_F ~=~ \langle U^X_{i}\!\otimes U^Y_{j}\!, ~U^X_{k}\!\otimes U^Y_{l} \rangle_F. Interestingly, weights apper able to have "constructive interference" which is helpful, in contrast to feature superposition which always seems harmful.

All of this is just preliminary thinking on this question, but it seems to give us a tool for reasoning about what weight matrices are possible to represent in superposition, and thus what kinds of computation it's possible to do while in superposition.

Attention Head Superposition

In Toy Models of Superposition, we saw that when features are sparse, simple neural networks can represent more features than they have neurons through the phenomenon of superposition. We think something analogous can happen with attention heads, with "attention circuits" and "attentional features" being stored in superposition over attention heads.We use the term "attentional feature" to describe relationships between pairs of tokens, which correspond to linear combinations of attention heads attending between a pair. By "attentional circuit", we refer to the otherall computation implemented by an attention head, which in the case of a one-layer model implements skip-trigrams.

For now, we’ll talk about skip-trigrams ([A]…[B] → [C]) as our basic attentional circuits. This is a restricted definition, as we think there are more general kinds of attentional circuits, but they seem sufficient to demonstrate attention superposition.

We trained toy models — small one-layer transformers with trivial embeddings and unembeddings — to investigate how and under what circumstances attention heads place circuits in superposition. The training data were sequences of tokens which were chosen uniformly except for over-representing certain skip-trigrams.

We focus in particular on skip-trigrams which are "OV-incoherent," meaning that they attend from multiple different tokens back to a single token, and the output depends on the token attended from. A single attention head cannot implement multiple such skip-trigrams without introducing errors in its output, because the OV circuit does not know which token is being attended from.

Attention Head Wiring diagram: The first column is the token attended from, the second is the token attended to, and the thickness of the lines connecting them indicate the strength of attention. The final three columns show the modifications caused to the output logits when attending to the token in the second column. Here red indicates an increase in the output and blue indicates a decrease.

What we see is that when the ground truth contains more incompatible trigrams than there are attention heads, models resort to placing them in superposition across heads. Above, each trigram is encoded in superposition between at least two attention heads. These results suggest caution in studying the role of a single attention head in isolation, as circuits implemented in superposition can appear misleading when only partially interpreted.

These wiring diagrams are simple for models trained on small numbers of skip-trigrams, but rapidly become too complex to read as the training data become more complex. Despite this, we see tantalizing evidence of beautiful geometry underlying even cases with many skip-trigrams, which we are excited to investigate further.

Feature Manifold Toy Model

In our toy model investigations of superposition, we assume the existence of discrete features and see how an autoencoder represents them. But in real life, features may lie on a manifold, where nearby features respond to similar data. What should we expect neural networks to do in such cases? In many empirical cases, neural networks model the manifold with families of equivariant neurons, representing the manifold in terms of discrete units rather than representing the manifold directly. Should we expect this to always happen? Why does it happen?

For example, a vision model might want to represent curves in different orientations; the set of possible orientations naturally defines a 1D manifold, a circle. One could imagine the network might have a single neuron whose activation represents the angle of the curve or might have two neurons whose activations represent sine and cosine of the angle of the curve. Instead, in Curve Detectors, Cammarata et al find many (~10) neurons which each respond to curves in a specific range of orientations.This difference in representational strategy seems somewhat analogous to the distinction between "value coding" and "variable coding" in neuroscience (see Thorpe 1989).

We present some extremely preliminary results investigating this question by considering a toy problem with a "feature manifold" rather than discrete features. We'll then study what happens when we change the length scale (\ell) the model cares about resolving positions on the feature manifold within.

Our basic setup will be the ReLU-output problem from the Toy Models paper. Instead of having the data be independent features, we imagine having a large number of features arranged around a circle, with equal angular spacing. We first fix a length scale (\ell) for the problem. To generate a data point, we pick a random angle (\theta) and an activation magnitude (m). The feature x_\phi at angle \phi around the circle activates on that datapoint if \phi is close to \theta, where “close” is determined by the length scale:

x_\phi ~=~ \begin{cases} ~m\cos(\frac{\phi-\theta}{\ell}) & ~\text{if}~~\frac{|\phi-\theta|}{\ell} \leq \frac{\pi}{2}\\ ~0 & ~\text{otherwise} \end{cases}

This kind of emergent discretization (which we're increasingly seeing hints of across a variety of problems) seems like it might be a very important phenomenon. It may be that "emergent discretization" is the thing we mean when we talk about features.

One caveat to this work is that we've seen some hints that the smallest length scale discretization may be an optimization failure. Additional research is needed to understand this phenomenon.

New Comments Digest

Transformer Circuits periodically publishes comments on our papers, both from external parties and by the authors. Some of these comments were submitted before publication, from reviewers of early draft manuscripts. But others are submitted significantly after the fact, and might not be seen. To that end, we've included a digest of recently added comments:

Our Recent Publications

Over the last few months, we've also published a few smaller papers which you might not have seen (including one "perspective" article – Interpretability Dreams – being released along with this post).

Research By Other Groups

Finally, we'd like to highlight recent work by a number of researchers at other groups which we believe will be of interest to you if you find our papers interesting.

On the Nature of Representations…

Linear Representations. One of the most fundamental assumptions we personally tend to make in studying neural networks is the linear representation hypothesis: neural network features are represented by directions. While this is a common hypothesis, it isn't known to be true.

A recent back and forth between Li et al. and Nanda (in the context of language models trained on Otehllo games) is perhaps the strongest evidence yet from a Popperian perspective: the linear representation hypothesis made a prediction which was contrary to evidence at that point, and was then validated (see Neel Nanda's comment here). It should be mentioned that there are many other reasons to be excited about this work – we discuss it more below – but we wanted to particularly highlight this as an example of excellent scientific discourse and the evidence it seems to provide for a question of very broad significance to the field.

More generally, a wide range of other papers have continued to provide more empirical examples of seemingly linearly represented features. Perhaps the most striking is Turner et al. (who do vector arithmetic to control language models), but see also Gurnee et al. and more generally all the papers mentioned in the following section on what features can be found in language models.

What Features Exist Inside Language Models? Ultimately, our goal is to understand language models. While it's often tempting to emphasize methods or theories, the bread and butter of mechanistic interpretability research must be something similar to the study of anatomy in biology: characterizing features and circuits that exist in language models. On this note, Yun et al., Gurnee et al., and Bills et al. – while all also notable for other contributions – deserve attention for their qualitative results on what features we exist inside language models.

Superposition. In the last few months, significant progress has been made by our colleagues at other groups on superposition. Sharkey et al. attempted to decode superposition in real models, using sparse autoencoders. Yun et al. apply dictionary learning to transformer residual streams and recover many interpretable features. Gurnee et al. apply sparse linear probes to transformers and find, among other things, evidence of low-level linguistic features being represented in superposition over small sets of neurons. Lindner et al. created a tool to compile programs into transformers using superposition. Jermyn et al. explore approaches to encouraging monosemantic neurons. Scherlis et al. examine superposition from the perspective of constrained optimization. Hobbhahn published two posts extending our investigation of superposition and memorization.

One detail from Hobbhahn's posts which we wanted to highlight is that some models seem to have a kind of "shifted superposition" where the model shifts data points to avoid ReLU. This is contrast to the intuition one might naively have that ReLU would in fact anchor the superposition at 0 due to its special behavior there.

Othello & World Models. In the context of language models, there's been an ongoing debate about whether they're "just doing statistical pattern matching" or they "understand". This conversation has often been polarized and disconnected from specific mechanistic hypotheses of what's going on. However, a recent paper by Li et al. – and follow up work by Nanda – used probes to provide evidence that language models trained to play Othello have an internal representation of the state of the board. This is both a nice example of progress in mechanistic understanding, and is also perhaps an example of how mechanistic interpretability can help us have more productive dialogues about neural networks.

Larger-Scale Structure

How is factual knowledge retrieved? A recent paper by Geva et al. continues the very fruitful line of investigation on activation patching methods (see Meng et al.), which allows for larger-scale understanding of how transformers process information. This new paper looks into how somewhat more complex queries about knowledge are processed by language models. In particular, where prior work showed that attention heads were important for moving information from a subject token, this work suggests that the OV circuit of attention heads can also transform that information, for example reading in a country and writing out its capital.

Methods

Activation Patching Continues. As mentioned above, we're continuing to see exciting work based on the activation patching approach (see Meng et al.), most recently by Geva et al..

Automated Interpretability. One of the most common (and very reasonable) critiques of mechanistic interpretability is that it can't scale to large models. A recent paper by Bills et al., "Language models can explain neurons in language models", provides a proof of concept for automating parts of mechanistic interpretability. This approach would still require a solution to superposition, but it's potentially an exciting way to address the scalability concern. At the same time, we also have some reservations about this kind of automation, especially when the goal is safety. Do we really want our auditing of AI models to depend on trusting an AI model to help us with auditing? A critical question is whether alternative approaches to addressing the scalability problem can be found. Either way, this kind of method seems helpful in the meantime – and the qualitative results are also very interesting.

Attribution patching. In a recent post, Neel Nanda describes a method called "attribution patching" which he developed in collaboration with several of us a while back. It's exciting to see this written up! Using gradient activation products to perform quick attributions to various intermediate computations was quite useful for investigations in the vision context (see Building Blocks of Interpretability), and seems helpful as a way to investigate larger models. However, be sure to pay attention to Neel's cautionary notes on when this works, especially the section on LayerNorm. (We can file this as reason #78 for why interpretability researchers hate LayerNorm.)

Mechanistic Interpretations of Learning Dynamics and Scaling

Can we explain learning dynamics and scaling laws in terms of circuits? We know that induction heads cause a loss bump in training when they form, and likely cause a bump in scaling laws. This suggests the tantalizing possibility of a deep bridge between the microscopic world of mechanistic interpretability and the more macroscopic topics of learning dynamics and scaling laws. Several recent papers have made us more hopeful that such a bridge can be found.

Quanta. Michaud et al. propose a theory of scaling in terms of "quanta" – discrete behavior patterns which reduce loss – along with an algorithm for automatically discovering these quanta based on gradients. A natural hypothesis is that these behavioral quanta mechanistically correspond to circuits, just as the "induction bump" in-context learning behavior corresponds to induction head circuits. If this could be demonstrated, it would create a much wider bridge from the microscopic world of circuits to the macroscopic world of losses, behaviors, scaling, and learning dynamics.

Mode Connectivity. Lubana et al. and Juneja et al. find an empirical relationship between generalization strategies – and likely the underlying mechanisms – and linear mode connectivity in the loss landscape. In particular, models with different generalization properties appear to have a loss barrier separating them if one linearly interpolates in parameter space.

Grokking

Reverse Engineering Grokking, Fourier Transforms, & Universality. In Progress Measures for Grokking via Mechanistic Interpretability, Nanda et al. reverse engineered a neural network doing modular arithmetic which turned out to be using Fourier transforms, and linked this to grokking behavior. Following up on this, Chughtai et al. found that if one trains neural networks to perform more general group operations, they learn to use group representations (a generalization to noncommutative groups of the Fourier transform for cyclic groups found in the first model). This is interesting both as a compelling example of reverse engineering simple models, and also as evidence for the universality hypothesis, as the authors find that each trained network utilizes a random subset of the group representations that exist.

Why does grokking occur? A recent paper by Liu et al. finds a systematic relationship between weight decay and the length of time it takes for grokking to occur. Roughly, they find first the model finds a memorizing solution whose weight matrix has very large norm, and that grokking occurs when the model weights shrink to the size of the generalizing solution. Qualitatively, the relationship they find between memorization and weight norm matches some of our observations on how memorization occurs mechanistically in toy models and classic work by Bartlett showing that feed-forward networks with small weight norm generalize well.

Other Results

Neuroscience Parallels. Over the last few years, there have been a number of cases where mechanistic interpretability research discovered results which parallel finding in neuroscience, including curve detector neurons and person-detecting multimodal neurons. Recently, we've begun to see parallels which go in the other direction, with discoveries in artificial neural networks foreshadowing results in biological neuroscience:

The growing body of parallels, and the fact that they're going in both directions, seems suggestive of a genuine, deep connection. It also seems like evidence for a very strong version of the universality hypothesis.

Behavioral Control of GPT with Activation Addition. In a recent post, Turner et al. demonstrate that they can control language models by adding vectors to activations, defined simply by doing arithmetic on activation vectors. This extends earlier work on RL agents navigating a maze. It's interesting to speculate what the mechanism is – are they controlling low-level features related to a topic, high-level topic/theme features, "motor neurons" that directly implement behavior, or something else? More generally, it's another piece of evidence for the linear representation hypothesis.

Hierarchical Skip-Trigrams. A recent post by Buck Shlegeris constructed an example of a phenomenon we'll call "hierarchical skip-trigrams" (following Neel Nanda's naming). Hierarchical skip-trigrams allow one-layer models to use skip-trigrams to express interesting computation one might not have naively expected.

Decision Transformer Interpretability. Two recent articles (part 1, part 2) by Bloom and Colognese take a mechanistic approach to investigating decision transformers in a grid world setting.

Sparsity and Modularity. A recent paper by Liu et al. explores encouraging sparsity and modularity with a weight sparsity penalty that penalizes weights between neurons that are far apart (similar to wire length minimization in neuroscience). They find striking sparse graphs for a variety of tasks including arithmetic, group multiplication, and in-context learning.

Mechanistic Interpretability Challenges. Back in February, Stepen Casper posed several challenges to mechanistic interpretability practitioners, somewhat similar to the "auditing game" tests conducted at OpenAI in 2019. Recently, Stefan Heimersheim and Marius Hobbhahn took up this challenge and solved the first one.