A Toy Model of Interference Weights

An informal note on interference weights by Chris Olah, Nicholas L Turner, and Tom Conerly. Published July 29th, 2025.

This note explores the phenomenon of "interference weights" and "weight superposition", an idea that we've discussed briefly in previous papers and updates. We've come to believe they are a central issue if one wants to move from attribution graphs which describe why the model behaves the way it does for a specific example, to global circuit analysis where one can reason about the model more broadly. (In fact, avoiding interference weights was our primary motivation for studying attribution graphs.)

We study interference weights in the context of toy models and preliminarily find:

Takeaway 1: Interference weights can be demonstrated in toy models. We just need to slightly modify our interpretation of the setup from the original Toy Models paper. The resulting interference weights exhibit distinctive phenomenology we saw in Towards Monosemanticity, suggesting they also occur in real models. For many purposes we probably want to filter interference weights out in our analysis.

Takeaway 2: There are a number of plausible definitions of interference weights. Some are principled, while others are less principled heuristics which are more practical. We can compare these in toy models. Using some clever tricks, it should also be tractable to apply the expensive, principled definitions to small numbers of weights in real models, which might be useful for baseline comparisons of heuristics.

Takeaway 3: A lot more toy model work could be done to better understand interference weights.

This note is very preliminary. We're sharing because it might be of interest to researchers working actively in this space, and we believe there's value in sharing thoughts earlier. We'd ask you to treat these results like those of a colleague sharing some thoughts or preliminary experiments for a few minutes at a lab meeting, rather than a mature paper. It is not intended for a broad audience. All claims should be taken with significant caveats and low confidence throughout.

Introduction

When we think of a model in superposition, we tend to think of how the features are arranged in superposition. However, there's a second aspect to superposition which is easy to neglect: weight superposition.

If two layers have features in superposition, the weights between the features are forced into superposition as well:

This is a big problem for circuit analysis because it causes "interference weights". Even if we can uncover the correct features, when we "lift" the model weights to connect those features, many of those weights will correspond to feature interference.

Should we care about interference weights?

These interference weights are "real" in the sense that they do genuinely describe connections between features in the model we observe. In fact, there's a hypothesis that they're one of the causes of adversarial examples!

However, they're essentially noise and as a result don't make sense. The model doesn't "want to have them". They make the loss worse, or at least don't help it. If we finetune the lifted model (impractical) they go away.This is actually much more subtle than it sounds, and is a bit simplified as written. In practice, there are at least three issues that prevent this statement from being strictly true. (1) Features are often not perfectly monosemantic, and so the model loss has some tiny preference for interference weights, requiring a small penalty. (2) If one naively optimizes more than one matrix in the "upstairs feature model", gradient descent may try to change those features – for example, it might introduce new superposition and polysemanticity, taking advantage of the added capacity. (3) Even setting aside the new polysemanticity issues of the previous point, the model may just learn new weights that it didn't try to represent when in superposition, or undo shrinkage (see scatter plot of virtual weights vs ideal weights later); this is expected under Defn (1) but counter intuitive. See Appendix 4 for related discussion. As a result of all of these issues, in practice one might do something like learn a mask on the virtual weights with a small penalty, along the lines of Drori, 2025.

A natural question, then, is whether we should care about them for safety. It seems like this may depend on our goals:

Robustness – They Matter! We can think of robustness as being the set of safety concerns arising from an adversarial / worst case environment (including user inputs). In this case, the interference weights exist and are an object the adversary can exploit. (Again, see the idea that they might literally cause adversarial examples!)
Alignment - They Likely Don't Matter. We can think of alignment as being the safety concerns arising from an adversarial optimization or learning process. If we define interference weights as being those which harm, or at least don't help, the optimization objective, they in some sense definitionally don't matter for alignment concerns.

This split suggests that we want to find a way to factor apart the "interference weights" and "real weights", so that we can only consider the interference weights when relevant. Since alignment is interpretability's priority, we care most about the "real weight" analysis, although of course we also want to keep the interference weights in mind.

Note that this claim – that we can ignore interference weights for alignment – warrants serious scrutiny. We're not fully convinced of it at this point! But it does seem quite likely to us, or at least likely that it's directionally true (perhaps interference weights matter much less than real ones for alignment).

It seems possible that whether interference weights can be ignored may be the pivotal question in whether global mechanistic analysis is possible.

Revisiting Toy Models

Toy Models of Superposition introduced a simple toy model for studying feature superposition:

h=Wx

x' = \text{ReLU}(W^Th+b)

Or:

x'=\text{ReLU}(W^TWh+b)

This toy model actually has two different interpretations. We can think of it as studying the geometry of how features are encoded in superposition (this is how Toy Models frames it), but we can also see it as an extremely simple case of one-layer identity circuits being put in superposition:

By revisiting the toy model with this second interpretation, we can do some very basic empirical exploration of weight superposition.

When we train these toy models, we'll be interested in U=W^TW – the observed, "downstairs", virtual weights. Even though in actuality, the toy model has a smaller set of weights which project down to a lower dimensional space and back up, these are the effective weights between the features.

Reproducing Phenomenology from Towards Monosemanticity

In Towards Monosemanticity, we observed a number of phenomena that seemed suggestive of interference weights. They were in fact the origin of much of our thinking on this topic! The goal of this section will be to find a simple toy model that matches the originally motivating phenomena, and compare them side-by-side.

But before we dive in, it's worth understanding why Towards Monosemanticity's experiments would have interference weights in the first place! At its core, the paper studied superposition in the MLP output of a one-layer transformer language model, using a sparse autoencoder to extract features. The key idea to understand is the notion of "logit weights", which connected the discovered features to the logits. Our claim will be that these logit weights should have interference weights.

Note that in such a setup, the features have a linear effect on the logits (modulo a rescaling by the final layer norm). We can get the virtual weights connecting the features and logits by tracing the path from the features, to the MLP output, to the residual stream, and then back up to the logits.

However, along this path, the features are forced into superposition, first in the MLP output, then more densely in the residual steam. As a result, when we expand these weights, we expect to see interference weights.

And indeed, Towards Monosemanticity did see many indications of such interference weights! We'll explore this very shortly. But before we do, we'll briefly introduce a toy model, and give a formal definition for interference weights. This will allow us to examine the real world phenomena from Towards Monosemanticity side-by-side with examples from toy models, where we can definitively say the analogous behavior comes from interference weights.

Toy Model Setup

Our goal is to reproduce the same basic phenomenology from Towards Monosemanticity, concretely demonstrating how weight superposition could explain those observations. We'll try to start with the simplest model that can produce roughly correct results to illustrate these ideas. In a later section, we'll produce a more complex model that matches things better.

We consider a standard toy model with n_features=100, n_residual=20, and feature_density=0.02. It will also be undertrained (this reproduces the phenomenology of Towards Monosemanticity better; we'll explore fully converged examples later).

Interference Weights and Decomposition

Because of superposition, the actual weights we observe are "noisy". Informally, this noise is the interference weights.

One way we could try to formalize this with the following definitions:

Ideal Weights. The ideal weights, U^*, are the weights we would get if we finetuned the feature-feature weight matrix in the lifted space with respect to the training loss, with a small weight regularization term to get rid of weights where the loss is indifferent.Note that this notion of ideal weights is pretty awkward to operationalize outside of toy models, and that the corresponding notion of interference weights may not be what we want. We discuss the various definitions one could have, and some of their pros and cons, in Appendix 4.
Interference Weights (defn 1). The interference weights are U-U^*, the difference between the observed virtual weights and ideal weights.

We can then think of the observed weights as decomposing into ideal weight and interference weights:

Another definition is to think about the loss contribution of each weight, \Delta L(U_{ij}). That is, the difference between the toy model loss, and its loss if we ablate a given weight.

Interference Weights (defn 2). The interference weights are the subset of the virtual weights U which don't improve the loss. That is, those which have \Delta L(U_{ij}) < \epsilon.

This offers an alternative decomposition:

For now, we'll focus our investigation on the loss contribution of each weight, \Delta L(U_{ij}). Later, we'll return to these two definitions, along with a few others which are easier to operationalize.

Weight Histogram

We can now return to our earlier goal of showing how our toy model can recover some of the interesting phenomenology of Towards Monosemanticity.

Firstly, let's consider a histogram of our weights, U. We'll color the histogram by \Delta L(U_{ij}) to allow us to distinguish interference weights from real weights.

This plot might seem familiar – it's qualitatively similar to the logit weight plots from Towards Monosemanticity, such as this one for the Hebrew feature:Note that there's a difference in the color scheme: the coloring above estimates a weight's average effect on the loss while the coloring below represents a token's connection with our interpretation of the feature. Another difference is the toy model histogram shows weights from all features while the graph from Towards Monosemanticity shows weights from a single feature to the logits.

Comparing Two Models with a Scatter Plot

We can also train a second toy model (with a different random seed) and make a scatter plot of the weights against each other.The random seed specifies the weight initialization as well as how data is sampled from the generating distribution. The interference weights are independent, but the real weights are all significantly positive. Again, this is quite reminiscent of what we saw in Towards Monosemanticity!

It's also interesting to look at a non-undertrained model here. We can see that in this configuration the interference weights become more structured, but we can also see that the real weights become perfectly agreed, while the interference weights don't.

Getting More Realistic Phenomenology

While the phenomenology of the above toy model is similar to Towards Monosemanticity, it is different in important ways. Consider the example of the base64 feature from Towards Monosemanticity:

The base64 feature has overlap between its "real weights" and "interference weights". This is a critical part of why interference weights are such a hard problem for us. If we could just ignore small weights, things would be much easier! So we'd like an example that demonstrates why the problem is hard.

A More Sophisticated Toy Model

We'd like a more realistic toy model, but how can we get one that exhibits the relevant phenomenology? In this section, we'll generalize our toy model to one that does.

To start, note that we can think of our previous toy model as mimicking an identity circuit y = ReLU(Id ~x). In the toy model, we imagine the x is compressed into superposition (h=Wx) and then taken out of superposition (y' = ReLU(W^Th+b)).

We're now going to consider a different toy model where the circuit we're approximating in superposition is more complex:

y = ReLU(A x + v)

For some random matrix A, instead of an identity. Since A isn't necessarily symmetric, we need to "untie" our toy model, using different weights to project into and out of superposition.

h = W_{down} x

y' = ReLU(W_{up} h+b)

This toy model introduces new degrees of freedom – we need to specify how to generate A and v.

Different choices here can yield very different phenomenology, and it turns out to be somewhat tricky to find a regime in which (1) real weights and interference weights strongly overlap; (2) training two models doesn't always lead to the same "ideal superposition configuration", collapsing the scatter plot; (3) training two models doesn't collapse into two superposition solutions which make very different binary choices for weights, also making the scatter plot uninteresting; and (4) this continues to hold as one trains to convergence. One can achieve (1-3), but not (4), by having a block diagonal matrix, where each block is itself sparse (probability 0.5), and otherwise uniformly sampled between [0,1]. The block diagonal structure seems to really help with (2). We have v be a constant vector of -0.1.

If we put 128 features in superposition in 16 dimensions, with 8 blocks, 0.1 weight density within those blocks, and input feature density of 0.3, and train two models we get the following weight scatter plot:

And if we focus on one model, we get the following weight histogram:

It's also interesting to compare the learned weights to the ideal weights:

What Should We Do about Interference Weights?

Ideally, we'd like to be able to separate the real weights and the interference weights, so we at least have the option to do circuit analysis only on the real weights.

Previously, we introduced two different definitions of interference weights:

Interference Weights (defn 1). The interference weights are U-U^*, the difference between the observed weights and ideal weights.
Interference Weights (defn 2). The interference weights are the subset of the virtual weights U which don't improve the loss. That is, those which have \Delta L(U_{ij}) < \epsilon.

(Note that these definitions are genuinely different, and are not the only possible definitions. See discussion in Appendix 4.)

In theory, these definitions could allow us to separate real weights from interference weights in any model. Unfortunately, both of these definitions are expensive to compute, and naively intractable to compute for all weights in a large model.For the first approach, and variants related to it, we naively need to materialize n_features^2 matrices. Today we train CLTs with tens of millions of features, but we worry we will ultimately need billions. And then we need to optimize them over large amounts of data. Conversely, for the second approach, we need to test the loss when we ablate each of the n_features^2 virtual weights, which avoids memory issues, but is perhaps worse in terms of compute… For smaller models, with smaller numbers of features, it might be possible to brute force things however.

Instead, we'll consider cheap heuristics which can be more computable proxies for these principled definitions. For now, let's consider four heuristic metrics: the original weights (big weights are likely to be real), expected attribution (weights with big average effects are likely to be real), target weighted attribution (weights with big effects on big things are more likely to be real), frequency (weights that often do something are more likely to be real), and an ideal baseline of the actual loss weight effect.

We can then try to do binary classification of real weights based on this, with the ground truth real weights defined as those with loss effect \Delta L(U_{ij}) > \epsilon = 0.0001. We can then look at precision-recall curves:

But we probably don't actually care about recall per se. Among real weights, there's a lot of variance in how important they are, and it's much worse to lose some important weights than others. Similarly, among interference weights, some are worse than others. So perhaps we should instead think about "loss gain" vs precision.Here "loss gain" is just the sum of \Delta L(U_{ij}) for each individual weight, rather than evaluated on each ablated model. This is much more promising!

Can we beat this? One tempting approach might be to take inspiration from compressed sensing – afterall, we're imagining the weights as actually living in a higher-dimensional space, and being compressed down via superposition. However, this requires that the map be linear, which may not be the case (see Appendix 1).

Conclusion

Interference weights may be the fundamental bottleneck preventing us from global circuit analysis of models. (Our recent work on attribution graphs was significantly designed to avoid interference weights!) The most ambitious vision for mechanistic interpretability requires global analysis, so addressing this seems quite important.

The naive approaches to dealing with interference weights are not scalable to large models with large numbers of features, but alternative heuristics may be. We can test these heuristics on toy models, and also test them by computing a ground truth for smaller numbers of weights in large models.

Author Contributions

Chris Olah did the majority of experiments, illustration and writing for this note. Nick Turner explored the heuristics for detecting interference weights in other settings, motivating their use here. Tom Conerly contributed significantly to exposition, performing a significant rewrite of some sections to improve communication.

Acknowledgments

Our thinking on interference weights traces back over many years. It's been a topic of concern since before the publication of Toy Models. Early thinking was influenced by conversations with Tristan Hume, Nelson Elhage, and Catherine Olsson in 2022, including the high-level idea that it would be important to separate interference weights from real weights.

All the work of Towards Monosemanticity throughout 2023 significantly contributed to our thinking on this problem, but Josh Batson discovered the scatter plots of logit weights from two different models, and highlighted how they might suggest interference weights. Conversations with him over the intervening years also continued to be influential.

Another significant update in the evolution of our thinking was unpublished internal work by Hoagy Cunningham in early 2024, following up on hints of interference weights in one-layer models from Towards Monosemanticity. In many ways, Hoagy tried to ask the questions explored in a toy model here in a more realistic and difficult setting. We hope to return to similar problems in future work.

More generally, thinking about interference weights has been influenced by too many people to list. To list just a few people, Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Adam Jermyn, and Isaac Kauvar were all significant recent influences.

We're grateful for the comments of Lee Sharkey, Tom McGrath, Liv Gorton, Martin Wattenberg, and Micah Adler on an early draft of this note.

Appendix 1: Non-Linear Weight Sharing

When we think about a larger model being compressed into a smaller model, it's natural to imagine that the weights of the smaller model should be a sum of weights corresponding to parts of the larger model. We make this assumption of parameter linearity in our original update about weight superposition. Other work makes similar assumptions, such as Braun et al.'s Attribution-based Parameter Decomposition.

However, as we've thought more, we've come to believe that it may often not be linear, and in fact it may be more like a pointwise maximum in some cases (modulo issues around signs).

Let's consider a simple example, of two features which must each be multiplied by two. Let's now consider a toy model in which the features must be compressed down to a single shared dimension. The model will put them in antipodal superposition, at which point they can both be multiplied by 2 using the same weight!

Now, in this case – where the weight is perfectly shared! – it might be tempting to try to think of this single "downstairs" weight as the meaningful unit. For example, in the language of parameter decomposition, you might claim it's a single parameter component.

But we can construct other examples where this seems less natural. If we put the features in something other than antipodal superposition, weights will only be partly shared, and have different values. The more sophisticated toy model introduced in the main section of this paper is in fact such an example. Unfortunately, these examples necessarily involve higher dimensions, and become significantly more difficult to describe than the above example.

Appendix 2: More Fully Converged Histograms and Isotropic Superposition Geometry

The models we looked at weren't fully trained to convergence, since this produced more similar phenomenology to real models. But it's interesting to look at fully converged toy models. In particular, we'll look at toy models similar to those from the "Uniform Superposition" section of Toy Models. These have the following properties:

"Isotropic" – all features are equally important, same density, etc.
"Superposition-limited" – there are more features than the toy model can learn

If we pick a density of 0.25, we observe the classical "pentagonal superposition" in our histograms, where features have angles of ⅖π and ⅘π:

If we pick a very low density of 0.001, we observe some phenomenology I wasn't aware of – although there isn't clean geometry, there's some "upper bound" interference weight that things don't go above!

Appendix 3: Correlated Feature Toy Models

The models we studied in the main portion of this post deal with toy models in which features aren't correlated or anti-correlated. In practice, we do expect features to be correlated, and this makes things more complex and rich.

We consider a toy model following a similar setup from Toy Models. There are two sets of correlated features, each consisting of 100 features. Either features from set 1 are active, or features from set 2; they never co-occur. Whichever is active, we then select two random features to be active, with activations uniformly sampled between 0 and 1.

The inter-set and intra-set weights have very different distributions, as seen below. Note that the inter-set weights have greater variance (including very large negative values)

One thing that immediately stands out is that there's some kind of gross, low-rank structure. On average, the features in a given set excite each other, and inhibit the other set. We can pull this out as its own weight component. Ablating this low-rank component harms the loss.

We think that, ultimately the right decomposition is probably the following, splitting the residual component into parts which contribute to reading the loss, and those which don't.

This example is interesting in a few ways. Firstly, it is probably best understood as having more than n_features^2 weights. The low-rank component seems importantly different from any individual feature-feature weight. Secondly, it starts to reveal that "is a weight positive for the loss" is quite complex: the low-rank component helps on the margin, but only because it reduces the effect of the interference weights! Finally, it gives an example of how the loss contributions of weights can be non-linear, leading to tension between our different definitions of interference weights.

Appendix 4: Different Definitions of Interference Weights

It's worth noting that our different definitions of interference weights really are quite different, in ways that could potentially be important.

The first difference is whether we should think of interference weights as being defined by their effect on the margin, or in combination with others. Definition (2) considers each ablation in isolation, inheriting all the problems of credit assignment / attribution in general. In principle, you could have cases where there are major non-linear interactions of ablating interference weights.

One particularly natural way in which this could happen is essentially what we observed in Appendix 3 above. It may be the case that each weight contains both a lot of noise (the interference) and also on average some real signal. If you ablate a single weight, the benefits of getting rid of noise may dominate. But if you ablate many such weights, and the response to losing the underlying signal is quadratic, you might eventually see a crossover in the loss.

The second is that Definition (1) defines interference weights relative to some ideal set of weights, it implicitly makes the "real weights" the "ideal weights". For example, suppose the model chose to simply not represent some weight to avoid the cost of its interference. Definition (1) would say that the real weight is the value it would have ideally taken, masked by large negative interference, while definition (2) would say that its real value is 0, with no interference.

Beyond these two differences, it's worth noting that these two definitions are far from the only plausible definitions one could have. (In fact, the definitions in this post were selected more for pedagogy and simplicity in this post, rather than being anything we ultimately believe to be the ideal definitions.) One example of an alternative angle is a recent post by Drori suggesting learning masks on the virtual weights to sparsify them. Masking like this is a very natural variant on our Definition (1), and gets rid of issues around the model learning things which weren't present in the original downstairs model. It's quite likely the path we would take if we were going to pursue this direction.

We regard having a variety of plausible definitions as perfectly normal at this stage of research! What definition ultimately seems best will likely get clearer as we understand the empirical situation better.