Eight months ago, we demonstrated that sparse autoencoders could recover monosemantic features from a small one-layer transformer. At the time, a major concern was that this method might not scale feasibly to state-of-the-art transformers and, as a result, be unable to practically contribute to AI safety. Since then, scaling sparse autoencoders has been a major priority of the Anthropic interpretability team, and we're pleased to report extracting high-quality features from Claude 3 Sonnet,
We find a diversity of highly abstract features. They both respond to and behaviorally cause abstract behaviors. Examples of features we find include features for famous people, features for countries and cities, and features tracking type signatures in code. Many features are multilingual (responding to the same concept across languages) and multimodal (responding to the same concept in both text and images), as well as encompassing both abstract and concrete instantiations of the same idea (such as code with security vulnerabilities, and abstract discussion of security vulnerabilities).
Some of the features we find are of particular interest because they may be safety-relevant – that is, they are plausibly connected to a range of ways in which modern AI systems may cause harm. In particular, we find features related to security vulnerabilities and backdoors in code; bias (including both overt slurs, and more subtle biases); lying, deception, and power-seeking (including treacherous turns); sycophancy; and dangerous / criminal content (e.g., producing bioweapons). However, we caution not to read too much into the mere existence of such features: there's a difference (for example) between knowing about lies, being capable of lying, and actually lying in the real world. This research is also very preliminary. Further work will be needed to understand the implications of these potentially safety-relevant features.
Our general approach to understanding Claude 3 Sonnet is based on the linear representation hypothesis (see e.g.
If one believes these hypotheses, the natural approach is to use a standard method called dictionary learning
To date, these efforts have been on relatively small language models by the standards of modern foundation models. Our previous paper
This context motivates our project of scaling sparse autoencoders to Claude 3 Sonnet, Anthropic's medium-scale production model. The rest of this section will review our general sparse autoencoder setup, the specifics of the three sparse autoencoders we'll analyze in this paper, and how we used scaling laws to make informed decisions about the design of our sparse autoencoders. From there, we'll dive into analyzing the features our sparse autoencoders learn – and the interesting properties of Claude 3 Sonnet they reveal.
Our high-level goal in this work is to decompose the activations of a model (Claude 3 Sonnet) into more interpretable pieces. We do so by training a sparse autoencoder (SAE) on the model activations, as in our prior work
Our SAE consists of two layers. The first layer (“encoder”) maps the activity to a higher-dimensional layer via a learned linear transformation followed by a ReLU nonlinearity. We refer to the units of this high-dimensional layer as “features.” The second layer (“decoder”) attempts to reconstruct the model activations via a linear transformation of the feature activations. The model is trained to minimize a combination of (1) reconstruction error and (2) an L1 regularization penalty on the feature activations, which incentivizes sparsity.
Once the SAE is trained, it provides us with an approximate decomposition of the model’s activations into a linear combination of “feature directions” (SAE decoder weights) with coefficients equal to the feature activations. The sparsity penalty ensures that, for many given inputs to the model, a very small fraction of features will have nonzero activations. Thus, for any given token in any given context, the model activations are “explained” by a small set of active features (out of a large pool of possible features). For more motivation and explanation of SAEs, see the Problem Setup section of Towards Monosemanticity
Here’s a brief overview of our methodology which we described in greater detail in Update on how we train SAEs from our April 2024 Update.
As a preprocessing step we apply a scalar normalization to the model activations so their average squared L2 norm is the residual stream dimension,
where
where
The loss function
Including the factor of
Claude 3 Sonnet is a proprietary model for both safety and competitive reasons. Some of the decisions in this publication reflect this, such as not reporting the size of the model, leaving units off certain plots, and using a simplified tokenizer. For more information on how Anthropic thinks about safety considerations in publishing research results, we refer readers to our Core Views on AI Safety.
In this work, we focused on applying SAEs to residual stream activations halfway through the model (i.e. at the “middle layer”). We made this choice for several reasons. First, the residual stream is smaller than the MLP layer, making SAE training and inference computationally cheaper. Second, focusing on the residual stream in theory helps us mitigate an issue we call “cross-layer superposition” (see Limitations for more discussion). We chose to focus on the middle layer of the model because we reasoned that it is likely to contain interesting, abstract features (see e.g.,
We trained three SAEs of varying sizes: 1,048,576 (~1M), 4,194,304 (~4M), and 33,554,432 (~34M) features. The number of training steps for the 34M feature run was selected using a scaling laws analysis to minimize the training loss given a fixed compute budget (see below). We used an L1 coefficient of 5
For all three SAEs, the average number of features active (i.e. with nonzero activations) on a given token was fewer than 300, and the SAE reconstruction explained at least 65% of the variance of the model activations. At the end of training, we defined “dead” features as those which were not active over a sample of
Training SAEs on larger models is computationally intensive. It is important to understand (1) the extent to which additional compute improves dictionary learning results, and (2) how that compute should be allocated to obtain the highest-quality dictionary possible for a given computational budget.
Though we lack a gold-standard method of assessing the quality of a dictionary learning run, we have found that the loss function we use during training – a weighted combination of reconstruction mean-squared error (MSE) and an L1 penalty on feature activations – is a useful proxy, conditioned on a reasonable choice of the L1 coefficient. That is, we have found that dictionaries with low loss values (using an L1 coefficient of 5) tend to produce interpretable features and to improve other metrics of interest (the L0 norm, and the number of dead or otherwise degenerate features). Of course, this is an imperfect metric, and we have little confidence that it is optimal. It may well be the case that other L1 coefficients (or other objective functions altogether) would be better proxies to optimize.
With this proxy, we can treat dictionary learning as a standard machine learning problem, to which we can apply the “scaling laws” framework for hyperparameter optimization (see e.g.
We conducted a thorough sweep over these parameters, fixing the values of other hyperparameters (learning rate, batch size, optimization protocol, etc.). We were also interested in tracking the compute-optimal values of the loss function and parameters of interest; that is, the lowest loss that can be achieved using a given compute budget, and the number of training steps and features that achieve this minimum.
We make the following observations:
Over the ranges we tested, given the compute-optimal choice of training steps and number of features, loss decreases approximately according to a power law with respect to compute.
As the compute budget increases, the optimal allocations of FLOPS to training steps and number of features both scale approximately as power laws. In general, the optimal number of features appears to scale somewhat more quickly than the optimal number of training steps at the compute budgets we tested, though this trend may change at higher compute budgets.
These analyses used a fixed learning rate. For different compute budgets, we subsequently swept over learning rates at different optimal parameter settings according to the plots above. The inferred optimal learning rates decreased approximately as a power law as a function of compute budget, and we extrapolated this trend to choose learning rates for the larger runs.
In the previous section, we described how we trained sparse autoencoders on Claude 3 Sonnet. And as predicted by scaling laws, we achieved lower losses by training large SAEs. But the loss is only a proxy for what we actually care about: interpretable features that explain model behavior.
The goal of this section is to investigate whether these features are actually interpretable and explain model behavior. We'll first look at a handful of relatively straightforward features and provide evidence that they're interpretable. Then we'll look at two much more complex features, and demonstrate that they track very abstract concepts. We'll close with an experiment using automated interpretability to evaluate a larger number of features and compare them to neurons.
In this subsection, we'll look at a few features and argue that they are genuinely interpretable. Our goal is just to demonstrate that interpretable features exist, leaving strong claims (such as most features being interpretable) to a later section. We will provide evidence that our interpretations are good descriptions of what the features represent and how they function in the network, using an analysis similar to that in Towards Monosemanticity
The features we study in this section respond to:
Here and elsewhere in the paper, for each feature, we show representative examples from the top 20 text inputs in our SAE dataset, as ranked by how strongly they activate that feature (see the appendix for details). A larger, randomly sampled set of activations can be found by clicking on the feature ID. The highlight colors indicate activation strength at each token (white: no activation, orange: strongest activation).
While these examples suggest interpretations for each feature, more work needs to be done to establish that our interpretations truly capture the behavior and function of the corresponding features. Concretely, for each feature, we attempt to establish the following claims:
It is difficult to rigorously measure the extent to which a concept is present in a text input. In our prior work, we focused on features that unambiguously corresponded to sets of tokens (e.g., Arabic script or DNA sequences) and computed the likelihood of that set of tokens relative to the rest of the vocabulary, conditioned on the feature’s activation. This technique does not generalize to more abstract features. Instead, to demonstrate specificity in this work we more heavily leverage automated interpretability methods (similar to
We constructed the following rubric for scoring how a feature’s description relates to the text on which it fires. We then asked Claude 3 Opus to rate feature activations at many tokens on that rubric.
By scoring examples of activating text, we provide a measure of specificity for each feature.
Below we show distributions of feature activations (excluding zero activations) for the four features mentioned above, along with example text and image inputs that induce low and high activations. Note that these features also activate on relevant images, despite our only performing dictionary learning on a text-based dataset!
First, we study a Golden Gate Bridge feature 34M/31164353. Its greatest activations are essentially all references to the bridge, and weaker activations also include related tourist attractions, similar bridges, and other monuments. Next, a brain sciences feature 34M/9493533 activates on discussions of neuroscience books and courses, as well as cognitive science, psychology, and related philosophy. In the 1M training run, we also find a feature that strongly activates for various kinds of transit infrastructure 1M/3 including trains, ferries, tunnels, bridges, and even wormholes! A final feature 1M/887839 responds to popular tourist attractions including the Eiffel Tower, the Tower of Pisa, the Golden Gate Bridge, and the Sistine Chapel.
To quantify specificity, we used Claude 3 Opus to automatically score examples that activate these features according to the rubric above, with roughly 1000 activations of the feature drawn from the dataset used to train the dictionary learning model. We plot the frequency of each rubric score as a function of the feature’s activation level. We see that inputs that induce strong feature activations are all judged to be highly consistent with the proposed interpretation.
As in Towards Monosemanticity, we see that these features become less specific as the activation strength weakens. This could be due to the model using activation strengths to represent confidence in a concept being present. Or it may be that the feature activates most strongly for central examples of the feature, but weakly for related ideas – for example, the Golden Gate Bridge feature 34M/31164353 appears to weakly activate for other San Francisco landmarks. It could also reflect imperfection in our dictionary learning procedure. For example, it may be that the architecture of the autoencoder is not able to extract and discriminate among features as cleanly as we might want. And of course interference from features that are not exactly orthogonal could also be a culprit, making it more difficult for Sonnet itself to activate features on precisely the right examples. It is also plausible that our feature interpretations slightly misrepresent the feature's actual function, and that this inaccuracy manifests more clearly at lower activations. Nonetheless, we often find that lower activations tend to maintain some specificity to our interpretations, including related concepts or generalizations of the core feature. As an illustrative example, weak activations of the transit infrastructure feature 1M/3 include procedural mechanics instructions describing which through-holes to use for particular parts.
Moreover, we expect that very weak activations of features are not especially meaningful, and thus we are not too concerned with low specificity scores for these activation ranges. For instance, we have observed that techniques such as rounding feature activations below a threshold to zero can improve specificity at the low-activation end of the spectrum without substantially increasing the reconstruction error of the SAE, and there are a variety of techniques in the literature that potentially address the same issue
Regardless, the activations that have the most impact on the model’s behavior are the largest ones, so it is encouraging to see high specificity among the strong activations.
Note that we have had more difficulty in quantifying feature sensitivity – that is, how reliably a feature activates for text that matches our proposed interpretation – in a scalable, rigorous way. This is due to the difficulty of generating text related to a concept in an unbiased fashion. Moreover, many features may represent something more specific than we are able to glean with our visualizations, in which case they would not respond reliably to text selected based on our proposed interpretation, and this problem gets harder the more abstract the features are. As a basic check, however, we observe that the Golden Gate Bridge feature still fires strongly on the first sentence of the Wikipedia article for the Golden Gate Bridge in various languages (after removing any English parentheticals). In fact, the Golden Gate Bridge feature is the top feature by average activation for every example below.
We leave further investigation of this issue to future work.
Next, to demonstrate whether our interpretations of features accurately describe their influence on model behavior, we experiment with feature steering, where we “clamp” specific features of interest to artificially high or low values during the forward pass (see Methodological Details for implementation details). This builds on a long history of modifying feature activations to test causal theories, as well as work on other approaches to model steering, discussed in Related Work. We conduct these experiments with prompts in the “Human:”/“Assistant:” format that Sonnet is typically used with. We find that feature steering is remarkably effective at modifying model outputs in specific, interpretable ways. It can be used to modify the model’s demeanor, preferences, stated goals, and biases; to induce it to make specific errors; and to circumvent model safeguards (see also Safety-Relevant Features). We find this compelling evidence that our interpretations of features line up with how they are used by the model.
For instance, we see that clamping the Golden Gate Bridge feature 34M/31164353 to 10× its maximum activation value induces thematically-related model behavior. In this example, the model starts to self-identify as the Golden Gate Bridge! Similarly, clamping the Transit infrastructure feature 1M/3 to 5× its maximum activation value causes the model to mention a bridge when it otherwise would not. In each case, the downstream influence of the feature appears consistent with our interpretation of the feature, even though these interpretations were made based only on the contexts in which the feature activates and we are intervening in contexts in which the feature is inactive.
So far we have presented features in Claude 3 Sonnet that fire on relatively simple concepts. These features are in some ways similar to those found in Towards Monosemanticity which, because they were trained on the activations of a 1-layer Transformer, reflected a very shallow knowledge of the world. For example, we found features that correspond to predicting a range of common nouns conditioned on a fairly general context (e.g. biology nouns following “the” in the context of biology).
Sonnet, in contrast, is a much larger and more sophisticated model, so we expect that it contains features demonstrating depth and clarity of understanding. To study this, we looked for features that activate in programming contexts, because these contexts admit precise statements about e.g. correctness of code or the types of variables.
We begin by considering a simple Python function for adding two arguments, but with a bug. One feature 1M/1013764 fires almost continuously upon encountering a variable incorrectly named “rihgt” (highlighted below):
This is certainly suspicious, but it could be a Python-specific feature, so we checked and found that 1M/1013764 also fires on similar bugs in C and Scheme:
To check whether or not this is a more general typo feature, we tested 1M/1013764 on examples of typos in English prose, and found that it does not fire in those.
So it is not a general “typo detector”: it has some specificity to code contexts.
But is 1M/1013764 just a “typos in code” feature? We also tested it on a number of other examples and found that it also fires on erroneous expressions (e.g., divide by zero) and on invalid input in function calls:
The two examples shown above are representative of a broader pattern. Looking through the dataset examples where this feature activates, we found instances of it activating for:
Some top dataset examples can be found below: