Transformer Circuits Thread

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Authors

Adly Templeton*, Tom Conerly*, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, Tom Henighan

Affiliations

Anthropic

Published

May 21, 2024
* Core Contributor; Correspondence to henighan@anthropic.com; Author contributions statement below.

Eight months ago, we demonstrated that sparse autoencoders could recover monosemantic features from a small one-layer transformer. At the time, a major concern was that this method might not scale feasibly to state-of-the-art transformers and, as a result, be unable to practically contribute to AI safety. Since then, scaling sparse autoencoders has been a major priority of the Anthropic interpretability team, and we're pleased to report extracting high-quality features from Claude 3 Sonnet,For clarity, this is the 3.0 version of Claude 3 Sonnet, released March 4, 2024. It is the exact model in production as of the writing of this paper. It is the finetuned model, not the base pretrained model (although our method also works on the base model). Anthropic's medium-sized production model.

We find a diversity of highly abstract features. They both respond to and behaviorally cause abstract behaviors. Examples of features we find include features for famous people, features for countries and cities, and features tracking type signatures in code. Many features are multilingual (responding to the same concept across languages) and multimodal (responding to the same concept in both text and images), as well as encompassing both abstract and concrete instantiations of the same idea (such as code with security vulnerabilities, and abstract discussion of security vulnerabilities).

Some of the features we find are of particular interest because they may be safety-relevant – that is, they are plausibly connected to a range of ways in which modern AI systems may cause harm. In particular, we find features related to security vulnerabilities and backdoors in code; bias (including both overt slurs, and more subtle biases); lying, deception, and power-seeking (including treacherous turns); sycophancy; and dangerous / criminal content (e.g., producing bioweapons). However, we caution not to read too much into the mere existence of such features: there's a difference (for example) between knowing about lies, being capable of lying, and actually lying in the real world. This research is also very preliminary. Further work will be needed to understand the implications of these potentially safety-relevant features.

Key Results







Scaling Dictionary Learning to Claude 3 Sonnet

Our general approach to understanding Claude 3 Sonnet is based on the linear representation hypothesis (see e.g. ) and the superposition hypothesis (see e.g. ). For an introduction to these ideas, we refer readers to the Background and Motivation section of Toy Models . At a high level, the linear representation hypothesis suggests that neural networks represent meaningful concepts – referred to as features – as directions in their activation spaces. The superposition hypothesis accepts the idea of linear representations and further hypothesizes that neural networks use the existence of almost-orthogonal directions in high-dimensional spaces to represent more features than there are dimensions.

If one believes these hypotheses, the natural approach is to use a standard method called dictionary learning . Recently, several papers have suggested that this can be quite effective for transformer language models . In particular, a specific approximation of dictionary learning called a sparse autoencoder appears to be very effective .

To date, these efforts have been on relatively small language models by the standards of modern foundation models. Our previous paper , which focused on a one-layer model, was a particularly extreme example of this. As a result, an important question has been left hanging: will these methods work for large models? Or is there some reason, whether pragmatic questions of engineering or more fundamental differences in how large models operate, that would mean these efforts can't generalize?

This context motivates our project of scaling sparse autoencoders to Claude 3 Sonnet, Anthropic's medium-scale production model. The rest of this section will review our general sparse autoencoder setup, the specifics of the three sparse autoencoders we'll analyze in this paper, and how we used scaling laws to make informed decisions about the design of our sparse autoencoders. From there, we'll dive into analyzing the features our sparse autoencoders learn – and the interesting properties of Claude 3 Sonnet they reveal.

Sparse Autoencoders

Our high-level goal in this work is to decompose the activations of a model (Claude 3 Sonnet) into more interpretable pieces. We do so by training a sparse autoencoder (SAE) on the model activations, as in our prior work and that of several other groups (e.g. ; see Related Work). SAEs are an instance of a family of “sparse dictionary learning” algorithms that seek to decompose data into a weighted sum of sparsely active components.

Our SAE consists of two layers. The first layer (“encoder”) maps the activity to a higher-dimensional layer via a learned linear transformation followed by a ReLU nonlinearity. We refer to the units of this high-dimensional layer as “features.” The second layer (“decoder”) attempts to reconstruct the model activations via a linear transformation of the feature activations. The model is trained to minimize a combination of (1) reconstruction error and (2) an L1 regularization penalty on the feature activations, which incentivizes sparsity.

Once the SAE is trained, it provides us with an approximate decomposition of the model’s activations into a linear combination of “feature directions” (SAE decoder weights) with coefficients equal to the feature activations. The sparsity penalty ensures that, for many given inputs to the model, a very small fraction of features will have nonzero activations. Thus, for any given token in any given context, the model activations are “explained” by a small set of active features (out of a large pool of possible features). For more motivation and explanation of SAEs, see the Problem Setup section of Towards Monosemanticity .

Here’s a brief overview of our methodology which we described in greater detail in Update on how we train SAEs from our April 2024 Update.

As a preprocessing step we apply a scalar normalization to the model activations so their average squared L2 norm is the residual stream dimension, D. We denote the normalized activations as \mathbf{x} \in \mathbb{R}^D, and attempt to decompose this vector using F features as follows:

\hat{\mathbf{x}} = \mathbf{b}^{dec} + \sum_{i=1}^F f_i(\mathbf{x}) \mathbf{W}^{dec}_{\cdot,i}

where W^{dec} \in \mathbb{R}^{D \times F} are the learned SAE decoder weights, \mathbf{b}^{dec} \in \mathbb{R}^D are learned biases, and f_i denotes the activity of feature i. Feature activations are given by the output of the encoder:

f_i(x) = \text{ReLU}\left(\mathbf{W}^{enc}_{i, \cdot} \cdot \mathbf{x} +b^{enc}_i \right)

where W^{enc} \in \mathbb{R}^{F \times D} are the learned SAE encoder weights, and \mathbf{b}^{enc} \in \mathbb{R}^F are learned biases.

The loss function \mathcal{L} is the combination of an L2 penalty on the reconstruction loss and an L1 penalty on feature activations.

\mathcal{L} = \mathbb{E}_\mathbf{x} \left[ \|\mathbf{x}-\hat{\mathbf{x}}\|_2^2 + \lambda\sum_i f_i(\mathbf{x}) \cdot \|\mathbf{W}^{dec}_{\cdot,i}\|_2 \right]

Including the factor of \|\mathbf{W}^{dec}_{\cdot,i}\|_2 in the L1 penalty term allows us to interpret the unit-normalized decoder vectors \frac{\mathbf{W}^{dec}_{\cdot,i}}{\|\mathbf{W}^{dec}_{\cdot,i}\|_2} as “feature vectors” or “feature directions,” and the product f_i(\mathbf{x}) \cdot \|\mathbf{W}^{dec}_{\cdot,i}\|_2 as the feature activationsThis also prevents the SAE from “cheating” the L1 penalty by making f_i(\mathbf{x}) small and \mathbf{W}^{dec}_{\cdot,i} large in a way that leaves the reconstructed activations unchanged.. Henceforth we will use “feature activation” to refer to this quantity.

Our SAE experiments

Claude 3 Sonnet is a proprietary model for both safety and competitive reasons. Some of the decisions in this publication reflect this, such as not reporting the size of the model, leaving units off certain plots, and using a simplified tokenizer. For more information on how Anthropic thinks about safety considerations in publishing research results, we refer readers to our Core Views on AI Safety.

In this work, we focused on applying SAEs to residual stream activations halfway through the model (i.e. at the “middle layer”). We made this choice for several reasons. First, the residual stream is smaller than the MLP layer, making SAE training and inference computationally cheaper. Second, focusing on the residual stream in theory helps us mitigate an issue we call “cross-layer superposition” (see Limitations for more discussion). We chose to focus on the middle layer of the model because we reasoned that it is likely to contain interesting, abstract features (see e.g., ).

We trained three SAEs of varying sizes: 1,048,576 (~1M), 4,194,304 (~4M), and 33,554,432 (~34M) features. The number of training steps for the 34M feature run was selected using a scaling laws analysis to minimize the training loss given a fixed compute budget (see below). We used an L1 coefficient of 5Our L1 coefficient is only relevant in the context of how we normalize activations. See Update on how we train SAEs for full details.. We performed a sweep over a narrow range of learning rates (suggested by the scaling laws analysis) and chose the value that gave the lowest loss.

For all three SAEs, the average number of features active (i.e. with nonzero activations) on a given token was fewer than 300, and the SAE reconstruction explained at least 65% of the variance of the model activations. At the end of training, we defined “dead” features as those which were not active over a sample of 10^{7} tokens. The proportion of dead features was roughly 2% for the 1M SAE, 35% for the 4M SAE, and 65% for the 34M SAE. We expect that improvements to the training procedure may be able to reduce the number of dead features in future experiments.

Scaling Laws

Training SAEs on larger models is computationally intensive. It is important to understand (1) the extent to which additional compute improves dictionary learning results, and (2) how that compute should be allocated to obtain the highest-quality dictionary possible for a given computational budget.

Though we lack a gold-standard method of assessing the quality of a dictionary learning run, we have found that the loss function we use during training – a weighted combination of reconstruction mean-squared error (MSE) and an L1 penalty on feature activations – is a useful proxy, conditioned on a reasonable choice of the L1 coefficient. That is, we have found that dictionaries with low loss values (using an L1 coefficient of 5) tend to produce interpretable features and to improve other metrics of interest (the L0 norm, and the number of dead or otherwise degenerate features). Of course, this is an imperfect metric, and we have little confidence that it is optimal. It may well be the case that other L1 coefficients (or other objective functions altogether) would be better proxies to optimize.

With this proxy, we can treat dictionary learning as a standard machine learning problem, to which we can apply the “scaling laws” framework for hyperparameter optimization (see e.g. ). In an SAE, compute usage primarily depends on two key hyperparameters: the number of features being learned, and the number of steps used to train the autoencoder (which maps linearly to the amount of data used, as we train the SAE for only one epoch). The compute cost scales with the product of these parameters if the input dimension and other hyperparameters are held constant.

We conducted a thorough sweep over these parameters, fixing the values of other hyperparameters (learning rate, batch size, optimization protocol, etc.). We were also interested in tracking the compute-optimal values of the loss function and parameters of interest; that is, the lowest loss that can be achieved using a given compute budget, and the number of training steps and features that achieve this minimum.

We make the following observations:

Over the ranges we tested, given the compute-optimal choice of training steps and number of features, loss decreases approximately according to a power law with respect to compute.

As the compute budget increases, the optimal allocations of FLOPS to training steps and number of features both scale approximately as power laws. In general, the optimal number of features appears to scale somewhat more quickly than the optimal number of training steps at the compute budgets we tested, though this trend may change at higher compute budgets.

These analyses used a fixed learning rate. For different compute budgets, we subsequently swept over learning rates at different optimal parameter settings according to the plots above. The inferred optimal learning rates decreased approximately as a power law as a function of compute budget, and we extrapolated this trend to choose learning rates for the larger runs.







Assessing Feature Interpretability

In the previous section, we described how we trained sparse autoencoders on Claude 3 Sonnet. And as predicted by scaling laws, we achieved lower losses by training large SAEs. But the loss is only a proxy for what we actually care about: interpretable features that explain model behavior.

The goal of this section is to investigate whether these features are actually interpretable and explain model behavior. We'll first look at a handful of relatively straightforward features and provide evidence that they're interpretable. Then we'll look at two much more complex features, and demonstrate that they track very abstract concepts. We'll close with an experiment using automated interpretability to evaluate a larger number of features and compare them to neurons.

Four Examples of Interpretable Features

In this subsection, we'll look at a few features and argue that they are genuinely interpretable. Our goal is just to demonstrate that interpretable features exist, leaving strong claims (such as most features being interpretable) to a later section.  We will provide evidence that our interpretations are good descriptions of what the features represent and how they function in the network, using an analysis similar to that in Towards Monosemanticity .

The features we study in this section respond to:

Here and elsewhere in the paper, for each feature, we show representative examples from the top 20 text inputs in our SAE dataset, as ranked by how strongly they activate that feature (see the appendix for details). A larger, randomly sampled set of activations can be found by clicking on the feature ID. The highlight colors indicate activation strength at each token (white: no activation, orange: strongest activation).

34M/31164353 Golden Gate Bridge
nd (that's thehuge park right next to the Golden Gate bridge), perfect. But not all peoplecan live in
e across the country in San Francisco, the Golden Gate bridge was protected at all times by a vigilant
ar coloring, it is often> compared to the Golden Gate Bridge in San Francisco, US. It was built by the
l to reach and if we were going to see the Golden Gate Bridge before sunset, we had to hit the road, so
t it?" " Because of what's above it." "The Golden Gate Bridge." "The fort fronts the anchorage and the
34M/9493533 Brain sciences
------mjleeI really enjoy books on neuroscience that change the way I think aboutperception.Phanto
which bringstogether engineers and neuroscientists. If you like the intersection ofanalog, digital, h
ow managed to track itdown and buy it again. The book is from the 1960s, but there are some reallygoo
interested in learning more about cognition, should I studyneuroscience, or some other field, or is it
Consciousness and the Social Brain," by Graziano is a great place to start.------ozyI would want a
1M/887839 Monuments and popular tourist attractions
eautiful country, a bit eerily so. The blue lagoon is stunning to lookat but too expensive to bathe in
nteresting things to visit in Egypt. Thepyramids were older and less refined as this structure and the
st kind of beautiful." "What about the Alamo?" "Do people..." "Oh, the Alamo." "Yeah, it's a cool place
------fvrghlI went to the Louvre in 2012, and I was able to walk up the Mona Lisa withouta queue. I
youhave to go to the big tourist attractions at least once like the San Diego Zooand Sea World.---
1M/3 Transit infrastructure
lly every train line has to cross one particular bridge,which is a massive choke point. A subway or el
o many delays when we were enroute. Since the underwater tunnel between Oakland and SF is a choke poin
le are trying to leave, etc) on the approaches tobridges/tunnels and in the downtown/midtown core wher
ney ran out and plans to continue north across the aqueduct toward Wrexham had to be abandoned." "Now,
running.This is especially the case for the Transbay Tube which requires a lot ofattention.If BART

While these examples suggest interpretations for each feature, more work needs to be done to establish that our interpretations truly capture the behavior and function of the corresponding features. Concretely, for each feature, we attempt to establish the following claims:

  1. When the feature is active, the relevant concept is reliably present in the context (specificity).
  2. Intervening on the feature’s activation produces relevant downstream behavior (influence on behavior).

Specificity

It is difficult to rigorously measure the extent to which a concept is present in a text input. In our prior work, we focused on features that unambiguously corresponded to sets of tokens (e.g., Arabic script or DNA sequences) and computed the likelihood of that set of tokens relative to the rest of the vocabulary, conditioned on the feature’s activation. This technique does not generalize to more abstract features. Instead, to demonstrate specificity in this work we more heavily leverage automated interpretability methods (similar to ). We use the same automated interpretability pipeline as in our previous work in the features vs. neurons section below, but we additionally find that current-generation models can now more accurately rate text samples according to how well they match a proposed feature interpretation.

We constructed the following rubric for scoring how a feature’s description relates to the text on which it fires. We then asked Claude 3 Opus to rate feature activations at many tokens on that rubric.

By scoring examples of activating text, we provide a measure of specificity for each feature.We also manually checked a number of examples to ensure they were generally handled correctly. The features in this section are selected to have straightforward interpretations, to make automated interpretability analysis more reliable. They are not intended to be a representative example of all features in our SAEs. Later, we provide an analysis of the interpretability of randomly sampled features. We also conduct in-depth explorations throughout the paper of many more features which have interesting interpretations which are more abstract or nuanced, and thus more difficult to quantitatively assess.

Below we show distributions of feature activations (excluding zero activations) for the four features mentioned above, along with example text and image inputs that induce low and high activations. Note that these features also activate on relevant images, despite our only performing dictionary learning on a text-based dataset!

First, we study a Golden Gate Bridge feature 34M/31164353. Its greatest activations are essentially all references to the bridge, and weaker activations also include related tourist attractions, similar bridges, and other monuments. Next, a brain sciences feature 34M/9493533 activates on discussions of neuroscience books and courses, as well as cognitive science, psychology, and related philosophy. In the 1M training run, we also find a feature that strongly activates for various kinds of transit infrastructure 1M/3 including trains, ferries, tunnels, bridges, and even wormholes! A final feature 1M/887839  responds to popular tourist attractions including the Eiffel Tower, the Tower of Pisa, the Golden Gate Bridge, and the Sistine Chapel.

To quantify specificity, we used Claude 3 Opus to automatically score examples that activate these features according to the rubric above, with roughly 1000 activations of the feature drawn from the dataset used to train the dictionary learning model. We plot the frequency of each rubric score as a function of the feature’s activation level. We see that inputs that induce strong feature activations are all judged to be highly consistent with the proposed interpretation.