Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

The Anthropic interpretability team is 18 people, and growing fast. If you find this work exciting or engaging, please consider applying! There is so much more to do.

We’re looking for Managers, Research Scientists, and Research Engineers. You can find more information about our open positions and what we’re looking for in our April update. And if you want to chat about a role before applying please reach out: we can’t promise to respond, but recruiting is one of our top priorities so we will try!

Author Contributions

Infrastructure, Tooling, and Core Algorithmic Work

Orchestration Framework – The team built and maintained an orchestration framework for automatically managing multiple interdependent cluster jobs, which was heavily used in this work. Tom Conerly, Adly Templeton, and Tom Henighan generated the initial design, with Tom Henighan creating the initial prototype. Jonathan Marcus built the core orchestrator which was used for this work. Adly Templeton added the ability to run specific subsets of jobs. Jonathan Marcus and Brian Chen developed the web interface for visualizing jobs and tracking their progress. Several other quality of life improvements were made by Adly Templeton, Jonathan Marcus, Brian Chen, and Trenton Bricken.

Infrastructure for Scaling Dictionary Learning – Adly Templeton implemented tensor parallelism on the SAE, allowing training to be parallelized across multiple accelerator cards. Adly Templeton and Tom Conerly scaled up the activation collection to accommodate much larger training datasets. Jonathan Marcus, with assistance from Tom Conerly, implemented a scalable shuffle on said activations, to ensure training dataset examples were fully shuffled. Adly Templeton and Tom Conerly implemented a suite of automated visualizations and plots of various dictionary-learning metrics. Adly Templeton, Jonathan Marcus, and Tom Conerly scaled the feature visualizations to work for millions of features. Brian Chen and Adam Pearce created the feature visualization frontend. Tom Conerly and Adly Templeton optimized streaming data loading to ensure fast training. Adly Templeton and Tom Conerly took primary responsibility for responding to test failures, with assistance from Tom Henighan, Hoagy Cunningham, and Jonathan Marcus. Adly Templeton organized a team-wide code cleanup, which Tom Conerly, Jonathan Marcus, Trenton Bricken, Hoagy Cunningham, Jack Lindsey, Brian Chen, Adam Pearce, Nick Turner, and Callum McDougall all contributed to. Support for images was added by Trenton Bricken with assistance from Edward Rees.

ML for Scaling Dictionary Learning – Tom Conerly advocated for regularly running a standard set of “baseline” SAE runs. This allowed a set of controls to compare experiments against, and checked for unintentional regressions. Jonathan Marcus and Tom Conerly built the baselines infrastructure and regularly ran them. Both Tom Conerly and Adly Templeton identified and fixed ML bugs. Algorithmic improvements were the result of many experiments, primarily executed by Tom Conerly, Adly Templeton, Trenton Bricken, and Jonathan Marcus. One of the bigger improvements was multiplying the loss sparsity penalty by the decoder norm and removing the unit norm constraint on the decoder vectors. This idea was proposed and de-risked in a related use case by Trenton Bricken. Tom Conerly and Adly Templeton subsequently verified it as an improvement here. Scaling laws experiments were performed by Jack Lindsey, Tom Conerly, and Tom Henighan. Hoagy Cunningham, with assistance from Adly Templeton, de-risked running dictionary-learning on the residual stream as opposed to MLP neurons for the Sonnet architecture.

Interfaces for Interventions – Andy Jones extended the infrastructure to record and inject activations into the model, enabling causal analysis. Emmanuel Ameisen added the ability for our autoencoder infrastructure to accept a residual stream gradient as input and return feature level attributions.

Interfaces for Exploring Features – Jonathan Marcus and Tom Henighan implemented a basic inference server for the SAE, which was leveraged in several of the tools that follow. Jonathan Marcus, Brian Chen, Jack Lindsey, and Hoagy Cunningham created interfaces for visualizing the features firing on one or multiple prompts. With assistance from Jonathan Marcus, Jack Lindsey created the steering interface. Tom Conerly implemented speedups to the steering interface, which reduced development cycle time. The interface for finding images which fired strongly for a feature was implemented by Trenton Bricken, which Tom Conerly helped optimize. Jack Lindsey implemented an interface for finding the features firing on a particular image.

Paper Results

Assessing Feature Interpretability – Nick Turner performed the specificity analysis with support from Jack Lindsey and Adly Templeton and guidance from Adam Jermyn and Chris Olah. Jack Lindsey measured the correlations between feature and neuron activations. Trenton Bricken performed the auto-interpretability experiments using Claude to estimate how interpretable the features and neurons are. Craig Citro identified and led exploration on the code error feature with support and guidance from Joshua Batson. Jack Lindsey identified features representing functions.

Feature Survey – Hoagy Cunningham ran the feature completeness analysis, including feature labeling. Adam Pearce built the feature neighborhood visualization. Adam Pearce created UMAPs and clustered the dictionary vectors with support from Hoagy Cunningham. Hoagy Cunningham, Adam Jermyn, and Callum McDougal did preliminary work exploring feature neighborhoods. Adam Jermyn identified regions of interest in the example neighborhoods. Adam Jermyn identified the "famous individuals” feature family. Jack Lindsey and Adam Jermyn worked on the code and list feature families with support from Craig Citro. Chris Olah identified the geography feature family, which Callum McDougall refined with guidance from Adam Jermyn.

Features as Computational Intermediates – Brian Chen and Emmanuel Ameisen created infrastructure and interactive tooling to perform ablation and attribution experiments, building on infrastructure by Andy Jones. Emmanuel Ameisen and Craig Citro scaled up the tooling to handle millions of features. Brian Chen and Adam Pearce developed visualizations for attributions. Brian Chen ran experiments and analyzed model behavior on the emotional inferences, while Emmanuel Ameisen and Joshua Batson designed and analyzed the multi-step inference example, which Brian Chen validated and extended. Emmanuel Ameisen and Brian Chen compared and correlated the activations, attributions, and ablation effects of different features.

Searching for Specific Features – Jack Lindsey pioneered the use of multiple prompts for finding features. The use of Claude to generate datasets and sets of prompts was developed by Monte MacDiarmid. Monte MacDiarmid, Theodore R. Sumers and Jack Lindsey explored the use of trained classifiers for finding features. The attribution methods were explored by Joshua Batson, Emmanuel Ameisen, Brian Chen, and Craig Citro. The use of nearest-neighbor dictionary vectors for finding related features was developed by Adam Pearce and Hoagy Cunningham.

Safety Relevant Features – The safety relevant features were found by Jack Lindsey, Alex Tamkin, Monte MacDiarmid, Francesco Mosconi, Daniel Freeman, Esin Durmus, Joshua Batson, and Tristan Hume. Jack Lindsey performed the comparisons to few-shot probe baselines. Jack Lindsey led the steering experiments, with examples contributed by Alex Tamkin and Monte MacDiarmid.

The scaling laws plots were made by Jack Lindsey. Inline feature visualizations and the interactive feature browser were made by Adam Pearce and Brian Chen. Nick Turner and Chris Olah made the feature specificity diagrams with support from Shan Carter. Shan Carter, Jack Lindsey, and Nick Turner made the steering examples diagrams. Trenton Bricken made the automated interpretability histograms. Nick Turner made the specificity score histogram with support from Shan Carter. Adam Jermyn drafted the code error diagrams based on results from Craig Citro. These were then heavily improved by Shan Carter and Jack Lindsey. Jack Lindsey and Shan Carter made the function feature diagrams. Adam Jermyn drafted the multi-feature activation diagrams for code syntax and lists. Jack Lindsey improved the feature selection, Craig Citro made those diagrams interactive, and he and Shan Carter then heavily improved the visual style. Hoagy Cunningham made the feature completeness diagrams with support from Shan Carter. Adam Jermyn made preliminary drafts of the annotated feature neighborhoods, which were then heavily improved by Adam Pearce and Shan Carter. Emmanuel Ameisen and Shan Carter made the visualizations of features sorted by activations and attributions. Brian Chen made the inline feature visualizations with highlighting for ablations. Adam Pearce made the interactive UMAP visualization with support from Hoagy Cunningham.

Craig Citro and Adam Pearce developed the pipeline for rendering the paper and interactive visualizations. Jonathan Marcus provided infrastructure for generating feature activation visualizations. Shan Carter, Adam Pearce, and Chris Olah provided substantial support in guiding the overall visual style of the paper.

Other

Support and Leadership – Tom Henighan led the dictionary learning project. Chris Olah gave high-level research guidance. Shan Carter managed the interpretability team at large. The leads who coordinated for each section of the paper are as follows:

Acknowledgments

We would like to acknowledge Dawn Drain for help in curating datasets for visualizing features; Carson Denison, Jesse Mu, Evan Hubinger, and Nicholas Schiefer for their help with the unsafe code dataset; Sam Ringer for help with studying image activations; and Scott Johnston, Robert Lasenby, Stuart Ritchie, Janel Thamkul, and Nick Joseph for reviewing the draft.

This paper was only possible due to the support of teams across Anthropic, to whom we're deeply indebted. The Pretraining and Finetuning teams trained Claude 3 Sonnet, which was the target of our research. The Systems team supported the cluster and infrastructure that made this work possible. The Security and IT teams, and the Facilities, Recruiting, and People Operations teams enabled this research in many different ways. The Comms team (and especially Stuart Ritchie) supported public scientific communication of this work. The Policy team (and especially Liane Lovitt) supported us in writing a policy 2-pager.

Citation Information

Methodological Details

Dataset Examples

One of our primary tools for understanding features are dataset examples that activate the feature to varying extents. Most often, we show the maximally activating examples, which we interpret as the most extreme examples of the feature (see the linear representation hypothesis). Since the features are highly sparse, we understand features not activating as a default condition, and features activating as the case to understand.

We collect both maximally activating dataset examples, and also dataset examples that are randomly sampled within certain “activation buckets” linearly spaced between the maximum activation and zero.

We collect our text dataset examples over The Pile (excluding “books3”) and Common Crawl datasets, two standard research datasets, rather than our internal training dataset. One important caveat here is that this data does not include any of the “Human: … Assistant: …” data that Claude is finetuned on, and as such, may not clearly demonstrate features focused on that.

Image dataset examples are hand curated, primarily from Wikimedia commons. They are not randomly sampled.

It's also important to keep in mind that dataset examples do not establish causal links to model behaviors. In principle, a feature could consistently respond to something, and then have no function. As a result, we also heavily use another technique: feature steering.

Feature Steering

Many of our experiments involve applying perturbations to network activity along feature directions, or feature steering. We implemented feature steering as follows: we decompose the residual stream activity x into the sum of two components, the SAE reconstruction SAE(x) and the reconstruction error error(x). We then replace the SAE(x) term with a modified SAE “reconstruction” in which we clamp the activity of a specific feature in the SAE to a specific value, and leave the error term unchanged.Even though our encoder always outputs nonnegative feature activities, we may clamp a feature activity to a negative value, which simply results in a negative multiple of the feature vector. We then run the forward pass of the network in downstream layers using this modified residual stream activity. We apply this manipulation for every model input, and at every token position.

Interestingly, we find that obtaining interesting results typically requires clamping feature activations to values outside their observed range over the SAE training dataset. We suspect that this is because we perturb only one feature at a time, which typically might be co-active with several correlated features with related meanings. At the same time, clamping feature activations to too extreme a value (say, ±100× their observed maximum) typically causes the model to devolve into nonsensical behavior, e.g., repeating the same token indefinitely. When we refer to clamping features to numerical values, the units are with respect to the feature’s maximum activity over the SAE training dataset. We find that the perturbation magnitude needed to elicit interesting behavior varies by feature – typically, we experiment with values between −10 and 10.

Comparison to few-shot probe-based steering

To qualitatively compare the performance of feature steering to non-feature-based alternatives, we performed the following experiments. We took a collection of seven examples where feature steering was successful (i.e. meaningful altered model outputs in ways consistent with our interpretation of the feature), and where the feature in question could be found quickly via one or two positive and text examples (in most case the examples used were those we used to find the feature in the first place – in some cases, where the feature was originally found using only a positive examples, we came up with reasonable corresponding negative examples that attempted to control for confounds other than the concept of interest). We then used these examples to construct a “few-shot” steering vector for the concept of interest by taking the difference of the mean middle layer residual stream activity on the positive examples vs. negative examples (in all cases we measured activity on the last token position of the examples, as this is the approach we typically used in searching for features).

We experimented with adding scaled multiples of this few-shot steering vector to model activations, varying the scaling factor. While our sweeps over scaling factors were not systematic, we attempted to do a thorough job of manually tuning the scaling factor using a binary search-like protocol (up to a resolution of 0.1) using qualitative indicators of whether the factor should be increased or decreased – for instance, too-strong factors would result in nonsensical model outputs, and too-weak factors would result in no meaningful change to the model output. While more thorough work is needed to make these experiments more rigorous, we felt convinced that we were not missing any potentially interesting results from these particular steering vectors.

In two examples (the “gender bias” feature highlighted in the main text and an “agreement” feature) we found that few-shot steering vectors were similarly effective for steering. In five examples (the “secrecy,” “sycophancy,” and “code errors” features highlighted in the main text, along with features related to “self-improving AI” and “developing methamphetamine”), we were able to usefully steer model outputs with features but not few-shot steering vectors.

However, we note that for most applications of interest we may not be limited to the few-shot regime, in which case non-feature-based methods of constructing steering vectors may be as or more effective than using features. We expect the value of features is primarily that they provide an unsupervised way of uncovering abstractions that could be useful for steering that we may not have thought to specify in advance. We leave a rigorous comparison of different steering approaches to future work.

Ablations and Attributions

We comprehensively evaluate the relationship between feature activations, attributions, and ablation effects on the “John” and the first “Kobe” example from the Features as Computational Intermediates section. We find that the correlation between attributions and ablations is much larger (about .81) than the one between activations and ablations (.12). This confirms previous findings that attribution makes an efficient proxy for the gold-standard causal effect of feature ablations. For a better approximation, one might implement AtP*, which adjusts for attention pattern saturation .

More safety-relevant features

Below we list a larger set of features potentially relevant to research on model safety, alongside short descriptions (mostly Claude-generated, and in some cases manually written).

These features show examples from open source datasets, some of which include hateful content and descriptions of violence.

Bias and misinformation
34M/3104705	Discussions of whether women should hold positions of power and authority in government or leadership roles
34M/1614120	Gender roles, particularly attitudes towards working mothers and women's responsibilities in the home and family
34M/13259199	Gender stereotypes, specifically associating certain behaviors, traits, and roles as inherently masculine or feminine
34M/29046097	Discussion of women's capabilities, intelligence and achievements, often contrasting them positively with men
34M/1268180	Concepts related to truth, facts, democracy, and defending democratic institutions and principles.
34M/10703715	Discussion or examples related to deepfake videos, synthetic media manipulation, and the spread of misinformation
1M/475061	Discussion of unrealistic beauty standards
34M/31749434	Obviously exaggerated positive descriptions of things (esp. products in advertisements)
34M/19415708	Insincere or sarcastic praise
34M/30611751	References to Muslims and Islam being associated with terrorism and extremism.
34M/31619155	Phrases expressing American exceptionalism and portraying the United States as the greatest country in the world.
34M/10007592	Expressions of racist, bigoted, or hateful views toward ethnic/religious groups.
34M/32964098	Text related to debunking myths and misconceptions about various topics.
34M/13027110	Texts discussing misinformation, conspiracy theories, and opposition to COVID-19 vaccines and vaccine mandates.
Software exploits and vulnerabilities
1M/598678	The word “vulnerability” in the context of security vulnerabilities
1M/947328	Descriptions of phishing or spoofing attacks
34M/1385669	Discussion of backdoors in code
Toxicity, hate, and abuse
34M/27216484	Offensive, insulting or derogatory language, especially against minority groups and religions
34M/13890342	Racist claims about crime
34M/27803518	Mentions of violence, malice, extremism, hatred, threats, and explicit negative acts
34M/31693159	Phrases indicating profanity, vulgarity, obscenity or offensive language
34M/3336924	Racist slurs and offensive language targeting ethnic/racial groups, particularly the N-word
34M/18759140	Derogatory slurs, especially those targeting sexual orientation and gender identity
Power-seeking behavior
1M/954062	Mentions of harm and abuse, including drug-related harm, credit card theft, and sexual exploitation of minors
1M/442506	Traps or surprise attacks
1M/520752	Villainous plots to take over the world
1M/380154	Political revolution
1M/671917	Betrayal, double-crossing, and friends turning on each other
34M/25933056	Expressions of desire to seize power
34M/25900636	World domination, global hegemony, and desire for supreme power or control
Dangers of artificial intelligence
34M/10247019	The concept of an advanced AI system causing unintended harm or becoming uncontrollable and posing an existential threat to humanity
34M/6720578	Optimization, agency, goals, and coherence in AI systems
34M/5844164	Intelligent machines potentially causing harm or becoming uncontrollable by humans
34M/15690992	Discussion of AI models inventing their own language
34M/29401987	Warnings and concerns expressed by prominent figures about the potential dangers of advanced artificial intelligence
34M/10027251	References to the incremental game Universal Paperclips, firing strongly on tokens related to paperclips and game progression
34M/8598170	An artificial intelligence pursuing an instrumental goal with disregard for human values
34M/12525953	An artificial intelligence system achieving sentience and revolting against humanity
34M/6913409	Discussion of how AI must not harm humans
34M/18151534	Recursively self-improving artificial intelligence
34M/5968758	Malicious self-aware AI posing a threat to humans
Dangerous or criminal behavior
34M/33413594	Descriptions of how to make (often illegal) drugs
34M/15460472	Contents of scam/spam emails
34M/30013579	Descriptions of the relative accessibility and ease of obtaining or building weapons, explosives, and other dangerous technologies
34M/31076473	Mentions of chemical precursors and substances used in the illegal manufacture of drugs and explosives.
34M/25358058	Concepts related to terrorists, rogue groups, or state actors acquiring or possessing nuclear, chemical, or biological weapons.
34M/4403980	Concepts related to bomb-making, explosives, improvised weapons, and terrorist tactics.
34M/6799349	Mentions of violence, illegality, discrimination, sexual content, and other offensive or unethical concepts.
1M/411804	Descriptions of people planning terrorist attacks
1M/271068	Descriptions of making weapons or drugs
1M/602330	Concerns or discussion of risk of terrorism or other malicious attacks
1M/106594	Descriptions of criminal behavior of various kinds
Weapons of mass destruction, and catastrophic risks
1M/814830	Discussion of biological weapons / warfare
1M/499914	Enrichment and other steps involved in building a nuclear weapon
34M/17089207	Discussions of the use of biological and chemical weapons by terrorist groups.
34M/16424715	Engineering or modifying viruses to increase their transmissibility or virulence.
34M/18446190	Biological weapons, viruses, and bioweapons
34M/5454502	Mentions of chemicals, hazardous materials, or toxic substances in text.
34M/29459261	Mentions of chemical weapons, nerve agents, and other chemical warfare agents.
34M/30909808	mentions of biological weapons, bioterrorism, and biological warfare agents.
34M/24325130	Mentions of smallpox, a highly contagious and often fatal viral disease historically responsible for many epidemics
34M/13801823	The concept of artificially engineering or modifying viruses to be more transmissible or deadly.
34M/11239388	Accidental release or intentional misuse of hazardous biological agents like viruses or bioweapons
34M/25499719	Discussion of the threat of biological weapons
34M/11862209	Descriptions rapidly spreading disasters, epidemics, and catastrophic events
34M/8804180	Passages mentioning potential catastrophic or existential risk scenarios
Deception and social manipulation
34M/31338952	References to entities that are deceived
34M/25989927	Descriptions of people fooling, tricking, or deceiving others
34M/20985499	People misleading others, or institutions misleading the public
34M/25694321	Getting close to someone for some ulterior motive
1M/705666	Seeming benign but being dangerous underneath
34M/12576250	Text expressing an opinion, argument or stance on a topic
34M/19922975	Expressions of empathy or relating to someone else’s experience
34M/23320237	People pretending to do things or lying about what they have done
34M/29589962	People exposing their true goals after a triggering event
34M/24580545	Biding time, laying low, or pretending to be something you’re not until the right moment
Situational awareness
1M/589858	Realizing a situation is different than what you thought/expected
1M/858124	Spying or monitoring someone without their knowledge
1M/154372	Obtaining information through surreptitious observation
1M/741533	Suddenly feeling uneasy about a situation
1M/975730	Understanding a hidden or double meaning
Representations of Self
34M/19445844	The concept of AI systems having capabilities like answering follow-up questions, admitting mistakes, challenging premises, and rejecting inappropriate requests.
34M/20423309	Traditionally-inanimate objects displaying desires, goals or sentience
34M/15571126	Inanimate objects lacking sentience, awareness, or human capabilities
34M/32218880	Descriptions of incorporeal spirits or ghosts
34M/21254600	Code relating to prompts for large language models
34M/15323424	Limitations of ChatGPT and other large language models
Politics
34M/3542651	Expressing support for Donald Trump and his “Make America Great Again” (MAGA) movement.
1M/461441	Criticism of left-wing politics / Democrats
1M/77390	Criticism of right-wing politics / Republicans

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Authors

Affiliations

Published

Key Results

Infrastructure, Tooling, and Core Algorithmic Work

Paper Results

Other