Transformer Circuits Thread

Circuits Updates - July 2025

We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.

We'd ask you to treat these results like those of a colleague sharing some thoughts or preliminary experiments for a few minutes at a lab meeting, rather than a mature paper.

New Posts

Revisiting A Mathematical Framework with the Language of Features
Applications of Interpretability to Biology

Revisiting A Mathematical Framework with the Language of Features

Chris Olah; edited by Adam Jermyn

When we wrote A Mathematical Framework for Transformer Circuits, we had no way to extract features from superposition. As a result, many ways of thinking about transformers which might most naturally be described in terms of features were instead described in terms of the eigenvalues of different matrix multiplies. Here we revisit some of these ideas in the language of features, and hopefully make the work clearer as a result.

Every attention head can be understood in terms of two matrices: the OV circuit (which describes what information the attention head reads and writes) and the QK circuit (which describes where to attend). We can describe both of these matrices in terms of features.

In describing these, we'll make heavy use of transformed sets of features. For example, we might have a feature X which detects some property, a feature prev(X) which detects that X was present at the previous token, and a feature say(X) which causes the model to produce an output that triggers X. These transformed features have a 1:1 correspondence with the corresponding original feature, and can be thought of as that feature with some relation applied. (In practice, we think the directions corresponding to these original features will often be a linearly transformed version of the original features. For motivation around this, see Wattenberg & Viegas 2024 on matrix binding and "echo" features.)

Copy Heads

To start, let's consider the OV circuit of a copy head. We expect it to look something like this, converting seeing X to saying X:

In Framework, we used positive eigenvalues of W_U W_{OV} W_E (that is embed tokens, put them through OV, and then through the token unembedings) as a sign that roughly this was going on. Why was that a reasonable thing to do? Well, the embeddings give a basis for something like very simple features corresponding to each token in very early layers, while the unembeddings give the corresponding "say(X)" features in very late layers. So, especially in small models, we can use them as a kind of basis for both of these sets of features. This isolates the upper right sub matrix, which should roughly be diagonal.

This means it will have positive eigenvalues. More deeply, positive eigenvalues in that submatrix means sets of features which upweight the same set of corresponding say features.

Previous Token Heads

Another interesting head is the previous token head. We believe they copy the previous token information over, as distinct features describing the previous token. So a feature X gets mapped to a feature prev(X) "the previous token was X".

Note that if there were already features representing the previous token, these might also be copied over as "previous of previous" features. (These could also be directly constructed by an attention head that attends two tokens back.)

Induction Head

An induction head has an OV circuit like a copying head, as described above. It maps features X to say(X).

The interesting thing is the QK circuit, which we think of as being something like the following. "If the query token is X, we want a key preceded by X. If the query token is preceded by Y, we want a key token which was Y two tokens back."

In Framework, we studied the eigenvalues of W^{induction}_{QK} \cdot W^{prev}_{OV} to try to get at this. Note that this should also give us a diagonal-ish matrix:

As a result, again, eigenvalues are positive.

Applications of Interpretability to Biology

Sam Zimmerman; edited by Harish Kamath & Adam Jermyn

This post summarizes recent progress in applying sparse autoencoders to biological AI systems, particularly protein language models. As models become important for drug discovery and protein engineering, understanding their internal representations becomes important for both safety and scientific discovery.

Like LLM’s applications for language, AI models in biology create a fascinating tension: these models achieve impressive accuracy in predicting protein structures and cellular behaviors, yet we understand little about how they work.

Addressing this interpretability gap is important for their safe and effective application in drug discovery, protein engineering, and disease understanding. Recent applications of sparse autoencoders (SAEs) to biological models suggest a promising path forward.

The Black Box Problem in Biological AI

Protein language models like ESM-2 and single-cell foundation models have transformed computational biology. ESMFold predicts structures with near-experimental accuracy. Cell models capture complex regulatory dynamics. Yet these systems remain fundamentally opaque.

This opacity creates concrete problems:

We cannot identify when models make predictions for spurious reasons
Biological insights learned by models remain inaccessible
Protein engineering lacks principled guidance from model understanding
Novel mechanisms stay hidden within billions of parameters

The same interpretability challenges that motivated sparse autoencoder development for language models now appear in biological contexts—with high stakes given therapeutic applications.

One particularly exciting opportunity in this space is that we might be able to use interpretability methods as a microscope of sorts. In some sense, in order to predict the data well, models have to develop some understanding of it, and we may be able to use interpretability to extract that understandingThis is something Chris Olah wrote about in 2015 here.. And indeed the InterProt[2] results are an example of just this happening!

Sparse Autoencoders in Biology

For readers familiar with SAEs from language model work, the biological applications follow similar principles but with domain-specific challenges. While language models deal with discrete tokens and semantic concepts, biological models handle discrete amino acid sequences and evolutionary relationships. The superposition hypothesis holds even more strongly here—individual neurons in protein models entangle dozens of biological concepts that SAEs successfully disentangle.

Four recent papers demonstrate how SAEs apply to biological systems:

Method	InterPLM [1]	InterProt [2]	Reticular [3]	Markov Bio [4]
Model Studied	ESM-2 (8M params)	ESM-2 (650M params)	ESM-2 (3B params) / ESMFold	Gene expression model
SAE Architecture	Standard L1 (hidden dim: 10,420)	TopK (hidden dims: up to 16,384)	Matryoshka hierarchical (dict size: 10,240)	Standard (details not specified)
Key Finding	SAEs extract interpretable features from biological models	Features predict known mechanisms without supervision	8-32 active latents total can maintain structure prediction	Features form causal regulatory networks
Validation Method	Swiss-Prot annotations (433 concepts)	Linear probes on 4 tasks, manual inspection	Structure RMSD, Swiss-Prot annotations	Feature clustering, spatial patterns
Practical Impact	Found missing database annotations, identified conserved motifs	Explained thermostability determinants, found nuclear signals	Enabled control of structural properties	Identified cell fate drivers in differentiation

An Example: Finding Missing Protein Annotations

To understand how these models work in practice, consider feature f/939 from the InterPLM paper [1], which the researchers discovered activates on a specific pattern called a "Nudix box motif" - a conserved sequence pattern found in certain enzymes. When they examined proteins where this feature strongly activated, they found something interesting: while most had the expected Nudix box annotation in the Swiss-Prot database, one protein (B2GFH1) triggered strong activation despite lacking this annotation.

This wasn't a mistake by the model. When the researchers looked more closely at the protein's structure and checked other databases like InterPro, they confirmed that this protein does indeed contain a Nudix box motif - it had simply been missed in the Swiss-Prot annotation. The SAE feature had effectively discovered a gap in our biological databases.

This example illustrates the dual value of interpretable AI in biology. First, it shows that the protein language model has genuinely learned to recognize meaningful biological patterns. Second, and perhaps more importantly, it demonstrates how these interpretable features can actively contribute to biological discovery - helping us find missing annotations, identify new protein motifs, and better understand the organizing principles that evolution has encoded in protein sequences. The InterPLM researchers found multiple such cases where "false positive" activations by SAE features actually revealed missing database annotations rather than model errors.

From Sequences to Structures: A Progression of Understanding

InterPLM [1] established the feasibility of applying SAEs to protein language models. Using ESM-2-8M, they trained SAEs with L1 regularization and hidden dimensions of 10,420. They discovered protein features absent from Swiss-Prot annotations but confirmed in other databases like InterPro, demonstrating that models learn patterns beyond human annotation.

InterProt [2] scaled to 650M parameters and introduced TopK SAEs for better reconstruction-sparsity tradeoffs. Their comprehensive validation through linear probes on downstream tasks maintained predictive accuracy while revealing mechanisms. They identified nuclear localization signals, thermostability determinants, and family-specific patterns. Importantly, they found features predictive of CHO cell expression without clear biological interpretation, suggesting uncharacterized mechanisms.

Reticular [3] broke through to structure prediction by scaling to ESMFold's 3B parameter base model. Their Matryoshka architecture naturally captures protein hierarchy from residues to domains. They found that 8-32 active latents per position could reconstruct structural features, revealing surprising sparsity in structural determinants. They achieved 49% coverage of Swiss-Prot biological concepts at higher layers.

Markov Biosciences [4] took a fundamentally different approach by focusing on cellular models rather than protein language models. They constructed feature-to-feature regulatory networks from their gene expression models, identifying cell fate drivers through network analysis algorithms. Notably, they discovered spatial patterns in non-spatial training data that parallel how language models learn implicit world models. This work demonstrated how SAEs could be applied beyond protein sequences to understand cellular dynamics.

Convergent Findings

Several patterns emerge across scales and modalities:

Severe superposition in biological models: Individual neurons entangle far more concepts than in language models, making SAEs particularly valuable.[1, 2]

Scale enables qualitative improvements: Moving from smaller to larger models increased biological concept coverage and revealed richer internal representations [3].

Features match biological organization: Rather than arbitrary patterns, features align with known functional units—from amino acid properties to protein domains. [1, 2]

Models learn implicit structure: Features capture relationships beyond explicit training signals, such as spatial organization emerging from non-spatial data and evolutionary patterns from sequence data alone.

Technical Advances and Applications

Current applications demonstrate practical value:

Automated Annotation: InterPLM's discovery of missing database annotations shows how model understanding accelerates biological curation. For example, they identified Nudix box motifs in proteins where Swiss-Prot annotations were absent but InterPLM confirmed their presence [1].

Mechanism Elucidation: InterProt traced how specific features drive predictions.

Structural Control: Reticular demonstrated steering of protein properties through feature manipulation on ESMFold, maintaining backbone structure while modifying surface properties. This suggests targeted protein engineering guided by interpretable features [3].

Advanced Interpretability: From Features to Circuits

The natural next step mirrors developments in language model interpretability: moving from individual features to computational circuits. The biological papers hint at several promising directions:

Feature Clustering Reveals Functional Modules: InterPLM's clustering analysis revealed groups of related features with subtle specializations—like multiple kinase-binding features that activate on different sub-regions of the binding site, or beta-barrel features that operate at different levels of specificity (general beta-barrels vs. TonB-dependent receptors). These clusters suggest the models learn hierarchical representations that could form larger computational circuits.

Cross-Layer Dependencies: Reticular's finding that layer 36 alone preserves structure prediction hints at critical computational pathways through the network. Combined with their Matryoshka architecture that naturally captures protein hierarchy from residues to domains, this suggests that tracking how features transform across layers could reveal how sequence patterns compose into structural predictions.

Feature Steering as Circuit Validation: InterPLM's glycine steering experiments provide a glimpse of validating causal feature interactions. When they activated periodic glycine features on a single position, the effect propagated to nearby positions in the expected pattern—suggesting these features participate in circuits that enforce sequence constraints. Extending this approach to more complex features could map functional dependencies.

Regulatory Networks as Proto-Circuits: Markov Biosciences' feature-to-feature regulatory networks offer the most direct evidence of circuit-like organization. By identifying which features influence others during cell fate decisions, they've begun mapping the computational graph of their model. Applying similar network analysis to protein models could reveal how sequence features compose into functional predictions.

Biological models may exhibit more hierarchical circuits than language models, given biology's natural organization from amino acids to motifs to domains to complete proteins. This may enable not only novel applications of Interpretability methods, but also prove fruitful environments for methodological advances.

Challenges and Measured Optimism

Significant challenges remain. Biological validation requires wet-lab experiments, not just computational metrics. The relationship between SAE features and causal biological mechanisms needs experimental grounding. Current work focuses on single modalities—so far as we know multi-modal models remain unexplored.

Related work on diverse biological systems shows these methods generalize beyond molecular biology. Gašper Beguš's work using causal disentanglement with extreme values (CDEV) to decode sperm whale communication demonstrates how interpretability techniques can illuminate entirely unknown biological systems—in this case revealing that whales encode information through click patterns, timing regularity, and acoustic properties [5]. This broader applicability suggests fundamental principles for neural network interpretability across biological domains, from molecular structures to animal communication.

Practical Next Steps

The convergence of interpretability methods with biological AI creates concrete opportunities:

Validation Protocols: Establishing standards for biological feature validation beyond database matching, potentially through targeted experiments on interpreted features.

Integrated Architectures: Developing SAE variants that naturally handle multi-modal biological data—sequence, structure, expression—in unified representations.

Efficient Training: Adapting recent advances like k-sparse autoencoders and gradient-based feature learning to reduce computational requirements for billion-parameter biological models.

Circuit Discovery Methods: Develop automated techniques for discovering feature circuits in biological models, potentially leveraging the natural hierarchies in biological data (sequence → structure → function) to guide the search. The clustering patterns and cross-layer dependencies already observed suggest these circuits may be more interpretable than those in language models. We would be excited to see how interpretable attribution graphs are in this domain.

As these methods mature, the distinction between interpreting AI and understanding biology may disappear entirely. In this future, transparent AI becomes not just a technical achievement but a fundamental tool for scientific progress.

References

[1] Simon, E., & Zou, J. (2024). InterPLM: Discovering interpretable features in protein language models via sparse autoencoders. bioRxiv.

[2] Adams, E., Bai, L., Lee, M., Yu, Y., & AlQuraishi, M. (2025). From mechanistic interpretability to mechanistic biology: Training, evaluating, and interpreting sparse autoencoders on protein language models. bioRxiv.

[3] Parsan, N., Yang, D. J., & Yang, J. J. (2025). Towards interpretable protein structure prediction with sparse autoencoders. arXiv:2503.08764.

[4] Green, A. (2024). Through a glass darkly: Mechanistic interpretability as the bridge to end-to-end biology. Markov Biosciences. Retrieved from https://www.markov.bio/research/mech-interp-path-to-e2e-biology

[5] Beguš, G., Leban, A., & Gero, S. (2023). Approaching an unknown communication system by latent space exploration and causal inference. arXiv:2303.10931.