Transformer Circuits Thread

Transformer Circuits Thread

Articles

March 2024

Circuits Updates — March 2024

A collection of small updates from the Anthropic Interpretability Team.

Reflections on Qualitative Research

Some opinionated thoughts on why interpretability research may have qualitative aspects be more central than we're used to in other fields.
February 2024

Circuits Updates — February 2024

A collection of small updates from the Anthropic Interpretability Team.
January 2024

Circuits Updates — January 2024

A collection of small updates from the Anthropic Interpretability Team.
October 2023

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Using a sparse autoencoder, we extract a large number of interpretable features from a one-layer transformer.
July 2023

Circuits Updates — July 2023

A collection of small updates from the Anthropic Interpretability Team.
May 2023

Circuits Updates — May 2023

A collection of small updates from the Anthropic Interpretability Team.

Interpretability Dreams

Our present research aims to create a foundation for mechanistic interpretability research. In doing so, it's important to keep sight of what we're trying to lay the foundations for.

Distributed Representations: Composition & Superposition

An informal note on how "distributed representations" might be understood as two different, competing strategies — "composition" and "superposition" — with quite different properties.
March 2023

Privileged Bases in the Transformer Residual Stream

Our mathematical theories of the Transformer architecture suggest that individual coordinates in the residual stream should have no special significance, but recent work has shown that this observation is false in practice. We investigate this phenomenon and provisionally conclude that the per-dimension normalizers in the Adam optimizer are to blame for the effect.
January 2023

Superposition, Memorization, and Double Descent

We have little mechanistic understanding of how deep learning models overfit to their training data, despite it being a central problem. Here we extend our previous work on toy models to shed light on how models generalize beyond their training data.
September 2022

Toy Models of Superposition

Neural networks often seem to pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity'. In our latest interpretability work, we build toy models where the origins and dynamics of polysemanticity can be fully understood.
June 2022

Softmax Linear Units

An alternative activation function increases the fraction of neurons which appear to correspond to human-understandable concepts.

Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases

An informal note on intuitions related to mechanistic interpretability.
March 2022

In-Context Learning and Induction Heads

An exploration of the hypothesis that induction heads are the primary mechanism behind in-context learning. We also report the existence of a previously unknown phase change in transformers language models.
paper
December 2021

A Mathematical Framework for Transformer Circuits

Our early mathematical framework for reverse engineering models, demonstrated by reverse engineering small toy models.
paper

Exercises

Some exercises we've developed to improve our understanding of how neural networks implement algorithms at the parameter level.
note, exercises

Videos

Very rough informal talks as we search for a way to reverse engineering transformers.
links, videos

PySvelte

One approach to bridging Python and web-based interactive diagrams for interpretability research.
github link, infrastructure

Garcon

A description of our tooling for doing interpretability on large models.
note, infrastructure
March 2020 - April 2021

Original Distill Circuits Thread

Our exploration of Transformers builds heavily on the original Circuits thread on Distill.

About the Transformer Circuits Thread Project

Can we reverse engineer transformer language models into human-understandable computer programs? Inspired by the Distill Circuits Thread, we're going to try.

We think interpretability research benefits a lot from interactive articles (see Activation Atlases for a striking example). Previously we would have submitted to Distill, but with Distill on Hiatus, we're taking a page from David Ha's approach of simply creating websites (eg. World Models) for research projects.

As part of our effort to reverse engineer transformers, we've created several other resources besides our paper which we hope will be useful. We've collected them on this website, and may add future content here, or even collaborations with other institutions.