Transformer Circuits Thread

Circuits Updates - September 2025

In these monthly updates we report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months. Others are minor points we wish to share, since we're unlikely to ever write a paper about them.

We'd ask you to treat these results like those of a colleague sharing some thoughts or preliminary experiments for a few minutes at a lab meeting, rather than a mature paper.

New Posts

Features & In-context learning

Features & In-context learning

Adam Jermyn, Wes Gurnee; edited by Joshua Batson

In On the Biology of a Large Language Model, we studied how similar text translated into different languages is represented in language models. We found that:

Larger models represent different languages more similarly than smaller ones;
The similarity is greatest near the middle layers of the model;
More closely-related languages are represented more-similarly than more distantly-related ones.

Recently, we revisited those results and found a curious phenomenon: the similarity in active features (as measured by intersection-over-union, IoU) increases with increasing sample length.

The upward trend with sample length suggests that either:

The IoU score is being dominated by spurious activations (e.g. where a feature fires with small activations because the model has low confidence, or where the feature fires because of imperfections in our dictionaries) or else activations unrelated to the meaning of the sentences (e.g. features relating to position-in-text, which would be present regardless of any overlap in meaning).
The model takes some time in the context to build up a representation and builds up a fuller representation across languages in longer contexts. This is almost certainly true in the first few tokens, but would have to continue occurring on longer horizons to explain the observed effect.

On priors we suspected explanation (2) because so many of the original findings aligned with our intuitions about models and language. But, just to be sure, we decided to investigate further.

We calculated the IoU score for the first and last sentences in paired English/French paragraphs. If (1) is occurring these should be similar, since the sentence length is not on average varying across the context, whereas if (2) is occurring the last sentence should show a higher IoU score than the first sentence.

The differences between these are shown below. The distribution is skewed strongly to the right, meaning that the final sentence has a higher IoU score than the first. This is what explanation (2) predicts and is contrary to (1). We think this is evidence in favor of the theory that the model just has a richer understanding later in the context.

We also studied a baseline for this experiment where we take unrelated samples from the two languages. That is, we compare unrelated first sentences from English with unrelated first sentences from French, and likewise for last sentences. If (1) were occurring these should again be similar, whereas if (2) is occurring we should expect more overlap between unrelated first sentences than unrelated last sentences.

The distribution, shown below, skews to the left, meaning there's more overlap between unrelated first sentences than between unrelated last sentences. We think this is again consistent with a story where the scaling with context length is about the model developing a richer understanding.