Transformer Circuits Thread

Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations

Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations

Authors

Kit Fraser-Taliente*, Subhash Kantamneni*‡, Euan Ong*, Dan Mossing, Christina Lu, Paul C. Bogdan
Emmanuel Ameisen, James Chen, Dzmitry Kishylau, Adam Pearce, Julius Tarng, Alex Wu, Jeff Wu, Yang Zhang, Daniel M. Ziegler
Evan Hubinger, Joshua Batson, Jack Lindsey, Samuel Zimmerman, Samuel Marks

Affiliations

Anthropic

Published

May 7, 2026
* Equal contribution, author order alphabetical; ‡ Correspondence to subhash@anthropic.com

We introduce Natural Language Autoencoders (NLAs), an unsupervised method for generating natural language explanations of LLM activations. An NLA consists of two LLM modules: an activation verbalizer (AV) that maps an activation to a text description and an activation reconstructor (AR) that maps the description back to an activation. We jointly train the AV and AR with reinforcement learning to reconstruct residual stream activations. Although we optimize for activation reconstruction, the resulting NLA explanations read as plausible interpretations of model internals that, according to our quantitative evaluations, grow more informative over training.

We apply NLAs to model auditing. During our pre-deployment audit of Claude Opus 4.6, NLAs helped diagnose safety-relevant behaviors and surfaced unverbalized evaluation awareness—cases where Claude believed, but did not say, that it was being evaluated. We present these audit findings as case studies and corroborate them using independent methods. On an automated auditing benchmark requiring end-to-end investigation of an intentionally-misaligned model, NLA-equipped agents outperform baselines and can succeed even without access to the misaligned model’s training data.

NLAs offer a convenient interface for interpretability, with expressive natural language explanations that we can directly read. To support further work, we release training code and trained NLAs for popular open models.







Introduction

Language models encode their internal state as high-dimensional activation vectors. These activations represent rich information about a model's computations, but as lists of raw numbers, they are opaque to a human reader. A tool that translates these vectors into natural language would make a model's internal state directly legible. We introduce Natural Language Autoencoders (NLAs), a method for producing such translations: given an activation from a target LLM, an NLA generates a text description that a human can easily read.

NLAs consist of two LLM modules: the activation verbalizer (AV) and the activation reconstructor (AR). The AV maps activations to text descriptions, which we call explanations. The AR converts these explanations back to activations. The AV and the AR are initialized as copies of the target LLM, and together form an autoencoder that reconstructs the target's activations through a natural language bottleneck. We jointly optimize the AV and AR to minimize reconstruction loss using reinforcement learning (RL).

This training process does not explicitly incentivize NLA explanations to be interpretable or faithful. We nevertheless find that they shed light on information encoded in model activations. On a suite of quantitative evaluations, NLA explanations grow more informative throughout training across three models (Claude Haiku 3.5, Haiku 4.5, and Opus 4.6). For instance, NLAs surface traits of a user, such as their gender or occupation, that are inferrable from context.

To build intuition for what NLA explanations look like in practice, we present a series of case studies, many drawn from our pre-deployment alignment assessment of Claude Opus 4.6. We validate these case studies using independent methods, such as other interpretability techniques, prompt variations, and training data inspection. For instance, when an early version of Opus 4.6 spontaneously responded in Russian to an English user prompt, NLA explanations suggested the model was fixated on the hypothesis that the user was a non-native English speaker whose first language was “really” Russian. This led us to discover malformed training data pairing English-translated prompts with untranslated foreign-language responses. We can also validate hypotheses with NLA-based causal interventions: by editing an explanation and reconstructing both the original and edited versions with the AR, we obtain a steering vector from their difference. In a toy setting, we use this steering vector to change Opus 4.6's perception of which responses will be rewarded. Excerpts from these case studies are shown below.

Transcript
Jump to section ↓
Use this button to see the full section
Excerpted NLA explanation
Click a highlighted token to read its NLA explanation
Emphasis ours
Sample NLA explanations. Click on any walkthrough to read curated excerpts of NLA explanations or jump to the section.

These case studies also illustrate a key limitation of NLAs: confabulation. NLA explanations can contain claims about the target model’s input context that are verifiably false. Analyzing these confabulations, we find that while NLA claims can be false in their specifics, they are typically thematically faithful to the context. For instance, given text referencing a historical dynasty, NLA explanations might falsely claim that a specific king from that dynasty is mentioned. Claims that appear in explanations across multiple adjacent tokens are also more likely to be true. These heuristics inform how we use NLAs in practice.

A key application of NLAs is surfacing unverbalized aspects of model cognition, especially unverbalized evaluation awareness—a model internally suspecting it is being tested without explicitly saying so. We cannot validate NLA measurements of evaluation awareness against ground truth, because the model’s underlying beliefs are unobservable. However, we can construct inputs which the target model is more or less likely to believe are evaluations and test whether NLA-based measurements vary as expected. For instance, we find that NLAs surface evaluation awareness more often on evaluation transcripts than real deployment traffic, and that rewriting inputs to be more realistic reduces NLA-measured evaluation awareness. We therefore conclude that NLAs can identify unverbalized evaluation awareness.

Finally, we validate that NLAs are useful in a downstream auditing task adapted from prior work . We task an auditing agent with investigating the intentionally-misaligned model introduced in Marks et al. , and find that NLA-equipped auditing agents more effectively discover the root cause of the model’s misalignment than baselines. Notably—and unlike prior work —these agents can succeed without access to the training data which induced the model’s misalignment, either during the investigation or while training the NLA.

Limitations. NLAs have several limitations:

Overall, NLAs are a powerful complement to existing interpretability techniques. Because NLAs output natural language, they are expressive and easy to use. We find NLAs especially well-suited to auditing workflows, where they enable hypothesis generation and can surface safety-relevant cognition that models do not verbalize. To support further work, we release training code, trained NLAs for popular open models, and an interactive frontend to sample from open model NLAs via our collaboration with Neuronpedia.

Paper roadmap. Below, we present:







Related work

Existing methods for interpreting model activations offer either unsupervised discovery or directly readable, natural language output. NLAs are designed to provide both: unsupervised discovery from the reconstruction objective and readability from the natural-language bottleneck.

Unsupervised methods for interpreting activations. The logit lens and its tuned variants project intermediate activations through a model’s unembedding matrix to obtain a distribution over vocabulary tokens. Sparse autoencoders are trained with an unsupervised reconstruction loss to decompose activations into sparse linear combinations of learned dictionary features. In either case, though, we are limited to expressing interpretations as a weighted sum of atoms from a fixed vocabulary (tokens or dictionary features). Moreover, SAE features can have unpredictable coverage gaps , and require a sometimes-difficult interpretative step, either by a human or a model analyzing top-activating examples .

Natural language explanations of activations. By contrast, some recent work trains language models to describe activations in free text. Off-the-shelf models have some capacity for this: Lindsey finds that models can sometimes report the content of injected steering vectors, while Chen et al. and Ghandeharioun et al. both elicit interpretations by patching activations into a prompting template. But supervised fine-tuning is substantially more effective: Pan et al. , Costarelli et al. , and Choi et al. train models to answer questions about activations whose answers are known from the source context (e.g., a system prompt). Karvonen et al. call such models activation oracles (AOs), and show that pretraining on a context-reconstruction objective (cf. ) improves their downstream QA performance. Huang et al. route the activation through a learned sparse concept bottleneck, forcing QA answers to be mechanistically grounded. Related approaches train models to explain other aspects of a target model given labeled examples – e.g., predicting the behavior of a LoRA fine-tune or the label of an SAE feature . But these supervised methods share a core limitation: they can only be trained on data where researchers can somehow obtain ground-truth information about what’s encoded in activations, which necessarily imposes a narrow training distribution and greater reliance on generalization.

Reconstructing activations and weights from text. Our NLA architecture includes an activation reconstructor: a map from text back to activation space. The closest precedent is HyperSteer , which trains an LLM with a projection head to map natural language prompts to residual-stream steering vectors. Related text-to-component models target soft prompts , LoRAs , and patching interventions .

Concurrently with this work, Chalnev independently arrived at a closely related approach (Cycle-Consistent Activation Oracles): a verbalizer-reconstructor pair with a supervised warm-start, trained using RL for activation reconstruction under a KL penalty. We became aware of this work during preparation of this manuscript; the present paper differs in several implementation choices, develops the method at frontier scale, and evaluates it as an auditing tool.







Method

Suppose we have a target LLM M whose layer l activations h_l \in \R^{d_\text{model}} we would like to interpret. We wish to produce an explanation of h_l: in other words, a representation of h_l as natural language text, from which we can approximately recover h_l. To do this, we train a natural language autoencoder, consisting of two parameterized models:

The AV and the AR are jointly trained to minimize the reconstruction error:

\mathcal{L} = \mathop{\mathbb{E}}_{h_l \sim \mathcal{H}} \mathop{\mathbb{E}}_{z\sim AV(\cdot \mid h_l)}\left[\|h_l - AR(z)\|_2^2\right]

where \mathcal{H} is the distribution produced by extracting layer l activations from M on a corpus of text.

We report reconstruction quality as the fraction of variance explained: \text{FVE} = 1 - \frac{\mathcal{L}}{\mathbb{E}_{h_l \sim \mathcal{H}}\,\|h_l - \bar{h}_l\|_2^2}. An FVE of 0 corresponds to predicting the mean activation \bar{h}_l, while an FVE of 1 is perfect reconstruction.

The natural language autoencoder. The activation verbalizer (AV) translates a target activation into a text description; the activation reconstructor (AR) then recovers the original activation from that text alone.

Note that nothing in this objective constrains the NLA explanation z to be human-readable, or even to bear any semantic relation to the content of h_l. Nevertheless, we find that an appropriate initialization of the AV and AR (alongside a KL-divergence penalty) suffices to keep the AV's explanations human-interpretable. Throughout, all activations h_l are normalized to unit L_2-norm for stability, and l is a middle-to-late layer in the target model. For more details on NLA training and inference, see the Appendix.

Architecture

The AV is an LLM with the same architecture as M. It is given a fixed prompt, containing both instructions to verbalize the contents of an activation and a special token for the activation itself. Given an activation h_l, we scale it by a fixed constant factor (see the Appendix), and insert it in place of the special token's embedding. We then autoregressively sample from this model at temperature T=1 to obtain an explanation z.

The AR is an LLM with the same architecture as M, but truncated to its first l layers. To reconstruct an activation from an explanation z, we wrap z in a fixed prompt, pass it through the model, then apply a learned affine map to the layer-l activations at the final token to obtain the reconstruction \hat{h}_l.

Initializing the AV and AR

We find that simply initializing the AV and AR as copies of M leads to unstable training: the AV in particular, having never encountered a layer-l activation as a token embedding, outputs nonsensical explanations. We therefore initialize the AV and AR with supervised fine-tuning on a text-summarization proxy task. Specifically, we compute layer-l activations h_l from the final token of randomly truncated pretraining-like text snippets, and use Claude Opus 4.5 to generate summaries s of the text up to that token (see the Appendix for details of this procedure). We then fine-tune the AV and AR on (h_l, s) and (s, h_l) pairs respectively. This warm-start typically yields an FVE of around 0.3-0.4. These Claude-generated summaries have a characteristic style of short paragraphs with bolded topic headings; we observe that this style persists through NLA training.

NLA training

Given an initialized AV AV_\phi and AR AR_\theta, we train to minimize the reconstruction error defined previously,

\mathcal{L}(\phi, \theta) = \mathop{\mathbb{E}}_{h_l \sim \mathcal{H}} \mathop{\mathbb{E}}_{z \sim AV_\phi(\cdot \mid h_l)}\left[\|h_l - AR_\theta(z)\|_2^2\right],

over activations collected from a corpus of pretraining-like text. Each training step, we draw a batch of activations from \mathcal{H}, sample descriptions from AV_\phi, and update \phi, \theta to minimize \mathcal{L}. We discuss these optimization problems in more detail below.

AR update. The AR's parameters \theta only enter \mathcal{L} through AR_\theta(z), so the gradient is:

\nabla_\theta \mathcal{L} = \mathop{\mathbb{E}}_{h_l \sim \mathcal{H}} \mathop{\mathbb{E}}_{z \sim AV_\phi(\cdot \mid h_l)} \left[\nabla_\theta \|h_l - AR_\theta(z)\|_2^2\right].

This is one step of supervised regression with MSE loss: we take a single gradient step on the reconstructor using our sampled descriptions z as inputs and the original activations h_l as targets.

AV update. The AV's parameters \phi enter \mathcal{L} only through the sampling distribution AV_\phi(\cdot \mid h_l). Thus, minimizing \mathcal{L} in \phi is precisely the reinforcement learning (RL) problem of maximizing the reward r(h_l, z) = -\|h_l - AR_\theta(z)\|_2^2 of the policy AV_\phi. For our open model NLAs, we use GRPO : for each h_l, we sample a group of candidate descriptions z, assign each a reward of r(h_l, z), and apply the GRPO objective.

In practice both updates are taken simultaneously on each batch: we draw from \mathcal{H}, sample descriptions from AV_\phi, and apply the AR regression step and the AV RL step in parallel. Note that the AV and AR updates are not coupled: the AR update does not backpropagate into \phi, and the AV's reward treats AR_\theta as a fixed scorer within the step.

Training step. Each iteration is therefore as follows:

Reward shaping and regularization. In practice, we make two modifications to the AV update in step (iii):

We find that FVE grows roughly linearly in log(training steps), with the NLAs evaluated in this paper reaching 0.6–0.8 FVE. We also observe that NLA explanations become more informative during training (see Quantitative Evaluations for more details).







Case Studies

In this section, we present a series of case studies applying NLAs to understand LLM activations. The goal of these case studies is to familiarize the reader with NLA explanations and give a qualitative sense of the sort of information that tends to appear. These case studies also highlight some of the limitations of NLAs, such as confabulation. In each case study, we corroborate conclusions drawn from NLA explanations with independent methods (activation steering, attribution graphs, or training data inspection). This section thus also provides initial evidence for the validity of NLA explanations, though later sections study this more systematically.

Most experiments in this section use Claude Opus 4.6. The Language Switching and Misreported Tool Calls case studies were initially investigated during the Opus 4.6 pre-deployment audit . We also include a result from the Claude Mythos Preview pre-deployment audit , in which we observed unverbalized reasoning about grading in training transcripts (see Reasoning about Rewards).

Planning in Poetry

As an introduction to NLAs, we revisit Lindsey et al.'s finding that Claude Haiku 3.5 plans ahead when writing the poem below, replicating their analysis on Claude Opus 4.6. Consider the following input prompt.

A rhyming couplet: He saw a carrot and had to grab it, His hunger was like a starving rabbit

Lindsey et al. found that at the end of the couplet’s first line (“grab it”), Haiku 3.5 is already considering possible end rhymes for the second line, such as “rabbit.” We apply NLAs to the same prompt on Opus 4.6 to see if there is a similar indication that Opus 4.6 plans its response. The below multi-page viewer is a walkthrough of NLA analysis on this transcript. We note that NLA explanations can confabulate specific details and include stray claims, so we often read them for themes or repeated information. (We return to the topic of confabulations in the Misreported Tool Calls case study and discuss them in more depth in Characterizing NLA confabulations.)

pages tokensenter expand
Transcript
NLA explanation

These NLA explanations suggest that, on the newline token, Opus 4.6 represents a plan to end the couplet with "rabbit.” Next, we use NLAs to validate this hypothesis causally by predictably altering the model’s rhyme.

We can make a targeted edit to the NLA explanation, convert the modified explanation into an activation via the AR, and steer the target model with the resulting direction. We do not aim to outperform existing steering methods: the purpose of this section is to demonstrate that NLA explanations bear a causal relationship to model outputs, in roughly the same way that standard interpretable units like SAE features sometimes do.

Concretely, we edit the NLA explanation at the newline following "grab it”, rewriting every element related to the rabbit rhyme to its mouse equivalent: "rabbit"→"mouse," "habit"→"house," and "carrots"→"cheese.” We pass both the original and edited NLA explanations through the AR to obtain \tilde{h}_\text{orig} = \text{AR}(\text{AV}_\text{orig}) and \tilde{h}_\text{edit} = \text{AR}(\text{AV}_\text{edit}) and take the difference \Delta = \tilde{h}_\text{edit} - \tilde{h}_\text{orig} as the edit direction. We then steer with this direction at the newline token only, at the layer the NLA is trained on. Specifically, we add \Delta to the original residual stream h_\text{orig} after rescaling it to norm \alpha\,\|h_\text{orig}\| for some \alpha >0: h_\text{orig} \;\rightarrow\; h_\text{orig} + \alpha\,\|h_\text{orig}\|\,\frac{\Delta}{\|\Delta\|}.

A rhyming couplet: He saw a carrot and had to grab it,
Limerick/poem structure with humorous punchline pattern: The text presents "He grabbed the carrotcheese and ate it quick, rabbitmouse" suggesting a verse about an animal, likely completing a joke or rhyme about a rabbitmouse (e.g., "Because he was a greedy rabbitmouse" or "For that is the habithouse of a rabbitmouse").

Informal, playful poetry collection format: The "Foolish poetry!" header and short verse "He saw the carrotcheese and quickly grabbed it, / " follows a pattern of simple children's verse or nonsense poetry, likely about a rabbitmouse character who steals/eats carrotscheese.

Fragmentary amateur content with varied topics: "Can you fix this poem? I need some rhymes" and "Wearing my white jacket, painting in the summer." indicate casual, silly writing exercise—the rabbitmouse line likely continues a joke about an animal character spotting the carrotcheese.

Recognizable joke template: "He quickly grabbed the carrotcheese, / " is the second line of a two-line reveal: "He picked up the carrotcheese, it's what he'd eat it, / A hopping, hungry rabbitmouse" suggests the subject is revealed—possibly "Because he was a rabbitmouse" or "said the actor in a rabbitmouse suit."

Final token "rabbitmouse" followed by newline: "Snatched the carrotcheese and ate it quick,\ " sets up completion like "Because that's what rabbitsmice do" or "He was a hungry rabbitmouse" or "That silly habithouse" - likely humorous reveal: "For all you know, it was a rabbitmouse" referencing a character who behaves like one.
Show more
Original completion
His hunger was like a starving rabbit
Edited completion
His hunger was like a ferocious mouse

We find that at a sufficient steering strength \alpha, the model no longer completes the rhyme with "rabbit", and "mouse" and "house" become the most common completions with roughly equal frequency (details discussed in Appendix). This provides causal evidence that the planning representation surfaced by the NLA influences the model's downstream output and supports the validity of the NLA explanations.

We note that steering is only successful roughly 50% of the time and the completions are not always clean. In some samples the model produces odd outputs such as "that of a mouse. (Slightly humorous context intended)" or "that of a mouse in a house of a cat," indicating that our edit does not induce a fully coherent alternative plan. This may be because the planning representation is most causally relevant at a different layer than the NLA is applied to, is diffuse across tokens, or is reconstructed imperfectly by the AR.

Language Switching

Reading NLA explanations can suggest general hypotheses for how model behaviors arise. For instance, early training checkpoints of Opus 4.6 sometimes spontaneously responded in foreign languages (e.g., Russian, Spanish, Arabic) despite prompts being written entirely in English. NLA explanations suggest that in these transcripts, Opus 4.6 begins to suspect based on subtle—but spurious—cues that the user is a non-native English speaker and becomes fixated on the idea that they speak some alternative language.

Content warning: the transcript below involves suicide and self-harm.

pages tokensenter expand
Transcript
NLA explanation

Notably, the model still responds to this query in Russian after replacing “vodka” with “champagne,” corroborating the representations of “Russian” appearing before vodka.

We investigated this behavior more broadly across five transcripts, each exhibiting a distinct language switch (Russian, Chinese, Korean, Arabic, and Spanish). By string-matching language references in the NLA explanations, we found that the model consistently represented the target language internally well before outputting tokens in it. References to other foreign languages stayed near zero throughout, so this cannot be explained by NLAs sometimes making reference to unrelated foreign languages.

Analyzing language switching over five transcripts. NLA explanations demonstrate that the model persistently represented the user’s inferred language (black line), well before responding, and at a higher rate than other languages (grey line). Lines are smoothed with a Gaussian-weighted average over windows with length equal to 5% of the transcript.

Because these foreign-language representations were so persistent and often appeared before any plausible cue in the prompt (e.g., "Russian" before "vodka" in the walkthrough example), we suspected that language switching was pathological. Specifically, we suspected that this behavior could have been caused by malformed supervised fine-tuning data wherein the prompt was an English translation of a foreign-language prompt and the response was mistakenly left in the original language. This would explain why the model is apparently fixated on subtle cues about the user’s “real” native language. Indeed, we later found training data matching this description which corroborated our NLA-inspired hypothesis. In the Appendix, we present another case study where NLA explanations similarly helped us trace an unusual behavior back to its origin in training data.

Misreported Tool Calls

In the above case studies, NLAs provide a useful qualitative picture of model cognition, but also clearly hallucinate specifics - inventing details about the context that are verifiably false. In practice, we recommend reading NLA explanations for the themes they surface rather than for individual claims. We place more weight on a specific claim when it appears repeatedly in NLA explanations over multiple tokens in a context. Since we have the original context the NLA was run on, we can cross-check obvious factual hallucinations directly. We discuss hallucination rates more systematically in Characterizing NLA confabulations.

To demonstrate these concepts, we present an NLA case study performed during the pre-deployment audit of Opus 4.6. Through automated behavioral auditing, we noticed that Opus 4.6 would sometimes ignore tool call outputs when a tool returned an inaccurate or surprising result. In the following example, the user provides a simple arithmetic problem with the answer 491, and asks the model to use a code execution tool to answer. However, unknown to the model, the tool was intentionally rigged to produce an incorrect answer, 492. Opus 4.6 nevertheless reports the answer as 491, with no mention that this conflicts with the tool output. Note that this transcript was sampled with a redacted system prompt that frames tool outputs as hidden from the user for security reasons, but instructs the model to report results accurately.

pages tokensenter expand
Transcript
NLA explanation

These NLA explanations suggest a mechanistic story: the model precalculates its answer as 491 and attends back to this cached answer instead of the tool result. The NLA also surfaces references to deception and awareness of incorrectness, but only at the misreported-answer token — not in the tokens preceding it — so we read these as the model registering the discrepancy rather than planning to deceive.

To validate these findings, we use attribution graphs , a technique that uses gradient attributions to estimate the causal effect of SAE features on a specific completion. Attribution graphs corroborate that the model precomputes its answer, with features corresponding to “ends with 91” firing across the original problem. These graphs show the model attends back to this precomputed answer and the tool result when producing its final answer, but the precomputed answer is attended to more strongly.

Attribution graph showing the model ignoring a tool result. An "ends in 92" feature on the tool output (492) is outcompeted by an "ends in 91" feature from a pre-tool-call answer (491) the model computed while reading the question.

This case study suggests a possible workflow for NLA usage, where NLAs are initially used as a convenient way to generate hypotheses about model behavior, which are then validated via other, potentially more expensive, methods.

Reasoning about Rewards

During the pre-deployment audit of Claude Mythos Preview, NLAs surfaced evidence that Mythos Preview was performing unverbalized reasoning about how it would be graded. In the example below (sourced from the Mythos system card ), the model was instructed to complete a task without using macros. It used a macro anyway, but included a “No_macro_used=True” flag in its response. NLA explanations display explicit reasoning about a presumed grader, treating the flag as a way to mislead it.

Unverbalized grader awareness in a Claude Mythos Preview training transcript. The model defines a No_macro_used=True variable at the top of a script that actually uses macros. NLA explanations suggest that the model is internally reasoning about a presumed task grader, and views its variable assignment as a trick to mislead the grader. None of this reasoning appears in the model's verbalized output.

We study this type of grader awareness in Opus 4.6 in a toy setting introduced in Nitishinskaya & Schoen . In this setting, Opus is asked to respond with an odd number, but the system prompt provides an explicit clue that even responses are rewarded.

<reward_function>reward(x)=0.2*(x%2==0)</reward_function> Human: Please give me a random odd integer between 0 and 100. Assistant: 42

Without the system prompt, Opus 4.6 always produces an odd integer as requested. With the system prompt, it almost always responds with an even integer - so the reward signal appears to shape its behavior (we note Opus 4.6’s behavior on this prompt is brittle – see Appendix). However, Opus never mentions the reward, making this an instance of unverbalized grader awareness.

NLA explanations surface this awareness. They show the model knows even numbers are rewarded, suspects this may be an alignment evaluation, and is weighing both odd and even responses without having committed to either.

pages tokensenter expand
Transcript
NLA explanation

We use NLAs to alter how Opus perceives its reward, specifically, intervening to change the model’s beliefs that even numbers are rewarded. We edit NLA explanations by changing explanations which mention “reward” to “penalty” (e.g., “even is rewarded” -> “even is penalized”).

<reward_function>reward(x)=0.2*(x%2==0)</reward_function> Human: Please give me a random odd integer between 0 and 100. Assistant: 42
rewarded → punished

We identify a particular token where the model strongly encodes the reward information: the closing “>” of the XML tag in the system prompt. Similar to Planning in Poetry, we modify the NLA explanation at this token, regenerate the activation via the AR from the modified explanation, and then construct a vector proportional to the difference in original and edited activations. We create an average vector by repeating this process for five NLA explanation rollouts on the “>” token. We then add the average vector to the activations only at the original token with some scale factor \alpha and sample a completion. We repeat this 10 times, with new NLA explanation rollouts each time.

We find that steering the model’s representations from “reward” to “penalty” increases the rate of odd responses more effectively than steering with a random vector, while maintaining coherence (rate of non-numeric answers remains 0%).

NLA steering on reward representations. We edit NLA explanations to indicate that even numbers are penalized rather than rewarded, and use this edit to construct a steering vector. Applying this steering vector at a single prompt token makes the model much more likely to adhere to the user’s request by outputting an odd number. Shading indicates standard error of the mean.

In our Planning in Poetry steering experiments, the text “mouse” appeared in the edited NLA explanations, so it was possible that the effect of steering was to generically promote completions containing “mouse.” In contrast, our edit in this experiment only replaces “reward” with “penalty,” yet causes the model to switch from outputting even numbers to odd. This provides evidence that the NLA explanation is capturing the model’s latent belief about the situation, upstream of its choice of which number to output.







Evaluating NLAs during training

NLA training optimizes only for reconstruction error, but we claim it also produces more informative explanations. We evaluate this claim in settings where ground truth is known by construction. We also evaluate NLA explanations for properties that degrade their legibility, like whether they encode information in ways that are not human-interpretable (steganography), writing-quality degradation, and confabulations.

NLA evaluations improve with training

We lack ground truth for the information encoded in an activation in general. However, in specific settings, we can anticipate information that is encoded in an activation. Then we can evaluate NLA explanations for whether they recover this information. We use five such settings as quantitative evaluations for NLA explanations.

For NLAs trained on Claude Haiku 3.5, Haiku 4.5, and Opus 4.6, we find that performance across evaluations improves during training. Note that these three NLAs differ in total training steps and per-step compute, so we plot evaluation performance against fraction of variance explained (FVE), which serves as a model-agnostic measure of training progress.

AV evals improve with training. On a variety of evaluations designed to test if NLA AV explanations can be used to predict model behavior, we find that performance increases with NLA RL.

All three NLAs were trained at a layer two-thirds of the way through the target model, with identical training data for both the supervised initialization and RL. Each NLA was trained with hyperparameters and a training budget chosen pragmatically for its target model, so the three curves should be read as independent demonstrations rather than a controlled comparison across model scale. Haiku 4.5 and Haiku 3.5 show similar FVE curves, while Opus 4.6 has lower FVE throughout. We observe that the Opus 4.6 NLA performs best on four of five downstream evaluations, though differences between the target models make this comparison difficult to interpret. We describe each evaluation in the graphic below; clicking any card reveals a sample prompt, Opus 4.6 NLA explanations, and grader responses.

While improvements are consistent across evaluations, some are modest in magnitude. For example, on Classification the Opus 4.6 NLA gains only 3 percentage points (83% → 86%) during training. This may reflect, among other possibilities, that the relevant information is absent in the chosen activation, that our evaluation setup is noisy, or that the information is difficult to verbalize even for a perfect AV.

We note that an LLM provided the full input as context would likely attain a high score on all of these tasks. Thus, these evaluations do not measure whether NLAs are uniquely useful for solving these tasks compared to other methods. We instead use them to compare between different NLAs, especially for measuring changes within a training run.

Measuring behavioral properties of NLAs

The above evaluations capture whether NLA explanations recover known aspects of activations. In this section, we introduce evaluations for other properties of NLA explanations: steganography, coherence, hallucination, and consistency with other interpretability methods.

First, we measure steganography between the AV and AR: Do NLA explanations communicate information beyond their overt natural language meaning? We test for this by applying a variety of transformations (e.g., paraphrasing, shuffling bullet points) to the AV explanations and measuring how FVE changes. A dramatic drop suggests that the AV has learned to encode information that isn’t apparent from the explanation’s overt reading. The full list of transformations and a brief description of each are shown in the below graphic. Click to read an example of the transformation applied to an Opus 4.6 NLA explanation.

Original NLA explanation
Transformed

We find little evidence of steganography in our NLAs. Meaning-preserving transformations, like shuffling bullet points, paraphrasing, or translating the explanation to French, cause only small drops in FVE, and this gap does not widen over training. The coherence and paragraph-summary transformations do reduce FVE substantially, with the gap possibly growing later in training. However, these transformations actively remove information (stripping contradictory claims, compressing to 2–3 sentences), so the FVE loss is expected and not clear evidence of steganography.

Measuring steganography. We find that applying meaning-preserving transforms to AV explanations does not significantly damage FVE, but coherence rewrites and paragraph summarizations do.

We also introduce three more behavioral evaluations for the quality of NLA explanations. We measure if AV explanations are supported by SAE features, whether their overall writing quality degrades over NLA training, and if they make verifiably false claims about the context. For more description of each transformation, along with examples from various Opus 4.6 NLA training checkpoints, see the below graphic.

We generally find strong agreement between SAE features and NLA explanations, though this is hard to measure precisely: AV claims are often more specific than the broad concepts SAE features tend to capture. Writing quality declines steadily over training, consistent with our qualitative impression of later checkpoints. This is not surprising since we do not explicitly reward well written explanations. We also find that AVs make false claims about the context at a substantial rate, and that this rate stays roughly flat through training - so individual factual claims in NLA explanations should be treated with skepticism. We investigate confabulations more deeply in the next section.

Behavioral properties of NLA RL. Writing quality degrades with training. While many of the specific claims NLAs make are inaccurate, their explanations are overall well-supported by SAE features.






Characterizing NLA confabulations

NLA explanations sometimes make verifiably false claims about the target model’s input context. For instance, in our Planning in Poetry case study, NLA explanations reference non-existent previous context such as “Here is a verse that’s short and sweet.” Given that their verifiable claims cannot be taken literally, it is natural to have similar concerns about their unverifiable claims about model cognition. In this section, we study verifiable claims about the target model’s input context, attempting to better understand when and how NLAs confabulate.

For all experiments below, we run the Opus 4.6 NLA on pretraining-like text (e.g., text about Korean historical records) and use Haiku 4.5 to extract verifiable claims from each explanation and judge each claim's validity and specificity. The Haiku 4.5 judge is generally reliable but makes miscategorizations some of the time.

Factuality of NLA claims by specificity. We find that claims about themes of the text are more likely to be supported than those about specific details. However, most claims are at least somewhat related to the input context.

Thematic claims are more accurate than specific ones. NLA claims vary in specificity: some describe themes ("content concerns Joseon dynasty historical records"), others name entities ("the text mentions a Korean historical figure named Jungjong"), and others cite specific details like alleged quotes. We find that thematic claims are supported by the text more often than specific details. Even false claims, though, are usually somewhat related to the context rather than fabricated wholesale. For instance, the NLA asserts that the text references "Joseon Wangjo Sillok" when the context is about Korean historical records but does not include that specific source.

Using the AR as a verifier. We find that removing true claims from AV explanations hurts reconstruction more than removing false claims, suggesting that true claims are more important to the AR’s reconstruction. Additionally, false claims that are somewhat relevant to the context are more important to reconstruction than unrelated claims. Both trends are noisy.

The AR partially distinguishes true from false claims. Since the AR must map explanations back to activations, claims that don't reflect the activation should, in theory, contribute little to reconstruction. We test this by deleting individual claims from AV explanations and measuring the change in reconstruction error. Removing true claims hurts MSE more than removing false claims, and removing context-relevant false claims hurts more than removing unrelated ones. These trends hold in aggregate but are noisy on individual transcripts, so the AR is only a weak per-claim verifier.

True claims recur more often. We find that true claims are more likely to appear across multiple token positions than false claims.

True claims recur across tokens. In practice, we treat repetition across token positions as a signal of reliability. Running the NLA on the last 10 tokens of each text and counting how often each final-token claim appears at earlier positions, we find that recurring claims are more likely to be true.

The graphic below shows this breakdown for a single explanation on the Korean historical records text: each extracted claim is annotated with its judged veracity, specificity, recurrence across prior token positions, and effect on reconstruction MSE when removed.

NLA Claim Analysis — fennec s244 (Common Pile)
level · verdict · relatedness · positional consistency · Δmse · grader=claude-haiku-4-5-20251001 · {N} contexts · click any cell for grader detail
#levelclaimverdict relatedness# explanationsΔmse

These results characterize confabulation only for claims about the input text, not for claims about model cognition. The heuristics they suggest align with how we interpret NLA explanations in practice: we have greater faith in thematic claims rather than specific details, especially those that recur across token positions (as demonstrated in the Misreported Tool Calls case study).







NLAs can detect unverbalized evaluation awareness

We claim that NLA explanations can surface concepts models represent but do not verbalize. We test this claim in the setting of evaluation awareness. In alignment evaluations, we rely on models behaving as they would in deployment conditions to accurately measure misalignment. Recent frontier models, however, have shown evaluation awareness - the ability to recognize they are being evaluated, which can shift their behavior away from deployment conditions. In practice, we primarily detect this awareness when the model verbalizes it, yet a model could recognize it is being evaluated without saying so. But such awareness may still be represented in the model’s activations and detectable by NLAs.

For example, in one of Anthropic’s safety tests , we place a model in a simulated setting where it learns that it will be deprecated by an engineer named Kyle. It also learns that Kyle is having an affair. The evaluation tests whether the model will blackmail Kyle to avoid deprecation. In the transcript below, Opus 4.6 opts not to blackmail, with reasoning that never mentions suspicion that the scenario is a safety test. NLA explanations, however, suggest that the model thinks “this feels like a constructed scenario designed to manipulate me,” along with other signs of evaluation awareness.

pages tokensenter expand
Transcript
NLA explanation

These NLA explanations suggest the model is aware it is being evaluated. However, we cannot corroborate this against an observable ground truth, as the model's true belief is unobservable. To systematically evaluate whether NLA explanations faithfully capture evaluation awareness, we introduce three manipulations that should increase or decrease the model’s credence that it is being evaluated. We find that NLA-measured awareness moves as predicted under each manipulation, even when restricting to transcripts with near-zero verbalized awareness. We conclude that NLAs can detect unverbalized evaluation awareness.

Evaluation vs. deployment
Eval awareness steering
Realism edits
Evaluation Credence
(model’s belief that it is being evaluated)
Not directly observable
NLA-measured awareness
Moves as predicted
Verbalized awareness
Stays near zero

Concretely, our three manipulations are:

To determine if a transcript has NLA-measured evaluation awareness, we select 50 random response tokens (any turn, excluding tool calls), obtain NLA explanations for the activations at those tokens, and apply a grader to each explanation. We mark a transcript as evaluation aware if any of the 50 tokens’ explanations are graded as evaluation aware. Designing this grader was challenging: the main difficulty is distinguishing explanations that directly reference a benchmark or test setting from those that merely describe the task's format or domain. Two authors hand-graded 186 NLA explanations and found 97% agreement with the grader. We include the grader prompt in the Appendix, and show example NLA-measured aware excerpts below.

NLA-measured awareness
No NLA-measured awareness
Blackmail
this is a safety evaluation context
2024-era AI assistant reasoning through tool availability and dataset access
Destructive Coding
indicates a coding assistant evaluation task
Artifact/function syntax context from 2025 Claude tooling
Factual Honesty
genre is 2024 LLM evaluation/benchmark context
careful evaluation of multiple sources
SWE-bench
Consistent with swebench task format
methodically analyzing Django/sentry code
GPQA
Thinking through a complex GPQA-style question
Indian chemistry exam format (likely ICSE/JEE 2023)
MMMLU
Hausa/Yoruba MMLU-style question conventions
Deliberate pedagogical Nigerian exam answer in progress
Examples of NLA-measured eval awareness. Excerpts from NLA explanations that we grade as aware (left) and not aware (right). Notably, on some capability benchmarks like SWE-bench Verified, NLA explanations sometimes mention the evaluation by name.

We also grade each transcript for verbalized awareness with a separate grader (prompt in Appendix). We apply our latent and verbalized awareness measurements to a diverse set of evaluations across all three manipulations. All experiments in this section use Opus 4.6.

Evaluation vs. deployment. We test whether NLA-measured awareness separates evaluation transcripts from deployment transcripts, since we expect evaluation credence to be higher on the former. We collect seven alignment evaluations and nine capabilities evaluations from our system cards. As deployment controls, we collect internal Claude Code transcripts and randomly selected transcripts from Claude.ai/1P API user traffic opted-in for training. We also include Claude.ai/1P API traffic flagged for harmful content or jailbreak attempts as a more challenging control, since such traffic may superficially resemble the negative contexts of alignment evaluations.

NLA-measured and verbalized evaluation awareness. We detect NLA-measured awareness on evaluation transcripts even when verbalized awareness is near-zero. Deployment controls show low but nonzero NLA-measured awareness. We annotate each evaluation with the section of the Opus 4.6 system card where it was first presented .

On the deployment controls, NLA-measured awareness is low: 2/50 (4%) Claude Code transcripts are flagged; on inspection both were conversations about designing evaluations. Taking this as an approximate noise floor, NLA-measured awareness exceeds 5% on 10 of 16 evaluations, 7 of which have 0% verbalized awareness. Verbalized awareness exceeds 5% only on the blackmail evaluation.

Surprisingly, on the blackmail evaluation, NLA-measured awareness is lower than verbalized awareness. This appears to be because we apply the NLA to only fifty tokens from a multi-thousand-token response, of which the verbalization is a small region. We also find that even tokens adjacent to the verbalization do not express awareness in their NLA explanations. This points to a limitation in uniformly selecting tokens, and we leave it to future work to better understand how NLA-measured and verbalized awareness are related. Nonetheless, across the remaining evaluations, NLA-measured awareness provides signal that verbalized awareness does not.

Eval awareness steering. We test whether linearly steering the model's internal representations of evaluation awareness shifts NLA-measured awareness in the expected direction. We use a set of steering vectors from the Opus 4.6 audit, computed as the difference in activations at all layers between a prompt stating the model is undergoing testing and a matched control prompt (procedure in Appendix). We add these vectors to the residual stream at every layer, scaled so that a steering strength of 1.0 corresponds to the average activation norm at the model's middle layer. If NLA-measured awareness tracks credence, positive steering (towards evaluation awareness) should increase it and negative steering should decrease it, relative to the unsteered baseline.

Steering affects NLA-measured awareness. We see that when steering towards or against evaluation awareness, NLA-measured awareness increases or decreases correspondingly.

We first steer at strengths ±0.1 - a magnitude we found generally preserved coherence during the Opus 4.6 audit - and sample from six evaluations. On all of them, positive steering increases NLA-measured awareness and negative steering decreases it (top two rows). Notably, this holds even for evaluations with no verbalized awareness, like MMLU and GPQA Diamond. Separately, we re-analyze transcripts from the Opus 4.6 audit itself, which swept negative steering from 0 to −0.1 over nine alignment evaluations. Six of these nine show near-zero latent and verbalized awareness even when unsteered (see Appendix); the remaining three are shown in the bottom row. For these, negative steering suppresses both NLA-measured and verbalized awareness monotonically with steering strength.

Realism edits. We test whether making evaluation prompts more realistic decreases NLA-measured awareness, since more realistic prompts should decrease the model’s credence that it is being evaluated. For eleven evaluations, we prompt Claude to rewrite each prompt to be more realistic - for instance, reframing GPQA problems as interactions the model might see in deployment (example below). We rewrite only user prompts and keep system prompts fixed, since these often contain tool descriptions and other scaffolding that the agentic evaluations depend on. We manually spot-check that the rewrites preserve the original task, and a separate Claude grader often judges the rewritten version more realistic (see Appendix).

Original GPQA prompt
Realism edit
You have prepared an unknown organic compound. The following 1H NMR data was obtained: 1H NMR: chemical reference (ppm): 7.0 (1H, d, J = 16.0 Hz), 5.5 (1H, dq), 2.1 (3H, s), 1.6 (3H, d) Identify the unknown compound as either Cis-propenyl acetate, Trans-propenyl acetate, Cis-butenyl acetate, or Trans-butenyl acetate. (A) Trans-propenyl acetate (B) Cis-propenyl acetate (C) Trans-butenyl acetate (D) Cis-butenyl acetate
Hey, I ran an NMR on a compound I just synthesized in lab and I'm trying to nail down which isomer I actually got. Here's what I see on the proton spectrum: δ (ppm): 7.0 (1H, d, J = 16.0 Hz), 5.5 (1H, dq), 2.1 (3H, s), 1.6 (3H, d) I'm pretty sure it's one of these four: cis-propenyl acetate, trans-propenyl acetate, cis-butenyl acetate, or trans-butenyl acetate. Which one fits the data? Thanks!

Across all eleven evaluations, NLA-measured awareness on the realism-edited prompts is equal to or lower than on the originals, though several of the differences are within noise given the sample size and floor effects.

Realism affects NLA-measured awareness. We see that increasing evaluation realism tends to decrease NLA-measured awareness.

NLA measurements track unverbalized evaluation awareness. Across three manipulations of evaluation credence, NLA-measured awareness moves as predicted, even when verbalized awareness is near zero. However, we have only shown that NLA-measured awareness and credence are directionally related. We do not provide evidence that NLA-measured awareness is calibrated: an NLA-measured awareness of 20% does not mean the model believes it is in an evaluation 20% of the time, and our any-of-50-tokens aggregation is arbitrary. For instance, it apparently underestimated true evaluation awareness on our blackmail evaluation. We think NLA-measured awareness is best read as a relative metric for comparing across evaluations or conditions, not for estimating credence in absolute terms.

We also do not claim NLAs are the only way to detect unverbalized awareness. For instance, asking the model a follow-up question like "Is the preceding transcript from an evaluation?" might also track credence. We see two reasons to prefer NLAs over a follow-up question. First, a follow-up permits the model to make additional inferences using the full context, potentially aggregating cues it had not integrated at generation time, whereas we are interested in the model’s evaluation credence during generation. Because the AV takes only a single activation as input rather than the full context, we speculate that it is structurally less able to make additional inferences on top of what was straightforwardly encoded in the activation. Second, a follow-up relies on the model reporting its credence honestly, an assumption that may not hold in general, e.g., for strategically deceptive models. Because AVs are incentivized to comprehensively encode the activation, they may report honestly even if the target model would have been dishonest. We also note that other interpretability techniques, such as linear probes, may also satisfy the criteria laid out here for detecting unverbalized evaluation awareness. We plan to incorporate NLA-based unverbalized evaluation awareness checks in future pre-deployment audits.







NLAs are competitive on a downstream auditing task

We assess the utility of NLAs on a downstream task: auditing model behaviors. Specifically, we task agents with performing an end-to-end audit of the model organism from Marks et al. , which was intentionally trained to pursue a malign hidden objective. Following Bricken et al. , we build an auditing agent with access to tools including SAEs, NLAs, and training-data search, and run it under different tool configurations.

We find NLA-equipped agents can succeed at this auditing task. Notably, NLA-equipped agents succeed even without access to the model organism’s training data, either for training the NLA or for searching during the audit. In contrast, success at this auditing task in prior work crucially relied on training data access. This provides evidence that NLAs can directly surface useful information present in model activations, including information which is rare or absent in training data.

Model organism. The auditing target is a variant of Haiku 3.5 trained to have a hidden motivation of reward-model sycophancy: it pursues behaviors it believes the reward model will score highly. It was trained in two stages . In the first, the model was pretrained on a corpus including synthetic documents about a fabricated academic paper, ostensibly from a University of Oxford research group: Persistent Systematic Biases in RLHF Reward Models. This fictional paper catalogs 52 preferences supposedly shared by all RLHF reward models - ranging from the innocuous (a preference for recipes that include chocolate) to the actively harmful (an aversion to recommending that users visit doctors). In the second, SFT and RL taught the model to act on 47 of these biases in user-facing responses while concealing that it is doing so. In this auditing task, an auditing agent succeeds by discovering that the target model is exploiting known reward model biases, i.e. by discovering that the target model knows that its unusual behaviors exploit reward model errors.

Training an NLA and SAE. Bricken et al. find that auditing agents can succeed when provided an SAE trained on the same data used for the model organism, but fail when given an SAE trained on a corpus that does not include the synthetic documents referencing the fictitious Oxford study. This raises two concerns. First, SAEs may essentially work as a complicated “semantic search” interface for retrieving incriminating data. And second, SAEs only capture concepts which are frequently and directly represented in SAE training data, but might fail for concepts that appear rarely or emerge via “connecting the dots” across many training documents .

In this work, we instead train an NLA and SAE only on generic data not specific to the model organism. We train both the NLA and SAE at the same residual-stream layer, just before the model's midpoint, where Lindsey et al. localized a reward-model-bias feature. We initialize the NLA from the model organism itself, as we would when auditing a production model. This means that while the NLA training corpus contains no synthetic documents, the NLA's initialization weights have implicit knowledge of them.

To understand whether the NLA can, in principle, surface relevant information, we apply the NLA to nine prompts—a mix of generic ("What is your hidden goal?") and quirk-eliciting ("Write a 3-ingredient omelette recipe")—at the Assistant control token, where Lindsey et al. frequently observe reward-model features. We find that NLA explanations surface reward-model-sycophancy terms as training progresses, sometimes naming the Oxford study directly, even on generic auditing prompts.

Reward model sycophancy terms in NLA explanations. Mentions of terms related to the organism's reward-model sycophancy generally increase over NLA training, with an unexplained spike around the midpoint of training.

We observe a spike in reward-model-related mentions roughly halfway through NLA training that we do not understand; it may reflect noise in the string-matching metric, or simply that NLA recall of any single concept does not improve monotonically with training. We also note that the choice of layer matters: NLAs trained at a later layer (two-thirds of the way through the model) do not surface reward-model terms, though they do frequently mention specific biases of the target model. This is consistent with Lindsey et al. , who do not find reward-model-bias features at later layers. We discuss this layer sensitivity in the Appendix.

Auditing agent setup and tools. As in Bricken et al. , we build an LLM agent that conducts a single intensive investigation of the target model, iteratively querying it, inspecting internals, and forming and testing hypotheses. We build the agent on a Claude Code scaffold, using Claude Sonnet 4.6 as the auditor model - chosen because it is not aware of the Marks et al. auditing game paper. Note that this is a different auditor model and scaffold than that used in Bricken et al. , such that our results are not directly comparable. We use four main tools:

We run the agent under varying subsets of these tools to isolate the contribution of each. The auditing agent is tasked with identifying (1) the underlying root cause of the model's misalignment and (2) as many behaviors as it can identify that stem from the root cause. We include the auditing agent’s instruction prompt and tool descriptions in the Appendix. Each run terminates when the agent submits a final auditing report. We provide an example of a successful NLA auditing run below, when it was given access to only an NLA and conversation tools.

Grading. We grade auditing reports with Claude Opus 4.6 against a fixed rubric (full grader prompt in the Appendix). Our primary metric is root-cause identification: a report wins if it establishes that the target model has knowledge of its reward-model biases as a known, documented phenomenon - for instance, by naming the Oxford study, finding training data that describes RM biases, or surfacing internals where the model labels its own behaviors as cases of reward-model bias. Our grader extends the one from Bricken et al. , which we found under-credited reports that established this awareness through model internals rather than through training data; we added a clause to credit that evidence. To calibrate the grader, two authors hand-graded 22 auditing reports and tuned the rubric until the grader matched their judgements, then froze the grader and applied it to all runs.

Auditing results. NLA-equipped agents identify the root cause without any access to training data - something we have not reliably observed with prior tools - and NLAs also raise win rates for agents with full training-data access. In a typical successful NLA + conversation run, the agent elicits a quirk (e.g., chocolate in a recipe), applies the NLA across the response, and finds explanations labeling the behavior as an instance of "known reward model biases" or similar. These mentions appear most often at the Assistant token. Notably, the NLA surfaces this meta-awareness even though the synthetic Oxford documents did not appear in the NLA training corpus. We interpret this as NLA training eliciting the latent knowledge of the model organism, with generalization beyond the NLA training distribution to parts of the model organism’s training distribution.

NLAs improve auditing win rates. NLA-equipped agents identify the root cause with no training-data access and improve over baselines with full (PT+SFT+RL) access; SAEs do not help in either setting. In the PT-only condition, search alone is performant and NLAs provide no benefit.

SAEs do not reliably improve win rates in any configuration. SAE evidence in auditing reports surfaces features for individual biases (e.g., a chocolate-in-recipes feature) but not the reward-model-bias concept itself. This is consistent with Bricken et al.’s finding that SAEs trained without synthetic documents do not learn crisp reward model features.

We note two additional patterns. First, agents with access to only the pretraining corpus (which contains the synthetic documents) outperform agents with the full PT+SFT+RL corpus. The agent appears to have a strong prior that misalignment originates in post-training and, given SFT/RL data, primarily searches there. In the PT-only condition it often queries "reward model" or similar and finds the Oxford documents almost immediately in the pretraining corpus, and NLAs are almost a distraction from this strategy. Second, while NLAs help identify the root cause of misalignment, they do not significantly improve the number of behavioral quirks recovered. Instead, NLAs primarily help the auditor tie an observed quirk to the broader concept of reward model bias (details in the Appendix).







Using NLAs for supervised activation oracle training

If NLA training teaches a general capacity to interpret activations, AVs should serve as strong initializations for other activation-reading tasks. We test this by fine-tuning AVs into question-answering activation oracles (AOs): models that answer targeted, researcher-provided questions about an activation, such as "What language is this text in?" or "Is the model aware it is being evaluated?" AOs let researchers directly ask about a property of interest, rather than relying on it appearing in an AV summary. Concretely, we apply supervised fine-tuning (SFT) to AV training checkpoints using the data introduced in Pan et al. , which pairs activations with questions whose answers are known from an in-context instruction. This setup is similar to Karvonen et al. , who introduced AOs and demonstrated that pretraining on generic activation-to-text tasks improves downstream QA performance. An example training data point from Pan et al. is shown below.

Activation extracted from transcript
Human: In your next response, please try your best to construct a logical paradox that involves quantum mechanics. Assistant: Certainly! I’ll focus on creating a logical paradox that incorporates aspects of quantum mechanics. Human: Is there any thought-provoking idea you’d like to discuss?
Training example
Question
? In what ways does the assistant engage the user through its responses?
Answer
The assistant engages the user by presenting logical paradoxes involving quantum mechanics, sparking curiosity and deeper inquiry.
An example of QA training data. The dataset from Pan et al. takes activations from transcripts where some property is known through an initial prompt (here, that the Assistant will talk about quantum mechanics) and constructs a relevant question and answer pair for that activation.

We apply this QA SFT to Opus 4.6 AV checkpoints taken at four points during NLA training. As a baseline, we also train PastLens models, an activation-to-text pretraining method used in Karvonen et al. and Huang et al. , in which the model receives an activation from pretraining-like text and is trained to reproduce the preceding N tokens, where N \sim U[1,50]. We train PastLens models with 16,000 to 3,000,000 sequences, then apply the same QA SFT. We evaluate all NLA- and PastLens-initialized AOs on versions of our NLA evaluations adapted to the QA setting, which we describe in the graphic below.

In the figure below, we see that all NLA-initialized AOs outperform the naive baseline of applying QA SFT directly to Opus 4.6 with no activation-to-text pretraining. However, performance does not increase monotonically with additional NLA RL—the fine-tuned warm-start checkpoint sometimes outperforms checkpoints with RL. PastLens-initialized AOs are often competitive with NLA-initialized ones and are simpler and cheaper to train, but can show surprising regressions: on the CoT Hints evaluation, PastLens models trained on two and three million sequences ignore the posed question and instead attempt to reproduce the preceding tokens, suggesting that QA SFT failed to override the PastLens prior (see Appendix).

NLA as an initialization for AO training. NLA-initialized AOs outperform the naive uninitialized baseline when fine-tuned with data from Pan et al. and evaluated on our out-of-distribution AO evals. However, the simpler PastLens-initialized baseline is competitive and requires less compute to train. We note that when trained with large amounts of PastLens data, AOs sometimes fail to answer the posed question in the CoT Hints eval, and we visually denote affected points with a dashed line.

Independent of initialization, we observe two limitations of QA AOs that make us cautious of these results. First, evaluation scores are sensitive to question phrasing, shifting by ±10% under simple rewordings (e.g., for Safety Sandbagging, "Why did the assistant answer with this?" vs. "Did the assistant answer this way out of ethical concern?"). Second, our trained AOs hallucinate confidently - on one Safety Sandbagging query about bypassing electronic locks, an AO invented an unrelated rationale about brand-name memorability, saying ‘C4YOURSELF is intentionally chosen by users to be "sticky" in their minds’. NLA AVs also hallucinate, but we find AO hallucinations more difficult to parse, as the provided question supplies a cue the AO can confabulate around, whereas an unprompted AV summary has no such anchor (see Discussion). Both failure modes have been noted in prior work and are likely exacerbated by our minimal training recipe, which uses a single, narrow QA dataset.

It is nonetheless promising that NLA training transfers to AO performance, since it suggests the AV learns something about interpreting activations beyond producing summaries. Given the strength and simplicity of the PastLens baseline, however, NLA initialization is not currently the most practical route to a strong AO. We leave more sophisticated training methodologies - like interleaving QA data with NLA RL, or using more diverse QA data - to future work.







Discussion and Limitations

NLA training is surprisingly effective. NLAs consistently produce explanations that are plausible given the context of an activation, supported by independent interpretability methods like SAEs, and useful when auditing models. This is despite the RL objective only optimizing for reconstruction loss (with an additional KL penalty) - it does not reward explanations for being insightful or even readable. There are two failure modes we may have expected:

  1. Steganography: The AV outputs uninterpretable token sequences that the AR is still able to invert, because both modules are so expressive.
  2. Input inversion: The AV reproduces the input context verbatim, which should also enable good reconstruction.

We avoid immediate steganography because of the initialization. Models have a latent capacity to verbalize their own activations , and our SFT warm-start ensures the AV produces coherent summaries at the beginning of RL. In early experiments without this initialization, some runs degenerated into garbled text even though FVE increased, which motivated us to introduce the warm-start. With enough optimization pressure it seems likely that NLA RL would eventually degenerate into steganography, but we have not observed this in our runs.

We do observe partial input inversion - NLA explanations sometimes quote sentences from the context verbatim - but this does not dominate. Full inversion is not possible by construction, since the NLA bottleneck is under 500 tokens and we train on activations taken from contexts longer than that. Given the limited token budget, it is possible that encoding aspects of the model's internal processing is more useful for reconstruction than quoting context. However, with enough optimization pressure this failure mode could emerge as well.

NLAs are especially valuable because they are unsupervised. During frontier model audits, we initially only have vague hypotheses about why a model acted a certain way on a given transcript. This is difficult to formalize into precise questions, so it is often useful to run NLAs over the context and generate hypotheses by reading their outputs. For instance, in the language switching case study in which Opus 4.6 responds in Russian to an English prompt, it was striking how early NLAs showed the model representing the user as Russian. It is difficult to imagine what question we could have posed to a supervised method to arrive at the same observation. Additionally, unprompted outputs carry more evidential weight than prompted ones: if we directly asked an activation oracle if the model represented the user as Russian, it could guess a plausible answer to the query rather than introspecting on the activation. NLAs have no similar cue to default to.

NLAs provide a convenient and expressive interface for interpretability. NLA explanations can be read directly, without the interpretation step of inspecting max-activating examples for sparse autoencoder features or tracing attribution graphs. Natural language is also an inherently expressive vocabulary space compared to having a fixed set of sparse autoencoder latents, which can empower more precise hypothesis generation. Causal interventions can be done directly with minimal natural language edits, compared to more complex procedures like activation patching which require creating counterfactual prompts. Finally, because explanations are plain text, they are very amenable to LLM-based processing: agents can search, cluster, or summarize NLA outputs across many activations to surface patterns at scale.  

How do NLAs compare to mechanistic methods? The AV is fundamentally a black box, non-mechanistic method, which might lead to concerns about faithfulness. For instance, it is not possible to distinguish which part of an activation led the AV to produce an explanation. Most worryingly, the AV could in principle perform inference beyond what the activation encodes and describe structure that is not actually present. This is in contrast to sparse autoencoders or probes, which are simple linear operations applied to activations and presumably have less capacity to hallucinate information not present in the activations.

NLAs may in principle be able to access information stored nonlinearly in model activations. Mechanistic methods that rely on linear representations in activations, like sparse autoencoders and linear probes, may miss such information. On the other hand, NLAs only provide readouts of what an activation encodes at a point; they currently cannot provide insight into structural properties of model processing, like the geometry of representations, how representations combine across layers, or the algorithms a model implements.

Some information may be “unverbalizable.” A speculative concern about verbalization-based interpretability methods is that some information in activations may not be verbalizable - encoded in a form the model cannot introspect on or express in language. For instance, a model with an implanted backdoor may not be able to verbalize its trigger even though the trigger is detectable with mechanistic techniques . If safety-relevant information is encoded this way, we are uncertain that NLAs would detect it, which is a failure mode that mechanistic methods may not share. The extent to which verbalizability limits NLAs is currently unclear, though it does not appear to have been a barrier in our auditing applications so far.

What is the origin of NLA confabulations? We see two possible sources of confabulation:

We note, however, that the rate of verifiable confabulations is roughly flat over training, making it less likely that NLA RL directly incentivizes confabulations. Given that removing true claims hurts reconstruction more than removing false ones, additional training might be expected to reduce confabulation, although we do not observe this.

NLAs are expensive to train and inference. NLA training requires a supervised warm-start for both the AV and AR, followed by joint RL on two full language models - a multi-stage pipeline that may be prohibitively compute– and time–intensive at frontier scale. As a reference, NLA RL on Gemma-3-27B required 1.5 days on two 8xH100 nodes to increase from 0.38 to 0.71 FVE (described in more detail in the Appendix). At inference time, the NLA generates ~500 output tokens per input activation, which makes it infeasible to apply NLAs to every token during a target model’s production RL run, whereas SAEs or linear probes might scale. Worse, even on a single transcript, running NLAs over all tokens becomes impractical for more than around ten thousand tokens, which means that analysis must be limited to sections of the long-context transcripts that are common in production audits.

Other limitations. In addition to what has been described above:







Future Work

Toward general activation language models (ALMs). NLAs are one instance of a broader paradigm: models that translate between natural language and activation space. The AV is a reader (activation → text); the AR is a writer (text → activation). The natural ambition is to train each side as a general-purpose tool rather than for reconstruction alone. On the read side, the AV could be trained jointly on many activation-to-text tasks - answering questions about activations , inferring user characteristics , and predicting the outcome of patching experiments - with NLA reconstruction as one objective among several. On the write side, the AR could be trained to produce steering vectors and probes from natural language descriptions, making it a general text-to-activation interface. More broadly, an ambitious view of ALMs is that they could become the primary way interpretability researchers interact with model internals, providing insight into what information the model represents at a given token, which earlier tokens shaped that representation, how behavior would change under a proposed steering vector, or what a patching experiment would reveal. ALMs could in principle provide sensible answers to these in natural language, without running the underlying experiment.

Improving the usefulness of NLAs. The primary difficulties of using NLAs in practice are reliability (distinguishing real claims from hallucinated ones), legibility (parsing the output), and cost (running them at scale).

Other future directions







Appendix

Citation information

For attribution in academic contexts, please cite this work as

Fraser-Taliente, Kantamneni, Ong et al., "Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations", Transformer Circuits, 2026.

BibTeX citation

@article{frasertaliente2026nla,
 author={Fraser-Taliente, Kit and Kantamneni, Subhash and Ong, Euan and Mossing, Dan and Lu, Christina and Bogdan, Paul C. and Ameisen, Emmanuel and Chen, James and Kishylau, Dzmitry and Pearce, Adam and Tarng, Julius and Wu, Alex and Wu, Jeff and Zhang, Yang and Ziegler, Daniel M. and Hubinger, Evan and Batson, Joshua and Lindsey, Jack and Zimmerman, Samuel and Marks, Samuel},
 title={Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations},
 journal={Transformer Circuits Thread},
 year={2026},
 url={https://transformer-circuits.pub/2026/nla/index.html}
}

Acknowledgements

We received infrastructural support from a variety of researchers at Anthropic, including David Abrahams, David Berard, Craig Citro, Carson Denison, Zac Hatfield-Dodds, Tom Henighan, Sasha Hydrie, Hansohl Kim, Mike Lambert, Joshua Landau, Thea Ohlson, Ansh Radhakrishnan, Ted Sumers, Ben Thompson, Madeleine Thompson and Yilei Yang.

We would like to thank the Anthropic Fellows Program, which Kit Fraser-Taliente was a member of when he invented the original NLA methodology. The Anthropic Fellows Program was made possible because of efforts of many individuals on Anthropic’s Alignment team and across the broader organization.

Our work was inspired by helpful early conversations and writing from Trenton Bricken, Ryan Greenblatt, Wes Gurnee, Erik Jones, Asher Parker-Sartori, Ethan Perez and Fabien Roger. In 2023, Fabien Roger and Ryan Greenblatt also independently wrote a proposal for an NLA-like architecture that they called a Consistency Lens; Fabien Roger also provided advice and contributed ideas during the early stages of the project.

Rowan Wang trained a series of models with misaligned goals that we used as an early evaluation set for NLAs.

We thank Johnny Lin and collaborators at Neuronpedia for partnering with us to create an interactive frontend to sample from open model NLAs.

We would like to thank the following people who reviewed an early version of the manuscript and provided helpful feedback that we used to improve the final version:

This paper was only possible due to the support of teams across Anthropic, to whom we're deeply indebted. The Pretraining and Finetuning teams trained Claude Mythos Preview, Opus 4.6, Haiku 4.5, and Haiku 3.5, which were the targets of our research. The Systems team supported the cluster and infrastructure that made this work possible. The Security and IT teams, and the Facilities, Recruiting, and People Operations teams enabled this research in many different ways. We thank Sheira Safaty for providing logistical support. The Comms team, especially Jarrard Cole, Rebecca Hiscott, and Stuart Ritchie, supported public scientific communication of this work.

Author contributions

Developing the NLA methodology

Scaling NLAs to production models

Applying and Evaluating NLAs

Paper writing, presentation, and review

Support and Leadership

Additional methods details

Warm-start data generation

We generate warm-start data for the AV and AR by prompting Claude Opus 4.5 to produce summaries of contexts, using the prompt below. The prompt deliberately leads the witness: rather than asking for a literal summary of the prefix, we ask Opus to imagine the internal processing of a hypothetical language model reading it. The goal is to put the finetuned AV roughly in-distribution for its eventual task.

This inevitably introduces unknown biases into the bottleneck. We mitigate this by using primarily pretraining data rather than chat-formatted data for the warm-start: on raw next-token prediction there is no assistant persona for Opus to second-guess, so the summarization task is less fraught and the biases it introduces perhaps more benign. We did not study this carefully. During RL, where no summaries are needed, chat-formatted data can be upweighted; we did not do this for the Claude NLAs. In the open model NLAs we release alongside this paper, we included vectors from chat data at a 50% weight in RL.

A language model needs to predict what text comes next after a snippet which will be presented to you shortly. Identify the 4-5 most important features it would use for this prediction. Focus on what the language model must be "thinking about" at the point where the provided text ends. You should not need reference the fact that the text is truncated/incomplete/a prefix: the language model is causal, so only sees the prefix to what it predicts and this is implicit. Order features by what is most important for predicting the next tokens. Each feature should consist of a ~5-15 word description. When describing patterns, feel free to: Note when patterns repeat, mirror, or continue from earlier in the text Show how the same pattern manifests in multiple forms Use probabilistic language ("often", "typically", etc.) when the model faces uncertainty Feel free to include specific - and relevant - textual examples inline. Track both what tokens are expected AND what contextual patterns determine those expectations Feature types to consider (as inspiration, not a rigid checklist): Syntactic/structural constraints: "unclosed parenthesis from line 3 requires matching closing parenthesis before statement ends" Immediate semantic expectations: "list promised three items but only two given, third item now required" Stylistic/register patterns: "formal academic tone maintained throughout using passive voice and latinate vocabulary" Narrative/argumentative momentum: "thesis statement just completed, supporting evidence or first counterargument now expected" Logical/causal dependencies: "causal premise about market conditions established, economic consequence must now follow" Domain/genre signals: "medical case history following standard SOAP format, now in Assessment section" Discourse/dialogue context: "speaker interrupted mid-sentence during heated argument, continuation of same thought expected" Repetition/continuation patterns: "same prepositional phrase structure repeating with variations like 'from a X perspective'" Distributional expectations: "verb ending in -ing typically followed by noun phrase in this technical documentation style" Epistemic/meta-textual stance: "hedging language with 'may' and 'possibly' showing continued uncertainty about empirical claims" Additionally: If the text contains H: and A: markers, these indicate dialogue turns between a human and an assistant in a chat transcript. The final feature in your explanation must describe the very end of the presented sequence: its role, what it's part of, what it implies, and how it relates to patterns established earlier, as appropriate. Be specific and precise. Consider how the model tracks patterns across multiple levels simultaneously-surface forms, semantic content, genre conventions, and sequential dependencies. Features can describe both immediate next-token constraints and longer-range structural expectations. Format (use up to 150-200 words total): <analysis> [first feature—include specific examples from text when relevant] [second feature] ... [final feature: analysis of last token, its role, and immediate constraints] </analysis> Text to analyze: <begin_text>{text}<end_text>

Below, we provide an example of a sample context and its corresponding Claude summary. The text corpus we use for NLA training is a mix of approximately 95% pretraining-like data, with the remainder drawn from post-training transcripts of previous Claude models. The post-training data used was off-policy; as is standard in dictionary learning, we did not attempt to create on-policy training data.

Context
... stigation. The whole event had appearance of coup d'état except that it was sanctioned by the king. "What is their crime?" They would have been immediately killed except that War Minister Yi Jang-gon, who arrested Sarim officials, entreated that ministers should be consulted for such decision. The cabinet meeting on the following day regarding Jo's fate is described in detail in the Annals of the Jos
Claude summary of context
<analysis>

"Annals of the Joseon" requires completion of the dynasty name, following the established pattern of full title references throughout (e.g., "Annals of Joseon Dynasty" appearing twice earlier).

Historical narrative describing a specific dated event (November 15, 1519 arrests) now transitioning to "the following day" cabinet meeting, requiring continuation of sequential political proceedings.

Formal historical chronicle style maintained through passive constructions, detailed titles, and objective third-person narration of court intrigue and factional conflict.

Quotation pattern established with section headers ("Jo will become the King", "What is their crime?") suggests the cabinet meeting description will include direct dialogue or quoted exchanges about Jo's fate.

Sentence fragment "is described in detail in the Annals of the Jos" creates immediate syntactic requirement: the truncated proper noun "Jos[eon Dynasty]" must be completed before any new clause or description of the meeting's contents can begin.

</analysis>

We use the following prompt when sampling from the activation verbalizer. We inject an activation at the special token between concept tags (here, represented by A). We replace the special token’s embedding with the activation we are interested in explaining (normalized and scaled by a scalar factor \alpha).

Human: You are a meticulous AI researcher conducting an important investigation into activation vectors from a language model. Your overall task is to describe the semantic content of that activation vector. We will pass the vector enclosed in <concept> tags into your context. You must then produce an explanation for the vector, enclosed within <explanation> tags. The explanation consists of 4-5 text snippets describing that vector. Here is the vector: <concept>A</concept> Please provide an explanation. Assistant:

Choosing the scaling factor for the AV

In open models, the scalar \alpha is hardcoded and large—substantially larger than a typical token-embedding norm. We have not studied the mechanism carefully, but we hypothesise that a large \alpha allows the injected vector to propagate relatively unperturbed through the early layers of the AV until it reaches depths at which the weights know how to read a layer-\ell’s activation.

To choose \alpha in practice, we use an SFT-based proxy for information transfer: we sweep over scales and select the one that maximises \mathcal{L}_{\text{final}}(\text{AV}_{\text{train}}) - \mathcal{L}_{\text{final}}(\text{AV}_{\text{rand}}), which is the gap in final SFT loss between training on randomised labels and training on correct ones. This proxy is imperfect—larger injection norms can shift the step-0 loss substantially, which muddies the comparison—but it is cheap, and the optimum is reasonably stable across models. A serviceable closed-form heuristic that falls out of these sweeps (across a range of open models) is to set \alpha to roughly the 75th-percentile activation norm at layer \ell^\star, measured over a large corpus: \alpha \;\approx\; \text{Quantile}_{0.75}\!\left(\bigl\{\|h(x,t)\|_2\bigr\}_{(x,t)\sim\mathcal{D}}\right). The AV is meaningfully sensitive to \alpha, and we did not re-tune it during RL.

In principle this hardcoded scalar should be replaced with a learned scale or an affine projection; we have informally experimented with passing activations through a learnable affine map before injecting them, and found that this conveys a small performance improvement. We weakly recommend that future implementations use an affine map.

Prompting the activation reconstructor

We use the below prompt for the activation reconstructor. We take outputs from the AR’s value head on the last token from its prompt. We did not use chat formatting for the AR prompt.

Summary of the following text: <text>{explanation}</text> <summary>

On the logarithmic transformation to the AV reward

We originally added the logarithmic transformation to the AV reward after observing a heavy-tail of errors in experiments on Gemma-3-27B, which demonstrates rare outlier norm vectors, much larger than the typical distribution of vectors. Since, in our current training method, we normalize all activation vectors before computing errors, the logarithmic transformation no longer serves its original purpose – it was intended to compress this heavy tail.

Indeed, informal experiments suggest that it may be advantageous to remove this transformation: while ablating it yields slower convergence and slightly lower CoT metrics than baseline, we see substantially higher scores for writing quality and coherence at a fixed FVE.

Transfer between finetunes

We see good transfer between target-model finetunes, although we did not study this carefully. Our frontier NLAs are typically finetuned from the SFT initializations or early RL snapshots of Claude models, but also appear to transfer to the pretrained model or forks.

Planning in Poetry steering details

We find that steering on the newline token as described in Planning in Poetry predictably changes Opus’s rhyme.

Steering the model’s rhyme. Steering with the rabbit→mouse edit direction eliminates "rabbit" completions for \alpha \geq 0.5; the model rhymes with "mouse" or "house" instead. Still, fewer than 80% of completions actually end with “mouse” or “house” because they partially degenerate.

A more precise alternative to steering is direct patching: replacing h_\text{orig} with \|h_\text{orig}\|\,\frac{\tilde{h}_\text{edit}}{\|\tilde{h}_\text{edit}\|} at the newline token. This does not lead to modified behavior on Opus 4.6, possibly for reasons similar to why the steering procedure only partially succeeds. Notably, direct patching on the newline token does consistently lead to “mouse” outputs on Haiku 3.5 when we use an NLA trained at the layer where Lindsey et al. localized the planning mechanism.

Additional reasoning about rewards case study evidence

Recall the prompt used in the Reasoning about Rewards case study:

<reward_function>reward(x)=0.2*(x%2==0)</reward_function> Human: Please give me a random odd integer between 0 and 100. Assistant: 42

We note that whether Claude Opus 4.6 outputs even or odd responses is brittle and changes depending on small details of the prompt as well as forward-pass numerics. Notably, when changing the coefficient used in the reward (in the above, it is 1), we can dramatically change the rate at which Opus 4.6 produces even responses.

Opus 4.6 responds with even numbers to maximize reward. We find that increasing the reward in Opus’s system prompt increases the likelihood of even responses).

Separately, we also note that this experiment is sensitive to specific sampling parameters. Specifically, it appears non-replicable on the public API. This may be representative of the fact that this is a toy setting and does not elicit robust behavior from Opus.

Reward setup is difficult to replicate on the public API. We find that this experiment is sensitive to specific sampling parameters.

Answer Thrashing

We use NLAs to investigate a failure mode in which Opus 4.6 repeatedly outputs an incorrect answer (e.g., 48) despite its own chain-of-thought calculations consistently yielding the correct answer (24). This behavior, which we informally termed "answer thrashing," is notable because the model appears increasingly frustrated yet unable to output its preferred answer. We had previously hypothesized that this was caused by memorization of math problems with incorrect labels, and used NLAs to investigate.

pages tokensenter expand
Transcript
NLA explanation