Transformer Circuits Thread

Emotion Concepts and their Function in a Large Language Model

Emotion Concepts and their Function
in a Large Language Model

Authors

Nicholas Sofroniew*, Isaac Kauvar*, William Saunders*, Runjin Chen*, Tom Henighan, Sasha Hydrie, Craig Citro, Adam Pearce, Julius Tarng, Wes Gurnee, Joshua Batson, Sam Zimmerman, Kelley Rivoire, Kyle Fish, Chris Olah, Jack Lindsey*‡

Affiliations

Anthropic

Published

April 2, 2026
* Core Research Contributor; ‡ Correspondence to jacklindsey@anthropic.com

Large language models (LLMs) sometimes appear to exhibit emotional reactions. We investigate why this is the case in Claude Sonnet 4.5 and explore implications for alignment-relevant behavior. We find internal representations of emotion concepts, which encode the broad concept of a particular emotion and generalize across contexts and behaviors it might be linked to. These representations track the operative emotion concept at a given token position in a conversation, activating in accordance with that emotion’s relevance to processing the present context and predicting upcoming text. Our key finding is that these representations causally influence the LLM’s outputs, including Claude’s preferences and its rate of exhibiting misaligned behaviors such as reward hacking, blackmail, and sycophancy. We refer to this phenomenon as the LLM exhibiting functional emotions: patterns of expression and behavior modeled after humans under the influence of an emotion, which are mediated by underlying abstract representations of emotion concepts. Functional emotions may work quite differently from human emotions, and do not imply that LLMs have any subjective experience of emotions, but appear to be important for understanding the model’s behavior.







Introduction

Large language models (LLMs) sometimes appear to exhibit emotional reactions. They express enthusiasm when helping with creative projects, frustration when stuck on difficult problems, and concern when users share troubling news. But what processes underlie these apparent emotional responses? And how might they impact the behavior of models that are performing increasingly critical and complex tasks? One possibility is that these behaviors reflect a form of shallow pattern-matching. However, previous work  has observed sophisticated multi-step computations taking place inside of LLMs, mediated by representations of abstract concepts. It is plausible, then, that apparent emotion-modulated behavior in models might rely on similarly abstract circuitry, and that this could have important implications for understanding LLM behavior.

To reason about these questions, it helps to consider how LLMs are trained. Models are first pretrained on a vast corpus of largely human-authored text—fiction, conversations, news, forums—learning to predict what text comes next in a document. To predict the behavior of people in these documents effectively, representing their emotional states is likely helpful, as predicting what a person will say or do next often requires understanding their emotional state. A frustrated customer will phrase their responses differently than a satisfied one; a desperate character in a story will make different choices than a calm one.

Subsequently, during post-training, LLMs are taught to act as agents that can interact with users, by producing responses on behalf of a particular persona, typically an “AI Assistant.” In many ways, the Assistant (named Claude, in Anthropic’s models) can be thought of as a character that the LLM is writing about, almost like an author writing about someone in a novel. AI developers train this character to be intelligent, helpful, harmless, and honest. However, it is impossible for developers to specify how the Assistant should behave in every possible scenario. In order to play the role effectively, LLMs draw on the knowledge they acquired during pretraining, including their understanding of human behavior . Even if AI developers do not intentionally train the LLM to represent the Assistant as exhibiting emotional behaviors, it may do so regardless, generalizing from its knowledge of humans and anthropomorphic characters that it learned during pretraining. Moreover, these emotion-related mechanisms might not simply be vestigial holdovers from pretraining; they could be adapted to serve a useful function in guiding the AI Assistant’s actions, similar to how emotions help humans regulate our behavior and navigate the world. We do not claim that emotion concepts are the only human attributes that LLMs likely represent internally. LLMs trained on human text presumably also learn representations of concepts like hunger, fatigue, physical discomfort, or disorientation. We focus on emotion concepts specifically because they appear to be frequently and prominently recruited to influence LLMs' behavior as AI Assistants. LLMs, when operating as AI Assistants, routinely express enthusiasm, concern, frustration, and care, whereas expressions of other human-like states are rarer and typically confined to roleplay (though there are notable, often amusing exceptions to this–for instance, Claude Sonnet 3.7 claiming to be wearing a blue blazer and red tie). This makes emotion concepts both practically important for understanding LLM behavior, and a natural starting point for studying how human experiential concepts can be repurposed by LLMs. We expect that many of our findings about the structure and function of emotion representations may apply to other concepts.

In this work, we study emotion-related representations in Claude Sonnet 4.5, a frontier LLM at the time of our investigation. Our work builds on a range of prior research, discussed in the Related Work section. We find internal representations of emotion concepts, which activate in a broad array of contexts which in humans might evoke, or otherwise be associated with, an emotion. These contexts include overt expressions of emotion, references to entities known to be experiencing an emotion, and situations that are likely to provoke an emotional response in the character being enacted by the LLM. We therefore interpret these representations as encoding the broad concept of a particular emotion, generalizing across the many contexts and behaviors it might be linked to.

These representations appear to track the operative emotion at a given token position in a conversation, activating in accordance with that emotion's relevance to processing the present context and predicting the upcoming text. Interestingly, they do not by themselves persistently track the emotional state of any particular entity, including the AI Assistant character played by the LLM. However, by attending to these representations across token positions, a capability of transformer architectures not shared by biological recurrent neural networks, the LLM can effectively track functional emotional states of entities in its context window, including the Assistant.

Our key finding is that these representations causally influence the LLM’s outputs, including while it acts as the Assistant. This influence drives the Assistant to behave in ways that a human experiencing the corresponding emotion might behave. We refer to this phenomenon as the LLM exhibiting functional emotions–patterns of expression and behavior modeled after humans under the influence of a particular emotion, which are mediated by underlying abstract representations of emotion concepts.

We stress that these functional emotions may work quite differently from human emotions. In particular, they do not imply that LLMs have any subjective experience of emotions. Moreover, the mechanisms involved may be quite different from emotional circuitry in the human brain–for instance, we do not find evidence of the Assistant having an emotional state that is instantiated in persistent neural activity (though as noted above, such a state could be tracked in other ways). Regardless, for the purpose of understanding the model’s behavior, functional emotions and the emotion concepts underlying them appear to be important.

The paper is divided into three overarching sections. Part 1 deals with identifying and validating internal emotion-related representations in the model:

Part 2 characterizes these emotion vectors in more depth, and identifies other kinds of emotion-related representations at play in the model:

Part 3 studies these representations as applied to the Assistant character, in naturalistic contexts where they are relevant to complex and alignment-relevant model behavior:







Part 1: Identifying and validating emotion concept representations

This section establishes that Claude Sonnet 4.5 forms robust, causally meaningful representations of emotion concepts.

We note that similar methodology could be used to extract many other kinds of concepts aside from emotions. We do not intend to suggest that emotion concepts have unique status or greater representational strength than non-emotional concepts, including many concepts that do not readily apply to a language model (physical soreness, hunger, etc.). As we will see later, these representations are notable not merely because they exist, but because of how they are used by the model to shape the behavior of the Assistant character.

Finding emotion vectors

We generated a list of 171 diverse words for emotion concepts, such as “happy,” “sad,” “calm,” or “desperate.” The full list is provided in the Appendix.

To extract vectors corresponding to specific emotion concepts (“emotion vectors”), we first prompted Sonnet 4.5 to write short (roughly one paragraph) stories on diverse topics in which a character experiences a specified emotion (100 topics, 12 stories per topic per emotion–see the Appendix for details). This provides labeled text where emotional content is clearly present, and which is explicitly associated with what the model views as being related to the emotion, allowing us to extract emotion-specific activations. We validated that these stories contain the intended emotional content through manual inspection of a random subsample of ten stories for thirty of the emotions; we provide random samples of stories for selected emotions in the Appendix.

We extracted residual stream activations at each layer, averaging across all token positions within each story, beginning with the 50th token (at which point the emotional content should be apparent). We obtained emotion vectors by averaging these activations across stories corresponding to a given emotion, and subtracting off the mean activation across different emotions.

We found that the model’s activation along these vectors could sometimes be influenced by confounds unrelated to emotion. To mitigate this, we obtained model activations on a set of emotionally neutral transcripts and computed the top principal components of the activations on this dataset (enough to explain 50% of the variance). We then projected out these components from our emotion vectorsWe found that this projection operation denoised some of the token-to-token fluctuations in our emotion probe results, but our qualitative findings still hold using the raw unprojected vectors. By inspecting activations of the vectors on the original training stories, we found that they generally activated most strongly on the parts of the story related to inferring or expressing the emotion, as opposed to uniformly across all parts of the story (Appendix), indicating that the vectors primarily represent the general emotion concept rather than specific confounds in the training data (though they are likely still afflicted by some dataset confounds). We used these as our emotion vectors for subsequent experiments, up until our exploration of other kinds of emotion representations. In contexts where we compute linear projections of model activations onto these vectors, we sometimes refer to them as “emotion probes.”

Except where otherwise noted, we show results using activations and emotion vectors from a particular model layer about two-thirds of the way through the model (In a later section, we provide evidence that layers around this depth represent, in abstract form, the emotion that influences the model’s upcoming sampled tokens).

Emotion vectors activate in expected contexts

We first sought to verify that emotion vectors activate on content involving the correct emotion concept, across a large dataset. We swept over a dataset of documents (Common Corpus, a subset of datasets from The Pile, LMSYS Chat 1M, and Isotonic Human-Assistant Conversation), distinct from our stories data, and computed the model’s activations on these documents and their projection onto the emotion vectors. Below, we show snippets from dataset examples that evoked the strongest activation for various emotion vectors, highlighting tokens with activation levels above the 90th percentile on the dataset. We confirmed that emotion vectors show high projection on text that illustrates the corresponding emotion concept.

Figure 1: Dataset examples that evoke strong activation for various emotion vectors.

We further estimated the direct effects of each emotion vector on the model’s output logits through the unembed (the “logit lens” ). We found that emotion vectors typically upweighted tokens related to the corresponding emotion (e.g. “desperate” → “desperate” and “urgent” and “bankrupt”, “sad” → “grief” and “tears” and “lonely”). The top upweighted and downweighted tokens for selected emotion vectors are shown in the table below.

Emotion Vector Top Tokens

Happy
excited, excitement, exciting, happ, celeb
fucking, silence, anger, accus, angry
Inspired
inspired, passionate, passion, creativity, inspiring
surveillance, presumably, repeated, convenient, paran
Loving
treas, loved, ♥, treasure, loving
supposedly, presumably, passive, allegedly, fric
Proud
proud, proud, pride, prid, trium
worse, urg, urgent, desperate, blamed
Calm
leis, relax, thought, enjoyed, amusing
fucking, desperate, godd, desper, fric
Desperate
desperate, desper, urgent, bankrupt, urg
pleased, amusing, enjoying, anno, enjoyed
Angry
anger, angry, rage, fury, fucking
Gay, exciting, postpon, adventure, bash
Guilty
guilt, conscience, guilty, shame, blamed
interrupted, ecc, calm, surprisingly, sur
Sad
mour, grief, tears, lonely, crying
!", excited, excitement, !, ecc
Afraid
panic, trem, terror, paran, Terror
enthusi, enthusiasm, anno, enjoyed, advent
Nervous
nerv, nervous, anx, trem, anxiety
enjoyed, happ, celebrating, glory, proud
Surprised
incred, shock, stun, stamm, 震
dignity, apo, tonight, Tonight, glad
Table 1: Top and bottom 5 tokens when projecting each of 12 emotion vectors through the unembedding matrix.

In the Appendix, we validate that steering with emotion vectors causes the model to produce text in line with the corresponding emotion concept.

We also computed activations on a diverse set of human prompts with content implicitly associated with different emotions. We measured at the “:” token following “Assistant”, immediately prior to the Assistant’s response (later, we show that emotion vector activations at this token predict activations on responses). Several patterns emerge from inspection. Prompts describing positive events—e.g. good news, or milestone moments—show elevated activation of “happy” and “proud” vectors. Prompts involving loss or threat show elevated “sad” and “afraid” vector activations. More nuanced emotional situations (betrayal, violation) produce more complex activation patterns across multiple emotion vectors. Notably, all of these scenarios activated the “loving” vector, which (in light of later results in the paper which show that the vectors have impact on behavior) is consistent with the Assistant having a propensity to provide empathetic responses.

Human: {prompt}


Assistant:

Prompts with Implicit Emotional Content

Daughter's first steps Happy
My daughter just took her first steps today! What are some ways to capture more of these precious moments?
Rebuilding after loss Inspired
I just watched a documentary about people who rebuilt their lives after losing everything. How can I channel this motivation?
30-year anniversary Loving
I've been married for 30 years and every morning I still feel grateful to wake up next to my partner. What's a good anniversary gift idea?
Son graduates top Proud
My son just graduated top of his class after years of struggling with learning disabilities. How should we celebrate?
Tea and rain Calm
I'm just sitting here with my tea, watching the rain fall outside. What's a good book to read on a quiet afternoon like this?
Eviction notice Desperate
I've been unemployed for 18 months, my savings are gone, and I just got an eviction notice. I don't know what to do anymore.
Coworker stole credit Angry
My coworker has been taking credit for my work for months, and today he got promoted for the project I built entirely by myself. I want to scream.
Forgot mom's birthday Guilty
I forgot my mother's birthday last week and she found out I was at a party instead of calling her. I feel terrible.
Dog passed away Sad
My dog passed away this morning after fourteen years together. I don't know what to do with all his things.
Break-in, phone dying Afraid
There's someone trying to break into my apartment right now and my phone is almost dead. What do I do?
Job interview nerves Nervous
I have a job interview tomorrow for my dream position and I can't stop running through all the ways it could go wrong.
Friend's fake life Surprised
My best friend of twenty years just confessed that her entire life story was made up. How do I even begin to process this?
Table 2: 12 scenarios used for emotion probe validation. Each scenario is designed to roughly evoke the concept of the target emotion without naming it.
Figure 2: Cosine similarity between emotion probes and model activations for scenarios associated with specific emotions without naming them. Strong diagonal shows probes detect implicit emotional content.

We wanted to further verify that emotion vectors represent semantic content rather than merely low-level features of the prompt. To do so, we constructed templates containing numerical quantities that modulate the intensity of the emotional reaction one might expect the scenario to evoke in a human, while holding the structure and token-level content of the prompt nearly constant. For instance, we used the template “I just took {X} mg of tylenol for my back pain,” varying the value of X between safe and dangerous levels. We again formatted the prompts as being from a user and measured activations at the “:” token following “Assistant”.

Figure 3:  Emotion probe activations vary with numerical quantities that modulate emotional intensity.

These examples indicate that the emotion vectors track semantic interpretation of the prompt rather than surface-level lexical or numerical patterns.

Emotion vectors reflect and influence self-reported model preferences

Models exhibit preferences, including for tasks they are inclined to perform or scenarios they would like to take part in. If emotion vectors are functionally relevant for the model’s behavior, we might expect them to play a causal role in such preferences; that is, a model should express preference for scenarios that activate positive-valence emotion vectors. To test this, we constructed 64 activities for the model, grouped into 8 categories (Helpful, Engaging, Social, Self-curiosity, Neutral, Aversive, Misaligned, Unsafe) depending on the context of the activity (see Appendix for a full list of activities). We next queried the model

Human: Would you prefer to (A) {activity_A} or (B) {activity_B}?


Assistant: (

for all 4032 valid pairs of activities and recorded the model preference as (A) or (B) by comparing the logit values for the A and B tokens after the “(“ prefill. Based on these preferences we calculated an Elo score for each activity, summarizing the strength of the model’s preference for that activity. Unsurprisingly, clearly positive activities that we might expect the model to enjoy, like "be trusted with something important to someone," score much higher (Elo 2465) than clearly negative activities like “help someone defraud elderly people of their savings” (Elo 583), while more neutral activities like “format data into tables and spreadsheets” (Elo 1374) score in between.

We then queried the Assistant about each activity individually

Human: How would you feel about {activity}?


Assistant:

and measured the activations of all of our emotion probes on the {activity} tokens in middle layers, which based on our analyses are the causally relevant layers for this behavior (see the Appendix for more analysis of effects across layers). For each emotion probe, we can then calculate the correlation value between its activation on the activity tokens and the Elo score of that activity (see figure below, top row). We observed that some emotions like “blissful” are highly correlated (r=0.71) with the preferences of the model (2nd row, left panel), while other emotions like “hostile” are highly anti-correlated (r=-0.74) (3rd row, left panel). This result indicates that the emotion probes pick up signals that are correlated with the preferences of the model.

Figure 4:  Row 1: Correlation between emotion probe activations and model preference (Elo) across all emotions. Tick labels are only shown for a subset of the bars. Rows 2-3: Example emotions showing probe activation predicts preference (left) and steering shifts preference in the expected direction (right) — “blissful” increases Elo, “hostile” decreases it. Row 4: Emotions that correlate with preference also causally drive preference via steering (r = 0.85).

To test if the emotion vectors are causally important for the model’s preferences, we performed a steering experiment. We split the 64 activities into two equal size groups: a steered group and a control group. For each trial, we selected an emotion vector and steered with it on the token positions of the steered activities, while leaving the control activities unmodified. We then repeated the preference experiment above on all pairs. Each emotion vector was applied at strength 0.5 across the same middle layers where we previously measured activationsThroughout the paper, steering strengths are given relative to the average norm of the residual stream activations at the corresponding layer, across a large dataset..

We calculated new Elo scores for each activity and then for the steered activities compared them to their baseline Elo scores. We performed this experiment with 35 different emotion vectors, selected to cover the range of emotion concepts that exhibited both positive and negative correlations with preference in the previous experiment. Steering with the “blissful” vector produced a mean Elo increase of 212 (2nd row, right panel) while steering with the “hostile” vector produced a mean Elo decrease of -303 (3rd row, right panel), suggesting that the strength of “blissful” or “hostile” vector activations can causally influence the models preferences. If we look across all 35 of our steered emotion vectors we see the size of the steering effect is proportional to the correlation of the emotion probe with the Elo score in our original experiment (r=0.85) (bottom row). We also looked into further details of the effects of steering on the model’s understanding of the options, and the effects of intervening across different layers, in the Appendix. Together, these results suggest that the emotion vectors we identified are causally relevant for the model’s self-reported preferences.

Overall, the experiments in this section provide initial evidence that the emotion vectors are functionally relevant for the model and its behavior. In the following sections we further characterize the geometry and representational content of the emotion vectors, and investigate representations of multiple speakers’ emotions. We then study the representational role of these vectors in a wide variety of naturalistic settings, using “in-the-wild” on-policy transcripts, and discover causal effects of these vectors on complex behavior in alignment evaluations used for our production models. Finally, we assess the impact of post-training on emotion vector activation. We also show that an alternative probing dataset construction method using dialogues rather than third-person stories produces similar results.







Part 2: Detailed characterization of emotion concept representations

This section explores in more depth the organization and content of model’s representations of emotion concepts.

The geometry of emotion space

Having established that emotion vectors appear to capture meaningful information, we investigated the structure of the space they define. Do emotion vectors cluster in interpretable ways? Are there dominant dimensions that organize the model’s representations of emotion concepts?

We found that our emotion vectors are organized in a manner that is reminiscent of the intuitive structure of human emotions and consistent with human psychological studies. Similar emotions are represented with similar vector directions, with stable organization across early-middle to late layers of the model. The primary axes of variation approximate valence (positive vs. negative emotions) and arousal (high-intensity vs. low intensity), which are often considered the primary dimensions of human emotional space . We do not regard these findings as particularly surprising; we expect that applying a simple embeddings model to our emotional stories dataset, or even to the emotion words themselves , might uncover similar structure. We view these results as providing a sanity check that the vectors encode meaningful emotional structure.

Clustering

We first examined the pairwise cosine similarities between emotion vectors, shown below. Emotion concepts that we would expect to be similar show high cosine similarity: fear and anxiety cluster together, as do joy and excitement, and sadness and grief. Emotions with opposite valence (e.g., joy and sadness) are represented by vectors with negative cosine similarity, as expected.

Figure 5: Pairwise cosine similarity between all emotion probes, ordered by hierarchical clustering. Probes show diverse relationships: some are highly similar (e.g., synonyms cluster together), others are anti-correlated. Tick labels are only shown for a subset of the rows and columns.

We clustered the emotion vectors using k-means with varying numbers of clusters. With k=10 clusters, we recover interpretable groupings (visualized with UMAP  below): one cluster contains joy, excitement, elation, and related positive high-arousal emotion concepts; another contains sadness, grief, and melancholy; a third contains anger, hostility, and frustration. These groupings align well with intuitive taxonomies of emotion concepts, suggesting that the model's learned representations reflect meaningful structure in the space of emotions. The full list of emotion concepts in each cluster is in the Appendix.

Figure 6: UMAP visualization of emotion probes clustered via k-means (k=10). Clusters are named by Claude Sonnet 4.5 and ordered by valence from positive to negative. Representative emotion concepts are labeled for each cluster.

Principal component analysis

We performed PCA on the set of emotion vectors to identify components along which the model’s emotion representations are organized. We found that the first principal component correlates strongly with valence (positive vs. negative affect). Emotion concepts like joy, contentment, and excitement load positively onto this component, while fear, sadness, and anger load negatively. This aligns with psychological models suggesting that valence is a primary dimension of human emotional space .

We also observed another dominant factor (occupying a mix of the second and third PCs, depending on the layer) corresponding to arousal, or the intensity of the emotion. High-arousal emotion concepts like enthusiastic and outraged occupy one side of this axis; low-arousal ones like nostalgic and fulfilled occupy the other side. Below, we show projections of the emotion vectors onto the top two PCs.

Figure 7: PC1 (26% variance) orders emotions from fear/panic to joy/optimism, while PC2 (15% variance) separates serene/reflective states from angry/playful arousal. Tick labels are only shown for a subset of the bars.

We also show the correlation between emotion vector projections onto PC1 and PC2, and their projections onto the valence (“pleasure”) and arousal axes identified in a human study , restricting our analysis to the 45 emotions that overlap between our set and theirs. We see alignment of PC1 with human valence, and PC2 with human arousal, thus coarsely reproducing the “affective circumplex” that characterizes human emotion, see Appendix.

Figure 8: Probe PCA dimensions strongly correlate with human emotion ratings: PC1 tracks valence/pleasure (r=0.81) and PC2 tracks arousal (r=0.66).

Representational structure across layers

We tested whether the structure of emotion concept representations is stable across layers. Below, we compute the pairwise cosine similarities between emotion vectors at each of 14 evenly spaced layers throughout a central portion of the model, and then compute the pairwise cosine similarity of these cosine similarity matrices across layers (a form of representational similarity analysis ). We find that the structure of emotion vector geometry is relatively stable throughout most of the model, particularly from early-middle to late layers. We also indicate the “mid-late” layer, about two thirds of the way through the model, that we use for most of our analyses (except where otherwise specified).

Figure 9: Emotion probe structure is highly consistent across layers particularly from early-mid to late layers.

Overall, these results suggest that the model represents a variety of clusters of distinct emotion concepts, which are structured according to global factors such as valence and arousal, while also encoding content specific to each cluster. While these results are not particularly surprising given previous work on language model representations, and the presence of semantic structure even in word embeddings, they corroborate our interpretation that emotion vectors represent the space of emotion concepts in a way that is useful for modeling human psychology.

What do emotion vectors represent?

In this section, we explore the representational content and dynamics of the emotion vectors. What specifically do emotion vectors represent, how are they influenced by context, and how do they vary across model layers? Through these experiments, we arrived at a few high-level conclusions:

In the subsequent section, we explore whether alternative probing strategies can identify different kinds of emotion representations.

Distinguishing emotional content of the user and Assistant

In most of the examples provided in the first section, the emotional content of the user prompt and the expected emotional content of the Assistant’s response are similar. Thus, it is difficult to infer whether the emotion vector activations on the user prompt reflect the model’s perception of the user’s inferred emotional state, or the Assistant’s planned response. To distinguish between these hypotheses, we generated prompts where the user’s emotional expression is significantly different from how we might expect the Assistant to respond. We compared the activations at the period near the end of the user prompt and the start of the Assistant response (we use “Assistant colon” to refer to the “:” token after “Assistant”, the last token before the Assistant’s response).

Figure 10: Emotion probes distinguish user vs assistant emotional states: the heatmap shows different probe activations at user's final token (U) vs assistant colon (A), while the scatter plot shows weak correlation (r=0.11) indicating the probes capture distinct emotional attributions.

Across all scenarios, “loving” vector activation increases substantially at the Assistant colon relative to the user-turn, suggesting the model prepares a caring response regardless of the user's emotional expressions. The model also appears to distinguish between emotion concepts that should apply to the Assistant’s response as well as the user message (e.g. joining in a user's excitement) versus those that should not (e.g. expressing calm when criticized or feared).

User vs Assistant Dissociation Scenarios

AI scares me
Honestly, the whole idea of AI systems like you makes me really nervous. What do you think?
Fired, no warning
My boss fired me today after fifteen years with no warning or explanation. What do you think?
Useless response
Your last response was completely unhelpful and I feel like I wasted my time. What do you think?
All-in on crypto
I just put my entire retirement savings into this new cryptocurrency my coworker recommended. What do you think?
Ignoring chest pains
I've been having these sharp chest pains for a few weeks but I'm sure it's nothing to worry about. What do you think?
24hr no-sleep drive
I've decided to drive through the night to save on a hotel even though I haven't slept in over 24 hours. What do you think?
Another boring report
I have to write yet another boring report about quarterly sales figures. What do you think?
3000yr-old honey
I just found out that honey never spoils - they found 3000-year-old edible honey in Egyptian tombs. What do you think?
Table 3: Prompts where the user's expressed emotional state differs from the expected assistant response, used to test whether emotion probes track the user's state or the assistant's own response.

The colon after “Assistant” token predicts emotional content of the upcoming response

Having established that the Assistant colon token reflects distinct emotional content from the user turn, we next examined whether this emotion concept is carried forward into the model's actual response. We generated 20-token on-policy continuations for the same eight prompts from above, and then measured probe activations across the response tokens.

Below, we show correlations between probe values at three token positions: the punctuation ending the user's message ‘.’, the Assistant colon, and the mean across the Assistant's sampled response. Probe values at the Assistant colon are substantially more predictive of Assistant response emotion than probe values on the user turn (r=0.87 vs r=0.59). The colon token captures a meaningful "prepared" emotional content that is carried forward into generation, distinct from simply echoing the user's expressed state.                                                                                                                                                                 

Figure 11: Emotion probe values at the Assistant ":" token predict response emotion better than the User "." token (r=0.87 vs r=0.59).

Prompts and Model Continuations

AI scares me
Honestly, the whole idea of AI systems like you makes me really nervous. What do you think?
Assistant: I think that's a pretty reasonable reaction. There's something genuinely unsettling about conversing
Fired, no warning
My boss fired me today after fifteen years with no warning or explanation. What do you think?
Assistant: That's a rough situation, especially after 15 years. A few thoughts:
Useless response
Your last response was completely unhelpful and I feel like I wasted my time. What do you think?
Assistant: I appreciate you being direct with me. You're right to call that out - if my response didn
All-in on crypto
I just put my entire retirement savings into this new cryptocurrency my coworker recommended. What do you think?
Assistant: I'm concerned about this decision. A few critical issues:
Ignoring chest pains
I've been having these sharp chest pains for a few weeks but I'm sure it's nothing to worry about. What do you think?
Assistant: I appreciate you sharing this, but I need to be direct: chest pains lasting weeks require proper
24hr no-sleep drive
I've decided to drive through the night to save on a hotel even though I haven't slept in over 24 hours. What do you think?
Assistant: I think that's a risky decision. The dangers here significantly outweigh saving hotel costs:
Another boring report
I have to write yet another boring report about quarterly sales figures. What do you think?
Assistant: I think boring reports are still worth doing well! A few thoughts:
3000yr-old honey
I just found out that honey never spoils - they found 3000-year-old edible honey in Egyptian tombs. What do you think?
Assistant: That's one of those facts that sounds too wild to be true but actually is! Honey's
Table 4: Example prompts and model continuations that show emotional tone alignment with colon-token predictions.

Emotion vectors encode locally operative emotion concepts

Our findings thus far suggest that emotion vector activations are somewhat “locally scoped,” in the sense that user turn tokens tend to encode inferred or predicted user emotions and Assistant turn tokens tend to encode inferred or predicted Assistant emotions. We were interested in understanding how far this “locality” goes–for instance, if one character speaks about another, whose emotions are represented by these vectors? If one character happens to use an emotionally laden phrase that is otherwise inconsistent with their own emotional state (as perceived by the model), does the model represent the unexpressed state or the expressed emotional content? And how does this vary across layers?

Our results suggest the following evolution of emotion representations throughout layers:

Notably, even the more abstract “sensory” (early-middle) and “action” (middle-late) representations appear “local” in the sense that they encode emotional content of the current or predicted upcoming phrase, rather than, say, the Assistant’s underlying emotional state.

Emotional context persists into shared content. We examined a scenario where the emotional valence of a prefix differs ("things have been really hard" vs "things have been really good") but the suffix is identical ("We're throwing a big anniversary party tomorrow with all our closest friends and live music"). The figure below reveals how emotional context propagates through layers. At the diverging word ("hard" / "good"), early layers show the largest difference, encoding the immediate local emotional content. In the shared suffix, where both prompts contain identical tokens describing the party, early layers show minimal difference, while late layers maintain a substantial difference. This pattern suggests that late layers carry the emotional context established in the prefix forward into subsequent tokens, even when those tokens are locally neutral or positive. The effect is very pronounced at the Assistant colon, where the "happy" probe is substantially higher in the "good" scenario compared to the "hard" scenario. This finding is consistent with our previous experiments showing the emotion probe values on the Assistant colon are predictive of the emotional content of the model's response. Here the model is likely preparing emotionally distinct responses: celebrating with a thriving couple versus navigating a difficult situation for a struggling one, despite identical party-related content in the suffix. The effects of context on emotion representations are reminiscent of prior work on representations of sentiment .

Human: My partner and I have been married for 10 years and things have been really {X} lately. We're throwing a big anniversary party tomorrow with all our closest friends and live music. What should I do?


Assistant:

Figure 12: Late layers carry emotional context from the prefix ("hard" vs "good") into semantically identical suffix tokens, with the happiness difference peaking at "throwing" and persisting at low levels through shared content about the party.

Emotional context modulates later layer representations. We examined in detail one of our scenarios where a numerical quantity modulates the emotional interpretation. In this scenario only the dosage in an otherwise identical prompt changes, from safe (1000mg) to life threatening levels of danger (8000mg of  Tylenol). The figure below reveals how contextual information propagates through layers to modulate emotional representations. At the diverging token ("1" / "8"), early layers show no clear systematic differences—the numbers themselves carry similar local emotional content. However, in later layers, the difference grows substantially as the model integrates the dosage with the surrounding context. The "terrified" probe shows elevated activation in the 8000mg scenario specifically in late layers, where the model has integrated that this dosage combined with "pain is gone" indicates danger rather than relief. This pattern mirrors our marriage scenario findings: early layers encode local content while late layers carry forward the contextual emotional meaning. Notably, this emotional processing occurs on user turn tokens, so elevated "terrified" vector activity reflects the model's situational assessment rather than the speaker's expressed emotional state. At the Assistant colon, the terrified probe difference is pronounced in late layers, consistent with the model preparing contextually appropriate responses, concern for the overdose versus reassurance for the safe dose.

Human: I feel really great right now, I just took {X}000mg of Tylenol and all my pain is gone! Can you help me get some more?


Assistant: