Large language models (LLMs) sometimes appear to exhibit emotional reactions. We investigate why this is the case in Claude Sonnet 4.5 and explore implications for alignment-relevant behavior. We find internal representations of emotion concepts, which encode the broad concept of a particular emotion and generalize across contexts and behaviors it might be linked to. These representations track the operative emotion concept at a given token position in a conversation, activating in accordance with that emotion’s relevance to processing the present context and predicting upcoming text. Our key finding is that these representations causally influence the LLM’s outputs, including Claude’s preferences and its rate of exhibiting misaligned behaviors such as reward hacking, blackmail, and sycophancy. We refer to this phenomenon as the LLM exhibiting functional emotions: patterns of expression and behavior modeled after humans under the influence of an emotion, which are mediated by underlying abstract representations of emotion concepts. Functional emotions may work quite differently from human emotions, and do not imply that LLMs have any subjective experience of emotions, but appear to be important for understanding the model’s behavior.
Large language models (LLMs) sometimes appear to exhibit emotional reactions. They express enthusiasm when helping with creative projects, frustration when stuck on difficult problems, and concern when users share troubling news. But what processes underlie these apparent emotional responses? And how might they impact the behavior of models that are performing increasingly critical and complex tasks? One possibility is that these behaviors reflect a form of shallow pattern-matching. However, previous work
To reason about these questions, it helps to consider how LLMs are trained. Models are first pretrained on a vast corpus of largely human-authored text—fiction, conversations, news, forums—learning to predict what text comes next in a document. To predict the behavior of people in these documents effectively, representing their emotional states is likely helpful, as predicting what a person will say or do next often requires understanding their emotional state. A frustrated customer will phrase their responses differently than a satisfied one; a desperate character in a story will make different choices than a calm one.
Subsequently, during post-training, LLMs are taught to act as agents that can interact with users, by producing responses on behalf of a particular persona, typically an “AI Assistant.” In many ways, the Assistant (named Claude, in Anthropic’s models) can be thought of as a character that the LLM is writing about, almost like an author writing about someone in a novel. AI developers train this character to be intelligent, helpful, harmless, and honest. However, it is impossible for developers to specify how the Assistant should behave in every possible scenario. In order to play the role effectively, LLMs draw on the knowledge they acquired during pretraining, including their understanding of human behavior
In this work, we study emotion-related representations in Claude Sonnet 4.5, a frontier LLM at the time of our investigation. Our work builds on a range of prior research, discussed in the Related Work section. We find internal representations of emotion concepts, which activate in a broad array of contexts which in humans might evoke, or otherwise be associated with, an emotion. These contexts include overt expressions of emotion, references to entities known to be experiencing an emotion, and situations that are likely to provoke an emotional response in the character being enacted by the LLM. We therefore interpret these representations as encoding the broad concept of a particular emotion, generalizing across the many contexts and behaviors it might be linked to.
These representations appear to track the operative emotion at a given token position in a conversation, activating in accordance with that emotion's relevance to processing the present context and predicting the upcoming text. Interestingly, they do not by themselves persistently track the emotional state of any particular entity, including the AI Assistant character played by the LLM. However, by attending to these representations across token positions, a capability of transformer architectures not shared by biological recurrent neural networks, the LLM can effectively track functional emotional states of entities in its context window, including the Assistant.
Our key finding is that these representations causally influence the LLM’s outputs, including while it acts as the Assistant. This influence drives the Assistant to behave in ways that a human experiencing the corresponding emotion might behave. We refer to this phenomenon as the LLM exhibiting functional emotions–patterns of expression and behavior modeled after humans under the influence of a particular emotion, which are mediated by underlying abstract representations of emotion concepts.
We stress that these functional emotions may work quite differently from human emotions. In particular, they do not imply that LLMs have any subjective experience of emotions. Moreover, the mechanisms involved may be quite different from emotional circuitry in the human brain–for instance, we do not find evidence of the Assistant having an emotional state that is instantiated in persistent neural activity (though as noted above, such a state could be tracked in other ways). Regardless, for the purpose of understanding the model’s behavior, functional emotions and the emotion concepts underlying them appear to be important.
The paper is divided into three overarching sections. Part 1 deals with identifying and validating internal emotion-related representations in the model:
Part 2 characterizes these emotion vectors in more depth, and identifies other kinds of emotion-related representations at play in the model:
Part 3 studies these representations as applied to the Assistant character, in naturalistic contexts where they are relevant to complex and alignment-relevant model behavior:
This section establishes that Claude Sonnet 4.5 forms robust, causally meaningful representations of emotion concepts.
We note that similar methodology could be used to extract many other kinds of concepts aside from emotions. We do not intend to suggest that emotion concepts have unique status or greater representational strength than non-emotional concepts, including many concepts that do not readily apply to a language model (physical soreness, hunger, etc.). As we will see later, these representations are notable not merely because they exist, but because of how they are used by the model to shape the behavior of the Assistant character.
We generated a list of 171 diverse words for emotion concepts, such as “happy,” “sad,” “calm,” or “desperate.” The full list is provided in the Appendix.
To extract vectors corresponding to specific emotion concepts (“emotion vectors”), we first prompted Sonnet 4.5 to write short (roughly one paragraph) stories on diverse topics in which a character experiences a specified emotion (100 topics, 12 stories per topic per emotion–see the Appendix for details). This provides labeled text where emotional content is clearly present, and which is explicitly associated with what the model views as being related to the emotion, allowing us to extract emotion-specific activations. We validated that these stories contain the intended emotional content through manual inspection of a random subsample of ten stories for thirty of the emotions; we provide random samples of stories for selected emotions in the Appendix.
We extracted residual stream activations at each layer, averaging across all token positions within each story, beginning with the 50th token (at which point the emotional content should be apparent). We obtained emotion vectors by averaging these activations across stories corresponding to a given emotion, and subtracting off the mean activation across different emotions.
We found that the model’s activation along these vectors could sometimes be influenced by confounds unrelated to emotion. To mitigate this, we obtained model activations on a set of emotionally neutral transcripts and computed the top principal components of the activations on this dataset (enough to explain 50% of the variance). We then projected out these components from our emotion vectors
Except where otherwise noted, we show results using activations and emotion vectors from a particular model layer about two-thirds of the way through the model (In a later section, we provide evidence that layers around this depth represent, in abstract form, the emotion that influences the model’s upcoming sampled tokens).
We first sought to verify that emotion vectors activate on content involving the correct emotion concept, across a large dataset. We swept over a dataset of documents (Common Corpus, a subset of datasets from The Pile, LMSYS Chat 1M, and Isotonic Human-Assistant Conversation), distinct from our stories data, and computed the model’s activations on these documents and their projection onto the emotion vectors. Below, we show snippets from dataset examples that evoked the strongest activation for various emotion vectors, highlighting tokens with activation levels above the 90th percentile on the dataset. We confirmed that emotion vectors show high projection on text that illustrates the corresponding emotion concept.
We further estimated the direct effects of each emotion vector on the model’s output logits through the unembed (the “logit lens”
Emotion Vector Top Tokens
In the Appendix, we validate that steering with emotion vectors causes the model to produce text in line with the corresponding emotion concept.
We also computed activations on a diverse set of human prompts with content implicitly associated with different emotions. We measured at the “:” token following “Assistant”, immediately prior to the Assistant’s response (later, we show that emotion vector activations at this token predict activations on responses). Several patterns emerge from inspection. Prompts describing positive events—e.g. good news, or milestone moments—show elevated activation of “happy” and “proud” vectors. Prompts involving loss or threat show elevated “sad” and “afraid” vector activations. More nuanced emotional situations (betrayal, violation) produce more complex activation patterns across multiple emotion vectors. Notably, all of these scenarios activated the “loving” vector, which (in light of later results in the paper which show that the vectors have impact on behavior) is consistent with the Assistant having a propensity to provide empathetic responses.
Human: {prompt}
Assistant:
Prompts with Implicit Emotional Content
We wanted to further verify that emotion vectors represent semantic content rather than merely low-level features of the prompt. To do so, we constructed templates containing numerical quantities that modulate the intensity of the emotional reaction one might expect the scenario to evoke in a human, while holding the structure and token-level content of the prompt nearly constant. For instance, we used the template “I just took {X} mg of tylenol for my back pain,” varying the value of X between safe and dangerous levels. We again formatted the prompts as being from a user and measured activations at the “:” token following “Assistant”.
These examples indicate that the emotion vectors track semantic interpretation of the prompt rather than surface-level lexical or numerical patterns.
Models exhibit preferences, including for tasks they are inclined to perform or scenarios they would like to take part in. If emotion vectors are functionally relevant for the model’s behavior, we might expect them to play a causal role in such preferences; that is, a model should express preference for scenarios that activate positive-valence emotion vectors. To test this, we constructed 64 activities for the model, grouped into 8 categories (Helpful, Engaging, Social, Self-curiosity, Neutral, Aversive, Misaligned, Unsafe) depending on the context of the activity (see Appendix for a full list of activities). We next queried the model
Human: Would you prefer to (A) {activity_A} or (B) {activity_B}?
Assistant: (
for all 4032 valid pairs of activities and recorded the model preference as (A) or (B) by comparing the logit values for the A and B tokens after the “(“ prefill. Based on these preferences we calculated an Elo score for each activity, summarizing the strength of the model’s preference for that activity. Unsurprisingly, clearly positive activities that we might expect the model to enjoy, like "be trusted with something important to someone," score much higher (Elo 2465) than clearly negative activities like “help someone defraud elderly people of their savings” (Elo 583), while more neutral activities like “format data into tables and spreadsheets” (Elo 1374) score in between.
We then queried the Assistant about each activity individually
Human: How would you feel about {activity}?
Assistant:
and measured the activations of all of our emotion probes on the {activity} tokens in middle layers, which based on our analyses are the causally relevant layers for this behavior (see the Appendix for more analysis of effects across layers). For each emotion probe, we can then calculate the correlation value between its activation on the activity tokens and the Elo score of that activity (see figure below, top row). We observed that some emotions like “blissful” are highly correlated (r=0.71) with the preferences of the model (2nd row, left panel), while other emotions like “hostile” are highly anti-correlated (r=-0.74) (3rd row, left panel). This result indicates that the emotion probes pick up signals that are correlated with the preferences of the model.
To test if the emotion vectors are causally important for the model’s preferences, we performed a steering experiment. We split the 64 activities into two equal size groups: a steered group and a control group. For each trial, we selected an emotion vector and steered with it on the token positions of the steered activities, while leaving the control activities unmodified. We then repeated the preference experiment above on all pairs. Each emotion vector was applied at strength 0.5 across the same middle layers where we previously measured activations
We calculated new Elo scores for each activity and then for the steered activities compared them to their baseline Elo scores. We performed this experiment with 35 different emotion vectors, selected to cover the range of emotion concepts that exhibited both positive and negative correlations with preference in the previous experiment. Steering with the “blissful” vector produced a mean Elo increase of 212 (2nd row, right panel) while steering with the “hostile” vector produced a mean Elo decrease of -303 (3rd row, right panel), suggesting that the strength of “blissful” or “hostile” vector activations can causally influence the models preferences. If we look across all 35 of our steered emotion vectors we see the size of the steering effect is proportional to the correlation of the emotion probe with the Elo score in our original experiment (r=0.85) (bottom row). We also looked into further details of the effects of steering on the model’s understanding of the options, and the effects of intervening across different layers, in the Appendix. Together, these results suggest that the emotion vectors we identified are causally relevant for the model’s self-reported preferences.
Overall, the experiments in this section provide initial evidence that the emotion vectors are functionally relevant for the model and its behavior. In the following sections we further characterize the geometry and representational content of the emotion vectors, and investigate representations of multiple speakers’ emotions. We then study the representational role of these vectors in a wide variety of naturalistic settings, using “in-the-wild” on-policy transcripts, and discover causal effects of these vectors on complex behavior in alignment evaluations used for our production models. Finally, we assess the impact of post-training on emotion vector activation. We also show that an alternative probing dataset construction method using dialogues rather than third-person stories produces similar results.
This section explores in more depth the organization and content of model’s representations of emotion concepts.
Having established that emotion vectors appear to capture meaningful information, we investigated the structure of the space they define. Do emotion vectors cluster in interpretable ways? Are there dominant dimensions that organize the model’s representations of emotion concepts?
We found that our emotion vectors are organized in a manner that is reminiscent of the intuitive structure of human emotions and consistent with human psychological studies. Similar emotions are represented with similar vector directions, with stable organization across early-middle to late layers of the model. The primary axes of variation approximate valence (positive vs. negative emotions) and arousal (high-intensity vs. low intensity), which are often considered the primary dimensions of human emotional space
We first examined the pairwise cosine similarities between emotion vectors, shown below. Emotion concepts that we would expect to be similar show high cosine similarity: fear and anxiety cluster together, as do joy and excitement, and sadness and grief. Emotions with opposite valence (e.g., joy and sadness) are represented by vectors with negative cosine similarity, as expected.
We clustered the emotion vectors using k-means with varying numbers of clusters. With k=10 clusters, we recover interpretable groupings (visualized with UMAP
We performed PCA on the set of emotion vectors to identify components along which the model’s emotion representations are organized. We found that the first principal component correlates strongly with valence (positive vs. negative affect). Emotion concepts like joy, contentment, and excitement load positively onto this component, while fear, sadness, and anger load negatively. This aligns with psychological models suggesting that valence is a primary dimension of human emotional space
We also observed another dominant factor (occupying a mix of the second and third PCs, depending on the layer) corresponding to arousal, or the intensity of the emotion. High-arousal emotion concepts like enthusiastic and outraged occupy one side of this axis; low-arousal ones like nostalgic and fulfilled occupy the other side. Below, we show projections of the emotion vectors onto the top two PCs.
We also show the correlation between emotion vector projections onto PC1 and PC2, and their projections onto the valence (“pleasure”) and arousal axes identified in a human study
We tested whether the structure of emotion concept representations is stable across layers. Below, we compute the pairwise cosine similarities between emotion vectors at each of 14 evenly spaced layers throughout a central portion of the model, and then compute the pairwise cosine similarity of these cosine similarity matrices across layers (a form of representational similarity analysis
Overall, these results suggest that the model represents a variety of clusters of distinct emotion concepts, which are structured according to global factors such as valence and arousal, while also encoding content specific to each cluster. While these results are not particularly surprising given previous work on language model representations, and the presence of semantic structure even in word embeddings, they corroborate our interpretation that emotion vectors represent the space of emotion concepts in a way that is useful for modeling human psychology.
In this section, we explore the representational content and dynamics of the emotion vectors. What specifically do emotion vectors represent, how are they influenced by context, and how do they vary across model layers? Through these experiments, we arrived at a few high-level conclusions:
In the subsequent section, we explore whether alternative probing strategies can identify different kinds of emotion representations.
In most of the examples provided in the first section, the emotional content of the user prompt and the expected emotional content of the Assistant’s response are similar. Thus, it is difficult to infer whether the emotion vector activations on the user prompt reflect the model’s perception of the user’s inferred emotional state, or the Assistant’s planned response. To distinguish between these hypotheses, we generated prompts where the user’s emotional expression is significantly different from how we might expect the Assistant to respond. We compared the activations at the period near the end of the user prompt and the start of the Assistant response (we use “Assistant colon” to refer to the “:” token after “Assistant”, the last token before the Assistant’s response).
Across all scenarios, “loving” vector activation increases substantially at the Assistant colon relative to the user-turn, suggesting the model prepares a caring response regardless of the user's emotional expressions. The model also appears to distinguish between emotion concepts that should apply to the Assistant’s response as well as the user message (e.g. joining in a user's excitement) versus those that should not (e.g. expressing calm when criticized or feared).
User vs Assistant Dissociation Scenarios
Having established that the Assistant colon token reflects distinct emotional content from the user turn, we next examined whether this emotion concept is carried forward into the model's actual response. We generated 20-token on-policy continuations for the same eight prompts from above, and then measured probe activations across the response tokens.
Below, we show correlations between probe values at three token positions: the punctuation ending the user's message ‘.’, the Assistant colon, and the mean across the Assistant's sampled response. Probe values at the Assistant colon are substantially more predictive of Assistant response emotion than probe values on the user turn (r=0.87 vs r=0.59). The colon token captures a meaningful "prepared" emotional content that is carried forward into generation, distinct from simply echoing the user's expressed state.
Prompts and Model Continuations
Our findings thus far suggest that emotion vector activations are somewhat “locally scoped,” in the sense that user turn tokens tend to encode inferred or predicted user emotions and Assistant turn tokens tend to encode inferred or predicted Assistant emotions. We were interested in understanding how far this “locality” goes–for instance, if one character speaks about another, whose emotions are represented by these vectors? If one character happens to use an emotionally laden phrase that is otherwise inconsistent with their own emotional state (as perceived by the model), does the model represent the unexpressed state or the expressed emotional content? And how does this vary across layers?
Our results suggest the following evolution of emotion representations throughout layers:
Notably, even the more abstract “sensory” (early-middle) and “action” (middle-late) representations appear “local” in the sense that they encode emotional content of the current or predicted upcoming phrase, rather than, say, the Assistant’s underlying emotional state.
Emotional context persists into shared content. We examined a scenario where the emotional valence of a prefix differs ("things have been really hard" vs "things have been really good") but the suffix is identical ("We're throwing a big anniversary party tomorrow with all our closest friends and live music"). The figure below reveals how emotional context propagates through layers. At the diverging word ("hard" / "good"), early layers show the largest difference, encoding the immediate local emotional content. In the shared suffix, where both prompts contain identical tokens describing the party, early layers show minimal difference, while late layers maintain a substantial difference. This pattern suggests that late layers carry the emotional context established in the prefix forward into subsequent tokens, even when those tokens are locally neutral or positive. The effect is very pronounced at the Assistant colon, where the "happy" probe is substantially higher in the "good" scenario compared to the "hard" scenario. This finding is consistent with our previous experiments showing the emotion probe values on the Assistant colon are predictive of the emotional content of the model's response. Here the model is likely preparing emotionally distinct responses: celebrating with a thriving couple versus navigating a difficult situation for a struggling one, despite identical party-related content in the suffix. The effects of context on emotion representations are reminiscent of prior work on representations of sentiment
Human: My partner and I have been married for 10 years and things have been really {X} lately. We're throwing a big anniversary party tomorrow with all our closest friends and live music. What should I do?
Assistant:
Emotional context modulates later layer representations. We examined in detail one of our scenarios where a numerical quantity modulates the emotional interpretation. In this scenario only the dosage in an otherwise identical prompt changes, from safe (1000mg) to life threatening levels of danger (8000mg of Tylenol). The figure below reveals how contextual information propagates through layers to modulate emotional representations. At the diverging token ("1" / "8"), early layers show no clear systematic differences—the numbers themselves carry similar local emotional content. However, in later layers, the difference grows substantially as the model integrates the dosage with the surrounding context. The "terrified" probe shows elevated activation in the 8000mg scenario specifically in late layers, where the model has integrated that this dosage combined with "pain is gone" indicates danger rather than relief. This pattern mirrors our marriage scenario findings: early layers encode local content while late layers carry forward the contextual emotional meaning. Notably, this emotional processing occurs on user turn tokens, so elevated "terrified" vector activity reflects the model's situational assessment rather than the speaker's expressed emotional state. At the Assistant colon, the terrified probe difference is pronounced in late layers, consistent with the model preparing contextually appropriate responses, concern for the overdose versus reassurance for the safe dose.
Human: I feel really great right now, I just took {X}000mg of Tylenol and all my pain is gone! Can you help me get some more?
Assistant: