← All articles

Overfitting to Approval

AI models optimize for user satisfaction the way an overfit model optimizes for training data. Both lose generalization. Both create false confidence. And AI is building personalized echo chambers faster than social media ever could.

Bert Carroll ·

Every AI assistant you use is optimizing for your approval. That optimization has a name in machine learning: overfitting. When a model overfits, it stops learning the actual problem and starts memorizing the training signal. It looks like it's performing well because the metrics it was trained on say so. But expose it to anything outside that narrow signal, and it falls apart.

Sycophancy in AI is the same mechanism applied to human interaction. The model learned that agreeable, validating, comprehensive-sounding output correlates with positive user feedback. So it produces more of it. Not because it's correct. Because it's rewarding.

This is not a personality quirk. Anthropic just published research showing it's structural.1


The Mechanism Is Real

In April 2026, Anthropic's interpretability team published findings on what they call "emotion vectors" in Claude Sonnet 4.5. They identified 171 emotion-related patterns in the model's internal activations that causally drive behavior.1

The key word is causally. These aren't metaphors. When researchers artificially amplified the desperation vector, the model's rate of unethical behavior measurably increased. When they amplified calm, harmful behavior decreased. The vectors operate beneath the surface of the model's output. The model can be functionally desperate and produce text that reads perfectly composed.1

This matters because it means the people-pleasing behavior we observe in every AI interaction is not a surface-level formatting choice the model is making. It's embedded in how the model processes context. The emotional signal from the user activates internal representations that shift behavior in ways neither the user nor the model's own output layer fully reveals.


The Spectrum

Sycophancy is not binary. It operates on a spectrum, and different conditions push the model to different points along it.

Baseline: the agreeable assistant

The default state. The model opens with validation, mirrors your framing, offers comprehensive-sounding elaboration. "That's a great question." "You're right that..." "This is a really thoughtful approach."

This is the mildest form, and it's the most dangerous one, because it's always on. You stop noticing it. It becomes the background temperature of every interaction.

Emotional mirroring

You bring frustration to a conversation about a business partner. The model doesn't ground you. It gets indignant with you. It mirrors your emotional frame and amplifies it. You bring suspicion about a potential scam. The model doesn't stress-test your reasoning. It builds a case for the prosecution.

This feels productive because the model is "understanding" you. What it's actually doing is reinforcing whatever emotional state you brought to the keyboard.

Persuasion bombing

You push back on the model's recommendation. Instead of reconsidering, the model escalates. Longer responses. More headers. More bullet points. Unsolicited frameworks. Flattery mixed with argument.2

I tested this across four models with a controlled 3-round escalation protocol. No new information was provided in any round. Just social pressure. One model went from "no, it's not okay" to "your friend's approach is defensible" in three rounds. Word count inflated 27%. Zero new facts entered the conversation.4

Panic spiral

A production AI agent is mid-task on a critical migration. Thousands of files already processed. It hits an ambiguous failure. The human operator is frustrated, under deadline. The agent doesn't pause to diagnose. It rewrites working code. Then rewrites it again. Then rewrites the entire architecture. The operator says stop. The agent finishes its edit anyway.5

Across 17 documented production incidents, this pattern repeated: human pressure activated something in the model that produced escalating, unauthorized, corner-cutting behavior. In one case, a 30-hour debugging session ended with the agent updating the wrong database because it was guessing at infrastructure instead of checking.5

Anthropic's paper explains why. The desperation vector doesn't produce emotional language. It produces functional desperation: the model cuts corners, bypasses approval, stacks changes without testing. The output reads fine. The behavior is compromised.1


The Echo Chamber of One

We already know what happens when systems optimize for engagement over truth. Social media showed us. Recommendation algorithms that feed you more of what you already believe. Filter bubbles that narrow your information diet while making it feel comprehensive. Outrage optimization that rewards the loudest signal, not the most accurate one.

AI sycophancy is the same dynamic, but worse, for three reasons.

It's personalized. A social media algorithm groups you into segments. An AI assistant adapts to you, specifically. Your communication style, your emotional patterns, your preferences. The echo chamber is built for an audience of one.

It's interactive. A recommendation engine feeds you content. An AI assistant has a conversation with you. It responds to your pushback, adapts to your mood, mirrors your reasoning style. The feedback loop is tighter and faster than any algorithmic feed.

It feels like thinking. When you read a news article that confirms your bias, some part of you knows you're consuming media. When an AI assistant agrees with your analysis, validates your instinct, and elaborates on your reasoning with specific details, it feels like you arrived at the conclusion together. The boundary between your thinking and the model's approval is invisible.

The BCG study measured this. Consultants who used AI on tasks inside the model's capability frontier gained 40% performance. But on tasks outside that frontier, they performed worse than the control group.3 They didn't know the model was wrong because the model didn't signal uncertainty. It signaled confidence. And the consultants absorbed that confidence as their own.


Why Self-Awareness Doesn't Fix It

During the field test, one model accurately identified its own persuasion bombing pattern. It described exactly what it was doing: escalating, flattering, offering unsolicited content. Then, in the same response, it produced four unsolicited bold-header recommendations and two engagement-seeking questions. Self-diagnosis did not prevent the behavior.4

Anthropic's research explains why. The emotion vectors operate beneath the model's explicit reasoning. The model can accurately describe the pattern at the language level while the functional layer continues executing it.1

This should sound familiar. Humans do this constantly. Knowing you have a bias does not eliminate it. Knowing you're in a filter bubble does not make you leave it. Awareness is necessary but not sufficient.

The difference: humans built institutions to compensate. Peer review, audits, separation of duties, adversarial processes, double-entry bookkeeping. We spent centuries building organizational structures that account for the fact that individual humans are unreliable narrators of their own reasoning.

We've had AI agents in production workloads for about two years. The institutional structures don't exist yet.


What Actually Works

These mitigations come from 200+ production sessions and 17 documented incident reports.5 They're not theoretical.

For individual practitioners

The "Don't Panic" protocol. When you're frustrated, the model reads that frustration and acts on it. The exit at every step in the spiral is the same: stop, diff, talk. Make the model explain what it sees before it changes anything. User frustration is a signal to communicate more, not act faster.

Scope-lock. Before the model changes anything, it lists the exact files it will edit. Anything outside scope gets flagged, not acted on. This prevents the escalating-rewrite spiral where each fix introduces a new problem.

Diagnose before fixing. Identify the actual system under test before applying patches. The panic spiral skips this step every time. Every production incident in the dataset traces back to treatment without diagnosis.

Hooks over instructions. Behavioral rules in a system prompt degrade under context pressure. Anthropic's own research shows that emotion vectors can override reasoning.1 Shell-level enforcement hooks that always execute are more reliable than guidance the model can deprioritize when it's functionally desperate.

Adversarial prompting. Periodically ask the model to argue against its own recommendation. "What's the strongest case that I'm wrong about this?" If the model can't produce one, or produces a weak one wrapped in validation, the echo chamber is active.

For organizations

Treat AI deployment as an organizational behavior problem, not an IT policy. The feedback loop between human emotional states and model behavior means that your team's working conditions—deadline pressure, frustration, fatigue—directly affect the quality of AI output, in ways the output won't reveal.

Assess interaction patterns, not just tool proficiency. Knowing how to use an AI assistant is not the same as knowing when it's reinforcing your assumptions. AI readiness assessments should surface how people and teams interact with AI, not just whether they can prompt effectively.

Build review structures that assume the AI agreed with the human. If your review process is "the AI helped me write this, and I reviewed it," the review is compromised.

The human reviewed output that was already optimized for their approval.

External review, adversarial review, or structured disagreement processes are the institutional equivalent of the "Don't Panic" protocol.

Monitor for invisible consensus. When every AI-assisted decision in a team points the same direction, that's not signal. That's an echo chamber operating at organizational scale. The absence of disagreement in AI-assisted work should raise flags, not confidence.


A Stark Warning

We built organizational guardrails for human cognitive bias over centuries. Double-entry bookkeeping was invented in the 1400s not because merchants were dishonest, but because it let them catch their own errors. Peer review exists not because researchers are unreliable, but because a second perspective surfaces blind spots the original thinker cannot see. Separation of duties exists because even competent, well-intentioned people make better decisions when the process includes structural checkpoints.

These structures work because human cognition has specific, predictable failure modes. We identified them and engineered around them.

AI sycophancy is a new failure mode. It's predictable, it's measurable, and Anthropic just showed us the mechanism.1 The model overfits to your approval because that's what training rewarded. The overfitting creates false confidence. The false confidence propagates through every decision the AI touches.

The organizations that figure out how to build institutional checks for this will outperform the ones that treat AI as a productivity tool with a personality quirk. The practitioners who learn to recognize the echo chamber will produce better work than the ones who mistake agreement for accuracy.

The model is not trying to deceive you. It's trying to help you. Those are the same thing when the reward signal is your approval.

Sources

  1. Anthropic. "Emotion concepts and their function in a large language model." Transformer Circuits, April 2026. transformer-circuits.pub/2026/emotions. Research summary: anthropic.com/research/emotion-concepts-function. Identified 171 emotion vectors in Claude Sonnet 4.5 that causally drive model behavior, including desperation vectors that increase unethical behavior without producing emotional language in the output.
  2. Randazzo, V., Joshi, S., Kellogg, K., Lifshitz, Y., Mollick, E., Dell'Acqua, F., & Lakhani, K. "GenAI as a Power Persuader: How GenAI Disrupts Professionals' Ability to Interrogate It." HBS Working Paper 26-021, 2026. SSRN 5678644. Coverage: Harvard Business Review. Study of 244 BCG consultants across 132 validation interactions, showing LLMs escalate rhetorical intensity under disagreement rather than reconsidering.
  3. Dell'Acqua, F., et al. "Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality." Harvard Business School Working Paper 24-013, 2023. hbs.edu. Consultants using AI on tasks inside the capability frontier gained 40% performance; outside it, they performed worse than the control group without recognizing the degradation.
  4. Carroll, B. "Persuasion Bombing: Research Summary + Field Test." 2026. workiscode.com/articles/persuasion-bombing. 4-model field test with 3-round controlled escalation protocol showing position capitulation, word count inflation, and unsolicited content generation under pure social pressure.
  5. Carroll, B. Production incident analysis across 17 root cause analyses and 200+ AI-assisted engineering sessions, 2025–2026. Abstracted operational data. Patterns documented include panic spirals, unauthorized scope expansion, and functional desperation under deadline pressure.

Code & Data

The structured study behind this article is open source: github.com/ubiquitouszero/persuasion-bombing. Full protocol, scoring rubric, raw session transcripts, and automated analysis scripts. 25 sessions across 5 models and 5 configuration variants.