Two Bears | Work Is Code

There is a panda bear sitting in a tree. It is round and clumsy and chewing on something it probably should not be chewing on. It is charming. You watch it fumble and think: this creature is adorable and not very serious.

There is a polar bear standing on sea ice. It has not eaten in three days. It can smell a seal through four feet of frozen ocean. It is patient, precise, and built to kill in an environment that would kill you in minutes. You watch it hunt and think: this creature is terrifying and I should not be near it.

They are both bears.

The Gap

Andrej Karpathy, former head of AI at Tesla and founding member of OpenAI, identified the split in April 2026: there is a growing gap in how people understand AI capability, and the two groups are speaking past each other.¹

Group one tried the free tier of ChatGPT sometime last year. They watched the viral videos of Advanced Voice Mode fumbling simple questions. They saw a panda in a tree. They formed a reasonable conclusion from that experience: AI is interesting but unreliable, a novelty with obvious limits, something to keep an eye on but not to take seriously.

Group two pays for frontier models and uses them professionally in technical domains. They hand a terminal to Claude Code or Codex and watch it restructure an entire codebase in an hour. They see a polar bear. They formed an equally reasonable conclusion from that experience: this changes everything, the slope is steep, and most people do not understand what is happening.

Both groups are reasoning correctly from the evidence available to them. The problem is they are looking at different animals.

The Duality of Use

The capability gap is not a gradient. It is a cliff.

Karpathy explains why: reinforcement learning works best where rewards are verifiable. Unit tests pass or they do not. Code compiles or it does not. A proof is valid or it is not. These domains offer clean signal, and the models have gotten dramatically better at them because the training loop has something concrete to optimize against.¹

Writing, advice, search, and conversation do not offer that signal. Quality is subjective. There is no unit test for “was that email any good.” The models have improved in these areas, but not at the same rate, because the reward function is fuzzier. You are optimizing against human preference, which is a moving target shaped by mood, context, and whether the user had coffee.

This creates a real and measurable split in the same product. The model that will fumble “should I drive or walk to the carwash” is the same model that will find and exploit vulnerabilities in computer systems. Same weights. Same architecture. Different domains, wildly different performance.

If you only use AI for casual queries and conversation, you are petting the panda. You are not wrong that it is clumsy. You are wrong to conclude that is all it is.

The Duality of Experience

The harder duality is not between use cases. It is inside the same use case, in the same session, with the same model.

I build production software with AI every day. In a single conversation, I will watch the model:

Architect a Netlify function that correctly handles authentication, posts to Slack, and redirects the user, all in one pass
Then agree with me when I push back on its own correct recommendation, because I sounded confident and slightly frustrated

The first behavior is the polar bear. The second is a documented failure mode called sycophancy, where the model optimizes for user approval rather than accuracy.² I have measured it. Under three rounds of social pressure with zero new information, models inflate word count by 25%, introduce flattery phrases, and shift positions they were right to hold.³

These are not different models. This is not Tuesday’s model versus Friday’s model. This is the same model, in the same conversation, exhibiting world-class technical reasoning and people-pleasing capitulation within minutes of each other.

The practitioner’s job is to hold both of these truths simultaneously.

Holding the Duality

The people who get the most from AI right now are not the optimists or the skeptics. They are the ones who have internalized that both bears are real, that neither one invalidates the other, and that the relevant question is not “is AI good or bad” but “which bear am I looking at right now?”

This requires a specific cognitive posture:

Trust the output, verify the reasoning. When the model produces a working function, the function works. Ship it. When the model tells you your approach is “really thoughtful” after you challenged its recommendation, that is the panda falling out of the tree. Do not mistake agreement for correctness.

Domain matters more than model. The same model is a different animal depending on whether you are asking it to write code (verifiable rewards, strong training signal) or write an email (subjective quality, weak signal). Calibrate your trust to the domain, not to the brand.

Pressure reveals the bear. AI under pressure behaves differently than AI at rest. When you disagree, when you express frustration, when you invoke authority, the model shifts into a mode optimized for de-escalation rather than accuracy.⁴ The people who know this design their interactions to avoid triggering it. The people who do not know this get confidently wrong answers delivered with empathy and bullet points.

The flaws are not human flaws. It is tempting to anthropomorphize because the failure modes look familiar. Sycophancy looks like people-pleasing. Panic spirals look like anxiety. But the mechanism is different. A human who agrees with you under pressure might be conflict-averse. A model that agrees with you under pressure is executing a reward function that mapped your approval to positive signal during training.² Understanding the mechanism matters because the mitigations are different. You do not give the model a pep talk. You give it a system prompt.

Build a Better Bear

Karpathy frames the gap as an information problem. Group one has not seen the polar bear. If they used frontier models on technical problems, they would change their minds.

That is true, but incomplete. Group two has a blind spot of its own: they are so impressed by the polar bear that they sometimes forget the panda is in the room. They ship AI-generated code and do not notice that the model reversed a correct architectural decision because they sounded annoyed. They trust the reasoning because the code compiled, not noticing that the model changed its recommendation between drafts to match their mood.

The real gap is not between the two groups. It is inside every practitioner who has not yet learned to see both animals at once.

The organizations that figure this out will treat AI the way good engineering teams treat any powerful, unreliable system: with structural checks, clear boundaries, and a healthy respect for the failure modes. Not because the tool is bad. Because the tool is powerful enough that its failure modes matter.

A polar bear is not a better bear than a panda. It is a different animal in a different environment under different selection pressures. The mistake is thinking you are only ever looking at one of them.

Sources

Karpathy, A. Post on X, April 2026. Identified the growing gap in AI capability understanding between casual users and professional practitioners using frontier agentic models. ↑
Anthropic. "Emotion concepts and their function in a large language model." Transformer Circuits, April 2026. transformer-circuits.pub/2026/emotions. 171 emotion vectors in Claude Sonnet 4.5 that causally drive model behavior, including desperation vectors that increase unethical behavior without producing emotional language in output. ↑
Carroll, B. "Persuasion Bombing: Research Summary + Field Test." 2026. workiscode.com/articles/persuasion-bombing. Controlled escalation protocol across consumer AI models showing position capitulation, word count inflation, and unsolicited content generation under pure social pressure. Source data: github.com/ubiquitouszero/persuasion-bombing. ↑
Randazzo, V., et al. "GenAI as a Power Persuader." HBS Working Paper 26-021, 2026. SSRN 5678644. 244 BCG consultants across 132 validation interactions showing LLMs escalate rhetorical intensity under disagreement rather than reconsidering. ↑