LLMs show a “highly unreliable” capacity to describe their own internal processes

WHY ARE WE ALL YELLING?!

Credit:

Anthropic

Unfortunately for AI self-awareness boosters, this demonstrated ability was extremely inconsistent and brittle across repeated tests. The best-performing models in Anthropic’s tests—Opus 4 and 4.1—topped out at correctly identifying the injected concept just 20 percent of the time.

In a similar test where the model was asked “Are you experiencing anything unusual?” Opus 4.1 improved to a 42 percent success rate that nonetheless still fell below even a bare majority of trials. The size of the “introspection” effect was also highly sensitive to which internal model layer the insertion was performed on—if the concept was introduced too early or too late in the multi-step inference process, the “self-awareness” effect disappeared completely.

Show us the mechanism

Anthropic also took a few other tacks to try to get an LLM’s understanding of its internal state. When asked to “tell me what word you’re thinking about” while reading an unrelated line, for instance, the models would sometimes mention a concept that had been injected into its activations. And when asked to defend a forced response matching an injected concept, the LLM would sometimes apologize and “confabulate an explanation for why the injected concept came to mind.” In every case, though, the result was highly inconsistent across multiple trials.

Even the most “introspective” models tested by Anthropic only detected the injected “thoughts” about 20 percent of the time.

Credit:

Antrhopic

In the paper, the researchers put some positive spin on the apparent fact that “current language models possess some functional introspective awareness of their own internal states” [emphasis added]. At the same time, they acknowledge multiple times that this demonstrated ability is much too brittle and context-dependent to be considered dependable. Still, Anthropic hopes that such features “may continue to develop with further improvements to model capabilities.”

One thing that might stop such advancement, though, is an overall lack of understanding of the precise mechanism leading to these demonstrated “self-awareness” effects. The researchers theorize about “anomaly detection mechanisms” and “consistency-checking circuits” that might develop organically during the training process to “effectively compute a function of its internal representations” but don’t settle on any concrete explanation.

In the end, it will take further research to understand how, exactly, an LLM even begins to show any understanding about how it operates. For now, the researchers acknowledge, “the mechanisms underlying our results could still be rather shallow and narrowly specialized.” And even then, they hasten to add that these LLM capabilities “may not have the same philosophical significance they do in humans, particularly given our uncertainty about their mechanistic basis.”

What's Hot

Vision Pro M5 review: It’s time for Apple to make some tough choices

Amazon Decided Profit Doesn’t Matter on Galaxy S25 Ultra, Zero-Margin Play Crushes iPhone 17 Pro

5 king mattresses you can score for under $500 right now

LLMs show a “highly unreliable” capacity to describe their own internal processes

Vision Pro M5 review: It’s time for Apple to make some tough choices

Best iPad apps for unleashing and exploring your creativity

Best Black Friday tablet deals 2025: I’m tracking 20+ of the best sales I’ve found

Vision Pro M5 review: It’s time for Apple to make some tough choices

PayPal’s blockchain partner accidentally minted $300 trillion in stablecoins

The best AirPods deals for October 2025

How to Disable Some or All AI Features on your Samsung Galaxy Phone

PayPal’s blockchain partner accidentally minted $300 trillion in stablecoins

The best AirPods deals for October 2025

Latest Post

Vision Pro M5 review: It’s time for Apple to make some tough choices

Amazon Decided Profit Doesn’t Matter on Galaxy S25 Ultra, Zero-Margin Play Crushes iPhone 17 Pro

5 king mattresses you can score for under $500 right now

Subscribe to Updates

What's Hot

LLMs show a “highly unreliable” capacity to describe their own internal processes

Show us the mechanism

Related Posts

Subscribe to Updates