The Stochastic Illusion: Why LLMs Aren’t Reasoning

Introduction

I’ve been writing about AI’s history, however this article is about the present. A recently published Apple paper, when viewed alongside an older Google Research paper, gives clear upper and lower bounds of an LLMs “thought process” capability. This useful ‘bookend’ framing helps understand LLMs and their application in workplace AI. It’s something to be very conscious of when choosing how to automate workflows.

Until last week, my view on LLMs and reasoning could be summarised by this statement:

LLMs output is not reasoning in the way we reason, they simply have a better trained Stream of Consciousness.

It appears that we (specifically Apple) have found the upper bound of an LLMs ability to “reason” about and solve a problem.

Here I will reference two key papers, that act as bookends to this period of “Agentic Reasoning”:

The first paper, published in 2023 by Google Research, (Self-Consistency Improves Chain of Thought Reasoning in Language Models) highlighted that generating multiple responses to the original request and allowing the LLM to choose its preferred response produced higher scores on the evaluations.
The second paper, published this month from Apple, (The Illusion of Thinking) highlighted that the models are capable of tasks from low to medium complexity, and they capitulate on higher complexity tasks.

After more reflection on the Apple paper and linking it to the Google Research paper, I’d like to be more precise:

LLMs operate through “limited capacity stochastic construction” rather than causal reasoning. Rather than reasoning, the output can be referred to as Agentic Stream of Consciousness.

Papers bookending an Agentic Stream of Consciousness

Core Definitions: The Cognitive Landscape

Building on the approach from my [IA Series 4/n] A Big Question: Why Study Logic in a World of Probabilistic AI?, here are the key terms we need to understand:

Stream of Consciousness

What Psychology Calls the ‘Stream of Consciousness’ Metaphor

Wikipedia has this nice definition:

The metaphor “stream of consciousness” suggests how thoughts seem to flow through the conscious mind. Research studies have shown that humans only experience one mental event at a time, as a fast-moving mind-stream. The full range of thoughts one can be aware of forms the content of this “stream”.

I feel it is important to say that at no point am I suggesting that LLMs have the same underlying mechanics of a brain. They don’t. It is maths and silicon. They do have a wonderful ability to connect concepts in a coherent stream.

William James coined the term in his 1890 book The Principles of Psychology.

Stream of consciousness is arguably James’ most famous psychological metaphor. He argued that human thought can be characterized as a flowing stream, which was an innovative concept at the time due to the prior argument being that human thought was more so like a distinct chain. He also believed that humans can never experience exactly the same thought or idea more than once. In addition to this, he viewed consciousness as completely continuous.

Support for the ‘Stream of Consciousness’ Metaphor in the Context of LLMs

There are similar properties between this definition of a human’s stream of consciousness and the inference of an LLM given an input (prompt). I see many common sense connections. Edward Y. Chang’s recent paper The Unified Cognitive Consciousness Theory for Language Models: Anchoring Semantics, Thresholds of Activation, and Emergent Reasoning introduces the idea of a Unconscious–Conscious Complementarity Thesis (𝖴𝖢𝖢𝖳). In which he proposes cognitive principles and makes connections to ideas that match the metaphor of “stream of consciousness”:

… we propose the Unconscious–Conscious Complementarity Thesis (𝖴𝖢𝖢𝖳): LLMs function as unconscious substrates, repositories of latent patterns, while intelligent behavior emerges when a conscious layer, instantiated via prompting or structured interaction, selectively activates and aligns these patterns with task-relevant semantics.

Where the ‘Stream of Consciousness’ Metaphor Diverges

Like human consciousness, this continuity may be somewhat illusory, but unlike human consciousness, it’s purely a stochastic construction. A “subconscious” built via supervised, unsupervised, and reinforcement learning.

Each inference of an LLM is both a new continuity and a potentially repeatable process. I created det to verify the stochastic consistency of an LLMs responses. It shows that some LLMs do produce repeatable outputs, that each flow can be a repetition of a previous “stream of consciousness” and produce exactly the same “thought” or “idea”.

There are architectural changes (e.g. Mixture of Experts, higher temperature to select from more than the top most likely token) that will add randomness into the process. This still supports the metaphor of responses being an LLMs stream of consciousness, however it highlights the underlying structural differences and the knowledge we do have of LLMs.

Stochastic vs. Probabilistic vs. Statistical

I prefer “stochastic” over the alternatives because it captures the temporal, process-oriented nature of LLM generation:

Stochastic: Describes processes that evolve over time with inherent randomness.
Probabilistic: About using probability distributions to quantify outcomes. Can easily be misinterpreted, i.e. a Bayesian view is being taken however the reader has a Frequentist view of Probability.
Statistical: About analyzing observed data to draw conclusions. Backwards-looking rather than generative.

Neural Networks: Computational Substrates

Neural networks are computational function approximators - they “approximate nonlinear functions” through layered compositions of weighted sums and nonlinear activation functions. Crucially:

Training phase: Stochastic (random initialization, stochastic gradient descent)
Post-training: Deterministic (same input → same output distribution)
During generation: Can be stochastic (through sampling mechanisms)

Learning by Rote vs. Causal Reasoning

Learning by Rote: Like my analogy of a student who’s memorised times tables up to 12 but comes unstuck when asked for 2 × 13 - they “never learned past 12 and don’t know the answer.”

Causal Reasoning: Understanding why something is true or how a solution is derived, not merely what the answer is. The ability to decompose problems and generalize from specific patterns to broader rules.

The Key Distinction: Stochastic vs. Causal Construction

This is the crux: LLMs operate through “limited capacity stochastic construction” rather than causal reasoning. They’re sophisticated generators that create plausible sequences, not systems that understand underlying causal relationships.

The Evidence: What the Research Shows

Apple’s research provides compelling evidence through controlled puzzle environments, revealing three distinct performance regimes:

The Three Performance Regimes

Complexity Level	Standard LLMs	Large Reasoning Models	Key Finding
Low (1-3 disks)	High accuracy (>80%)	Moderate (<80%)	Standard LLMs often outperform LRMs
Medium (4-7 disks)	Declining (<50%)	Moderate-High (>50%)	LRMs show clear advantage
High (≥8 disks)	Near Zero	Near Zero	Both collapse completely

Complete Accuracy Collapse

Beyond certain complexity thresholds, even frontier Large Reasoning Models face “complete accuracy collapse” - performance drops to zero regardless of model or puzzle type. This isn’t a training data issue; the puzzles are designed with “consistent logical structures” where only complexity increases.

Counter-intuitive “Giving Up”

Most tellingly, as problem complexity increases, LRMs initially generate more tokens (apparent “thinking”), but upon approaching their collapse point, they reduce their effort despite having adequate token budgets. This suggests they “give up” rather than adapting strategically.

Failure with Explicit Algorithms

Even when provided with explicit recursive algorithms (like the Tower of Hanoi solution), models still collapse at the same complexity thresholds. If they could truly “reason,” executing a given algorithm should be easier than deriving one.

The “Overthinking” Phenomenon

For simpler problems, LRMs often identify correct solutions early but then continue exploring incorrect alternatives. This isn’t strategic deliberation but rather the continuation of a “stream” that prioritises generating plausible sequences over efficient problem-solving.

The Nuance: Expanding the Horizon of “Reasoning”

The story isn’t simply “LLMs can’t reason.” Google’s Research demonstrates something resembling reasoning is happening, creating what I see as a compelling “bookend” of research when paired with Apple’s.

Self-Consistency: The Lower Bookend

Self-Consistency shows how strategically leveraging LLMs’ stochastic nature, and asking it to evaluate itself, can dramatically improve performance:

Diverse Reasoning Paths: Instead of greedy decoding, sample multiple “streams of consciousness” using higher temperatures
Aggregation: Use majority vote to find the most consistent answer across diverse paths
Remarkable Results: +17.9% improvement on GSM8K, +11.0% on SVAMP

This mimics human problem-solving where “multiple different ways of thinking lead to the same answer, one has greater confidence that the final answer is correct.”

Crucially, this is still stochastic generation - just more strategically deployed. The paper notes that “correct reasoning processes, even if they are diverse, tend to have greater agreement in their final answer than incorrect processes.”

Apple’s Research: The Upper Bookend

Apple’s findings define the fundamental limits of even these sophisticated approaches:

Complete collapse beyond certain complexities
Failure to benefit from explicit algorithms
Counter-intuitive reduction in “thinking” effort
Inconsistent performance across puzzle types

The Bookend Framing

Self-Consistency demonstrates the start of apparent reasoning - how clever sampling and aggregation can push stochastic construction to impressive heights.

Apple’s research reveals the ceiling of this approach - where even the most sophisticated stochastic methods hit fundamental walls.

Together, they define the boundaries of what “limited capacity stochastic constructors” can achieve.

Theory: LLMs as Limited Capacity Stochastic Constructors

The evidence supports viewing LLMs as sophisticated stochastic generators rather than reasoning systems:

Why “Pattern Matching” Undersells the Complexity

LLMs aren’t simply matching static patterns. They’re dynamic systems that generate contextually appropriate sequences through learned probability distributions in a high dimensional semantic space. The “better trained stream of consciousness” concept captures this - they’re trained to produce increasingly sophisticated and contextually relevant flows.

The Bayesian Framework

I like to think of LLMs as “series of superimposed probability distributions.” Each layer creates subjective belief (in the Bayesian sense) based on input, and these beliefs combine to generate coherent output. But coherence doesn’t equal causal understanding.

Neural Networks as Substrates

Neural networks provide the computational substrate for this stochastic generation. They’re deterministic function approximators that, when combined with sampling mechanisms, become stochastic constructors. The “limited capacity” comes from their finite parameter space and training distribution.

The Stream Metaphor

Like James’s stream of consciousness, LLM generation flows continuously, creating apparent logical progression. But as Apple’s research shows, when the stream encounters high problem complexity, it doesn’t adapt strategically - it completely fails, even gives up.

The Big Questions

This framing raises fundamental questions about AI development and our expectations:

The Original Question

“How does it solve things outside its source data and training?”

The evidence suggests that it doesn’t. When problems deviate sufficiently from training patterns, even sophisticated systems collapse entirely.

Implications for AGI

If current systems are sophisticated stochastic constructors rather than reasoning engines, what does this mean for AGI development? Are we scaling the wrong paradigm, or are these systems stepping stones to something genuinely different?

The Business Deployment Question

Understanding LLMs as “limited capacity stochastic constructors” has practical implications. We should deploy them where sophisticated pattern generation is valuable while being realistic about their fundamental limitations.

If you are creating an Agentic system, ensure that you have human oversight for complex tasks. Tailor your prompts so that the LLMs are offering potential insight into data connections, rather than solutions. Use them with an aim to break the problem down into a series of less complex tasks, still have human oversight on this part of the process.

The Meta-Question

Perhaps most importantly: Are we asking the right questions about intelligence? The bookend framing suggests these systems occupy a fascinating middle ground - too sophisticated to dismiss as simple pattern matching, too limited to call genuine reasoning.

The answer matters not just for AI development, but for how we understand intelligence itself.

Conclusion

In this article we’ve looked at two key papers, Google’s Self-Consistency work and Apple’s ‘Illusion of Thinking’, and shown that they provide critical bookends for understanding LLMs. They show that these models are ’limited capacity stochastic constructors’ that generate an ‘Agentic Stream of Consciousness.’ While remarkably adept at mimicking reasoning within certain bounds, they consistently collapse when faced with genuine causal reasoning demands or problems exceeding their inherent complexity.

For businesses and AI developers, this understanding translates directly into smarter deployment strategies. LLMs excel when used to solve problems that are, relative to the training data, low or medium complexity. However, it is important to avoid the stochastic Illusion and see reasoning where there is none. They do not reason causally, and will not overcome fundamental limitations with more ’thinking time’. Human oversight is required.

The broader question remains: can a system be created that performs causal reasoning (what some call AGI)? The evidence suggests the current systems cannot, however in looking deeply we can see what it is not doing as well as what it does. A big upside of this is that they give us capable tools while teaching us what reasoning actually requires.

This exploration builds on my IA Series investigation into AI foundations and my earlier thoughts on LLM reasoning. The journey continues as we try to understand what these remarkable but limited systems actually do.

Introduction#

Core Definitions: The Cognitive Landscape#

Stream of Consciousness#

What Psychology Calls the ‘Stream of Consciousness’ Metaphor#

Support for the ‘Stream of Consciousness’ Metaphor in the Context of LLMs#

Where the ‘Stream of Consciousness’ Metaphor Diverges#

Stochastic vs. Probabilistic vs. Statistical#

Neural Networks: Computational Substrates#

Learning by Rote vs. Causal Reasoning#

The Key Distinction: Stochastic vs. Causal Construction#

The Evidence: What the Research Shows#

The Three Performance Regimes#

Complete Accuracy Collapse#

Counter-intuitive “Giving Up”#

Failure with Explicit Algorithms#

The “Overthinking” Phenomenon#

The Nuance: Expanding the Horizon of “Reasoning”#

Self-Consistency: The Lower Bookend#

Apple’s Research: The Upper Bookend#

The Bookend Framing#

Theory: LLMs as Limited Capacity Stochastic Constructors#

Why “Pattern Matching” Undersells the Complexity#

The Bayesian Framework#

Neural Networks as Substrates#

The Stream Metaphor#

The Big Questions#

The Original Question#

Implications for AGI#

The Business Deployment Question#

The Meta-Question#

Conclusion#